Approving this for integration into JDK 18.
We have identified a trivial performance issue in Constructor that represented a significant part of the remaining overhead.
We also observed that allocation overhead has been reduced in several microbenchmarks with the proposed changes, which means that GC can behave differently. The microbenchmarks should probably specify -Xmx1g or similar to reduce variance stemming from GC and heap sizing ergonomics.
Remaining throughput performance overhead - which is now less than 1.5ns/op across the microbenchmarks - might be due to C2 tripping on subtle difference in code shape. Cleaning out debug scaffolding and dropping legacy code could be beneficial to help ensure optimal performance.
As follow-ups we should consider implementing method handle intrinsics so that we can reduce the current MH combinator build process to a single-step lambda form creation - which might then be something that could then easily be pre-generated at build/jlink-time, which allows such LFs to be both archived in CDS and more AOT-friendly. This would likely mostly be helpful for startup/warmup considerations, but might in practice also provide better throughput since custom LFs might be easier for JITs to optimize.
01-06-2021
Performance evaluation has been ongoing during the prototyping stage, and the implementation as of today has performance mostly on par with the baseline w.r.t. throughput in microbenchmarks. There are a few exceptions that are being investigated, such as a ~20% regression on calling a Constructor. There are also slight startup regressions on a variety of applications.
To balance startup considerations with optimal performance, the current prototype has a tiered implementation - similar to the baseline. A generic MH is created up front, and then specialized by spinning a class that roots the MH in a static field. This allows the invoke to inline very well in most cases, but the increased code complexity and shape seem to cause a slight degradation in some microbenchmarks compared to the baseline.
There's a some startup overheads due one-off initialization overhead and the need to create non-trivial MH combinators. Not a huge cost amortized over time, but still adds up to a 2-10% total overhead on smaller, reflection-heavy applications.
There are some ideas on how to reduce these overheads - including implementing method handle intrinsics for the relatively simple combinators needed. Potentially this could also do away with the need for a staged approach, which might end up being net neutral w.r.t. implementation complexity, would very likely reduce startup overheads, and might help remove some remaining rough edges on throughput microbenchmarks.
Overall I think we have a good understanding of how to address the performance concerns in the JDK 18 time frame, and further improvements could be considered as high priority follow-up RFEs/bugs.
Some minor enhancements are applicable regardless and have been spun off and integrated into the mainline, so the magnitude of the regressions could be smaller once we get close to targeting this JEP.