JDK-8231349 : Lazy stub generation
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 11,12,13,14,19
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2019-09-23
  • Updated: 2022-03-01
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Duplicate :  
Relates :  
Description
During the JVM bootstrap we go through StubGenerator::generate_all() to generate a number of specialized routines. Many are used by interpreter and compiled code alike. As we add more such specialized routines, we incur a bootstrap cost, as evidenced by JDK-8270323

During C2 initialization, we also spend ~10 ms in a single thread generating stubs, most of which are not used early on. This happens in OptoRuntime::generate, generating 14 different stubs. By making this initialization happen on-demand, we would see less CPU spent early on things that might not be used, and we could see useful C2 compilations happen sooner, both of which could be beneficial to startup time.

Both cases could benefit from a lazy initialization scheme where uninitialized stubs are generated on demand.
Comments
Would it be possible to move stubs only used by C2 from StubGenerator::generate_all to OptoRuntime::generate? That might be a low effort way to eliminate much of the observed regression in JDK-8270323 since we move work from the bootstrap thread to the initial C2 thread, and it would ensure that modes that don't run C2 at all doesn't take the cost at all (if Graal uses the same stubs it could be factored out in a way that makes it easy to call the same initializer when initializing JVMCI)
13-08-2021

Good notice about -XX:+PrintStubCode. The only downside of uncommon trap when the path in compiled code is taken, deoptimization and recompilation is triggered but called stub is still not assembled. It may lead to few cycles of re-compilation/deoptimization. That is why I talked about performance testing of both approaches: 1. Get compiled method fast but have uncommon traps for called stubs which are not ready. Could be beneficial when paths which call stubs are not taken based on profiling. 2. During compilation, assemble called stubs which will delay when compiled method is ready. Could be beneficial when paths are definitely taken.
12-08-2021

[~kvn] Stub code printing though JVM flag -XX:+PrintStubCode may also need appropriate handling. Also, introducing an un-common trap is safe but has penalty on generated code size. Since there is no-guarantee that at run-time stub call path will be taken so preventing compilation of entire method may turn out to be costly, may be one can play under a runtime option to delay entire method compilation until its dependent stubs are assembled, in this mode un-common traps could be avoided altogether and zero code size impact.
12-08-2021

[~kvn]: true, the ones generated in OptoRuntime::generate() would only indirectly affect startup unless you're running on a single CPU thread (which appears to be more common than expected in cloud environments). Did anyone in the compiler team volunteer to take this little project on? Dave asked me to take a look and I'm happy to attempt a solution, but it's definitely outside of my comfort zone - especially if you think we'll need to emit new uncommon traps.
11-08-2021

Note, OptoRuntime::generate() produce very small wrappers which should not affect startup time. But I am fine if you investigate that too. What affects startup are intrinsics stubs `generate_all()`: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp#L7802 We had discussion today and few items come up: 1. Some stubs we have to generate during startup as we do now because they are used from start. 2. Lazy generate a stub when corresponding Java method is called in Interpreter. Interpreter should wait stub generation - it is not performance critical. 3. Make sure only one thread generates a stub - other threads should wait. This is the case when several threads running Interpreter request a stub. 4. Create uncommon trap in code if JIT compiler (C1 or C2) see that a stub is not generated yet. At the same time trigger the stub generation (may be by vmOperation). 5. We should have a mode in testing to request generation of all stubs at the start (as now) to check that we have enough space in CodeCache for all of them.
11-08-2021

AES-GCM is another large intrinsic that would benefit.
11-08-2021