JDK-8219233 : 3x performance drop for some Clojure applications
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 8u201,11.0.2,12,13
  • Priority: P2
  • Status: Resolved
  • Resolution: Other
  • OS: generic
  • CPU: generic
  • Submitted: 2019-02-15
  • Updated: 2019-08-15
  • Resolved: 2019-06-11
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 13
Related Reports
Relates :  
Relates :  
Relates :  
Sub Tasks
JDK-8219492 :  
JDK-8219969 :  
JDK-8219974 :  
Mac OS X 10.13.6 on 2018 Macbook Pro

I am part of the Clojure core team. I have received at least 4 reports in the last day of significantly degraded performance for Clojure when using the latest Java 1.8 and 11 releases (1.8.0_201 and 11.0.2) vs the prior releases (1.8.0_192 and 11.0.1), with no other changes.

REGRESSION : Last worked in version 11.0.1

You can find a reproducible case at https://github.com/hiredman/weird-user-load-performance.

1) Install the Clojure clj dependency manager / runner (see https://clojure.org/guides/getting_started)
2) Clone the repo https://github.com/hiredman/weird-user-load-performance.git
3) cd weird-user-load-performance
4) clj -A:user

clj is used basically just to download the dependency jars from Maven, form the classpath, and start the Clojure runtime to run a script, which just loads a Clojure namespace (and subsequently all downstream namespaces). The "work" happening here is loading .clj files from jar files on the classpath, running the compiler, generating bytecode, and loading it in a classloader.

The first run of clj will download all of the transitive deps so that run should be discarded. All subsequent runs should use the cached jars in ~/.m2/repository and a cached classpath, so should not use the network.

This is a narrowed down use case from a more complicated library, but one aspect that seems to aggravate things is that there is very macro-heavy code, which is single-thread CPU-intensive.

The program will print a timing of loading the namespaces. The expected result is that the performance is approximately the same in 1.8.0_192 vs 1.8.0_201 or 11.0.1 vs 11.0.2.
1.8.0_192:   2834.550371 msecs
1.8.0_201:   9545.118397 msecs   (3.3x performance degradation)

11.0.1:   2945.72692 msecs
11.0.2: 10817.427187 msecs   (3.7x performance degradation)

---------- BEGIN SOURCE ----------
Unfortunately, I have been unable to narrow this down enough to be able to reproduce it in a Java-only environment. The Clojure source files are at the root of the repo - there are only 3 lines of Clojure code that just trigger a lot of compilation. I am hoping there is enough information or that this report can be combined with others to discover the cause.
---------- END SOURCE ----------

Clojure users are currently downgrading to the prior version to avoid the issue.

FREQUENCY : always

RT Triage: the issue was fixed by the linked bugs.

x86-specific fix for 13 is tracked by JDK-8223213. Webrev: http://cr.openjdk.java.net/~vlivanov/8223213/webrev.00/ RFR: https://mail.openjdk.java.net/pipermail/hotspot-dev/2019-May/037903.html

Interperter/C1/C2 clinit barrier support on x86_64: http://cr.openjdk.java.net/~vlivanov/8219233/webrev.02/ $ java -version openjdk version "13-internal" 2019-09-17 OpenJDK Runtime Environment (build 13-internal+0-adhoc.vlivanov.jdk) OpenJDK 64-Bit Server VM (build 13-internal+0-adhoc.vlivanov.jdk, mixed mode, sharing) $ for ((I=0; I<10; I++)); do clj -A:user; done "Elapsed time: 3492.346187 msecs" "Elapsed time: 3266.464127 msecs" "Elapsed time: 3262.715666 msecs" "Elapsed time: 3374.897899 msecs" "Elapsed time: 3156.932682 msecs" "Elapsed time: 3259.29953 msecs" "Elapsed time: 3223.25165 msecs" "Elapsed time: 3278.636153 msecs" "Elapsed time: 3168.031023 msecs" "Elapsed time: 3230.07922 msecs"

Updated prototype: http://cr.openjdk.java.net/~vlivanov/8219233/webrev.01/ Benchmark: $ clj -A:user (1) baseline (jdk13 + JDK-8219492) (-XX:-UseNewCode -XX:-UseNewCode2) "Elapsed time: 13316.689099 msecs" "Elapsed time: 13046.727283 msecs" "Elapsed time: 12091.244153 msecs" "Elapsed time: 11348.779809 msecs" "Elapsed time: 11871.087797 msecs" (2) clinit barriers on nmethod entry & in c2i adapters (-XX:+UseNewCode -XX:-UseNewCode2) "Elapsed time: 3826.157429 msecs" "Elapsed time: 3713.880772 msecs" "Elapsed time: 3932.874569 msecs" "Elapsed time: 3719.124583 msecs" "Elapsed time: 3869.528133 msecs" (3) skip clinit checks in compiled code (-XX:-UseNewCode -XX:+UseNewCode2) "Elapsed time: 3806.31077 msecs" "Elapsed time: 3757.50956 msecs" "Elapsed time: 3817.134897 msecs" "Elapsed time: 3890.512896 msecs" "Elapsed time: 3988.982547 msecs" (4) no clinit checks both in compiled & interpreter code(-XX:-UseNewCode -XX:+UseNewCode2 -XX:+UseNewCode3) "Elapsed time: 3512.457131 msecs" "Elapsed time: 3182.654609 msecs" "Elapsed time: 3271.498659 msecs" "Elapsed time: 3313.055795 msecs" "Elapsed time: 3112.831206 msecs" #2 vs #3 demonstrates the cost of clinit barriers in generated code (nmethods entry + c2i adapters) which is negligible on the particular benchmark ("clj -A:user "startup). The rest of the overhead (#2/#3 vs #4) comes from interpreter - CP cache entry reresolution.

This main bug is now to be used for the JIT changes. The simple runtime changes are handled in the sub-task.

Here's the prototype of #3 approach: http://cr.openjdk.java.net/~vlivanov/8219233/webrev.00/ (x86_64 only, applies on top of http://cr.openjdk.java.net/~dholmes/8219233/webrev.v2/ ) Some performance & results & diagnostic data: * baseline (reresolution): -XX:-UseNewCode -XX:-UseNewCode2: * fixed: -XX:+UseNewCode -XX:-UseNewCode2 * disabled checks (no reresolution): -XX:-UseNewCode -XX:+UseNewCode2 ======================================== -XX:-UseNewCode -XX:-UseNewCode2 "Elapsed time: 13896.292601 msecs" real 0m14.705s user 0m34.613s sys 0m0.696s -XX:+CountRuntimeCalls 5817122 SharedRuntime::resolve_static_call_C(JavaThread *thread ) 348097 InterpreterRuntime::resolve_from_cache(JavaThread* thread, Bytecodes::Code bytecode) 1637 SharedRuntime::handle_wrong_method_ic_miss(JavaThread* thread) ======================================== -XX:+UseNewCode -XX:-UseNewCode2 "Elapsed time: 5027.573052 msecs" real 0m5.831s user 0m17.084s sys 0m0.544s -XX:+CountRuntimeCalls 18777 SharedRuntime::resolve_static_call_C(JavaThread *thread ) 489072 InterpreterRuntime::resolve_from_cache(JavaThread* thread, Bytecodes::Code bytecode) 1914 SharedRuntime::handle_wrong_method_ic_miss(JavaThread* thread) ======================================== -XX:-UseNewCode -XX:+UseNewCode2 "Elapsed time: 4072.403491 msecs" real 0m4.923s user 0m14.604s sys 0m0.450s -XX:+CountRuntimeCalls 4128 SharedRuntime::resolve_static_call_C(JavaThread *thread ) 113648 InterpreterRuntime::resolve_from_cache(JavaThread* thread, Bytecodes::Code bytecode) 1222 SharedRuntime::handle_wrong_method_ic_miss(JavaThread* thread)

I've been experimenting with different approaches, but haven't found good enough variant yet. I have a working fix for C2, but slowdown is still ~2x and the bottleneck is in C1 code. There are 3 modes in play here (interpreter, C1, and C2) and I consider the following approaches: (1) barrier on caller side: leave interpreter as is (reresolution), put explicit check in C1/C2 around static calls to methods which holder is being initialized; It fits interpreter & C2 well, but requires significant (+ platform-specific) changes in C1. The main complication is that C1 compiles calls to unloaded methods and uses patching to fix the code on the first invocation, so there's no way to insert a barrier before the call. Requires either unconditional deoptimization support or extended support for Klass* constant patching (load holder class on the first check and patch the check). (2) barrier on callee side: put init barrier on nmethod entries for not-yet-initialized classes and normal method entry in interpreter. For generated code (C1/C2), barrier triggers deoptimization when holder is being initialized and current thread doesn't match init thread. Then, during deoptimization the execution is blocked until initialization is over. But for interpreter it requires platform-specific code on normal method entry. (3) mixed (best of 2 prev. approaches): check on caller side in interpreter and on callee side in C1/C2. But it leaves 1 invocation mode not covered - compiled-to-interpreter. It's possible to put a barrier into c2i adapter, but (again) requires platform-specific changes. So far, I'm in favor of #1 & #3.

That essentially sounds like a kind of per-thread resolution, which I've had some discussions with Claes about. We already have class initialization "barriers" - that's what an unresolved method call is. What we lack is a way to treat the call-site as resolved only for the thread performing the clinit. The idea that you compile a method a special way while the class is uninitialized, then throw it away and recompile, may address the current problem, but sounds like it will instead have a negative impact on other usecases.

The only way I see to fix the performance regression is to insert class initialization barriers (cheap ones!) in generated code (either at call sites or at nmethod entry). Relevant discussion (w/ some details) during JDK-8188133 review: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-February/032615.html

Excerpt from diagnostic VM output (-XX:+TraceCallFixup -XX:+PrintCompilation -XX:+PrintInlining): 6363 132 b clojure.core$first__5369::invokeStatic (7 bytes) @ 3 clojure.lang.RT::first (35 bytes) failed initial checks resolving static (invokestatic) call to clojure.lang.RT::first at pc: 0x00000001137513d4 to code: 0x0000000113751510 resolving static (invokestatic) call to clojure.lang.RT::first at pc: 0x00000001137513d4 to code: 0x0000000113751510 resolving static (invokestatic) call to clojure.lang.RT::first at pc: 0x00000001137513d4 to code: 0x0000000113751510 resolving static (invokestatic) call to clojure.lang.RT::first at pc: 0x00000001137513d4 to code: 0x0000000113751510 resolving static (invokestatic) call to clojure.lang.RT::first at pc: 0x00000001137513d4 to code: 0x0000000113751510 ... resolving static (invokestatic) call to clojure.lang.RT::first at pc: 0x00000001137513d4 to code: 0x00000001137cb410 resolving static (invokestatic) call to clojure.lang.RT::next at pc: 0x00000001137a13d4 to code: 0x00000001137e5090 resolving static (invokestatic) call to clojure.lang.RT::seq at pc: 0x00000001137bcfd4 to code: 0x00000001137ecd10 resolving static (invokestatic) call to clojure.lang.RT::first at pc: 0x00000001137513d4 to code: 0x00000001137cb410 resolving static (invokestatic) call to clojure.lang.RT::next at pc: 0x00000001137a13d4 to code: 0x00000001137e5090 resolving static (invokestatic) call to clojure.lang.RT::seq at pc: 0x00000001137bcfd4 to code: 0x00000001137ecd10 resolving static (invokestatic) call to clojure.lang.RT::first at pc: 0x00000001137513d4 to code: 0x00000001137cb410 resolving static (invokestatic) call to clojure.lang.RT::next at pc: 0x00000001137a13d4 to code: 0x00000001137e5090

As expected, most of the overhead comes from call site reresolutions in compiled code: 10 iterations: $ clj -A:user 8219233/webrev (CPCache resolution and nmethod call site patching DISABLED): Elapsed time: 11.28s +-0.16 real 11.99s +-0.16 user 28.79s +-0.97 sys 0.61s +-0.02 8219233/webrev w/ CPCache resolution ENABLED (cpCache.cpp): Elapsed time: 9.34s +-0.16 real 10.13s +-0.18 user 19.36s +-0.68 sys 0.45s +-0.01 8219233/webrev w/ call site patching ENABLED (sharedRuntime.cpp): Elapsed time: 3.45s +-0.07 real 3.81s +-1.21 user 8.93s +-0.13 sys 0.32s +-0.01 baseline: CPCache resolution and nmethod call site patching ENABLED Elapsed time: 3.10s +-0.01 real 3.49s +-0.09 user 8.22s +-0.13 sys 0.35s +-0.02

JDK-8188133 doesn't make a difference, because it helps only in the case when inlining happens (and call site goes away, thus effectively eliminating initializaiton barrier).

There is a simple adjustment to the fix for JDK-820574, which continues to mark the call as resolved when the caller class is the same as the callee class (in which case the current thread must have <clinit> on its callstack). This restores performance of the ClinitCalls microbenchmark. > ../jdk-dev2/build/linux-x64/images/jdk/bin/java -Xint ClinitCalls Nanos/call: 72 > java11 -Xint ClinitCalls Nanos/call: 74 > ../jdk-dev2/build/linux-x64/images/jdk/bin/java ClinitCalls Nanos/call: 3 > java11 ClinitCalls Nanos/call: 4 > ../jdk-dev2/build/linux-x64/images/jdk/bin/java -Xcomp ClinitCalls Nanos/call: 2 > java11 -Xcomp ClinitCalls Nanos/call: 3 http://cr.openjdk.java.net/~dholmes/8219233/webrev/

Thanks David for looking into this issue, i just verified your patch, it does restores performance caused by JDK-8205714 for ClinitCalls. I also verified with Clojure application, but still it doesn't look ok. Below are the verification results 1) jdk-12-ea+27 without JDK-8205714 "Elapsed time: 10583.491926 msecs" 2) jdk-12-ea+28 with inclusion of JDK-8205714 "Elapsed time: 26956.274453 msecs" 3) jdk-13-internal with your patch "Elapsed time: 24325.168679 msecs" There is slight improvement, but still 2.3x drop.

I had a simple optimization to avoid unnecessary calls to initialize() when we're already initializing but this leads to (inexplicable) JCK test failures.

Additional Information from submitter: After posting this bug report, more work was done to narrow down the problem. It appears that the slowdown is specifically related to any work done in the static initializer. Clojure's runtime is loaded in clojure.lang.RT. Clojure has support for a well-known "user.clj" file which developers can put initialization code for development (loading common utilities, etc). This file is typically not used in production setups but is often used in development, sometimes to load a large number of namespaces and set up system state (this might require loading and compiling hundreds of files). I now have a simpler repro at https://github.com/puredanger/slow-user-load, which just does some meaningless busy-work either in or out of user.clj. The user.clj file is loaded during the RT static initializer. It seems that work running in the static initializer is now running at interpreted speeds as of these releases, which has a dramatic effect on the performance of this use case in Clojure. This seems to be due to the fix for JDK-8215634 / JDK-8205714 (not public).

We will look at this from a number of angles. I'm taking this for runtime as the RE for the fix that introduced the regression. There may be something we can do in the runtime code to mitigate the overhead associated with the static calls. I will file a sub-task for the compiler team to look into what optimizations may be possible on their side. Possible work-around: off-load the static method of the current class to a static method of a nested helper class.

Having finally gotten clj/clojure up and running it appears that JDK-8188133 doesn't help for the case reported here.

JDK-8188133 (13-b07) is the reason we see less of a slowdown on ClinitCalls in recent builds. It applied cleanly to jdk12 (but not jdk11u) where I could verify a similar improvement.

This little program show a 64x increase in the printed elapsed time going from 12-b27 to 12-b28: public class ClinitCalls { public static void main(String[] args) { } static { long start = System.nanoTime(); for (int i = 0; i < 1000000; i++) { callIt(); } long end = System.nanoTime(); System.out.println("Elapsed: " + (end - start)); } static void callIt() { } } I'd note that recent jdk/jdk builds sees the cost in this example drop from ~64x to ~8x relative to 12-b27, due some as yet unidentified change. So there might be some things we can do to recuperate here.

Issue is discussed a little here (thanks [~redestad]): https://dev.clojure.org/jira/browse/CLJ-2484 But I'm having a great deal of trouble seeing how the fix relating to static initialization could result in such significant performance reduction. The ability to compile and optimize the clinit method, and anything it calls should not be affected, other than any invokestatic back to the current class (being initialized) will need to take a trip into the class-initialization logic. Is that what happens?

I have heard that the code involved uses a static initializer that continues to run rather than completing as a static initializer is expected to do. While <clinit> is executing the class remains in the "initializing" state and all access to it by other threads is prohibited - they will block until static initialization completes. A bug that has been fixed in the versions showing the current problem, previously allowed access to the incompletely initialized class - but that is prohibited.

There is performance drop observed in all the versions of jdk. Observed performance drop - 2.5x to 3x Below are the results 8u192 GA - Pass 8u201 b01 - Fail jdk 11.0.1 GA - Pass jdk 11.0.2 b01 - Fail jdk 12 ea b27 - Pass jdk 12 b28 - Fail