Bug ID: JDK-8302314 JDK-8231895 may cause rough performance degradation at least for VMWare ESX Hosts/Windows Guests

JDK-8302314 : JDK-8231895 may cause rough performance degradation at least for VMWare ESX Hosts/Windows Guests

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 17,20

Priority: P4
Status: Closed
Resolution: External
OS: windows
CPU: x86_64

Submitted: 2023-02-13
Updated: 2023-07-13
Resolved: 2023-07-13

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdResolved

Related Reports

Relates :	JDK-8231895 - Avoid String allocations in JVM_FindLoadedClass
Relates :	JDK-8300926 - Several startup regressions ~6-70% in 21-b6 all platforms

Description

Below is full problem description from our customer after JDK-8231895 - Avoid String allocations in JVM_FindLoadedClass

------------------------------------------------------------------------------------

Since a few month we are shipping the latest Java 17 Zulu JDK with our product. Those updates are not slowly reaching production systems and those systems have an enhanced rate of failure. Here are the symptoms and what we already tried:

a) The symptom is that during startup of our large (OSGi based) Application Server a long(er) time the CPU us fully used and most of the time also with high kernel cpu time in windows.

b) It happens only on Windows, but different Versions (Server 2016, Server 2019)

c) It happens only on ESX Hosts with those Windows x64 Guests (different systems). We could not yet reproduce it on a KVM based Openstack Host.

d) It happens in smaller (4 vCore) and larger (16 vCore) Guests.

e) (We might have seen it with Win11 on bare Hardware as well but that's not 100% sure)

f) When this happens the machine becomes hard to debug remotely so the information we could gather so far is limited:

a. In Sysinternals ProcessExplorer we see the cpu usage is by the java process

b. When looking at thread details we see multiple threads with 0.1CPU usage. Those threads show Windows Stack traces like JVM_findClass or JVM_registerSignal -> KeWaitForSingleObject -> KeMutex. - non scientifically speaking it looks like synchronized blocks are not pausing the CPU.

c. Some time the system recovers after an hour and then it might continue to use a single core full time in kernel mode. (here for example 26648)

f. We did a lot of start/stops on test systems, so it is not clear if there is a correlation ir not but some of those starts do not hang as described above but crash. If they crash they do not write a hs_err file but I see crash mesasages in Windows Event log.
Faulting application name: java.exe, version: 17.0.4.1, time stamp: 0x55db5af9
Faulting module name: ntdll.dll, version: 10.0.17763.3232, time stamp: 0xd6e0b8e1
Exception code: 0xc0000005
Fault offset: 0x00000000000dae5d
Faulting process id: 0x510
Faulting application start time: 0x01d8ce85298bd720
Faulting application path: H:\BIS\runtime\jvm64\bin\java.exe
Faulting module path: C:\Windows\SYSTEM32\ntdll.dll
Report Id: 69e99d94-02e8-48af-a977-e274e288a064
Faulting package full name:
Faulting package-relative application ID:

g) Wenn we keep everything the same (OS, App, Load, Hypervisor) but replace the JDK17 with a JDK11 the problem is gone (or at least the few tries we did did not show it).

h) We tried different Java17 versions, especially 17.34+19 and 17.36+14-SA and 17.36+18-SA

i) We did not see it on Linux or other hypervisors however our test-base is small. These Windows Installations are the majority of users of this new Java version.

j) The problem affects only system startup. So if a instance is in this state and we terminate it and start it multiple time until it does not get into this state (or by manually starting components after global startup) it normally behaves fine.

k) We did on the aforementioned KVM system automated startup tests and 50 startups with Java 11 and Java 17 did not show such a problem on those Windows machines. We plan to repeat this test on ESX/VMWARE hosts later next week to see if it can be reproduced outside of productive systems.

------------------------------------------------------------------------------------

Investigating the problem we found that the regression was introduced exactly between 13 and 14. Regarding we were limited in using of debug tools I added few exra logging to capture system performance metrics like CPU consumption by processes/threads, memory usage (page faults, RAM/swap usage), I/O rates, etc., so we figured out that typicaly 10-15 of user's threads consume all available CPU time, and rest of the system (including other processes, VM/JFR threads) is in lack of timeslices. The consumers may be different from timemark to timemark, but they all are working threads.

We captured several coredumps using HandshakeTimeout option to crash JVM and found that a lot of working threads are in JVM_FindLoadedClass(), and that is how we suspected JDK-8231895 (the only change to JVM_FindLoadedClass between 13 and 14). After reverting of JDK-8231895 the problem has gone. The only explanation we have is that removed allocation possibly ran a synchronization primitive and so forced the scheduler arbitrage, but now code like this

https://github.com/apache/felix-dev/blob/master/framework/src/main/java/org/apache/felix/framework/BundleWiringImpl.java#L1488

faces no restrictions (a problem in Windows scheduler?) and completely consumes all CPU time.

We also tried to limit number of threads running JVM_FindLoadedClass() simultaneously with Semaphore::wait_with_safepoint_check() with the semaphore capacity equals to <number of cores> - 1 and that even boosted the system performance. So we will propose the limiting of number of threads concurrently calling Felix API as a workaround, but it still looks as JDK issue.

Comments

After backporting of JDK-8300926 to zulu17 the problem has gone: in 5-10% of starts the system still shows high CPU consumption for 2-3 mins, but then it goes Ok. That is nothing against 30-40 mins of 100% CPU load without JDK-8300926. As I see the things we had positive feedback here: all READY threads exactly ran class loading and so caused deoptimizations and prevented compilation for the rest of threads, and therefore the scheduler just didn't have an alternative and kept running SeeHFWorker* longer and longer. The customer fixed the problem by limiting number of threads running class loading in parallel, but that still look as good argument for backporting of JDK-8300926 to jdk17
21-04-2023
There are over 200 threads runnable threads that are safepoint unsafe. On Windows this means these get to run under optimum scenario once every 250 ms (16 vCPU). When a thread is contexted switched in a non-safepoint safe state, the handshake needs to wait until such thread can execute (on-proc) and reached such state (either by blocking or hitting a safepoint poll). During class-loading we need to deopt those complied methods that made assumptions about the class-graph. We do this by walking the stacks and invalidate returns to no-longer valid methods. (patching return PC into such) In this scenario I would not be surprised if a handshake on this load takes a 1 second or so. We need a CPU do drive the handshake, there are other processes using CPU and other Java threads (that already performed the handshake) may become runnable, thus we are not just running "un-handshaked" threads. JDK-8300926 may even worsen this, since it will handshake without any locks. Here we have 78 threads waiting for that lock, thus it will reduce the amount of CPU for "un-handshaked" threads. The reason why deopt is better for this workload in old JDK is that they safepoint, thus stopping all threads. Which means there are fewer and fewer threads competing for CPU time. In balanced workload this means there is a lot of free CPU time when safepoint, thus reducing throughput and latency. Since the first thread to stop needs to wait until the last thread have stopped. Why is there a SeeHFWorker173 ? It seems really bad to have 174 such workers on such small system. They seem to be them doing the all SymbolTable lookups. Since most of them are running equal in SymbolTable it means the there are multiple items in a bucket. Thus increasing the size will avoid the equal, e.g. the chance that the bucket is empty instead increases => truncated hash becomes 'more' unique. When running over the buckets and doing equals we can't handshake that thread, hence may help a small amount. Setting SeeHFWorker to 8 or so will help a lot.
21-02-2023
Ok, I'll try to backport the fix and create evaluation build for the customer
20-02-2023
Okay, so JDK-8300926 may well be the fix then. Paging [~rehn]!
20-02-2023
[~dholmes] The customer raised the issue against 17 but then we figured out that the problem appeared in 14 exactly when deopt started using handshakes
20-02-2023
[~apavlyutkin] what JDK version was that analysis done on? Those stacks look to me like the issue in JDK-8300926 caused by a bad interaction between handshakes and deopt.
20-02-2023
I tried simulating what you might be seeing by playing with existing Class.forName micros, and found a bottleneck in URLClassPath that means threads will synchronize getting hold of the class loader. I've prototyped a fix and see a decent improvement to scalability on micros: https://github.com/openjdk/jdk/pull/12619 The other patches above had little to no effect in the ClassForName microbenchmarks ran with thread counts above 1, but this had a significant effect. Can you give it a try? This is probably a worthwhile fix regardless, assuming I got the concurrency right.
17-02-2023
Sorry if I'm grasping at straws here since I have no reproducer. If you could provide one then I might be able to gain more insight as to what's happening here. as_utf8_string doesn't do much other than allocating a buffer in a resource area and copy that. I can't see how as_utf8_string or the (*p != ..) loop can stall for seconds. Tangentially related work is in-fligh: - https://github.com/openjdk/jdk/pull/12585 will reduce the need to grab the Compile_lock - https://github.com/openjdk/jdk/pull/12562 reduces allocations when adding entries to the Symbol table, which might have some effect when combined with #12585. As for java_lang_String::as_utf8_string there's a small deficiency in that we unnecessarily call Thread::current() when doing the resource area allocation. This patch would reduce that: https://github.com/openjdk/jdk/compare/master...cl4es:jdk:as_utf8_thread?expand=1 - though I'm not sure it matters at all for what you're seeing.
17-02-2023
[~redestad] I doubt that the problem is about Symbol lookup, cuz so many threads are in as_utf8_string() or in while (p != '\0') {...}, that does not look as a collision on Symbol lookup. IMHO the problem is that most of READY threads are frequently running exactly JVM_FindLoadedClass/JVM_DefineClass https://github.com/apache/felix-dev/blob/1bdd64eff5cd818f31953e193f49014c3124d968/framework/src/main/java/org/apache/felix/framework/BundleWiringImpl.java#L2137 but the rest of the system (probably intend to do something else) is in BLOCKED state. A lot of threads are locked in CompileBroker::compile_method_base() on CompileQueue_lock. The lock is owned by C2 CompilerThread0 that in turn is locked on Compile_lock along with C1 CompilerThread0, and Compile_lock is taken by a user thread .454 Id: 2134.1a38 Suspend: 0 Teb: 00000023`4a226000 Unfrozen "StartupThread#AS2 Client" Start: ucrtbase!thread_start<unsigned int (__cdecl)(void * __ptr64)> (00007ffe`2af02650) Priority: 0 Priority class: 32 Affinity: f # Child-SP RetAddr Call Site 00 00000023`307fd0f8 00007ffe`2deb87c2 ntdll!NtWaitForAlertByThreadId+0x14 01 00000023`307fd100 00007ffe`2a56cdd9 ntdll!RtlSleepConditionVariableCS+0xd2 02 00000023`307fd170 00007ffd`f545f8d9 KERNELBASE!SleepConditionVariableCS+0x29 03 00000023`307fd1a0 00007ffd`f5430e94 jvm!os::PlatformMonitor::wait+0x29 [s\src\hotspot\os\windows\os_windows.cpp @ 5623] 04 00000023`307fd1d0 00007ffd`f5621de5 jvm!Monitor::wait+0xb4 [s\src\hotspot\share\runtime\mutex.cpp @ 244] 05 (Inline Function) --------`-------- jvm!MonitorLocker::wait+0xe [s\src\hotspot\share\runtime\mutexLocker.hpp @ 263] 06 (Inline Function) --------`-------- jvm!VMThread::wait_until_executed+0x10e [s\src\hotspot\share\runtime\vmThread.cpp @ 365] 07 00000023`307fd260 00007ffd`f51ccc03 jvm!VMThread::execute+0x1a5 [s\src\hotspot\share\runtime\vmThread.cpp @ 535] 08 00000023`307fd2d0 00007ffd`f50d373d jvm!Handshake::execute+0x83 [s\src\hotspot\share\runtime\handshake.cpp @ 342] 09 00000023`307fd340 00007ffd`f559236f jvm!Deoptimization::deoptimize_all_marked+0xed [s\src\hotspot\share\runtime\deoptimization.cpp @ 916] 0a (Inline Function) --------`-------- jvm!SystemDictionary::add_to_hierarchy+0x2e [s\src\hotspot\share\classfile\systemDictionary.cpp @ 1594] 0b 00000023`307fd390 00007ffd`f5593f71 jvm!SystemDictionary::define_instance_class+0x23f [s\src\hotspot\share\classfile\systemDictionary.cpp @ 1455] 0c 00000023`307fd4e0 00007ffd`f5595da1 jvm!SystemDictionary::find_or_define_helper+0x1b1 [s\src\hotspot\share\classfile\systemDictionary.cpp @ 1539] 0d (Inline Function) --------`-------- jvm!SystemDictionary::find_or_define_instance_class+0x11 [s\src\hotspot\share\classfile\systemDictionary.cpp @ 1557] 0e 00000023`307fd550 00007ffd`f5595e34 jvm!SystemDictionary::resolve_class_from_stream+0x181 [s\src\hotspot\share\classfile\systemDictionary.cpp @ 934] 0f 00000023`307fd600 00007ffd`f528e7eb jvm!SystemDictionary::resolve_from_stream+0x24 [s\src\hotspot\share\classfile\systemDictionary.cpp @ 962] 10 00000023`307fd640 00007ffd`f5294a09 jvm!jvm_define_class_common+0x28b [s\src\hotspot\share\prims\jvm.cpp @ 878] 11 00000023`307fd770 00007ffe`1e9c17ec jvm!JVM_DefineClassWithSource+0xb9 [s\src\hotspot\share\prims\jvm.cpp @ 1042] 12 00000023`307fd7c0 0000013a`20d44efb java!Java_java_lang_ClassLoader_defineClass1+0x140 [s\src\java.base\share\native\libjava\ClassLoader.c @ 132] 13 00000023`307fdce0 00000007`65e603f8 0x0000013a`20d44efb 14 00000023`307fdce8 00000007`6a5c8300 0x00000007`65e603f8 15 00000023`307fdcf0 00000005`cbb4be68 0x00000007`6a5c8300 16 00000023`307fdcf8 0000013a`190811e4 0x00000005`cbb4be68 17 00000023`307fdd00 00000023`307fdd38 0x0000013a`190811e4 18 00000023`307fdd08 00000000`00000000 0x00000023`307fdd38 that awaits for completing of deoptimization handshake. The handshake operation starts at Event: 163.970 Executing VM operation: HandshakeAllThreads and gets timed out [173.972s][error][handshake ] Handshake timeout: Deoptimize(0x00000023307fd308), pending threads: 2 [173.972s][error][handshake ] JavaThread 0x0000013a6156d520 has not cleared handshake op: 0x00000023307fd308 [173.972s][error][handshake ] JavaThread 0x0000013a6156fba0 has not cleared handshake op: 0x00000023307fd308 and the threads, who have not cleared the handshake .513 Id: 2134.22d0 Suspend: 0 Teb: 00000023`4a2a6000 Unfrozen "pool-82-thread-1" Start: ucrtbase!thread_start<unsigned int (__cdecl)(void __ptr64)> (00007ffe`2af02650) Priority: 0 Priority class: 32 Affinity: f # Child-SP RetAddr Call Site 00 (Inline Function) --------`-------- jvm!UNICODE::utf8_length+0x2a [s\src\hotspot\share\utilities\utf8.cpp @ 433] 01 00000023`2e3fded0 00007ffd`f520aaf3 jvm!UNICODE::as_utf8<signed char>+0x3f [s\src\hotspot\share\utilities\utf8.cpp @ 440] 02 (Inline Function) --------`-------- jvm!java_lang_String::as_utf8_string+0xa9 [s\src\hotspot\share\classfile\javaClasses.cpp @ 691] 03 00000023`2e3fdf00 00007ffd`f5296435 jvm!java_lang_String::as_utf8_string+0xb3 [s\src\hotspot\share\classfile\javaClasses.cpp @ 680] 04 00000023`2e3fdf30 0000013a`20b75f8b jvm!JVM_FindLoadedClass+0x135 [s\src\hotspot\share\prims\jvm.cpp @ 1052] 05 00000023`2e3fdf90 00000000`000003d8 0x0000013a`20b75f8b 06 00000023`2e3fdf98 00000007`4f742dd0 0x3d8 07 00000023`2e3fdfa0 0000013a`61a54f40 0x00000007`4f742dd0 08 00000023`2e3fdfa8 0000013a`61a55318 0x0000013a`61a54f40 09 00000023`2e3fdfb0 00000005`ca58f3e0 0x0000013a`61a55318 0a 00000023`2e3fdfb8 00000007`4f742da8 0x00000005`ca58f3e0 0b 00000023`2e3fdfc0 00000007`5133bd70 0x00000007`4f742da8 0c 00000023`2e3fdfc8 f8008379`00000005 0x00000007`5133bd70 0d 00000023`2e3fdfd0 ea2677ae`b94b1faa 0xf8008379`00000005 0e 00000023`2e3fdfd8 00000007`50930d40 0xea2677ae`b94b1faa 0f 00000023`2e3fdfe0 00000007`4f7438d8 0x00000007`50930d40 10 00000023`2e3fdfe8 00000007`000004a6 0x00000007`4f7438d8 11 00000023`2e3fdff0 00000007`5133bd70 0x00000007`000004a6 12 00000023`2e3fdff8 0000013a`20e84294 0x00000007`5133bd70 13 00000023`2e3fe000 00000005`ca58f3e0 0x0000013a`20e84294 14 00000023`2e3fe008 00000007`4f743e68 0x00000005`ca58f3e0 15 00000023`2e3fe010 00000005`ca58f3e0 0x00000007`4f743e68 16 00000023`2e3fe018 00000005`c1d53f88 0x00000005`ca58f3e0 17 00000023`2e3fe020 00000005`c1bd46c0 0x00000005`c1d53f88 18 00000023`2e3fe028 00000007`4f743e68 0x00000005`c1bd46c0 19 00000023`2e3fe030 00000007`4f743930 0x00000007`4f743e68 1a 00000023`2e3fe038 00000007`4f742da8 0x00000007`4f743930 1b 00000023`2e3fe040 00000007`5133bd70 0x00000007`4f742da8 1c 00000023`2e3fe048 00000005`c81ae5b8 0x00000007`5133bd70 1d 00000023`2e3fe050 00000000`00000001 0x00000005`c81ae5b8 1e 00000023`2e3fe058 00000000`00000009 0x1 1f 00000023`2e3fe060 00000000`00000000 0x9 .514 Id: 2134.2174 Suspend: 0 Teb: 00000023`4a2a8000 Unfrozen "pool-83-thread-1" Start: ucrtbase!thread_start<unsigned int (__cdecl)(void __ptr64)> (00007ffe`2af02650) Priority: 0 Priority class: 32 Affinity: f # Child-SP RetAddr Call Site 00 (Inline Function) --------`-------- jvm!UNICODE::as_utf8+0x34 [s\src\hotspot\share\utilities\utf8.cpp @ 470] 01 00000023`2e7fe270 00007ffd`f520aaf3 jvm!UNICODE::as_utf8<signed char>+0x88 [s\src\hotspot\share\utilities\utf8.cpp @ 442] 02 (Inline Function) --------`-------- jvm!java_lang_String::as_utf8_string+0xa9 [s\src\hotspot\share\classfile\javaClasses.cpp @ 691] 03 00000023`2e7fe2a0 00007ffd`f5296435 jvm!java_lang_String::as_utf8_string+0xb3 [s\src\hotspot\share\classfile\javaClasses.cpp @ 680] 04 00000023`2e7fe2d0 0000013a`20b75f8b jvm!JVM_FindLoadedClass+0x135 [s\src\hotspot\share\prims\jvm.cpp @ 1052] 05 00000023`2e7fe330 00000000`000003d8 0x0000013a`20b75f8b 06 00000023`2e7fe338 00000007`4f236498 0x3d8 07 00000023`2e7fe340 0000013a`61a5c950 0x00000007`4f236498 08 00000023`2e7fe348 0000013a`61a5cd28 0x0000013a`61a5c950 09 00000023`2e7fe350 00000005`ca58f3e0 0x0000013a`61a5cd28 0a 00000023`2e7fe358 00000007`4f236470 0x00000005`ca58f3e0 0b 00000023`2e7fe360 00000007`50fe27c8 0x00000007`4f236470 0c 00000023`2e7fe368 f8008379`00000005 0x00000007`50fe27c8 0d 00000023`2e7fe370 ea1fc4f9`b94b1faa 0xf8008379`00000005 0e 00000023`2e7fe378 00000007`50930d40 0xea1fc4f9`b94b1faa 0f 00000023`2e7fe380 00000007`4f236fa0 0x00000007`50930d40 10 00000023`2e7fe388 00000008`000004a6 0x00000007`4f236fa0 11 00000023`2e7fe390 00000007`50fe27c8 0x00000008`000004a6 12 00000023`2e7fe398 0000013a`20e84294 0x00000007`50fe27c8 13 00000023`2e7fe3a0 00000005`ca58f3e0 0x0000013a`20e84294 14 00000023`2e7fe3a8 00000007`4f237530 0x00000005`ca58f3e0 15 00000023`2e7fe3b0 00000005`ca58f3e0 0x00000007`4f237530 16 00000023`2e7fe3b8 00000005`c1d53f88 0x00000005`ca58f3e0 17 00000023`2e7fe3c0 00000005`c1bd46c0 0x00000005`c1d53f88 18 00000023`2e7fe3c8 00000007`4f237530 0x00000005`c1bd46c0 19 00000023`2e7fe3d0 00000007`4f236ff8 0x00000007`4f237530 1a 00000023`2e7fe3d8 00000007`4f236470 0x00000007`4f236ff8 1b 00000023`2e7fe3e0 00000007`50fe27c8 0x00000007`4f236470 1c 00000023`2e7fe3e8 00000005`c81ae5b8 0x00000007`50fe27c8 1d 00000023`2e7fe3f0 00000000`00000001 0x00000005`c81ae5b8 1e 00000023`2e7fe3f8 00000000`00000009 0x1 1f 00000023`2e7fe400 00000000`00000000 0x9 It does not seem like they do any syncing.
17-02-2023
Noting that there's an extremely large number of threads in flight in these logs (694) compared to the number of physical cores. It doesn't seem unreasonable that we're seeing contention in the Symbol table lookups, exacerbated by the optimization in JDK-8231895 as Ioi says. There should be plenty of synchronization points and safepoint polls as threads transition from Java to native to VM and back. [~rehn] suggested an experiment to increase Symbol table size - which could reduce contention on individual buckets in the symbol table. For example "-XX:+UnlockExperimentalVMOptions -XX:SymbolTableSize=131072"
15-02-2023
It could also be that a faster JVM_FindLoadedClass(), which doesn't allocate a Java object, revealed a race condition in the application, resulting in a pathological case where JVM_FindLoadedClass() gets called much more frequently so that it shows up in the profile. JVM_FindLoadedClass() is a very fast operation. If you see it a lot, that's usually not because it's slow, but because you call it more often. I would suggest adding an atomic counter and read its value from a debugger.
14-02-2023
Just got an update from the customers: the problem is reproducible on bare HW
14-02-2023
It is certainly possible that removing a potential "synchronization" point could change the control flow such that we get threads contending in a different way that is worse than in the original case. However, it is very suspicious this only happens on one OS, and even more suspicious that it happens only with one particular Hypervisor. I also would not be able to classify this as a JDK problem. The code is correct, but one particular execution environment shows a scheduling pathology when running specific application code that happens to be affected by this change in JDK code. [~redestad] do you have any thoughts/comments?
14-02-2023