Below is full problem description from our customer after JDK-8231895 - Avoid String allocations in JVM_FindLoadedClass
------------------------------------------------------------------------------------
Since a few month we are shipping the latest Java 17 Zulu JDK with our product. Those updates are not slowly reaching production systems and those systems have an enhanced rate of failure. Here are the symptoms and what we already tried:
a) The symptom is that during startup of our large (OSGi based) Application Server a long(er) time the CPU us fully used and most of the time also with high kernel cpu time in windows.
b) It happens only on Windows, but different Versions (Server 2016, Server 2019)
c) It happens only on ESX Hosts with those Windows x64 Guests (different systems). We could not yet reproduce it on a KVM based Openstack Host.
d) It happens in smaller (4 vCore) and larger (16 vCore) Guests.
e) (We might have seen it with Win11 on bare Hardware as well but that's not 100% sure)
f) When this happens the machine becomes hard to debug remotely so the information we could gather so far is limited:
a. In Sysinternals ProcessExplorer we see the cpu usage is by the java process
b. When looking at thread details we see multiple threads with 0.1CPU usage. Those threads show Windows Stack traces like JVM_findClass or JVM_registerSignal -> KeWaitForSingleObject -> KeMutex. - non scientifically speaking it looks like synchronized blocks are not pausing the CPU.
c. Some time the system recovers after an hour and then it might continue to use a single core full time in kernel mode. (here for example 26648)
d.
e.
f. We did a lot of start/stops on test systems, so it is not clear if there is a correlation ir not but some of those starts do not hang as described above but crash. If they crash they do not write a hs_err file but I see crash mesasages in Windows Event log.
Faulting application name: java.exe, version: 17.0.4.1, time stamp: 0x55db5af9
Faulting module name: ntdll.dll, version: 10.0.17763.3232, time stamp: 0xd6e0b8e1
Exception code: 0xc0000005
Fault offset: 0x00000000000dae5d
Faulting process id: 0x510
Faulting application start time: 0x01d8ce85298bd720
Faulting application path: H:\BIS\runtime\jvm64\bin\java.exe
Faulting module path: C:\Windows\SYSTEM32\ntdll.dll
Report Id: 69e99d94-02e8-48af-a977-e274e288a064
Faulting package full name:
Faulting package-relative application ID:
g) Wenn we keep everything the same (OS, App, Load, Hypervisor) but replace the JDK17 with a JDK11 the problem is gone (or at least the few tries we did did not show it).
h) We tried different Java17 versions, especially 17.34+19 and 17.36+14-SA and 17.36+18-SA
i) We did not see it on Linux or other hypervisors however our test-base is small. These Windows Installations are the majority of users of this new Java version.
j) The problem affects only system startup. So if a instance is in this state and we terminate it and start it multiple time until it does not get into this state (or by manually starting components after global startup) it normally behaves fine.
k) We did on the aforementioned KVM system automated startup tests and 50 startups with Java 11 and Java 17 did not show such a problem on those Windows machines. We plan to repeat this test on ESX/VMWARE hosts later next week to see if it can be reproduced outside of productive systems.
------------------------------------------------------------------------------------
Investigating the problem we found that the regression was introduced exactly between 13 and 14. Regarding we were limited in using of debug tools I added few exra logging to capture system performance metrics like CPU consumption by processes/threads, memory usage (page faults, RAM/swap usage), I/O rates, etc., so we figured out that typicaly 10-15 of user's threads consume all available CPU time, and rest of the system (including other processes, VM/JFR threads) is in lack of timeslices. The consumers may be different from timemark to timemark, but they all are working threads.
We captured several coredumps using HandshakeTimeout option to crash JVM and found that a lot of working threads are in JVM_FindLoadedClass(), and that is how we suspected JDK-8231895 (the only change to JVM_FindLoadedClass between 13 and 14). After reverting of JDK-8231895 the problem has gone. The only explanation we have is that removed allocation possibly ran a synchronization primitive and so forced the scheduler arbitrage, but now code like this
https://github.com/apache/felix-dev/blob/master/framework/src/main/java/org/apache/felix/framework/BundleWiringImpl.java#L1488
faces no restrictions (a problem in Windows scheduler?) and completely consumes all CPU time.
We also tried to limit number of threads running JVM_FindLoadedClass() simultaneously with Semaphore::wait_with_safepoint_check() with the semaphore capacity equals to <number of cores> - 1 and that even boosted the system performance. So we will propose the limiting of number of threads concurrently calling Felix API as a workaround, but it still looks as JDK issue.