JDK-8296463 : Memory leak in JVM_StartThread with the integration of Virtual threads
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 18,19
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: windows_10
  • CPU: x86_64
  • Submitted: 2022-11-02
  • Updated: 2022-12-13
  • Resolved: 2022-11-18
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 19 JDK 20
19.0.2Fixed 20 b25Fixed
Related Reports
Duplicate :  
Relates :  
Description
ADDITIONAL SYSTEM INFORMATION :
Microsoft Windows [Version 10.0.19044.2006]

openjdk version "19.0.1" 2022-10-18
OpenJDK Runtime Environment (build 19.0.1+10-21)
OpenJDK 64-Bit Server VM (build 19.0.1+10-21, mixed mode, sharing)

A DESCRIPTION OF THE PROBLEM :
See https://youtrack.jetbrains.com/issue/IDEA-305072 for a related discussion.

The attached testcase runs fine without a debugger, but if you run it in debug mode the JVM will get progressively slower until ~2 hours later the JVM will hang indefinitely.

Sometimes the hang is so bad that jstack and jmap will refuse to connect (they are probably unable to spawn a new thread).

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1. Start a debugging session of the attached testcase: 
2. About 2 hours - 2:30 hours later, output will cease and the JVM will hang indefinitely.
3. If you run the same testcase without a debugging session, it will not hang.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
There are about 500 live threads running at a time and no threads or memory seem to be leaking (as far as VisualVM seems to show), so I am not expecting a hang or a crash.
ACTUAL -
JVM hangs. Often when this happens, jstack (and related tools) cannot attach to probe the state of the JVM.

---------- BEGIN SOURCE ----------
pom.xml:
----------------------------
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>

	<groupId>org.example</groupId>
	<artifactId>ThreadsHang</artifactId>
	<version>1.0-SNAPSHOT</version>

	<properties>
		<maven.compiler.release>19</maven.compiler.release>
		<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
	</properties>

	<dependencies>
		<dependency>
			<groupId>com.google.guava</groupId>
			<artifactId>guava</artifactId>
			<version>31.1-jre</version>
		</dependency>
		<dependency>
			<groupId>org.slf4j</groupId>
			<artifactId>slf4j-api</artifactId>
			<version>2.0.3</version>
		</dependency>
		<dependency>
			<groupId>ch.qos.logback</groupId>
			<artifactId>logback-classic</artifactId>
			<version>1.4.4</version>
			<exclusions>
				<exclusion>
					<groupId>javax.activation</groupId>
					<artifactId>activation</artifactId>
				</exclusion>
				<exclusion>
					<groupId>com.sun.mail</groupId>
					<artifactId>javax.mail</artifactId>
				</exclusion>
			</exclusions>
		</dependency>
	</dependencies>

	<build>
		<plugins>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-compiler-plugin</artifactId>
				<version>3.10.1</version>
				<configuration>
					<showWarnings>true</showWarnings>
					<showDeprecation>true</showDeprecation>
					<compilerArgs>
						<arg>-Xlint:all,-preview,-try</arg>
						<arg>-Xdiags:verbose</arg>
						<arg>-Werror</arg>
						<arg>--enable-preview</arg>
					</compilerArgs>
				</configuration>
			</plugin>
		</plugins>
	</build>
</project>

PlatformThreadsTestcase:
---------------------------------------
import com.google.common.util.concurrent.ThreadFactoryBuilder;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Semaphore;
import java.util.concurrent.ThreadFactory;
import java.util.concurrent.atomic.LongAdder;

public class PlatformThreadsTestcase
{
	private final ThreadFactory blockingTasksThreadFactory = new ThreadFactoryBuilder().
		setNameFormat("BlockingTask-%d").
		build();
	private final ExecutorService blockingTasksExecutor = Executors.newThreadPerTaskExecutor(
		blockingTasksThreadFactory);
	private final Logger log = LoggerFactory.getLogger(PlatformThreadsTestcase.class);

	public static void main(String[] args) throws InterruptedException
	{
		PlatformThreadsTestcase testcase = new PlatformThreadsTestcase();
		testcase.run();
	}

	public void run() throws InterruptedException
	{
		Semaphore semaphore = new Semaphore(500);
		LongAdder adder = new LongAdder();
		while (true)
		{
			semaphore.acquire();
			blockingTasksExecutor.submit(() ->
			{
				adder.increment();
				long sum = adder.sum();
				if ((sum % 1000) == 0)
					log.info("Progress: {}", sum);
				try
				{
					Thread.sleep(50);
				}
				catch (InterruptedException e)
				{
					throw new RuntimeException(e);
				}
				finally
				{
					semaphore.release();
				}
			});
		}
	}
}
---------- END SOURCE ----------

FREQUENCY : always



Comments
Fix Request First my apologies for not following the process and integrating without approval. This is a trivial fix for a memory leak that was introduced by the Project Loom preview that went in to JDK 19, hence it is important to fix in 19u ASAP. The fix is extremely low risk - it only releases storage at thread termination that was not being released. There are no specific testing issues related to this fix. Verification of the fix requires manually checking used memory over time as threads are created and destroyed. The fix does not apply cleanly due to code refactoring in mainline JDK, but the change itself is identical in the context it applies. The fix has been reviewed - see PR link above.
29-11-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk19u/pull/97 Date: 2022-11-28 06:16:21 +0000
28-11-2022

[~coleenp] there is - JDK-8297168
18-11-2022

Changeset: 7b3984cb Author: David Holmes <dholmes@openjdk.org> Date: 2022-11-18 00:46:09 +0000 URL: https://git.openjdk.org/jdk/commit/7b3984cb5a08edb99a233c28331c00b25457d664
18-11-2022

Maybe there should be a new RFE to add these all at once.
17-11-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/11198 Date: 2022-11-16 22:10:18 +0000
16-11-2022

[~alanb] releasing the _extentLocalCache OopHandle fixed the problem. I no longer see any malloc/handle leak. I will file a seperate RFE to address the ServiceThread interaction. Thanks.
16-11-2022

[~alanb] Good point! It may not be a termination bottleneck just this missing OopHandle release. I will check that tomorrow (and still look at improving the interaction with the service thread). Though I also note that this extra release also adds to the potential for that bottleneck.
16-11-2022

[~dholmes] Yes. Implementation would not be trivial. No need to further discuss it in this bug. I'll create an issue. (https://bugs.openjdk.org/browse/JDK-8297115)
16-11-2022

[~dholmes] Are you going to your repeat your test with the code changed to release _extentLocalCache?
16-11-2022

[~alanb] I am considering a bulk add, and also moving as much work as possible to the callee to take some load off the service thread. The OopHandleList management makes things a little awkward. [~rehn] are you suggesting we try to define a handshake that is effectively a request for any JavaThread to do some work on behalf of the terminating thread? Spreading the load like that may work in one sense, but stealing arbitrary threads to do some internal VM book-keeping like this seems a questionable idea in general. Worth thinking about though - but as a significant future enhancement. I've also no idea how you say "here is a handshake for anyone" as normally handshakes need a target thread. ??
15-11-2022

I brought up the issue of releasing _scopedValueCache (_extentLocalCache in main line) with aph and he has fixed it in the PR for JEP 429: https://github.com/openjdk/jdk/commit/21eb011771ce0232d69706b88a102a93461fd6f5
15-11-2022

If we have like 10 JavaThreads exiting at the same time, this can become very slow. Since normally service thread have a really low run-quote, when we have one JavaThread owning the Service_lock and nine JavaThreads + ServiceThread waiting on that lock, it could bounce around to service thread multiple times before all the JT gets through. We added a bunch of work on the service thread, which is essentially "a JavaThread should do this" (but I don't want to take the cost). All 'extra' threads makes is just harder to tune your benchmark, and you may end up with two choices: over- or under-provision. If we instead spread out the work on the application JavaThreads, so for example: Growing the stringtable is done in segmented work (not prolonging TTS), this segmented work can perfectly fine be done with multiple threads. So instead of service thread spinning up and burning 100% for a hundreds of milliseconds, we could just emit handshakes with the segmented work until completed. In this case we need a JT to release oopStorage, so if we had a new type of handshake: "asynchronous with target any" we could just emit that with the oop storage release inside. Handshakes are already installed with CAS, with a per handshake closure list, we don't need a global one at all, so this would be done with a single CAS. (not saying implementing handshake "any" is trivial, since we don't want that 'any' to be a thread in native, never returning to the VM) We use to have a test which created a many thousands of threads until out of resources, then joined them, if you did something bad in exit path of JT this test would timeout. I saw some changes to that test a while back, I think the orginal version would timeout. EDIT: I misremembered some code in ServiceThread.
15-11-2022

[~dholmes] I was surprised to see that JavaThread::~JavaThread doesn't also release _extentLocalCache. For ServiceThread, would a bulk add help so that Service_lock is only locked/notified once by the JavaThread deconstructor help reduce the overhead?
15-11-2022

One potential issue is that vthreads tripled the amount of work being delegated to the service thread: // Ask ServiceThread to release the threadObj OopHandle ServiceThread::add_oop_handle_release(_threadObj); + ServiceThread::add_oop_handle_release(_vthread); + ServiceThread::add_oop_handle_release(_jvmti_vthread); that in itself would not be an issue but this is actually a very heavyweight operation and we will wake up the service thread for each oop! void ServiceThread::add_oop_handle_release(OopHandle handle) { MutexLocker ml(Service_lock, Mutex::_no_safepoint_check_flag); OopHandleList* new_head = new OopHandleList(handle, _oop_handle_list); _oop_handle_list = new_head; Service_lock->notify_all(); } So if we have a lot of threads we can be introducing a lot of context switching with the service thread and a lot of contention on the Service_lock.
15-11-2022

FTR it seems the original IDEA issue reported above is a memory leak in IDEA.
15-11-2022

I think the key thing is here: (malloc=69124KB +66742KB #112041 +106649) showing that we have an increasing number of Handles (which will consume memory of course). If I change the test to a simple: Thread t = new Thread() ... t.start(); t.join(); then there is no handle leak. I think we may have a thread termination bottleneck in the VM.
15-11-2022

As I did not observe any leak with my investigation of JDK-8294228: Thread (reserved=39038KB +7KB, committed=1242KB +47KB) (thread #0) (stack: reserved=38952KB, committed=1156KB +40KB) (malloc=53KB +7KB #196 +11) (arena=33KB #58) (this is after several minutes of continually attaching and detaching the current thread) it suggests that the leak is somewhere in the JVM_StartThread path, rather than the actual JavaThread creation and destruction (which also happens with JNI AttachCurrentThread).
14-11-2022

Hi David, Alan, I think I could reproduce 2 kinds of leak here. 1. With the usage of new Thread() Reproducible on 19 ea b22 and not on 19 ea b21 or any other previous releases. So possibly caused due to JDK-8284161 2. There is another slow leak with Executors.newThreadPerTaskExecutor/newCachedThreadPool, the testcase I have with third party dependencies. Reproducible with jdk19 and jdk18 also. Didn't try with lower versions yet. Let me know, if second slow leak need to be tracked with separate bug? Thanks,
14-11-2022

Hi [~dholmes], yes exactly the same things. - Thread (reserved=459715KB +98515KB, committed=93175KB +68779KB) (thread #0) (stack: reserved=390144KB +31744KB, committed=23604KB +2008KB) (malloc=69124KB +66742KB #112041 +106649) (arena=447KB +29 #764 +50)
14-11-2022

[~fmatte] are you using the NMT diff option to see the usage or are you also using jcmd's VM.native_memory command to look at the memory usage?
13-11-2022

I am also running the same program against jdk11.0.12 (Oracle) version but didn't observe the leak Start - Thread (reserved=512259KB +157390KB, committed=28935KB +6978KB) (thread #504 +159) (stack: reserved=509952KB +156672KB, committed=26628KB +6260KB) (malloc=1717KB +534KB #2802 +725) (arena=590KB +184 #1007 +313) After 90 mins - Thread (reserved=458751KB +103883KB, committed=28563KB +6607KB) (thread #446 +101) (stack: reserved=456704KB +103424KB, committed=26516KB +6148KB) (malloc=1524KB +341KB #2677 +600) (arena=523KB +117 #894 +200) There is no leak in jdk11, the regression may be somewhere between jdk11 to jdk18, is it worth checking that?
11-11-2022

Hi [~dholmes] I am using NMT to track the memory, I also used new Thread() instead of ExecutorService, and leak still exist Start - Thread (reserved=128887KB -217701KB, committed=12427KB -10505KB) (thread #0) (stack: reserved=122880KB -220160KB, committed=6420KB -12964KB) (malloc=5744KB +2591KB #10152 +3634) (arena=263KB -132 #578 -98) After 2hour 45 mins - Thread (reserved=748207KB +401619KB, committed=558571KB +535639KB) (thread #0) (stack: reserved=200704KB -142336KB, committed=11068KB -8316KB) (malloc=547267KB +544114KB #874334 +867816) (arena=236KB -159 #404 -272)
11-11-2022

I got an update from Gil that removes the ExecutorService from the picture, it just uses `new Thread(Runnable).start()` so keeps creating platform threads. The interesting thing is that he's duplicating the issues/leak with JDK 18 and no debugger. This is a bit different to the original description where it suggested the issue was only observed when running with the debugger. He's running with -XX:NativeMemoryTracking=detail and then using jcmd's VM.native_memory command to look at the memory usage.
11-11-2022

[~fmatte] How are you observing the memory usage? I was chasing phantom leaks in JDK-8294228 that appeared to be caused by the monitoring tools themselves.
11-11-2022

Just to confiorm, I am not using debugger there is no IDE's/visualvm involved.
11-11-2022

So it uses a ThreadPerTaskExecutor and we see Thread native memory increasing over time. How many tasks are there? If the debugger is somehow stopping tasks or threads from completing then the number of threads will increase over time and eventually exhaust the ability to create native threads (though you should see some kind of error/warning).
10-11-2022

It may help to just replace the use of Executors in the test to use new Thread(runnable).start(). Also just to confirm, the issue is only running with the debugger agent. The original description says it doesn't happen when there is no debugger (I think).
10-11-2022

Using jdk18 without preview, there is no memory leak issue Start - Thread (reserved=545205KB -1027KB, committed=36577KB -43KB) (thread #0) (stack: reserved=543744KB -1024KB, committed=35116KB -40KB) (malloc=841KB -2KB #3722 -6) (arena=620KB -1 #1060 -2) After 25 minutes - Thread (reserved=546232KB, committed=36796KB +176KB) (thread #0) (stack: reserved=544768KB, committed=35332KB +176KB) (malloc=842KB #3728) (arena=622KB #1062) I got another update from Gili == I've got another interesting discovery for you. I ran the JDK 18, 19 testcases I sent you yesterday. JDK 18 and JDK 19 without --enable-preview did not hang. Now for the interesting news... When running JDK 19 with --enable-preview testcases involving Executor.newCachedThreadPool() never hang, but Executors.newThreadPerTaskExecutor(Thread.ofPlatform().factory()) do. This executor hangs even if I remove all 3rd-party libraries and just use System.out.println() to print out the progress. This makes me believe that the bug has something to do with Executors.newThreadPerTaskExecutor(), as opposed to the type of Thread being used. == Thanks Gili for that update.
10-11-2022

Looks like it is native memory leak in Thread area which is consuming entire memory over a period of time. You can see the malloc area is increasing over a period of time (on 19.0.1) Start - Thread (reserved=549976KB +80719KB, committed=40304KB +9415KB) (thread #0) (stack: reserved=540672KB +75776KB, committed=31000KB +4472KB) (malloc=8687KB +4858KB #16237 +8098) (arena=617KB +84 #1054 +144) After 4 mins - Thread (reserved=533290KB +64032KB, committed=47722KB +16832KB) (thread #0) (stack: reserved=515072KB +50176KB, committed=29504KB +2976KB) (malloc=17628KB +13799KB #30410 +22271) (arena=590KB +57 #1008 +98) After 10 mins - Thread (reserved=591090KB +121832KB, committed=81422KB +50532KB) (thread #0) (stack: reserved=540672KB +75776KB, committed=31004KB +4476KB) (malloc=49801KB +45972KB #81400 +73261) (arena=617KB +84 #1054 +144) After 30 mins - Thread (reserved=646635KB +177378KB, committed=136979KB +106090KB) (thread #0) (stack: reserved=540672KB +75776KB, committed=31016KB +4488KB) (malloc=105346KB +101517KB #169738 +161599) (arena=617KB +84 #1054 +144) After 1 hour - Thread (reserved=540089KB +70832KB, committed=223589KB +192700KB) (thread #0) (stack: reserved=334848KB -130048KB, committed=18348KB -8180KB) (malloc=204739KB +200910KB #326881 +318742) (arena=502KB -30 #858 -52) After 2 hours - Thread (reserved=788189KB +318932KB, committed=424097KB +393208KB) (thread #0) (stack: reserved=386048KB -78848KB, committed=21956KB -4572KB) (malloc=401699KB +397870KB #639276 +631137) (arena=442KB -90 #756 -154) I will execute same test against jdk18 tomorrow.
10-11-2022

The testcase takes 3 to 4 hours to hang, this is my observations. 19.0.1 with --enable-preview -> JVM hang 19.0.1 Without --enable-preview -> JVM still hang. There is one more observation, we can get the hang only when we tried to launch visualvm and try to start observing monitor window. Without that application will be keep running. I am running now for 19 GA with above two cases.
09-11-2022

[~fmatte] Can you change the test to use Executors.newCachedThreadPool as it should give you an equivalent that works with older JDK releases as the test doesn't use virtual threads. It would be interesting to try on JDK 18, JDK 19, and JDK 20 EA, with and without --enable-preview. That would give information on whether this is a regression or an issue when running with --enable-preview or not.
08-11-2022

Please ignore my earlier comment. It is my lack of knowlede on the Maven as the reproducer is mvn project and I didn't change that to jdk18. I just added jdk18 in the IDE and project didn't pick that up. After understanding how to do proper setting in Maven project, it looks like the project cannot be run in jdk18 (still not sure why).
08-11-2022

Likely related to changes made for virtual threads.if it only happens in 19.
08-11-2022

Thanks [~ddamodaran] for the reproducer testcase, I am able to reproduce the issue in 19.0.1 and 19 GA release, but the issue is not reproducible on 18 GA. Debugging more...
07-11-2022