Bug ID: JDK-6490770 Supposedly fixed memory leak leads to 100% VM CPU usage

JDK 7
7Resolved

EVALUATION Doug Lea is providing a fix for this. Hopefully we will get it right this time. The changes are being tracked in 6460501: Synchronizer timed acquire still leaks memory which had fallen off our radar. However, hotspot's reaction to the memory leak in Bug3 also appears unsatisfactory. I've just filed two bugs against hotspot/garbage_collector 6493287: Unproductive CPU-spinning GCs on heap exhaustion; please throw OOME instead 6493335: Mismatch between -Xm[sx] and verbose:gc output Our first attempt to fix this only worked if one thread was polling. With multiple polling threads, a more sophisticated fix is necessary. Doug Lea is providing the fix. Closing this bug as a dup of 6460501
14-11-2006
EVALUATION Repeating the above experiment with $ jver 6 jr $flag -Xmx3m -XX:$collector -verbose:gc Bug3.java 10 10 for values of flag in (-server -client) and values of collector in (UseSerialGC UseParallelGC) all show unproductive GCs, indicative of a leak in either the libraries or the GC mechanism.
11-11-2006
EVALUATION Here's a reliable variant of reproducing this: Use a Linux system (old RH AS 2.1 single CPU) (I use "check6") Test program: ----------------------------------------------- import java.util.concurrent.; public class Bug3 { private static int intArg(String[] args, int i, int defaultValue) { return args.length > i ? Integer.parseInt(args[i]) : defaultValue; } public static void main(String[] args) { final BlockingQueue<Object> q = new LinkedBlockingQueue<Object>(); final int threads = intArg(args, 0, 10); final int millis = intArg(args, 1, 100); for (int i = 0; i < threads; i++) new Thread() { public void run() { while (true) { try { q.poll(millis, TimeUnit.MILLISECONDS); } catch (Throwable t) { throw new Error(t); }}}}.start(); } } ----------------------------------------------- Run like this: (After about half an hour the heap fills up, and VM goes into a frenzy trying to perform full unproductive GCs) $ jver 6 jr -Xmx3m -XX:+UseParallelGC -verbose:gc Bug3.java ==> javac -source 1.6 -Xlint:all Bug3.java ==> java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 [GC 768K->620K(4992K), 0.0151570 secs] [GC 1388K->972K(4992K), 0.0200720 secs] [GC 1740K->1420K(4992K), 0.0201680 secs] [GC 2188K->1992K(4992K), 0.0225560 secs] [GC 2760K->2588K(4992K), 0.0233750 secs] [GC 3356K->3064K(4608K), 0.0217680 secs] [GC 3448K->3380K(4800K), 0.0180340 secs] [GC 3764K->3704K(4800K), 0.0215990 secs] [GC 4088K->4024K(4800K), 0.0326170 secs] [Full GC 4024K->4023K(5824K), 0.1142040 secs] [GC 4407K->4347K(5824K), 0.0168690 secs] [GC 4731K->4671K(5824K), 0.0335550 secs] [GC 5055K->4991K(5824K), 0.0201900 secs] [Full GC 4991K->4990K(5824K), 0.1204090 secs] [Full GC 5374K->5296K(5824K), 0.1371490 secs] [Full GC 5504K->5388K(5824K), 0.1725220 secs] [Full GC 5503K->5493K(5824K), 0.1504360 secs] [Full GC 5503K->5493K(5824K), 0.1401190 secs] [Full GC 5503K->5493K(5824K), 0.1408060 secs] [Full GC 5503K->5475K(5824K), 0.1412240 secs] [Full GC 5503K->5501K(5824K), 0.1400280 secs] [Full GC 5503K->5501K(5824K), 0.1416890 secs] [Full GC 5503K->5501K(5824K), 0.1411700 secs] [Full GC 5503K->5483K(5824K), 0.1414830 secs] [Full GC 5503K->5503K(5824K), 0.1520990 secs] [Full GC 5503K->5496K(5824K), 0.1403970 secs] [Full GC 5503K->5503K(5824K), 0.1402680 secs] ...... GC Frenzy! Uptime during the initial heap eating phase is around 0.6, while during the GC frenzy stabilizes at about 2.5. $ while true; do jver ~/promoted jmap -histo `jver ~/promoted jps \| awk '/Bug3/ {print $1}'` \| g 'Ab.Node'; uptime; sleep 10; done 1: 19452 622464 java.util.concurrent.locks.AbstractQueuedSynchronizer$Node 12:08pm up 36 days, 23:25, 3 users, load average: 0.17, 0.90, 0.80 1: 20591 658912 java.util.concurrent.locks.AbstractQueuedSynchronizer$Node 12:08pm up 36 days, 23:25, 3 users, load average: 0.37, 0.90, 0.81 .... 1: 172512 5520384 java.util.concurrent.locks.AbstractQueuedSynchronizer$Node 12:42pm up 36 days, 23:58, 3 users, load average: 2.30, 1.73, 1.05 1: 172512 5520384 java.util.concurrent.locks.AbstractQueuedSynchronizer$Node 12:42pm up 36 days, 23:59, 3 users, load average: 2.38, 1.77, 1.07 1: 172512 5520384 java.util.concurrent.locks.AbstractQueuedSynchronizer$Node 12:42pm up 36 days, 23:59, 3 users, load average: 2.63, 1.85, 1.10 This would explain why when running with 100 threads instead of 10, the system becomes completely unresponsive. I speculate that the load average goes to 25 and most Linux systems simply can't handle that kind of load. Looking at the "ps" output, during the initial phase none of the Java threads appear to be using significant CPU. Only one thread shows 1 second of CPU, all others are at 0 seconds. Older Linux systems show each thread as a separate line in the "ps" output, which is usually annoying, but very useful here. Once the GC frenzy starts, we see output like this: mb29450 16437 0.0 5.9 126028 15028 pts/4 S 12:05 0:00 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16438 0.0 5.9 126028 15028 pts/4 S 12:05 0:00 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16439 0.0 5.9 126028 15028 pts/4 S 12:05 0:00 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16440 0.0 5.9 126028 15028 pts/4 S 12:05 0:00 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16441 7.7 5.9 126028 15028 pts/4 R 12:05 2:54 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16442 0.0 5.9 126028 15028 pts/4 S 12:05 0:00 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16443 0.0 5.9 126028 15028 pts/4 S 12:05 0:00 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16444 0.0 5.9 126028 15028 pts/4 S 12:05 0:00 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16445 0.0 5.9 126028 15028 pts/4 S 12:05 0:00 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16446 0.0 5.9 126028 15028 pts/4 S 12:05 0:00 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16447 0.0 5.9 126028 15028 pts/4 S 12:05 0:00 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16448 0.3 5.9 126028 15028 pts/4 S 12:05 0:08 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16449 0.3 5.9 126028 15028 pts/4 S 12:05 0:08 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16450 0.3 5.9 126028 15028 pts/4 S 12:05 0:08 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16451 0.3 5.9 126028 15028 pts/4 S 12:05 0:08 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16452 0.3 5.9 126028 15028 pts/4 S 12:05 0:08 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16453 0.3 5.9 126028 15028 pts/4 S 12:05 0:08 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16454 0.3 5.9 126028 15028 pts/4 S 12:05 0:08 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16455 0.3 5.9 126028 15028 pts/4 S 12:05 0:08 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16456 0.3 5.9 126028 15028 pts/4 S 12:05 0:08 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16457 0.3 5.9 126028 15028 pts/4 S 12:05 0:08 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 mb29450 16550 0.0 5.9 126028 15028 pts/4 S 12:08 0:00 java -Xmx3m -XX:+UseParallelGC -verbose:gc -esa -ea Bug3 This is very interesting. One thread, presumably the GC thread, is very busy performing unproductive GCs. This is easy to understand. (But why aren't we getting OOME?) More interesting is the fact that exactly 10 of these threads started consuming significant CPU (but much less than the one very busy GC thread) when the GC frenzy started. It is easy to explain the increased load average as a result of these threads.
11-11-2006
EVALUATION On check6, a single CPU Linux system with 2.4.9 kernel, When running with 100 threads and a timeout of 100 milliseconds, I see eventually that the "ps" command hangs, the "uptime" command hangs, but I am fortunately still able to kill the process. $ while true; do uptime; sleep 10; done ..... 3:27pm up 36 days, 2:44, 5 users, load average: 2.08, 0.66, 0.29 3:27pm up 36 days, 2:44, 5 users, load average: 2.22, 0.74, 0.32 3:27pm up 36 days, 2:44, 5 users, load average: 2.34, 0.81, 0.35 3:27pm up 36 days, 2:44, 5 users, load average: 2.44, 0.88, 0.37 3:28pm up 36 days, 2:45, 5 users, load average: 2.53, 0.95, 0.40 3:28pm up 36 days, 2:45, 5 users, load average: 2.60, 1.02, 0.43 3:28pm up 36 days, 2:45, 5 users, load average: 2.66, 1.08, 0.46 3:28pm up 36 days, 2:45, 5 users, load average: 2.71, 1.15, 0.48 3:28pm up 36 days, 2:45, 5 users, load average: 2.76, 1.21, 0.51 3:28pm up 36 days, 2:45, 5 users, load average: 2.79, 1.26, 0.54 3:29pm up 36 days, 2:46, 5 users, load average: 2.82, 1.32, 0.56 3:29pm up 36 days, 2:46, 5 users, load average: 2.92, 1.39, 0.59 3:29pm up 36 days, 2:46, 5 users, load average: 2.94, 1.45, 0.62 3:29pm up 36 days, 2:46, 5 users, load average: 3.02, 1.54, 0.66 3:29pm up 36 days, 2:46, 5 users, load average: 3.01, 1.58, 0.69 3:29pm up 36 days, 2:46, 5 users, load average: 3.01, 1.63, 0.71 # At this point "uptime" hangs for many minutes. # When I kill the java program, the ouptut continues: 4:03pm up 36 days, 3:20, 5 users, load average: 25.88, 24.02, 19.31 4:03pm up 36 days, 3:20, 5 users, load average: 22.93, 23.46, 19.18 4:03pm up 36 days, 3:20, 5 users, load average: 19.41, 22.68, 18.97 4:03pm up 36 days, 3:20, 5 users, load average: 16.42, 21.94, 18.77 4:04pm up 36 days, 3:21, 5 users, load average: 13.90, 21.21, 18.57 4:04pm up 36 days, 3:21, 5 users, load average: 11.76, 20.51, 18.37
11-11-2006
EVALUATION Doug Lea writes: Martin: This looks like a failure of GC to reach a safepoint. Would you like me to contact someone in GC/runtime who can better diagnose? .... One further oddity I've seen is that, even though I can't get this to fail with OOME (at least not within 10min or so of running) , under some values for poll times I do see occasional cases where verbosegc output stops for a second or two but perfbar shows that something (or several somethings) are running. And when they resume, the reported GC times are still small. Suggesting again missed safepoints or missed wakeups or maybe overruns during Full-GC, where threads wake up even though GC thinks they are all sleeping, or something like that. Holger writes: >Holger: Could you run this with -verbosegc? This will show if heap >> is growing steadily versus some other problem. I ran it for 90 >> minutes on linux with 1 ms timeout, and got stable small footprint. >> I do see a load of around 1 CPU's worth when running with 1ms timeout >> and 500 threads, which is about what you might expect. 50 threads, 10ms poll timeout, Windows with b94, -Xmx32m -verbosegc: [GC 896K->175K(5056K), 0.0035468 secs] [GC 1071K->640K(5056K), 0.0042653 secs] [GC 1536K->1248K(5056K), 0.0050593 secs] [GC 2144K->2055K(5056K), 0.0061893 secs] <spin> Without -Xmx, only -verbosegc: [GC 896K->176K(5056K), 0.0040974 secs] [GC 1072K->639K(5056K), 0.0075856 secs] [GC 1535K->1249K(5056K), 0.0060013 secs] <spinning> [GC 2145K->1523K(5056K), 0.0045911 secs] .. same on Linux: holger>java -Xmx32m -verbosegc -cp . BlockingQueueLeak [GC 896K->405K(5056K), 0.0058620 secs] [GC 1301K->861K(5056K), 0.0046020 secs] [GC 1757K->1015K(5056K), 0.0021780 secs] [GC 1911K->1015K(5056K), 0.0008340 secs] <spin> So both app and GC continue to run even after spinning starts; I did not notice that earlier. I did manage to get jconsole attached after the spinning started by lowering the process priority; see run-with-continuous-spin-after-short-time statistics. There is the usual sawtooth-staircase pattern and full gc seems to work, even permspace is collected, classes unloaded etc. However the CPU is still maxed and does not go down again. But wait, there's more! There does seem to be an element of randomness here. While the first run on Windows did start to spin quickly, a second run without -verbosegc but initially attached jconsole shows a completely different profile and behaviour. Here the heap grows continually, it does start to spin after a longer time but then goes down again after a few seconds to its usual idle polling. The interval between "temporary" spins goes up over time, until it eventually spins always. See run-without-verbosegc/cpu.csv for the spin-down periods. I can also readily reproduce all this on my 600MHz P3 Laptop running Windows2000 & Mustang b97, so it's neither machine, OS or CPU. I'll rule out possession by evil spirits for the moment. It just seems that out of nowhere something grabs the CPU and has trouble letting go. No idea whether that is the reason for the leak or related at all. .... Doug Lea wrote: >> Holger -- could you check if you still get this when running >> with either >> java -XX:-UseParallelGC ... >> or >> java -Xincgc 1) -XX:-UseParallelGC -Xmx24m -verbosegc: - heap grows - starts to spin eventually - once it does, verbosegc output stops (unlike before!) 2) -Xincgc -Xmx24m -verbosegc: - no unexpected spinning! - heap grows more quickly than before, at ~100k/sec until -Xmx is reached (~33MB according to TaskManager) - it runs for quite a long time with full heap and expectedly high CPU load, trying really hard to catch the train doing full collections. Eventually the VM complains properly about OOM.
10-11-2006
EVALUATION I ran the submitter's program on a Linux-i586 single CPU Ubuntu system (new thread library) with "-Xmx10m" and observed that load average was close to ZERO for 10 minutes until the heap was full, then load average varied between 2 and 20, eventually settling down to 5 or so. An hour later the machine became completely unresponsive and had to be rebooted.
10-11-2006
EVALUATION Something strange is happening here. If we go by "jmap -histo:live" then indeed the memory leak has not been fixed. Objects of type AbstractQueuedSynchronizer$Node accumulate and are not found by System.gc(). Here's a modified test program: --------------------------------------------- import java.util.concurrent.; public class Bug { public static void main(String[] args) { final BlockingQueue<Object> q = new LinkedBlockingQueue<Object>(); for (int i = 0; i < 10; i++) new Thread() { public void run() { while (true) { try { q.poll(1, TimeUnit.SECONDS); } catch (Throwable t) { throw new Error(t); }}}}.start(); new Thread() { public void run() { while (true) { try { System.gc(); Thread.sleep(101000); } catch (Throwable t) { throw new Error(t); }}}}.start(); } } --------------------------------------------- While "java Bug" is running, we get: $ while true; do jver 6 jmap -histo:live 16139 \| grep 'Abs.*Node'; sleep 10; done 16: 184 5888 java.util.concurrent.locks.AbstractQueuedSynchronizer$Node 16: 229 7328 java.util.concurrent.locks.AbstractQueuedSynchronizer$Node 16: 306 9792 java.util.concurrent.locks.AbstractQueuedSynchronizer$Node 16: 354 11328 java.util.concurrent.locks.AbstractQueuedSynchronizer$Node 16: 398 12736 java.util.concurrent.locks.AbstractQueuedSynchronizer$Node 14: 479 15328 java.util.concurrent.locks.AbstractQueuedSynchronizer$Node 14: 560 17920 java.util.concurrent.locks.AbstractQueuedSynchronizer$Node 12: 645 20640 java.util.concurrent.locks.AbstractQueuedSynchronizer$Node 12: 702 22464 java.util.concurrent.locks.AbstractQueuedSynchronizer$Node So these Node objects appear to accumulate forever, __EXCEPT__ that when max heap size is hit, threads do not get OOME! Instead the VM memory stays pegged at the max heap size; collections do not reduce the number of live objects, at least as recorded by jconsole.
09-11-2006

Duplicate :	JDK-6460501 - Synchronizer timed acquire still leaks memory
Relates :	JDK-6493335 - Mismatch between -Xm[sx] and verbose:gc output
Relates :	JDK-6493287 - Unproductive CPU-spinning GCs on heap exhaustion; please throw OOME instead