JDK-8217170 : gc/arguments/TestUseCompressedOopsErgo.java timed out
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 12,13
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux,windows
  • CPU: x86_64
  • Submitted: 2019-01-15
  • Updated: 2022-06-22
  • Resolved: 2019-07-08
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 14
11.0.17-oracleFixed 14 b05Fixed
Related Reports
Blocks :  
Duplicate :  
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
Error. Program `/scratch/opt/mach5/mesos/work_dir/jib-master/install/jdk12-jdk.1221/linux-x64-debug.jdk/jdk-12/fastdebug/bin/java' timed out (timeout set to 480000ms, elapsed time including timeout handling was 527414ms).
Comments
Fix request [11u] I backport this for parity with 11.0.17-oracle. No risk, only a test change. I had to do one trivial resolve, one patched test omitted as it is not in 11. SAP nighlty testing passed.
21-06-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk11u-dev/pull/1159 Date: 2022-06-20 09:56:07 +0000
20-06-2022

Git URL: https://github.com/openjdk/jdk/commit/b85ebc19e0c593595ee1ece211755ee2be19860f
20-06-2022

The tests have been updated to disable ZapUnusedHeapArea and VerifyBeforeExit. There's still a lingering problem that sometimes the child processes take an unusually long time (3-5 seconds) to complete (or 20+ seconds for CMS; don't know what that's about), but disabling those two options strongly mitigates the timeout noise in mach5 CI testing. JDK-8227414 tracks that lingering problem.
08-07-2019

URL: http://hg.openjdk.java.net/jdk/jdk/rev/47fdb5b0fa41 User: kbarrett Date: 2019-07-08 21:26:16 +0000
08-07-2019

On the last confidential report by [~jwilhelm], jtreg was executed with concurrency=4. Mdash reported that the host (?????ico.??.oracle.com) has 60.55GB RAM (is this really the case? [~iignatyev] could you verify?). The gc/arguments would tend to execute concurrently (because jtreg runs tests alphabetically). It's possible that the concurrent VMs have excessively large, committed and dirtied heaps regions. That would cause excessive swapping and pathological slow down. Note that all these tests go through TestMaxHeapSizeTools.java, which calls ProcessTools.createJavaProcessBuilder() with addTestVmAndJavaOptions=false. So even the test executor has added -J-XX:MaxRAMPercentage=6 to the jtreg command line (as a heuristic to ensure the concurrent tests will not use more than the physical RAM size), this is not passed to the child JVM processes. So by default (I think), they are executed with 16GB max heap, which is much larger than the 0.06 * 60 = 3.6GB max heap size set by the test executor. This increases the risk of excessive swapping, especially if the reported physical RAM size (60.55GB) is wrong (i.e., incorrect accounting of virtual hosts). To avoid future problems with excessive swapping, I think we should make sure that the gc/arguments tests will only allocate the heap regions virtually and will not do anything to dirty the heap regions (such as ZapUnusedHeapArea). I would suggest adding a self check in all the test case -- - at test exit, check the amount of all dirtied heap regions (is this possible through some API in the C++ "os" class?) - fail the test if the amount is larger than a fixed size (128MB??)
29-06-2019

I found out about ZapUnusedHeapArea with JDK-8210040 "TestOptionsWithRanges is very slow". From the bug description: -XX:-ZapUnusedHeapArea should be specified for fastdebug VMs. Otherwise some GCs such as CMS will spend lots of time filling the heap with junk (which might be useful for some GC tests, but it's really unnecessary for this test).
29-06-2019

Nice find [~iklam]! I ran tier1 on one of the machines that has been reliably failing for that, but this time with -XX:-ZapUnusedHeapArea added to TestUseCompressedOopsErgo tests, and it passed. This suggests some thread might be continuing to zap memory even though we want to shut down. This also makes JDK-8225373 much more likely to be a duplicate, even though it occurred on Linux rather than Windows. However, looking at the log for TestUseCompressedOopsErgo, it seems this may have just been a coincidental pass. There are still very long waitFor times, typically in the 10-20 second range for CMS (with some much higher). So perhaps -XX:+ZapUnusedHeapArea is contributing (hard to tell yet), but there's something else going on. And for G1 the times are instead in the 3-5 second range (which is still surprisingly long, but much better, which explains why we don't see any history of failures with G1).
28-06-2019

For this kind of timeout, are the failures (mostly) limited to gc/arguments/* tests? Or, specifically, test cases that use TestMaxHeapSizeTools? There are many test cases that use ProcessTools to launch subprocesses. For example, the appcds tests does a lot of that -- some appcds tests would launch dozens of subprocesses. Yet we don't see similar failures with the appcds tests. One thing I noticed is the elapsed time seems to be highly related to the initial heap size. See attached files gc.arguments.ErgoArgsPrinter.log -- log file from one of the timeouts of TestMinInitialErgonomics.java gc.arguments.ErgoArgsPrinter.odc -- spreadsheet of the data and graph gc.arguments.ErgoArgsPrinter.png -- graph captured by the above spreadsheet parse.tcl -- getting heap size and timing data from the log $ cat gc.arguments.ErgoArgsPrinter.log | tclsh parse.tcl pid mx ms InitialHeapSize MinHeapSize MaxHeapSize Elapsed 9200 0 0 0 0 0 23273 8864 2036334592 0 0 0 0 24471 6192 2036334592 0 0 508559360 0 26361 8520 2036334592 0 0 2034237440 0 38691 5872 2036334592 0 0 0 0 25374 4000 2036334592 0 508559360 0 0 13670 11296 2036334592 0 2034237440 0 0 42584 5484 2036334592 0 0 0 0 21367 8644 2036334592 0 508559360 508559360 0 11711 2720 2036334592 0 2034237440 508559360 0 41056 2376 2036334592 0 2034237440 2034237440 0 75812 4892 2036334592 0 0 0 0 34339 2520 2036334592 0 0 508559360 0 23133 9864 2036334592 0 0 2034237440 0 36865 11696 2036334592 0 0 0 0 22589 8284 2036334592 0 508559360 0 0 12934 9336 2036334592 0 2034237440 0 0 54957 7380 2036334592 0 0 0 0 27079 5188 2036334592 0 508559360 508559360 0 15604 5304 2036334592 0 2034237440 508559360 0 50573 3232 2036334592 0 2034237440 2034237440 0 57945 10836 2036334592 508559360 0 0 0 21406 8720 2036334592 508559360 0 508559360 0 21021 6120 2036334592 508559360 0 0 0 19818 12136 2036334592 508559360 508559360 0 0 22254 6920 2036334592 508559360 2034237440 0 0 80864 5952 2036334592 508559360 0 0 0 39192 5380 2036334592 2034237440 0 0 0 77836 (JIRA messes up the formatting. Click this link to see with proper tabulation https://bugs.openjdk.java.net/secure/attachment/83172/gc.arguments.ErgoArgsPrinter.txt ) The close correlation of heap size and elapsed time is apparent on this graph https://bugs.openjdk.java.net/secure/attachment/83174/gc.arguments.ErgoArgsPrinter.png "Size" in the graph is the max of (ms, MinHeapSize and InitialHeapSize), as that's likely (?) to be the amount of data zapped by GC when -XX:+ZapUnusedHeapArea is enabled (which is by default).
27-06-2019

Based on the above data, I believe the bug is in the test case or GC, as the elapsed time scales with (initial) heap size. I would suggest running with -XX:-ZapUnusedHeapArea so that we don't have pathological slow down when the test machine is running with high swapping. Reassigning back to GC subcategory.
27-06-2019

Here's an experiment to try. (Note that I don't (yet) know how to do some of these steps on Windows; if it were Linux...) Change the call to waitFor to use the form with a timeout, using something like a 10sec timeout. If it times out then extract stack dumps and the like from the child process. (On Linux I'd hit it with a signal that would cause it to crash and dump core.) Examine that information for further clues. Perhaps the child process is somewhat stuck somewhere? Or perhaps the child process is actually already dead, but waitFor is taking a long time to recognize and report that? The waitFor in question is in test/lib/jdk/test/lib/process/OutputBuffer.java:getExitValue. Running tier1 on one of the usual machines discussed in earlier comments pretty reliably provokes these problems.
26-06-2019

Passing this off to the runtime team - there's no apparent GC issues here, just a GC test playing canary.
26-06-2019

Saw what looks like another example of this in a personal mach5 job, in gc/arguments/TestMinInitialErgonomics.java, with the test run on a different machine, e.g. not one of my favorite machines for reproducing these timeouts. I'm finding the "slow and overloaded machines" explanation less and less compelling.
19-06-2019

Another experiment - I added a timestamp to the printout of the command line before creating the subprocesses, to see if process creation might also be slow. Ran tier1 on one of my favorite "problem" machines. And nope. The time from just before subprocess creation until start of waiting for subprocess termination is generally on the order of 10's to small 100's of milliseconds. It's the wait for the subprocess to terminate that always takes a long time, e.g. 10s of seconds. So while the timeouts are strongly correlated with load on these "slow" machines, the long times for subprocess termination (or perhaps recognition of such) suggests there may still be some product problem that is the root cause. I think the next step is to try to figure out what's going on in the subprocess during that period.
19-06-2019

As an experiment, I changed the test to use driver mode for the toplevel test process. That didn't change the behavior to any noticable extent.
18-06-2019

See also JDK-8225624. These machines seem to be both ridiculously slow and demonstrating very high variance in performance. Yet even then they don't often trigger timeouts. So there has to be something else at play here. I remain completely skeptical about the performance of individual VMs being satisfactory when other VMs on the same host can be inducing their own arbitrary load.
12-06-2019

there is one more commonality b/w this test and the tests in JDK-8225624, JDK-8225187 they all spawn a new JVM under test and wait till it exists.
11-06-2019

I'm no longer sure that something about those specific machines is the root cause here, knowing that they are known to be "slow". I'm finding other, passing, tests on other machines that exhibit long wait times for subprocesses to complete, just not as long as the machines where the timeouts are occurring. Instead of 20-40+ seconds per wait in the failures, there are successfull tests where the subprocess waits are typically around 10 seconds. That still seems like a really long time for subprocesses that do very little (many are just -version with -XX:+PrintFlagsFinal).
11-06-2019

Running TestUseCompressedOoopsErgo repeatedly on one of the problematic machines, there were 7 failed tests out of 100 tasks (each running 4 tests). However, looking at some of the passed tests it looks like the waitFor times are higher than I've seen on other machines, just not cumulatively high enough to exceed the timeout. Running it 100 times on randomly scheduled machines had no failures.
11-06-2019

test-only is quite common, it doesn't indicate any problem w/ doing builds, it just means that a host is slow and doing builds on it will have an impact on CI throughput.
10-06-2019

I didn't spot anything unusual about the config of those two machines either, other than they are both marked as "test-only" so not used for builds. I don't know if that's common or indicative of some prior problem with doing builds on these machines.
10-06-2019

Some mach5 statistics about TestUseCompressedOopsErgo.java#id{0-3} - run 94 times for CI or sbr testing on two machines where errors have been seen. - 81 are since the initial reporting of JDK-8217170 on 7-May-2019. - 22 of those are failures. - No other failures on those two machines. - First run of this test on these machines is 30-April-2019. - Only one failure (7-May-2019) of 28 between 30-April-2019 and 12-May-2019. - Remaining 21 failures are in the last 66 runs (since 12-May-2019). - Only one failure of this test on a different machine (JDK-8225373, 5-June-2019).
09-06-2019

Running tier1 on one of the two Windows VMs where we've seen problems had failures in 6/6 jobs. Running tier1 on a Windows VM on a different hypervisor had 0/5 jobs fail. So the problem seems to be associated with the two VMs on that one hypervisor.
09-06-2019

All of the failures since May 8 (when the current spate started, there are some similar but much older failures of this test) occurred on one of two Windows machines, both of which are VMs on the same hypervisor (and the only two Windows VMs currently on that hypervisor). In searching through the failures, I also noticed a few other similar looking test failures on the same machines, such as for JDK-8212630. Perhaps there is some kind of configuration error on those two machines? Doing an entire tier1 job on one of them seems to reliably provoke these failures. Maybe they are configured to be able to handle more tasks than they are actually capable of?
09-06-2019

Running tier1 on a specific machine reliably provoked failures from these and some other gc/arguments tests (c.f. JDK-8225373), with or without VerifyBeforeExit. Failures were like I described earlier, with the main test process's waitFor a subprocess taking an unusually long time, and that happening for *all* of the subprocesses created by the main test process. And it continues to be all but the G1 test that (may) fail; I've yet to see the TestUseCompressedOopsErgo G1 subtest fail. There *could* be a machine configuration issue; the machine I picked for running tier1 was one on which this failure has occurred in the CI (since I happened to know its name). But I'm pretty sure I've seen these failures on multiple machines. I'm going to go through the failures more carefully looking at that question.
08-06-2019

Here's the CMS run on Windows-x64-debug: jib > TEST: gc/arguments/TestUseCompressedOopsErgo.java#id3 jib > build: 0.0 seconds jib > build: 0.0 seconds jib > driver: 0.047 seconds jib > build: 0.906 seconds jib > compile: 0.906 seconds jib > main: 138.51 seconds jib > build: 0.016 seconds jib > main: 122.696 seconds Overall the test is much much slower and the VerifyBeforeExit has a much larger cost ~ 15 seconds. I suppose that could be enough to push us over the timeout in some circumstances. But it's obviously more specific than just using CMS - it must be machine related as well.
07-06-2019

I took this test and duplicated each of the @run lines to add -XX:-VerifyBeforeExit to undo the effects of JDK-8222534, and to allow a comparison of both variants. I ran it on Linux and Windows debug, 5 times each. There were no timeouts in that basic configuration. First observation is that there is a significant variance in runs on different machines. The fastest Windows run was 4m51, while the slowest 9m08. Linux was less extreme: 2m14 and 5m28. Looking at the runs the results were somewhat surprising. Here's a Windows example that shows what I expected to see: jib > TEST: gc/arguments/TestUseCompressedOopsErgo.java#id2 jib > build: 6.413 seconds jib > compile: 6.402 seconds jib > build: 0.792 seconds jib > compile: 0.791 seconds jib > driver: 3.151 seconds jib > build: 5.286 seconds jib > compile: 5.282 seconds jib > main: 14.491 seconds jib > build: 0.001 seconds jib > main: 9.527 seconds jib > TEST RESULT: Passed. Execution successful jib > -------------------------------------------------- jib > TEST: gc/arguments/TestUseCompressedOopsErgo.java#id0 jib > build: 11.826 seconds jib > compile: 11.809 seconds jib > build: 1.198 seconds jib > compile: 1.197 seconds jib > driver: 5.681 seconds jib > build: 3.752 seconds jib > compile: 3.75 seconds jib > main: 15.258 seconds jib > build: 0.0 seconds jib > main: 13.717 seconds jib > TEST RESULT: Passed. Execution successful jib > -------------------------------------------------- jib > TEST: gc/arguments/TestUseCompressedOopsErgo.java#id3 jib > build: 9.871 seconds jib > compile: 9.86 seconds jib > build: 0.771 seconds jib > compile: 0.771 seconds jib > driver: 3.612 seconds jib > build: 4.903 seconds jib > compile: 4.899 seconds jib > main: 22.511 seconds jib > build: 0.001 seconds jib > main: 18.801 seconds jib > TEST RESULT: Passed. Execution successful jib > -------------------------------------------------- jib > TEST: gc/arguments/TestUseCompressedOopsErgo.java#id1 jib > build: 8.041 seconds jib > compile: 8.029 seconds jib > build: 0.773 seconds jib > compile: 0.772 seconds jib > driver: 4.384 seconds jib > build: 7.389 seconds jib > compile: 7.377 seconds jib > main: 15.929 seconds jib > build: 0.001 seconds jib > main: 14.114 seconds jib > build: 0.001 seconds jib > main: 13.241 seconds jib > build: 0.0 seconds jib > main: 13.222 seconds For each pair of "main" executions the second is quicker than the first as it disables VerifyBeforeExit. In some cases up to 5 seconds quicker, in others only a fraction of a second. But here is an extract from another Windows run: jib > TEST: gc/arguments/TestUseCompressedOopsErgo.java#id0 jib > build: 28.21 seconds jib > compile: 28.197 seconds jib > build: 0.612 seconds jib > compile: 0.61 seconds jib > driver: 4.263 seconds jib > build: 4.128 seconds jib > compile: 4.124 seconds jib > main: 27.663 seconds jib > build: 0.001 seconds jib > main: 33.144 seconds Here the second run is SLOWER by 5 seconds! That makes no logical sense. On Linux I see it happen even more frequently that the run with VerifyBeforeExit disabled is slower! But despite the oddities there's no sign that VerifyBeforeExit would slow things down enough to cause timeouts as we have seen. That said at least some of the timeouts occur when CMS is being used. So I will submit a tier2 job to see what happens there. AFAICS jtreg will run everything serially so there should not be any conflict between different @run's or different @test groupings.
07-06-2019

Possibly of interest, JDK-8222534 was pushed on 7-May-2019 (hg:54729), the same day and shortly before this category of failures first appeared (hg:54738, mach5 CI jdk-13-968).
06-06-2019

The test contains several subtests (different @test blocks), one for each tested GC. Each of these subtests launches 12 pairs of subprocesses with varying arguments and examines the output for information reported by the subprocesss. The first of each pair reports the UseCompressedOops heap size limit for the configuration. The second is invoked with -Xmx -XX:+PrintFlagsFinal -version, plus per-process configuration, and the subtest is verifying the UseCompressedOops line in the flag output is as expected. Looking at test history in mach5, failures are always on Windows, first occurring circa 7-May-2019 (jdk-13-968). (There are occasional older unrelated failures of this test). Failures seen with Serial, Parallel, and (most commonly) CMS; no failures seen with G1 in normal CI testing (I didn't check personal jobs). Looking at the logs for this test from a recent mach5 job containing a failure (jdk-13-1216), a subtest typically waits for a subprocess (of either kind) to complete for less than 2 seconds. Even on a slow machine the wait times are at worst circa 5 seconds. But for the failing subtest, the *shortest* wait time is 21 seconds, are typically in the 21-40+ second range, and with one 89 second wait. Similarly in the several failures (Serial, Parallel, and CMS) in the jdk-13-1254 job. Because the failures are occurring with several different GCs, I've removed the #id0 from this bug's summary.
06-06-2019

All of the failures in the last few weeks have been on Windows not Linux. Which in itself seems to be a recent trend. Kim's additional logging seems to show the tests actually completing fine and they don't seem particularly slow - though there is one subtest that seems a lot slower than the others. But in general (unless you can show these subtests should be taking a few seconds each) this just looks like too much test for too little time.
04-06-2019

This test seems to fail a lot all of a sudden..
03-06-2019

we managed to get a list of test processes, but when by the time failure handler started to analyze them, they all were dead. the log suggests that the child exited w/ 131, so the main process threw the exception, I guess it all happened concurrently to the failure handler work. "... stderr: [] exitValue = 131 java.lang.RuntimeException: Expected to get exit value of [0] at jdk.test.lib.process.OutputAnalyzer.shouldHaveExitValue(OutputAnalyzer.java:419) at gc.arguments.TestUseCompressedOopsErgoTools.expect(TestUseCompressedOopsErgoTools.java:174) at gc.arguments.TestUseCompressedOopsErgoTools.expectValid(TestUseCompressedOopsErgoTools.java:179) at gc.arguments.TestUseCompressedOopsErgoTools.checkUseCompressedOops(TestUseCompressedOopsErgoTools.java:156) at gc.arguments.TestUseCompressedOopsErgoTools.checkCompressedOopsErgo(TestUseCompressedOopsErgoTools.java:144) at gc.arguments.TestUseCompressedOopsErgo.main(TestUseCompressedOopsErgo.java:85) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:567) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) at java.base/java.lang.Thread.run(Thread.java:835) JavaTest Message: Test threw exception: java.lang.RuntimeException: Expected to get exit value of [0] JavaTest Message: shutting down test STATUS:Failed.`main' threw exception: java.lang.RuntimeException: Expected to get exit value of [0] ----------rerun:(31/6957)*----------"
11-02-2019

I have re-opened this bug because test still might time out.
11-02-2019

Because of JDK-8218197, there is no information here with which to understand what might have gone wrong with the child process(es), so closing this. Once JDK-8218197 is fixed, any new failure can be reported as a new bug with (hopefully) useful information.
04-02-2019

given our latest finding (see JDK-8218197 and comments in JDK-8176214), I'm moving this one back to hotspot/gc.
02-02-2019

a possible explanation for the observed data is the sub-process exited or was killed during work of the timeout handler, so you still see "main" test process waiting for it (and it's still alive), but while timeout handler was gathering other data from "main" process, the child process terminated, so timeout handler isn't able to connect to it.
23-01-2019

What we can see from the available logs is - the stacktrace above with the call to getExitValue() of the process that started the process using ProcessBuilder and then OutputAnalyzer being killed by the timeout handler - and the second process already exited. The logs the tools that are called upon it all return that there is no such process (any more). It is unclear how this process terminated (normally, or shut down externally or whatever).
23-01-2019

[~tschatzl] Thomas, if it's believed that `j.t.l.process.OutputAnalyzer::shouldHaveExitValue` continue to wait for an already exited process, it will mean the actual problem is in `java.lang.Process::waitFor` as OutputAnalyzer does no magic there. moving to core-libs
23-01-2019

Kindly asking the maintainers of the test library to look into this issue that apart from the test does not seem to have any connection to GC, but at least related to the use of the test library (where I hope I got the correct component/subcomponent for the CR for). There is also JDK-8211366 that is very similar to this one.
23-01-2019

The G1 process (453) is waiting for a parallel GC test run process (1161) to exit - that process has already been exit though. Stack trace: at java.lang.Object.wait(java.base@12-internal/Native Method) - waiting on <0x000000071c719f50> (a java.lang.ProcessImpl) at java.lang.Object.wait(java.base@12-internal/Object.java:326) at java.lang.ProcessImpl.waitFor(java.base@12-internal/ProcessImpl.java:495) - locked <0x000000071c719f50> (a java.lang.ProcessImpl) at jdk.test.lib.process.OutputBuffer$LazyOutputBuffer.getExitValue(OutputBuffer.java:113) at jdk.test.lib.process.OutputAnalyzer.getExitValue(OutputAnalyzer.java:511) at jdk.test.lib.process.OutputAnalyzer.shouldHaveExitValue(OutputAnalyzer.java:417) Output e.g. from the tool getting the Parallel GC process information: ---------------------------------------- /bin/gdb, --pid=1166, -batch, -ex, thread apply all backtrace] timeout=20000 ---------------------------------------- warning: unable to open /proc file '/proc/1166/status' warning: unable to open /proc file '/proc/1166/status' ptrace: No such process. In addition to that, two more java threads are waiting on stdout and stderr of that process (presumably). There is nothing that indicates that a GC thread is abnormally waiting on something.
16-01-2019