Bug ID: JDK-8221967 InternTest.java timed out

JDK 13
13 b22Fixed

Thanks [~coleenp] for digging in this.
20-05-2019
[~iignatyev] Thank you for the changes above. It helped a lot! This test reads 30,000 Strings from a file that all hash to the same value in java.lang.String.hashCode. At some bucket length threshold, the StringTable sets a global to cause rehashing. This rehashing is done at during safepoint cleanup actions. The test hung for two reasons: 1. The fix for JDK-8216049 adds a 'lookup(char*)' call before the do_intern call. The do_intern call also looks up the string, then does an insert that looks up the string and inserts if the string isn't found. So for each of these unbalanced buckets, there are 3 traversals of the unreasonably long bucket. Rearranging the do_intern() lookup call to after insert makes the test not appear to hang. But there was a second problem. 2. [~kbarrett] and I were wondering why the timeout handler was triggering the test to run to completion. It was because the jcmd triggers a safepoint, and the table is rehashed in the safepoint cleanup action. Once the table is rehashed, the buckets are shorter and the test runs to completion while the other jtreg timeout handlers were running. The second piece of the fix is to add that the tables need rehashing in SafepointSynchronize::is_cleanup_needed() so that the GuaranteedSafepointInterval finds this reason to do the safepoint and rehash the tables.
17-05-2019
please note, the build under test will have to be built from the source w/ the said changes in failure_handler.
16-05-2019
the order of tools to run is defined in test/failure_handler/src/share/conf/*.properties files. 1st, jtreg runs commands defined in `onTimeout` property from `common.properties`, and then commands from `onTimeout` property from os-specific file. something similar to the following patch should help you: diff -r 9cfbe22bcdf8 test/failure_handler/src/share/conf/common.properties --- a/test/failure_handler/src/share/conf/common.properties Thu May 16 12:16:47 2019 -0700 +++ b/test/failure_handler/src/share/conf/common.properties Thu May 16 16:16:06 2019 -0700 @@ -28,15 +28,7 @@ # process info to gather ################################################################################ onTimeout=\ - jinfo \ - jcmd.compiler.codecache jcmd.compiler.codelist \ - jcmd.compiler.queue \ - jcmd.vm.classloader_stats jcmd.vm.stringtable \ - jcmd.vm.symboltable jcmd.vm.uptime jcmd.vm.dynlibs \ - jcmd.vm.system_properties \ - jcmd.gc.heap_info jcmd.gc.class_stats jcmd.gc.class_histogram jcmd.gc.finalizer_info \ - jstack - + jinfo jinfo.app=jinfo jcmd.app=jcmd diff -r 9cfbe22bcdf8 test/failure_handler/src/share/conf/solaris.properties --- a/test/failure_handler/src/share/conf/solaris.properties Thu May 16 12:16:47 2019 -0700 +++ b/test/failure_handler/src/share/conf/solaris.properties Thu May 16 16:16:06 2019 -0700 @@ -30,8 +30,6 @@ # prepareProcess info to gather ################################################################################ onTimeout=\ - native.pmap \ - native.pfiles \ native.stack native.core ################################################################################ # solaris specific
16-05-2019
I was going to add -XX:+SafepointTimeout and -XX:Crash on safepoint timeout flags to the test. What prints out this? Timeout refired 1440 times
16-05-2019
I added these flags and it's not crashing. Honestly, I think the function is swapped out for 20 minutes but that seems odd. In the processes.html file it appears that jtreg does this for the timeout: 2019-05-16 19:39:54] [/scratch/opt/mach5/mesos/work_dir/jib-master/install/2019-05-16-1556553.coleen.phillimore.13intern-test/solaris-sparcv9-debug.jdk/jdk-13/fastdebug/bin/jcmd, 24621, Compiler.codecache] timeout=20000 which works, then this: [2019-05-16 19:39:56] [/scratch/opt/mach5/mesos/work_dir/jib-master/install/2019-05-16-1556553.coleen.phillimore.13intern-test/solaris-sparcv9-debug.jdk/jdk-13/fastdebug/bin/jcmd, 24621, Compiler.codelist] timeout=20000 which has lots of output, then this: [2019-05-16 19:39:59] [/scratch/opt/mach5/mesos/work_dir/jib-master/install/2019-05-16-1556553.coleen.phillimore.13intern-test/solaris-sparcv9-debug.jdk/jdk-13/fastdebug/bin/jcmd, 24621, Compiler.queue] timeout=20000 then: [2019-05-16 19:40:02] [/scratch/opt/mach5/mesos/work_dir/jib-master/install/2019-05-16-1556553.coleen.phillimore.13intern-test/solaris-sparcv9-debug.jdk/jdk-13/fastdebug/bin/jcmd, 24621, VM.classloader_stats] timeout=20000 ---------------------------------------- 24621: java.io.IOException: Premature EOF then: [2019-05-16 19:40:15] [/bin/pmap, -F, 24621] timeout=20000 ---------------------------------------- pmap: cannot examine 24621: no such process or core file And the process is gone and the test says it passes. Can I get a special jtreg that does pstack first? All the failures have this jump in time between safepoints, and in between initiated safepoints. [0.607s][info ][safepoint,cleanup] safepoint cleanup tasks, 0.0001769 secs [0.607s][info ][safepoint ] Safepoint "RevokeBias", Time since last: 628050 ns, Reaching safepoint: 233910 ns, At safepoint: 48105 ns, Total: 282015 ns Thu May 16 21:18:31 UTC 2019: interned 10000 0-hash strings - last 10000 took 165.322s (0.0165322s per String) Thu May 16 21:26:50 UTC 2019: interned 20000 0-hash strings - last 10000 took 499.064s (0.0499064s per String) Timeout refired 1440 times [1452.032s][debug][safepoint ] Safepoint synchronization initiated using semaphore wait barrier. (11 threads) [1452.043s][info ][safepoint,cleanup] deflating global idle monitors, 0.0000016 secs
16-05-2019
if -XX:+SafepointTimeout and/or any -XX:Crash flags can help w/ getting to the bottom of it, sure use them by all means. I was just explaining how we try to get as much info as possible from timed out tests in a general case.
16-05-2019
'Timeout refired ' messages are printed by jtreg.
16-05-2019
> I wish it would crash the test at this point so we can see what it's doing. as we don't suspend a test process when a timeout occurs, the test and the timeout handler are run concurrently, and there is a race b/w them. one (actually the last one) of the things we run on a test timeout is 'kill -ABRT', so if the test process is still running by the time we get there, there should be a dump generated. processes.html also should contain several stack trace dumps.
16-05-2019
This test has been timing out almost daily since April 3. About time to problemlist?
16-05-2019
This test reads 30,000 lines from another file. It might be hung up on IO on solaris, which is causing the timeouts. I verified that the rehashing actually takes place on linux. When these tests hang, there's a processes.html file that is written but the one from this failure has nothing. Some of these failures look like the test completed. I did notice that the test doesn't close the BufferedStream. I think this "Timeout fired" message comes from the test harness. I wish it would crash the test at this point so we can see what it's doing.
15-05-2019
Dan: try setting a much larger timeout and then seeing if we eventually hit the "failed to reach a safepoint" guarantee failure. Seems to me that if we are going long stretches without a safepoint then it must be the VMThread that is hung/stuck.
11-04-2019
Sigh... another timeout with the longer default timeout value in place... Snippets from the log file: start=Wed Apr 10 06\:22\:22 UTC 2019 <snip> #section:main ----------messages:(6/361)---------- command: main -Xmixed -XX:+PrintStringTableStatistics -Xlog:safepoint+cleanup=info InternTest bad reason: User specified action: run main/othervm/timeout=180 -Xmixed -XX:+PrintStringTableStatistics -Xlog:safepoint+cleanup=info InternTest bad Mode: othervm [/othervm specified] Timeout information: --- Timeout information end. elapsed time (seconds): 1467.852 ----------configuration:(0/0)---------- ----------System.out:(93/7084)---------- <snip> <snip> [2.285s][info][safepoint,cleanup] safepoint cleanup tasks, 0.0007007 secs Wed Apr 10 06:27:12 UTC 2019: interned 10000 0-hash strings - last 10000 took 284.818s (0.0284818s per String) Wed Apr 10 06:36:08 UTC 2019: interned 20000 0-hash strings - last 10000 took 535.862s (0.0535862s per String) Timeout refired 1440 times [1451.916s][info][safepoint,cleanup] deflating global idle monitors, 0.0000012 secs <snip> [1452.329s][info][safepoint,cleanup] safepoint cleanup tasks, 0.4128765 secs Wed Apr 10 06:46:36 UTC 2019: interned 30000 0-hash strings - last 10000 took 628.379s (0.0628379s per String) [1452.355s][info][safepoint,cleanup] deflating global idle monitors, 0.0000017 secs [1452.373s][info][safepoint,cleanup] safepoint cleanup tasks, 0.0002346 secs SymbolTable statistics: Number of buckets : 32768 = 262144 bytes, each 8 Number of entries : 2580 = 41280 bytes, each 16 Number of literals : 2580 = 94576 bytes, avg 36.657 Total footprsize_t : = 398000 bytes Average bucket size : 0.079 Variance of bucket size : 0.079 Std. dev. of bucket size: 0.281 Maximum bucket size : 3 StringTable statistics: Number of buckets : 16777216 = 134217728 bytes, each 8 Number of entries : 31394 = 502304 bytes, each 16 Number of literals : 31394 = 1768864 bytes, avg 56.344 Total footprsize_t : = 136488896 bytes Average bucket size : 0.002 Variance of bucket size : 0.002 Std. dev. of bucket size: 0.043 Maximum bucket size : 2 ----------System.err:(1/15)---------- STATUS:Passed. ----------rerun:(35/3874)---------- So the test ran for 1467.852 seconds which is 27 seconds over the revised total timeout value (3 minute timeout value 8 timeout factor) 24 minutes == 1440 seconds. The three "interned ... 0-hash strings" lines report a total time of 1449.059. This gap in "[safepoint,cleanup]" log messages is interesting: [2.285s][info][safepoint,cleanup] safepoint cleanup tasks, 0.0007007 secs [1451.916s][info][safepoint,cleanup] deflating global idle monitors, 0.0000012 secs I didn't notice it before in the previous logs, but how could we go for 24 minutes without a single "[safepoint,cleanup]" log message. I just reread all the comments and this one from David sticks out: > That suggests to me that the test is actually hung and it is only the actions of > the timeout handler that actually allow the test to make progress again - quickly > completing and "passing", but also being marked as timed-out. So while my increase of the default timeout value has reduced the number of sightings in the CI, there is clearly something else going on here.
10-04-2019
[~coleenp] - When you fix this issue, please don't forget to undo the work around added by JDK-8222226.
09-04-2019
Here are snippets from a failure today: start=Tue Apr 09 16\:31\:09 UTC 2019 [1.273s][info][safepoint,cleanup] safepoint cleanup tasks, 0.0002731 secs Tue Apr 09 16:35:35 UTC 2019: interned 10000 0-hash strings - last 10000 took 263.134s (0.0263134s per String) Tue Apr 09 16:43:39 UTC 2019: interned 20000 0-hash strings - last 10000 took 484.249s (0.0484249s per String) Timeout refired 960 times [972.040s][info][safepoint,cleanup] deflating global idle monitors, 0.0000011 secs [972.040s][info][safepoint,cleanup] updating inline caches, 0.0000011 secs [972.040s][info][safepoint,cleanup] compilation policy safepoint handler, 0.0000006 secs [972.438s][info][safepoint,cleanup] rehashing string table, 0.3988909 secs [972.439s][info][safepoint,cleanup] purging class loader data graph, 0.0000002 secs [972.439s][info][safepoint,cleanup] resizing system dictionaries, 0.0000014 secs [972.439s][info][safepoint,cleanup] deflating per-thread idle monitors, 0.0000075 secs, monitors=0 [972.439s][info][safepoint,cleanup] safepoint cleanup tasks, 0.3992152 secs Tue Apr 09 16:47:23 UTC 2019: interned 30000 0-hash strings - last 10000 took 223.462s (0.0223462s per String) [972.528s][info][safepoint,cleanup] deflating global idle monitors, 0.0000011 secs Here are the interesting time points: - test started at 16:01:09 - [info][safepoint,cleanup] log mesg at 1.273 seconds after VM startup - it took 263 seconds/4 min, 23 secs to intern the first 10000 0-hash strings - it took 484 seconds/6 min, 4 secs to internal the second 10000 0-hash strings - we hit the 16 minute timeout (default 2 minute X default timeout factor of 8) - it took 223 seconds/3 min, 43 secs to intern the third 10000 0-hash strings Here's the end of the log: SymbolTable statistics: Number of buckets : 32768 = 262144 bytes, each 8 Number of entries : 2393 = 38288 bytes, each 16 Number of literals : 2393 = 90688 bytes, avg 37.897 Total footprsize_t : = 391120 bytes Average bucket size : 0.073 Variance of bucket size : 0.073 Std. dev. of bucket size: 0.271 Maximum bucket size : 3 StringTable statistics: Number of buckets : 16777216 = 134217728 bytes, each 8 Number of entries : 31240 = 499840 bytes, each 16 Number of literals : 31240 = 1760608 bytes, avg 56.357 Total footprsize_t : = 136478176 bytes Average bucket size : 0.002 Variance of bucket size : 0.002 Std. dev. of bucket size: 0.043 Maximum bucket size : 2 ----------System.err:(1/15)---------- STATUS:Passed. ----------rerun:(35/3868)*---------- Again, the test passed, but I don't think it was hung. The StringTable looks very big: Number of buckets : 16777216 = 134217728 bytes, each 8 That's a huge number of buckets and 128MB in space... Update: Here's the stats for a recent passing run on MacOSX: SymbolTable statistics: Number of buckets : 32768 = 262144 bytes, each 8 Number of entries : 2370 = 37920 bytes, each 16 Number of literals : 2370 = 89984 bytes, avg 37.968 Total footprsize_t : = 390048 bytes Average bucket size : 0.072 Variance of bucket size : 0.073 Std. dev. of bucket size: 0.270 Maximum bucket size : 3 StringTable statistics: Number of buckets : 16777216 = 134217728 bytes, each 8 Number of entries : 31240 = 499840 bytes, each 16 Number of literals : 31240 = 1760608 bytes, avg 56.357 Total footprsize_t : = 136478176 bytes Average bucket size : 0.002 Variance of bucket size : 0.002 Std. dev. of bucket size: 0.043 Maximum bucket size : 2 so it looks like the large number buckets in the string table and the large size is normal... sigh...
09-04-2019
It's possible the first tier8 failure on Feb 7 matches with the changes pushed for JDK-8213753: "SymbolTable is double walked during class unloading and clean up table timing in do_unloading" - on Jan 31.
07-04-2019
The failure pattern is odd here. We have one failure on Feb 7 and then March11, both of which took 32 minutes before timing out (tier8 Xcomp runs so double the timeout?). Then we have 6 CI failures at the the 16m since April 4. So something seems very wrong with this code on sparc, but why has the frequency of this problem changed significantly? There must be some code change that has exacerbated this. Each failure occurs on a different machine, so not a machine issue. (Though they all may be 'slow'.)
07-04-2019
The failure mode is very strange. As Dan notes the tests actually "pass" during the timeout handling - in all cases. In fact all the statistics appear the same for each timeout case. That includes when the timeout is 32m as well as 16m. That suggests to me that the test is actually hung and it is only the actions of the timeout handler that actually allow the test to make progress again - quickly completing and "passing", but also being marked as timed-out. We need to run this test locally on a sparc system and examine the state of things after a few minutes.
07-04-2019
Is this a hint from the test itself: title=This test performs poorly if alternate hashing isn't used for string table.
07-04-2019
Here's a snippet from the log file for a recent failure: [1.431s][info][safepoint,cleanup] safepoint cleanup tasks, 0.0004594 secs Fri Apr 05 11:51:38 UTC 2019: interned 10000 0-hash strings - last 10000 took 226.35s (0.022635s per String) Fri Apr 05 11:58:44 UTC 2019: interned 20000 0-hash strings - last 10000 took 425.689s (0.0425689s per String) Timeout refired 960 times [970.115s][info][safepoint,cleanup] deflating global idle monitors, 0.0000014 secs [970.115s][info][safepoint,cleanup] updating inline caches, 0.0000011 secs [970.115s][info][safepoint,cleanup] compilation policy safepoint handler, 0.0000006 secs [970.515s][info][safepoint,cleanup] rehashing string table, 0.4001926 secs [970.515s][info][safepoint,cleanup] purging class loader data graph, 0.0000003 secs [970.515s][info][safepoint,cleanup] resizing system dictionaries, 0.0000008 secs [970.515s][info][safepoint,cleanup] deflating per-thread idle monitors, 0.0000099 secs, monitors=0 [970.515s][info][safepoint,cleanup] safepoint cleanup tasks, 0.4005450 secs Fri Apr 05 12:04:00 UTC 2019: interned 30000 0-hash strings - last 10000 took 316.613s (0.0316613s per String) [970.593s][info][safepoint,cleanup] deflating global idle monitors, 0.0000010 secs [970.593s][info][safepoint,cleanup] updating inline caches, 0.0000007 secs [970.593s][info][safepoint,cleanup] compilation policy safepoint handler, 0.0000003 secs [970.593s][info][safepoint,cleanup] purging class loader data graph, 0.0000003 secs [970.593s][info][safepoint,cleanup] resizing system dictionaries, 0.0000010 secs [970.593s][info][safepoint,cleanup] deflating per-thread idle monitors, 0.0000062 secs, monitors=0 [970.593s][info][safepoint,cleanup] safepoint cleanup tasks, 0.0002235 secs [970.594s][info][safepoint,cleanup] deflating global idle monitors, 0.0000004 secs [970.594s][info][safepoint,cleanup] updating inline caches, 0.0000002 secs [970.594s][info][safepoint,cleanup] compilation policy safepoint handler, 0.0000001 secs [970.594s][info][safepoint,cleanup] purging class loader data graph, 0.0000001 secs [970.594s][info][safepoint,cleanup] resizing system dictionaries, 0.0000002 secs [970.594s][info][safepoint,cleanup] deflating per-thread idle monitors, 0.0000034 secs, monitors=1 [970.594s][info][safepoint,cleanup] safepoint cleanup tasks, 0.0001241 secs [970.609s][info][safepoint,cleanup] deflating global idle monitors, 0.0000011 secs [970.609s][info][safepoint,cleanup] updating inline caches, 0.0000011 secs [970.609s][info][safepoint,cleanup] compilation policy safepoint handler, 0.0000007 secs [970.609s][info][safepoint,cleanup] purging class loader data graph, 0.0000003 secs [970.609s][info][safepoint,cleanup] resizing system dictionaries, 0.0000006 secs [970.609s][info][safepoint,cleanup] deflating per-thread idle monitors, 0.0000088 secs, monitors=1 [970.609s][info][safepoint,cleanup] safepoint cleanup tasks, 0.0002017 secs SymbolTable statistics: Number of buckets : 32768 = 262144 bytes, each 8 Number of entries : 2578 = 41248 bytes, each 16 Number of literals : 2578 = 94480 bytes, avg 36.649 Total footprsize_t : = 397872 bytes Average bucket size : 0.079 Variance of bucket size : 0.079 Std. dev. of bucket size: 0.281 Maximum bucket size : 3 StringTable statistics: Number of buckets : 16777216 = 134217728 bytes, each 8 Number of entries : 31394 = 502304 bytes, each 16 Number of literals : 31394 = 1768864 bytes, avg 56.344 Total footprsize_t : = 136488896 bytes Average bucket size : 0.002 Variance of bucket size : 0.002 Std. dev. of bucket size: 0.043 Maximum bucket size : 2 ----------System.err:(1/15)---------- STATUS:Passed. ----------rerun:(35/3874)*---------- In this particular test run, the test timedout, but during timeout handling the test managed to pass. It is using the default timeout of 120 seconds and with a default timeout factor of 4 used by the CI, that's 8 minutes. This instance of the test ran for: elapsed time (seconds): 984.983 or 16.41 minutes. If the test's timeout value is changed to 300 (5 minutes), then with a default timeout factor of 4, this would give this test 20 minutes. Correction, it looks like the SPARC tests are given: -J-Dtest.timeout.factor=8.0 so changing the test's timeout value to 180 will give it a total timeout value of 24 minutes on SPARC and 12 minutes on other platforms.
05-04-2019