JDK-8328831 : MoCrazy.java stress test hangs on linux-x64 with 1024 threads at 7H+
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 23
  • Priority: P4
  • Status: Closed
  • Resolution: Withdrawn
  • OS: linux_ubuntu
  • CPU: x86_64
  • Submitted: 2024-03-22
  • Updated: 2024-04-01
  • Resolved: 2024-04-01
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdResolved
Related Reports
Relates :  
Description
I resurrected [~rehn]'s MoCrazy.java test program as part of
my stress testing of the following fix:

    JDK-8319251 [REDO] Change LockingMode default from LM_LEGACY to LM_LIGHTWEIGHT

Robbin and I both used MoCrazy.java to shake out monitor deflation
bugs some number of years ago when we were updating that code.

My stress kit runs MoCrazy for 8 hours on linux-x64 with 1024
threads and the test program hangs between 7-8 hours on all
build configs: release, fastdebug, slowdebug.

I did a baseline stress run with jdk-23+14-1012 and the three hangs
also reproduced in that run so the hangs are not related to JDK-8319251.
I'm currently stress testing jdk-23+15 (which includes the fix for
JDK-8319251) and the hangs are reproducing there also (no surprise).
Comments
Withdrawing this bug since the fix is in a pre-integration version of the test.
01-04-2024

Rewrote the outer loop to look like this: long count = 0; long start_time = System.currentTimeMillis(); while (System.currentTimeMillis() < start_time + (n_seconds * 1000)) { count++; do_run = true; // doit() loops in the "stress" portion for 60 seconds: long sum = mc.doit(); System.out.println("Score-" + count + ": " + sum); } Now we loop for 1 minute durations of work plus any overhead and stop looping when we have run for the specified number of seconds. This version of the test was used in jdk-23+16 stress testing and we no longer had any timeouts of the MoCrazy test.
01-04-2024

I executed 3 slowdebug run of MoCrazy for 8 hours each without JTREG. The load average was up over 100 for the entire duration of the runs so I expected the "hang" to reproduce. Here are snippets from MoCrazy_1024_480_slowdebug.log.save.02a: java version "23-internal" 2024-09-17 Java(TM) SE Runtime Environment (slowdebug build 23-internal-2024-03-25-1814497.dcubed...) Java HotSpot(TM) 64-Bit Server VM (slowdebug build 23-internal-2024-03-25-1814497.dcubed..., mixed mode, sharing) INFO: n_threads=1024 INFO: n_minutes=480 Score-1: 749315072 Score-2: 619958272 Score-3: 580575232 <snip> Score-478: 604822528 Score-479: 586295296 Score-480: 606539776 real 522m10.321s user 3727m8.864s sys 396m31.602s Here are snippets from MoCrazy_1024_480_slowdebug.log.save.02b: java version "23-internal" 2024-09-17 Java(TM) SE Runtime Environment (slowdebug build 23-internal-2024-03-25-1814497.dcubed...) Java HotSpot(TM) 64-Bit Server VM (slowdebug build 23-internal-2024-03-25-1814497.dcubed..., mixed mode, sharing) INFO: n_threads=1024 INFO: n_minutes=480 Score-1: 649732096 Score-2: 633641984 Score-3: 620092416 <snip> Score-478: 593620992 Score-479: 621213696 Score-480: 651915264 real 522m4.890s user 3667m28.101s sys 397m6.213s Here are snippets from MoCrazy_1024_480_slowdebug.log.save.02c: java version "23-internal" 2024-09-17 Java(TM) SE Runtime Environment (slowdebug build 23-internal-2024-03-25-1814497.dcubed...) Java HotSpot(TM) 64-Bit Server VM (slowdebug build 23-internal-2024-03-25-1814497.dcubed..., mixed mode, sharing) INFO: n_threads=1024 INFO: n_minutes=480 Score-1: 619231232 Score-2: 586404864 Score-3: 609130496 <snip> Score-478: 592411648 Score-479: 630879232 Score-480: 933356544 real 522m29.173s user 3736m35.525s sys 397m22.610s Each of the three runs executed 480 iterations and finished in 522 real/wall clock minutes. Wait... that's 42 minutes longer than 8 hours... Here's the StressWrapper that I've been using: $ cat closed/test/hotspot/jtreg/dummy_subdir/StressWrapper_MoCrazy_8H.java /** * @test id=linux * @summary Stress test multi-threaded monitor locking on linux for 8 hours. * @requires os.family == "linux" * @run main/othervm/timeout=29100 * -ms256m -mx256m * MoCrazy 1024 480 */ /** * @test id=mac * @summary Stress test multi-threaded monitor locking on mac 8 hours. * @requires os.family == "mac" * @run main/othervm/timeout=29100 * -ms256m -mx256m * MoCrazy 32 480 */ and it sets the timeout to 29100 seconds/485 minutes... Well there's your problem right there... Here's our stress test doit() function: private long doit() { threads = new Thread[n_threads]; for (int i = 0; i < threads.length; i++) { threads[i] = new Thread(this); } locks = new Object[1024]; cnt = new Long[locks.length]; for (int i = 0; i < locks.length; i++) { locks[i] = new Object(); cnt[i] = Long.valueOf("0"); } for (Thread t : threads) { t.start(); } try { Thread.sleep(60 * 1000); } catch (InterruptedException e) { e.printStackTrace(); } do_run = false; for (Thread t : threads) { try { t.join(); } catch (InterruptedException e) { e.printStackTrace(); } } long sum = 0; for (Long l : cnt) sum += l; return sum; } It launches 1024 threads and delays for 60 seconds before switching the volatile `do_run` variable to `false`. And then we do cleanup work: join all the worker threads and sum up all the counters. That's some work but I didn't imagine it to be a lot of work... Here's our outer loop: for (int m = 1; m <= n_minutes; m++) { do_run = true; long sum = mc.doit(); System.out.println("Score-" + m + ": " + sum); } We simply call the stress `doit()` function for `n_minutes` iterations and _assume_ that we're only running for that many minutes... 522 - 480 == 42 42 / 480 == 0.0875 60 * 0.0875 == 5.2500 So we have 5.25 seconds of overhead per iteration and: 5.25 * 480 == 2520 2520 / 60 = 42 and for 8 hours/480 runs that costs us 42 minutes. Okay. This simplistic loop doesn't scale: for (int m = 1; m <= n_minutes; m++) { do_run = true; long sum = mc.doit(); System.out.println("Score-" + m + ": " + sum); } so I'm going to have to tweak the test a bit. Also, all of my other stress tests take n_seconds as the parameter rather than n_minutes so that it is more easily matched up with the `/timeout=NNNN` value.
01-04-2024

Here's the log files from my jdk-23+15 stress run sightings on linux-x64: $ unzip -l jdk-23+15_linux-x64.8328831.zip Archive: jdk-23+15_linux-x64.8328831.zip Length Date Time Name --------- ---------- ----- ---- 96012 2024-03-21 22:28 jdk-23+15_1/failures.linux-x86_64/StressWrapper_MoCrazy_8H_linux.jtr.release 107453 2024-03-22 06:35 jdk-23+15_1/failures.linux-x86_64/StressWrapper_MoCrazy_8H_linux.jtr.fastdebug 106823 2024-03-22 14:42 jdk-23+15_1/failures.linux-x86_64/StressWrapper_MoCrazy_8H_linux.jtr.slowdebug 113180 2024-03-23 00:39 jdk-23+15_2/failures.linux-x86_64/StressWrapper_MoCrazy_8H_linux.jtr.release 104945 2024-03-23 08:46 jdk-23+15_2/failures.linux-x86_64/StressWrapper_MoCrazy_8H_linux.jtr.fastdebug 108903 2024-03-23 16:53 jdk-23+15_2/failures.linux-x86_64/StressWrapper_MoCrazy_8H_linux.jtr.slowdebug 115075 2024-03-24 02:41 jdk-23+15_3/failures.linux-x86_64/StressWrapper_MoCrazy_8H_linux.jtr.release 105852 2024-03-24 10:48 jdk-23+15_3/failures.linux-x86_64/StressWrapper_MoCrazy_8H_linux.jtr.fastdebug 107525 2024-03-24 18:55 jdk-23+15_3/failures.linux-x86_64/StressWrapper_MoCrazy_8H_linux.jtr.slowdebug --------- ------- 965768 9 files
25-03-2024

I've attached the current version of MoCrazy and the StressWrapper that I'm using to run the test for 8 hours. I'm using this current version in jdk-23+15 stress testing right now: $ unzip -l MoCrazy.zip Archive: MoCrazy.zip Length Date Time Name --------- ---------- ----- ---- 3822 2024-03-21 12:48 closed/test/hotspot/jtreg/dummy_subdir/MoCrazy.java 427 2024-03-21 12:48 closed/test/hotspot/jtreg/dummy_subdir/StressWrapper_MoCrazy_8H.java --------- ------- 4249 2 files I happen to put the new test on the closed side while I'm shaking it out, but it can also be dropped into open/test/hotspot/jtreg/dummy_subdir. The dummy_subdir thing is to work around a weird bug where the two `@requires` cause problems when the test is not in a subdirectory. I'll try to create a simpler standalone test case for that bit of weirdness and file another bug.
22-03-2024

I did a single baseline run using jdk-23+14-1012 bits and reproduced the same hangs on linux-x64 without any changes from JDK-8319251. So the hangs are also in the baseline system without the changes for JDK-8319251. Again, the monitor stress tests were only partially integrated into my stress kit. I have the output logs from my do_monitor_stress_tests.ksh script, which includes both MoCrazy results and Inflate2 results. The Inflate2 runs all passed. $ unzip -l jdk-23+14-1012_linux-x64_monitor_stress_logs.zip Archive: jdk-23+14-1012_linux-x64_monitor_stress_logs.zip Length Date Time Name --------- ---------- ----- ---- 331959 2024-03-20 11:52 jdk-23+14-1012_1/logs.linux-x86_64/do_monitor_stress_tests.log --------- ------- 331959 1 file
22-03-2024

When I did JDK-8319251.v2 testing, the monitor stress tests were only partially integrated into my stress kit. I have the output logs from my do_monitor_stress_tests.ksh script, which includes both MoCrazy results and Inflate2 results. The Inflate2 runs all passed. $ unzip -l JDK-8319251.v2_linux-x64_monitor_stress_logs.zip Archive: JDK-8319251.v2_linux-x64_monitor_stress_logs.zip Length Date Time Name --------- ---------- ----- ---- 331825 2024-03-17 01:08 JDK-8319251.v2_1/logs.linux-x86_64/do_monitor_stress_tests.log 327845 2024-03-18 03:20 JDK-8319251.v2_2/logs.linux-x86_64/do_monitor_stress_tests.log 332428 2024-03-19 05:30 JDK-8319251.v2_3/logs.linux-x86_64/do_monitor_stress_tests.log --------- ------- 992098 3 files
22-03-2024