JDK-8277345 : investigate if specific failed tests are causing Mach5 task timeouts
  • Type: Task
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 18
  • Priority: P3
  • Status: Closed
  • Resolution: External
  • OS: generic
  • CPU: generic
  • Submitted: 2021-11-17
  • Updated: 2022-05-04
  • Resolved: 2022-05-04
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 19
19Resolved
Related Reports
Relates :  
Relates :  
Sub Tasks
JDK-8277346 :  
JDK-8277576 :  
Description
We're investigating task timeouts in Mach5 via:

MACH5-5002 getting a number of "timeouts in execution" for test tasks on macOS again

MACH5-5014 Post execution timeout on Linux aarch64

I've filed this JBS issue to coordinate the investigation into whether
specific failed tests are causing the Mach5 task timeouts.

For the macosx-x64 task timeouts, there are a couple of related
bugs that are tracking some failure modes:

    JDK-8267433 Core dumps on OSX sometimes take a very long time

    JDK-8265037 serviceability/sa/ClhsdbPmap.java#id1 failed with "RuntimeException: Process is still alive. Can't get its output."

The current operational theory for the macosx-x64 failures is that
tests that core dump can somehow mess up the test machine
in a way that causes slow test execution in subsequent tests that
core dump or even in other tests in general. We have even seen
kernel panics on the macosx-x64 machines when we reboot them
after the machine has gotten slow. Here's an example:

panic(cpu 2 caller 0xffffff80207fe9fe): watchdog timeout: no checkins from watchdogd in 301 seconds (216979 totalcheckins since monitoring last enabled), shutdown in progress


Comments
I'm closing this bug as "External" since the hunt to find specific test cases finished a long time ago and all of the "TimeoutException in EXECUTION." sightings are being tracked by MACH5-5002.
04-05-2022

Here's the current summary from the spreadsheet that I'm using to track the MDash "TimeoutException in EXECUTION" task failures: Summary of the Data: # of tasks: 188 # of failed tests: 737 # of unique hostnames: 83 # of unique testsuites: 22 # of unique testnames: 109 Phase 1 Sightings Info: Earliest Phase 1 Sighting: jdk-18+16-874-tier2-20210920-1507-24636438 Last Phase 1 Sighting: jdk-18+24-1609-tier8-20211120-0143-26415360 Phase 1 ProblemListings: JDK-8277346 and JDK-8277351 are in jdk-18+24-1615 # of Phase 1 Build-IDs: 108 # of Phase 1 Sightings: 116 Phase 1 Sightings Average: 1.0740740741 Phase 2 Sightings Info: Earliest Phase 2 Sighting: jdk-18+25-1621-tier6-20211119-0607-26381320 Last Phase 2 Sighting: jdk-18+25-1670-tier8-20211125-0140-26585195 Phase 2 ProblemListings: JDK-8277576 is in jdk-18+25-1678 # of Phase 2 Build-IDs: 63 # of Phase 2 Sightings: 5 Phase 2 Sightings Average: 0.0793650794 Phase 3 Sightings Info (PRELIMINARY): Earliest Phase 3 Sighting: jdk-18+26-1750-tier3-20211130-0357-26705670 Last Phase 3 Sighting: Phase 3 ProblemListings: # of Phase 3 Build-IDs: 328 # of Phase 3 Sightings: 67 Phase 3 Sightings Average: 0.2042682927 The phased analysis never completed Phase 3 because we never found another group of tests that was useful to ProblemList. At this point the spreadsheet just tracks the "TimeoutException in EXECUTION" task failures along with host, testsuite, and testname data. All of the sightings are being tracked in MACH5-5002 so this bug doesn't really serve any purpose at this point.
04-05-2022

I've been tracking macosx-x64 task timeouts in MACH5-5002 starting with jdk-18+16-875-tier1-20210920-1506-24636410 I've created a spreadsheet to track and analyze the data and here's the current summary: Summary of the Data: # of tasks: 110 # of unique hostnames: 42 # of unique test suites: 15 # of unique testnames: 31 Here's the current TOP8 failing tests: runtime/jni/checked/TestPrimitiveArrayCriticalWithBadParam.java 18 serviceability/sa/ClhsdbCDSCore.java 21 serviceability/sa/ClhsdbFindPC.java#no-xcomp-core 19 serviceability/sa/ClhsdbFindPC.java#xcomp-core 19 serviceability/sa/ClhsdbPmap.java#core 21 serviceability/sa/ClhsdbPstack.java#core 21 serviceability/sa/TestJmapCore.java 13 serviceability/sa/TestJmapCoreMetaspace.java 9 The serviceability/sa tests make sense because those tests generate core files. runtime/jni/checked/TestPrimitiveArrayCriticalWithBadParam.java makes less sense, but I think that particular test might be trying to core dump, but I haven't proven that yet.
17-11-2021