Bug ID: JDK-8227695 assert(pss->trim_ticks().seconds() == 0.0) failed: Unexpected partial trimming during evacuation

JDK-8227695 : assert(pss->trim_ticks().seconds() == 0.0) failed: Unexpected partial trimming during evacuation

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 13,14,15,16,17

Priority: P2
Status: Closed
Resolution: Fixed

Submitted: 2019-07-15
Updated: 2024-12-20
Resolved: 2021-01-21

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 16	JDK 17
16 b34Fixed	17Fixed

Related Reports

Blocks :	JDK-8230794 - Improve assert to get more information about the JDK-8227695 failure
Duplicate :	JDK-8233350 - vmTestbase/nsk/jdi/ObjectReference/referringObjects/referringObjects003/referringObjects003.java failed with nsk.share.Failure: Caught IOException while writing an object to IOPipe connection
Relates :	JDK-8218668 - Clean up evacuation of optional collection set
Relates :	JDK-8232771 - Revert JDK-8230794 because of environment changes
Relates :	JDK-8213890 - Implementation of JEP 344: Abortable Mixed Collections for G1

Description

Test failed with

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (t:/workspace/open/src/hotspot/share/gc/g1/g1CollectedHeap.cpp:3656), pid=23808, tid=38764
#  assert(pss->trim_ticks().seconds() == 0.0) failed: Unexpected partial trimming during evacuation
#
# JRE version: Java(TM) SE Runtime Environment (13.0) (fastdebug build 13-ea+0-1461)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 13-ea+0-1461, mixed mode, sharing, tiered, compressed oops, g1 gc, windows-amd64)
# Core dump will be written. Default location: T:\testoutput\test-support\jtreg_closed_test_hotspot_jtreg_applications_microbenchmarks_other_Test_20_java\scratch\0\hs_err_pid23808.mdmp
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

---------------  S U M M A R Y ------------

Command Line: -XX:MaxRAMPercentage=3 -XX:+CreateCoredumpOnCrash -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -XX:+TieredCompilation -XX:+IgnoreUnrecognizedVMOptions -XX:-DoEscapeAnalysis --add-opens=java.base/java.io=ALL-UNNAMED -Djmh.ignoreLock=true -XX:CompileCommandFile=C:\cygwin\tmp\jmh12803977271919436643compilecommand org.openjdk.jmh.runner.ForkedMain 127.0.0.1 27005

Host:  AMD EPYC 7551 32-Core Processor                , 16 cores, 63G,  Windows Server 2012 R2 , 64 bit Build 9600 (6.3.9600.19358)
Time: Mon Jul 15 07:38:41 2019 /GM elapsed time: 4 seconds (0d 0h 0m 4s)

---------------  T H R E A D  ---------------

Current thread (0x000000d50b73b000):  GCTaskThread "GC Thread#5" [stack: 0x000000d50d3e0000,0x000000d50d4e0000] [id=38764]

Stack: [0x000000d50d3e0000,0x000000d50d4e0000]
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [jvm.dll+0xadadb1]  os::platform_print_native_stack+0xf1  (os_windows_x86.cpp:369)
V  [jvm.dll+0xcdf8ef]  VMError::report+0xf0f  (vmerror.cpp:716)
V  [jvm.dll+0xce11ae]  VMError::report_and_die+0x8ae  (vmerror.cpp:1518)
V  [jvm.dll+0xce1834]  VMError::report_and_die+0x64  (vmerror.cpp:1308)
V  [jvm.dll+0x50d9c2]  report_vm_error+0x102  (debug.cpp:264)
V  [jvm.dll+0x5e3d97]  G1EvacuateRegionsBaseTask::evacuate_live_objects+0x217  (g1collectedheap.cpp:3656)
V  [jvm.dll+0x5e3de5]  G1EvacuateRegionsTask::evacuate_live_objects+0x25  (g1collectedheap.cpp:3706)
V  [jvm.dll+0x5ecc46]  G1EvacuateRegionsBaseTask::work+0xc6  (g1collectedheap.cpp:3689)
V  [jvm.dll+0xd20f97]  GangWorker::loop+0xb7  (workgroup.cpp:345)
V  [jvm.dll+0xc7c592]  Thread::call_run+0x192  (thread.cpp:405)
V  [jvm.dll+0xad98be]  thread_native_entry+0x10e  (os_windows.cpp:471)
C  [ucrtbase.DLL+0x203ba]
C  [KERNEL32.DLL+0x13f2]
C  [ntdll.dll+0x154f4]


---------------  P R O C E S S  ---------------

Comments

The fix for this bug is integrated in jdk-16+34-2207.
22-01-2021
Changeset: ede1beae Author: Thomas Schatzl <tschatzl@openjdk.org> Date: 2021-01-21 18:21:07 +0000 URL: https://git.openjdk.java.net/jdk16/commit/ede1beae
21-01-2021
Fix Request for JDK 16 approved.
21-01-2021
Fix Request Motivation: This assert fails on tests randomly, cluttering the CI results Risk: (Very) Low. The fix is straight forward, changing an FP comparison to an integer comparison in an assert. That exact assert with integer comparison is in use in another place in the code already and has never been failing. There is also no known product issue behind it. Patch: https://github.com/openjdk/jdk16/pull/128 Testing: Tier1-5 on all Oracle platforms. Testing with 4k iterations of that patch on two tests that seem to be failing more often than others. Testing with 4k iterations of the same two tests with this assert placed in 10+ other locations that were used during the investigation found out sometimes failing.
21-01-2021
jdk-15-defer-request justification: This is an assert that is supposed to fire if the timing information for GC work had not been flushed properly into global time measurement data structures at the end of GC. Latest investigations show that the value has indeed been flushed (i.e. this is a duration that is zero), but the assert fails anyway. I.e. the calculation of "duration * ((double)1 / 1000000)" where "duration" is 0 is unequal the floating point constant of "0.0". Moving around this assertion shows that this may occur in seemingly random places during GC related to reading the Windows performance counter/calling os::elapsedCounter(). Other than that recent investigations have not brought up really new insights in how this is caused. Impact: There is no known significant bad behavior caused by this failing assert in these circumstances in the product. The reported duration does have the value 0, i.e. has been flushed properly. Anyway, at most a single data point in the heuristics to size the young generation could be off if the asserted value did not contain zero. G1 always aggregates and smooths out multiple data points in its heuristics, so a single one slightly off. In this case, an actual nonzero value in that duration would result in slightly smaller young generations. Overall, since this issue occurs very very infrequently, the impact on the (smoothed) prediction will be extremely small. I.e. risk/impact is low. It also only occurs on one particular type of machine (AMD Epyc) and platform (Windows). This should not imply that the platforms are buggy, but only this particular combination causes the situation to arise which seems timing dependent. It seems to occur less often now, but the cause may be frequent host rotation in CI.
18-07-2020
Has anyone tried reproducing this outside the VM with just calls to QPC and some FP logic?
09-04-2020
Always use fp_value < machine_epsilon when comparing floating point numbers to zero. The assignment x = 0.0 isn't guaranteed to result in a binary value of only zeros in x. The same goes for constants. If you compare to the constant 0.0 you are not guaranteed to compare to a binary value of all zeros. So even if your left value is 0x000.0000 the constant 0.0 on your right may not be - and as it fails most likely isn't.
02-04-2020
This is a problem, but not this problem. The problem as stated above that the FP value 0x0000..00 is != 0.0 (0x000...000). I.e. even when augmenting the assert to print the binary value, it is 0.0 (exact) and still fails. In different places in the evacuation loop, i.e. if you add above code snippet that adds the compared values to the assert in various places in the evacuation loop, it randomly fails in different locations, always with the left value of the comparisons being 0x000.0000. Most frequently in close proximity to calls to QueryPerformanceCounter/os:elapsedCount().
20-03-2020
Doesn't trying to compare an exact floating-point value of 0.0 violate the basic rules when it comes to FP comparisons. FP is imprecise so your zero value may have a tiny epsilon delta from exact zero. If you expect a truly zero value then cast to int and check == 0
20-03-2020
jdk-14-defer-request justification: This is an assert that is supposed to fire if the timing information for GC work had not been flushed properly into global time measurement data structures at the end of GC. Latest investigations show that the value has indeed been flushed (i.e. this is a duration that is zero), but the assert fails anyway. I.e. the calculation of "duration * ((double)1 / 1000000)" where "duration" is 0 is unequal the floating point constant of "0.0". Moving around this assertion shows that this may occur in seemingly random places during GC (it may also occur at other times, but we did not investigate). Impact: There is no known significant bad behavior caused by this failing assert in these circumstances in the product. The reported duration does have the value 0, i.e. has been flushed properly. Anyway, at most a single data point in the heuristics to size the young generation could be off if the asserted value did not contain zero. G1 always aggregates and smooths out multiple data points in its heuristics, so a single one slightly off. In this case, an actual nonzero value in that duration would result in slightly smaller young generations. Overall, since this issue occurs very very infrequently, the impact on the (smoothed) prediction will be extremely small. I.e. risk/impact is low. It also only occurs on one particular type of machine (AMD Epyc) and platform (Windows). This should not imply that the platforms are buggy, but only this particular combination causes the situation to arise which seems timing dependent.
14-01-2020
Found in mach5-one-jdk-14+21-923-tier: /nsk/jdi/ObjectReference/referringObjects/referringObjects003/referringObjects003.java job:https://java.se.oracle.com:10065/mdash/jobs/mach5-one-jdk-14+21-923-tier5-20191030-1043-6277699 log:https://mach5.us.oracle.com:10060/api/v1/results/mach5-one-jdk-14+21-923-tier5-20191030-1043-6277699-tier5-svc-vmTestbase_nsk_jdi-windows-x64-debug-385-625/log
12-11-2019
Also remove the debug code introduced in JDK-8230794 when fixing this.
10-09-2019
Occurs on a young normal GC, so not related to mixed gc (i.e. abortable mixed gc) changes; also seen on jdk 14 and so not (accidentally) fixed by JDK-8213108
12-08-2019
Deferral Request: This is an assert that fires if the timing information for some GC work has not been flushed into global time taking data structures at the end of GC. This issue could not be reproduced after 8500 executions of one of the failing tests in our CI. Impact: I am not aware of any significant bad behavior caused by this (crash, ...). The assert checks whether all GC work has been accounted for in internal timing measurements. The GC work (copying objects around) has been done, but G1 did not add the time this took, to the total amount of object copy work. This total amount of copying work and the time it took is used to generate a "bytes copied/ms" rate. This value is (e.g. after temporal smoothing) used to predict how much time evacuating a particular region would take to evacuate; G1 uses this value for predicting the number of young regions the mutator is allowed to fill until the next gc. The time is used in the denominator, resulting in a higher cost for copying bytes. So in the product, this may result in slightly smaller edens; note that this issue occurs very very infrequently, the impact on the (smoothed) prediction will be extremely small. I.e. imo, risk/impact is low. It also only seem to occur on one particular type of machine (AMD Epyc) and platform (Windows). This should not imply that the platforms are buggy, but only this particular combination causes the situation to arise which seems timing dependent. The assert has been introduced in JDK-8213890 (12) and moved around a bit in JDK-8218668 (13)
02-08-2019
Assuming this is not some random write to that variable storing the data that should be zero: it means that we did some garbage collection work (copying objects) that is not properly collected in the internal GC statistics. Depending on how big these are, this might or might not affect predictions. There is no direct issue with functionality as the work itself has been performed; there is also no correspondence to particular slowness of the machines I can think of (and these EPYC processors are quite beefy anyway). However it could be caused by extreme variations in GC thread scheduling as exhibited in virtualized environments (This is only a guess, although code review did not find an obvious candidate for this situation occurring, ie. a part of GC that might skip flushing this timing information into the global statistics). The assert is overly cryptic and should be fixed.
30-07-2019
What is the assertion actually trying to check? If this is non-zero is that because of slow machines?
29-07-2019
The CTW tests fail with this error: # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (t:/workspace/open/src/hotspot/share/gc/g1/g1CollectedHeap.cpp:3610), pid=12352, tid=40652 # assert(pss->trim_ticks().seconds() == 0.0) failed: Unexpected partial trimming during evacuation # # JRE version: Java(TM) SE Runtime Environment (14.0+7) (fastdebug build 14-ea+7-189) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 14-ea+7-189, mixed mode, sharing, compressed oops, g1 gc, windows-amd64) # Core dump will be written. Default location: T:\testoutput\test-support\jtreg_closed_test_hotspot_jtreg_applications_ctw_jars\scratch\2\hs_err_pid12352.mdmp # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # --------------- S U M M A R Y ------------ Command Line: -XX:MaxRAMPercentage=3 -XX:+CreateCoredumpOnCrash -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -Djava.awt.headless=true -Xbatch -XX:-UseCounterDecay -XX:-ShowMessageBoxOnError -XX:+UnlockDiagnosticVMOptions -XX:+DisplayVMOutputToStderr -DCompileTheWorldStartAt=0 -DCompileTheWorldStopAt=291 -XX:+WhiteBoxAPI -Xbootclasspath/a:. --add-exports=java.base/jdk.internal.jimage=ALL-UNNAMED --add-exports=java.base/jdk.internal.misc=ALL-UNNAMED --add-exports=java.base/jdk.internal.reflect=ALL-UNNAMED --add-exports=java.base/jdk.internal.access=ALL-UNNAMED -XX:+LogCompilation -XX:LogFile=hotspot_xom-1.1.jar_0_%p.log -XX:ErrorFile=hs_err_xom-1.1.jar_0_%p.log -XX:ReplayDataFile=replay_xom-1.1.jar_0_%p.log -XX:CompileCommand=exclude,java/lang/invoke/MethodHandle.* sun.hotspot.tools.ctw.CompileTheWorld C:\ade\mesos\work_dir\jib-master\install\org\spec\SPECjvm2008\1.00\SPECjvm2008-1.00.zip\SPECjvm2008\lib\xom-1.1.jar Host: AMD EPYC 7551 32-Core Processor , 16 cores, 63G, Windows Server 2012 R2 , 64 bit Build 9600 (6.3.9600.19358) Time: Fri Jul 19 04:08:31 2019 /GM elapsed time: 31 seconds (0d 0h 0m 31s) --------------- T H R E A D --------------- Current thread (0x000000a1b4f07000): GCTaskThread "GC Thread#6" [stack: 0x000000a1b6330000,0x000000a1b6430000] [id=40652] Stack: [0x000000a1b6330000,0x000000a1b6430000] Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [jvm.dll+0xadde81] os::platform_print_native_stack+0xf1 (os_windows_x86.cpp:369) V [jvm.dll+0xce2ddf] VMError::report+0xf0f (vmerror.cpp:716) V [jvm.dll+0xce469e] VMError::report_and_die+0x8ae (vmerror.cpp:1524) V [jvm.dll+0xce4d94] VMError::report_and_die+0x64 (vmerror.cpp:1308) V [jvm.dll+0x50e0b2] report_vm_error+0x102 (debug.cpp:264) V [jvm.dll+0x5e4617] G1EvacuateRegionsBaseTask::evacuate_live_objects+0x217 (g1collectedheap.cpp:3610) V [jvm.dll+0x5e4665] G1EvacuateRegionsTask::evacuate_live_objects+0x25 (g1collectedheap.cpp:3660) V [jvm.dll+0x5ed556] G1EvacuateRegionsBaseTask::work+0xc6 (g1collectedheap.cpp:3643) V [jvm.dll+0xd244f7] GangWorker::loop+0xb7 (workgroup.cpp:345) V [jvm.dll+0xc7f9d2] Thread::call_run+0x192 (thread.cpp:405) V [jvm.dll+0xadc94e] thread_native_entry+0x10e (os_windows.cpp:471) C [ucrtbase.DLL+0x203ba] C [KERNEL32.DLL+0x13f2] C [ntdll.dll+0x154f4]
19-07-2019