Bug ID: JDK-8231779 crash HeapWord*ParallelScavengeHeap::failed_mem_allocate

JDK-8231779 : crash HeapWord*ParallelScavengeHeap::failed_mem_allocate

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 8u202,11,14,15

Priority: P3
Status: Resolved
Resolution: Fixed

Submitted: 2019-10-03
Updated: 2020-08-14
Resolved: 2020-03-23

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 11	JDK 13	JDK 15	JDK 8	Other
11.0.8-oracleFixed	13.0.4Fixed	15 b16Fixed	8u251Fixed	openjdk8u272Fixed

Comments

Fix Request (8u) I would like to backport this to 8u for parity with Oracle 8u271. The original patch does not apply cleanly. Code review: https://mail.openjdk.java.net/pipermail/jdk8u-dev/2020-July/012108.html (reviewed)
21-07-2020
Fix request (13u): The original change applies cleanly, tier1 tests pass.
05-06-2020
Fix request (11u) -- will label after testing completed. I would like to downport this for parity with 11.0.8-oracle. The change needed some trivial resolves: http://mail.openjdk.java.net/pipermail/jdk-updates-dev/2020-March/002931.html
27-03-2020
URL: https://hg.openjdk.java.net/jdk/jdk/rev/fcbd54a2c2d9 User: poonam Date: 2020-03-23 18:02:43 +0000
23-03-2020
From core files provided by the customer facing this crash, it was found that the gcc compiled code performing 'double' to 'float' conversion was corrupting some of the values in the floating point registers. The crash was seen with JDK8 and not with 6 or 7. Stack trace of the crashing thread: 166 --- called from signal handler with signal 11 (SIGSEGV) --- 167 ffffffff7e838754 __ftoul (a5400, f5c, 3d, 1002e9858,ffffffff7df93c60, 1002e9800) + 34 168 ffffffff7db1e66c __1cKPSScavengeQinvoke_no_policy6F_b_ (78400,100114320, 10011e520, ffffffff7e033130, 10011e820, ffffffff7df93c60) + 1334 169 ffffffff7db1d14c __1cKPSScavengeGinvoke6F_b_ (100114320, 2c0, 0,ffffffff7e041de0, ffffffff7df93c60, 51) + 44 170 ffffffff7dabe518 __1cUParallelScavengeHeapTfailed_mem_allocate6ML_pnIHeapWord__ (10011e520, 12, 10011e520, 4d57b4, e, ffffffff7df93c60) + 70 171 ffffffff7dcc2ea0 __1cbDVM_ParallelGCFailedAllocationEdoit6M_v_(fffffffed33fd970, 10011e520, 0, fffffffedeaff87f, ffffffff7df93c60, c) + 98 172 ffffffff7dcca504 __1cMVM_OperationIevaluate6M_v_ (fffffffed33fd970, 1002d7000, 8, ffffffff7df93c60, 2c97a4, 0) + 4c 173 ffffffff7dcc851c __1cIVMThreadSevaluate_operation6MpnMVM_Operation__v_ (1a, fffffffed33fd970, c3d20, 100243800, ffffffff7df93c60, 3d8) + 114 174 ffffffff7dcc8b98 __1cIVMThreadEloop6M_v_ (1002d7000,ffffffffffead716, ffffffff7e057840, 0, ffffffff7df93c60, ffffffff7e057980) +428 175 ffffffff7dcc80cc __1cIVMThreadDrun6M_v_ (1002d7000, 7f, 99220,ffffffff7df93c60, 2cbc2c, 99000) + 9c 176 ffffffff7da91c18 java_start (1002d7000, ffffffff7e030118, 0,1002a6650, ffffffff7df93c60, c7c14) + 378 177 ffffffff7e8d8c40 _lwp_start (0, 0, 0, 0, 0, 0) 178 ----------------- lwp# 19 / thread# 19 -------------------- Crash happened in _ftoul(): 0xffffffff7e838748 <__ftoul+40>: or %g5, %o0, %o0 0xffffffff7e83874c <__ftoul+44>: retl 0xffffffff7e838750 <__ftoul+48>: add %sp, 0xc0, %sp => 0xffffffff7e838754 <__ftoul+52>: fstox %f1, %f2 0xffffffff7e838758 <__ftoul+56>: std %f2, [ %sp + 0x8af ] Frame that called _ftoul() was: 0xffffffff7dc21d34: resize_all_tlabs+0x003c: ldx [%i0], %i5 //java thread 0xffffffff7dc21d38: resize_all_tlabs+0x0040: brz,pn %i5, resize_all_tlabs+0x204 ! 0xffffffff7dc21efc 0xffffffff7dc21d3c: resize_all_tlabs+0x0044: sethi %hi(0x0), %l5 0xffffffff7dc21d40: resize_all_tlabs+0x0048: ld [%i5 + 164], %f3 //store AdaptiveWeightedAverage::_average to f3 0xffffffff7dc21d44: resize_all_tlabs+0x004c: xor %l5, 704, %l3 0xffffffff7dc21d48: resize_all_tlabs+0x0050: mov %i5, %o1 0xffffffff7dc21d4c: resize_all_tlabs+0x0054: ldx [%i4 + %l3], %o7 0xffffffff7dc21d50: resize_all_tlabs+0x0058: sethi %hi(0x100000), %l6 0xffffffff7dc21d54: resize_all_tlabs+0x005c: ldx [%o7], %o0 0xffffffff7dc21d58: resize_all_tlabs+0x0060: st %f3, [%sp + 2223] //store AdaptiveWeightedAverage::_average to [%sp + 2223] ... 0xffffffff7dc21e0c: resize_all_tlabs+0x0114: ld [%sp + 2223], %f1 //load AdaptiveWeightedAverage::_average in f1 0xffffffff7dc21e10: resize_all_tlabs+0x0118: sethi %hi(0xc4400), %l3 0xffffffff7dc21e14: resize_all_tlabs+0x011c: mov 12, %l4 0xffffffff7dc21e18: resize_all_tlabs+0x0120: xor %l3, 760, %l1 0xffffffff7dc21e1c: resize_all_tlabs+0x0124: fabss %f0, %f2 0xffffffff7dc21e20: resize_all_tlabs+0x0128: call _PROCEDURE_LINKAGE_TABLE_+0xc0 [PLT] ! 0xffffffff7df97ec0 //call __ftoul Register values: f1 +1.836710e-39 f2 +1.310720e+06 f3 +1.401298e-45 The generated code for a double to float conversion seems to be doing the conversion wrong. From the disassembly, we can see that different parts of the stack value are being used while loading and storing values into and from floating point registers. On stack, we had: (dbx) x 0xfffffffedeafe641+2223 0xfffffffedeafeef0: 0x0014000000000001 The value 0x0014000000000001 is 2.78134232313400222292843673791E-308 as a double-precision floating point number, which is within the limits of -1024 to 1023 for the exponent for doubles. Now when this double value on stack is converted to a float, a bad float ends up in f1 register that causes the crash: 0xffffffff7e838754: __ftoul+0x0034: fstox %f1, %f2 <-- crash here This is what happens during the conversion: 0xffffffff7dc21d40: resize_all_tlabs+0x0048: ld [%i5 + 164], %f3 //load _average (0x01) into f3 0xffffffff7dc21d58: resize_all_tlabs+0x0060: st %f3, [%sp + 2223] //0x1 gets stored on lower 32 bits on stack 0xffffffff7dc21e0c: resize_all_tlabs+0x0114: ld [%sp + 2223], %f1 //upper 32bits from stack get loaded into f1 Looks like the 'ld' at resize_all_tlabs+0x0114 reads wrong 32-bits from stack. That resulted in a bad value in f1. The code that computes the _allocation_fraction changed in JDK8 (in ThreadLocalAllocBuffer::accumulate_statistics() ): 91 double alloc_frac = MIN2(1.0, (double) allocated_since_last_gc / used); 92 _allocation_fraction.sample(alloc_frac); In 6 and 7, we have: 91 size_t allocation = _number_of_refills * desired_size(); 92 double alloc_frac = allocation / (double) used; 93 _allocation_fraction.sample(alloc_frac); This change was made with https://bugs.openjdk.java.net/browse/JDK-8030177: G1: Enable TLAB resizing. _allocation_fraction field is a float, but we compute a double value in alloc_frac and then convert it to a float in the call to sample(). Rather we can have _alloc_frac as a float so as to avoid this double to float conversion. Suggested change: --- a/src/share/vm/memory/threadLocalAllocBuffer.cpp +++ b/src/share/vm/memory/threadLocalAllocBuffer.cpp @@ -88,7 +88,7 @@ // The result can be larger than 1.0 due to direct to old allocations. // These allocations should ideally not be counted but since it is not possible // to filter them out here we just cap the fraction to be at most 1.0. - double alloc_frac = MIN2(1.0, (double) allocated_since_last_gc / used); + float alloc_frac = MIN2(1.0f, allocated_since_last_gc / (float) used); _allocation_fraction.sample(alloc_frac); } global_stats()->update_allocating_threads(); @@ -205,7 +205,7 @@ // this thread is redone in startup_initialization below. if (Universe::heap() != NULL) { size_t capacity = Universe::heap()->tlab_capacity(myThread()) / HeapWordSize; - double alloc_frac = desired_size() * target_refills() / (double) capacity; + float alloc_frac = desired_size() * target_refills() / (float) capacity; _allocation_fraction.sample(alloc_frac); }
19-03-2020