Bug ID: JDK-8078904 CMS: Assert failed: Ctl pt invariant

JDK-8078904 : CMS: Assert failed: Ctl pt invariant

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 8u60,9

Priority: P2
Status: Resolved
Resolution: Fixed

Submitted: 2015-04-29
Updated: 2017-07-26
Resolved: 2015-08-08

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 9
9 b79Fixed

Related Reports

Relates :	JDK-8079556 - BACKOUT - Determining the desired PLAB size adjusts to the the number of threads at the wrong place
Relates :	JDK-8130459 - Add additional validation after heap creation
Relates :	JDK-8133349 - CMS: Assert failed: Ctl pt invariant

Description

Test: jdk/test/java/lang/management/MemoryMXBean/ResetPeakMemoryUsage.java

Crashed on Linux 64 and Solaris 64. Same stacktrace in both cases.

#  Internal Error (/hotspot/src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.cpp:4992), pid=1822, tid=140235732694784
#  assert(_cursor[j] == _survivor_plab_array[j].end()) failed: Ctl pt invariant
#
# JRE version: Java(TM) SE Runtime Environment (9.0) (build 1.9.0-internal-fastdebug-20150428213602.jesper.8073204-b00)

Current thread (0x00007f8bb4264800):  VMThread [stack: 0x00007f8b2cf83000,0x00007f8b2d084000] [id=1878]

Stack: [0x00007f8b2cf83000,0x00007f8b2d084000],  sp=0x00007f8b2d0822e0,  free space=1020k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x10558a1]  VMError::report_and_die()+0x151;;  VMError::report_and_die()+0x151
V  [libjvm.so+0x7687bb]  report_vm_error(char const*, int, char const*, char const*)+0x7b;;  report_vm_error(char const*, int, char const*, char const*)+0x7b
V  [libjvm.so+0x72a038]  CMSCollector::merge_survivor_plab_arrays(ContiguousSpace*, int)+0x168;;  CMSCollector::merge_survivor_plab_arrays(ContiguousSpace*, int)+0x168
V  [libjvm.so+0x72a3d3]  CMSCollector::initialize_sequential_subtasks_for_young_gen_rescan(int)+0x63;;  CMSCollector::initialize_sequential_subtasks_for_young_gen_rescan(int)+0x63
V  [libjvm.so+0x72cb23]  CMSCollector::checkpointRootsInitialWork()+0x543;;  CMSCollector::checkpointRootsInitialWork()+0x543
V  [libjvm.so+0x72ce41]  CMSCollector::checkpointRootsInitial()+0xe1;;  CMSCollector::checkpointRootsInitial()+0xe1
V  [libjvm.so+0x743c97]  CMSCollector::do_CMS_operation(CMSCollector::CMS_op_type, GCCause::Cause)+0x3f7;;  CMSCollector::do_CMS_operation(CMSCollector::CMS_op_type, GCCause::Cause)+0x3f7
V  [libjvm.so+0x1053047]  VM_CMS_Initial_Mark::doit()+0xf7;;  VM_CMS_Initial_Mark::doit()+0xf7
V  [libjvm.so+0x107fef3]  VM_Operation::evaluate()+0xa3;;  VM_Operation::evaluate()+0xa3
V  [libjvm.so+0x107d78e]  VMThread::evaluate_operation(VM_Operation*)+0x14e;;  VMThread::evaluate_operation(VM_Operation*)+0x14e
V  [libjvm.so+0x107e0c3]  VMThread::loop()+0x4b3;;  VMThread::loop()+0x4b3
V  [libjvm.so+0x107e2f9]  VMThread::run()+0xb9;;  VMThread::run()+0xb9
V  [libjvm.so+0xda9bc2]  java_start(Thread*)+0xf2;;  java_start(Thread*)+0xf2

VM_Operation (0x00007f8b2db18dc0): CMS_Initial_Mark, mode: safepoint, requested by thread 0x00007f8bb4188000


elapsed time: 2 seconds (0d 0h 0m 2s)

Comments

Bugs found by nightly testing. Verified by passed nightly.
26-07-2017
The question is somewhat, why does it assert then? - _survivor_plab_arrays contains the PLABs for a given thread within the survivor space - _survivor_chunk_array should contain all PLABs within the survivor space _survivor_chunk_array should be large enough, as it is sized to contain the maximum number of PLABs possible in survivor space, i.e. max_plab_samples should be sufficient (for some reason it allocates two times that). The problem seems to be the discrepancy in the initial PLAB size (in the constructor of ParNewGeneration, _plab_stats is initialized with a PLAB size of YoungPLABSize) and the minimum PLAB size used for calculation of the lengths of these buffers, which is based on MinTLABSize (plab_sample_minimun_size() that should equal PLAB::min_size(), but is not). G1 implements the same thing, but this results only in too small PLABs used initially. I assume that you can create problems if Young/OldPLABSize are larger than the humongous object threshold, or really small (less than the alignment requirement). Suggested fix (there may be much better options): when constructing the PLAB, make sure that the passed desired_plab_sz is within the bounds of PLAB::min_size()/PLAB::max_size() - or alternatively, in the getter for PLAB::desired_sz() always clamp the internal value to [min_size(), max_size) before returning it. Then also the clamping in adjust_desired_sz() can go away. Also: - make sure that PLAB::min_size() and PLAB::max_size() are consistent throughout. - also for TLAB::min_size() and TLAB::max_size() - I think CMSCollector::plab_sample_minimum_size() should simply use PLAB(Stats)::min_size() instead some additional ad-hoc formula. - arguments checking for Min/MaxTLABSize, YoungPLABSize, OldPLABSize
12-05-2015
I commented out the assert to see the downstream effect. It results in really bad distribution of work over the from space in CMSParMarkTask::do_young_space_rescan() which happens during VM_CMS_Initial_Mark. For example with -XX:MinTLABSize=32K, and 18 PGC threads, there are 134 tasks created where 133 tasks cover the first 1.2M of from space, then the last task gets the remaining ~900K of a ~2200K from space. However, it does correctly scan the whole from space, just unbalanced.
11-05-2015
Thanks Stefan I can reproduce it with your instructions above even with backout applied.
08-05-2015
Possibly MinTLABSIze * number of GC threads is larger than the available Survivor/Old gen, which might be the actual problem here. (Just thinking about possible problems given the MinTLABSIze=1M parameter in that example).
08-05-2015
Eric, I don't think this bug should be closed. Backing out JDK-8079556 only hides the problem, but the real problem is still there. Running my reproducer above reproduces the assert with the latest jdk9/hs-gc.
07-05-2015
This was resolved by backout bug JDK-8079556.
07-05-2015
The problem seems to be that with the change in JDK-8073204, the size returned by PLABStats::desired_plab_sz() could get below the minimum PLAB size due to truncating integer division in that function, and actually missing alignment. This results in several places getting wrong results because of this broken invariant. It may also cause problems in G1 btw as it uses the same code.
07-05-2015
We have several changes in hs-gc that would be good to get into main asap. I'll push as much as possible of hs-gc to main tomorrow morning PT. It would be good if JDK-8073204 could be backed out before then unless this bug can be fixed by tomorrow.
06-05-2015
This is suddenly happening now because it appears to be a side effect of the fix for JDK-8073204, which changed the PLAB sizes. I created a rollback build with that patch out and the test passes. I'll have to look closer at that change and Stefan's comment above to decide what to do.
06-05-2015
A workaround is to run without the parallel initial mark and parallel remark: -XX:-CMSParallelInitialMarkEnabled -XX:-CMSParallelRemarkEnabled
05-05-2015
merge_survivor_plab_arrays iterates over [0, _survivor_chunk_capacity) and selects one _survivor_plab_array entry per iteration. At the end of the iteration it assumes that all _survivor_plab_array entries have been selected. However, _survivor_chunk_capacity is less than the amount of _survivor_plab_array entries: _survivor_chunk_capacity = 2max_plab_samples; _survivor_chunk_array = NEW_C_HEAP_ARRAY(HeapWord, 2max_plab_samples, mtGC); and _survivor_plab_array = NEW_C_HEAP_ARRAY(ChunkArray, ParallelGCThreads, mtGC); for (uint i = 0; i < ParallelGCThreads; i++) { HeapWord* vec = NEW_C_HEAP_ARRAY(HeapWord, max_plab_samples, mtGC); ChunkArray cur = ::new (&_survivor_plab_array[i]) ChunkArray(vec, max_plab_samples); so we can have ParallelGCThreads * max_plab_samples entires in all the _survivor_plab_arrays, but only 2 * max_plab_samples in the _survivor_chunk_array.
05-05-2015
Happened for me while running SPECjvm98: cd /localhome/tests/gc-test-suite/specjvm98 ; /home/stefank/hg/jdk9/hs-gc/build/linux-x86_64-normal-server-fastdebug/jdk//bin/java -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -Xms32m -Xmx64m SpecApplication -s100 -g _209_db
05-05-2015
I can reproduce this with JDK 8: (cd /localhome/tests/gc-test-suite/specjvm98 ; /localhome/java/jdk-8-fcs-bin-b132/fastdebug/bin/java -XX:MinTLABSize=1m -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+CMSParallelInitialMarkEnabled -XX:+CMSParallelRemarkEnabled -XX:+VerifyBeforeGC -XX:+VerifyAfterGC -XX:+VerifyDuringGC -XX:+ShowMessageBoxOnError -Xms32m -Xmx64m SpecApplication -s100 -g _209_db)
05-05-2015
Eric, have you investigated which change that caused this? Since this bug seems to be confined to hs-gc, it is currently blocking pushes from jdk9/hs-gc to jdk9/hs.
04-05-2015
I can reproduce this on a local box with a slowdebug build.
30-04-2015
There is something suspicious here since this test calls mbean.gc() to do a heap calculation but here it was run with -XX:+ExplicitGCInvokesConcurrent.
29-04-2015
ILW = High (crash), Low (happened twice), High (none) = P2
29-04-2015