The graph (ext_root_scan_avg_max.png) shows ext root scan times average and max for a baseline (b99-runs, yellow and read) and with the nmethod patch (green and blue).
Note that for baseline, ext root scan contains both code scan and system dictionary scan (and interned strings), while with the nmethod patch the code scan has been moved out of this phase.
I.e. remaining is basically system dictionary scan.
Unfortunately in the baseline run there has been a full gc where parts of the system dictionary has been unloaded. The time has been marked.
The graph still shows that particularly max ext root scan (presumably system dictionary scan) time grows linearly; and if in the baseline run there were no full gc, it would at some point also reflect in the max ext root scan time. (Which is dominated by the code cache scan there)
i.e. using java -jar target/microbenchmarks.jar ".*octane.*" -wi 20 -i 100 -f 5
(The NavierStokes test looks slower here, but rerunning showed that the variance of the means is too high, i.e. basically showed a reverse result)
After testing with a small version of the program the original CR has been reported for (JDK-8004888):
- there has been an issue with unsynchronized access to the per-region code cache remembered set; the call to unregister_method() has not been synchronized with the CodeCache_lock. This resulted in various mess-up of the data structure.
- code root handling does not increase almost linearly with code cache size any more.
Code root marking, done during an initial mark phase, settles after some time as expected, see the blue line in code_root_handling.png. Other additional (serial) components performed always are comparatively low, i.e. scanning the code roots themselves (yellow, so small it is not even visible).
Code root fixup (updating non-evacuated regions, green), and code root migration (updating evacuated regions, red) may need to be parallelized.
There is a big imbalance regarding threading in Code Root marking, that probably needs some more work.
- there are still linearly growing components in the gc time, i.e. probably the system dictionary scanning as expected; this is JDK-8003420. See ext_root_scan.perf.png where the graph tends to grow in a linear fashion until it is interrupted by a full GC. Total time is also very small, but this is because we are not running a "big" test.
- no performance regressions with builds not containing this patch
- The graph in object_copy_update_rs_scan_rs.png shows object copy, update RS and scan RS times; as seen in the figure, remembered set handling takes a significant amount of the gc pause already, even more than object copying. See also object_copy_update_rs_scan_rs.detail.png for detail. (Another significant contributor in the same magnitude seems to be the new code root marking btw)
- the remembered set approaches 10% of the live heap size (e.g. 106M remembered set, 1.3G live heap size) in this application which seems to be the cause for above problem with remembered set management taking a lot of time.
- sometimes there is a lot of general "Other" time; this is probably JDK-8019342: "G1: High "Other" time most likely due to card redirtying" with higher impact than described in the CR. I.e. this "Other" time taking >20% of total gc time sometimes.
- logs about memory consumption show a big imbalance in references in the code root remembered sets: sometimes a few regions have ~30k nmethods to scan, while others are almost empty. The problem seems to be that work is distributed on a per-region basis, which does not work very well.
- total memory overhead of the new remembered set is 1.15MB in total - negigible in light of the other remembered set overhead.
Summary comments from Thomas and Jenny and Bengt(concerning 25k runs)
1 The logs regarding to new nmethods is there: 'Code Root Scanning' 'Code Root Marking ' and 'Code Root Migration'
2. It seems the average object copy time is slightly better in nmethod build, but is offset by the code root handling.
3. there are 2 to-space exhausted in the base runs, which gives larger paruse time variation.
4. Given your comments about the pmap output and the finished aurora
testing with no regressions I think we can conclude that the additional
memory consumption is negligible (at least for this benchmark).
5. It sounds like this is change is beneficial to overall performance so it looks good.