JDK-7145569 : G1: optimize nmethods scanning
  • Type: Enhancement
  • Status: Resolved
  • Resolution: Fixed
  • Component: hotspot
  • Sub-Component: gc
  • Priority: P3
  • Affected Version: 8
  • OS: generic
  • CPU: generic
  • Submit Date: 2012-02-14
  • Updated Date: 2014-02-20
  • Resolved Date: 2013-08-15
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availabitlity Release.

To download the current JDK release, click here.
7u60Fixed 8Resolved
Related Reports
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
The current mechanism that maintains the scavenge roots in nmethods list is unsuitable for G1. The list implies that there are two types of collections: young (after which the list should be pruned) and full (during which we would scan the whole code cache). So, currently G1 doesn't prune this list at all, which means that the list can grow pretty big. We must have something more flexible. The best way I see to do that is that we would have Rset for each G1 region that would contains nmethods that point to this region. This would allow us to scan only the nmethods requried and also to parallelize the scanning process.

The graph (ext_root_scan_avg_max.png) shows ext root scan times average and max for a baseline (b99-runs, yellow and read) and with the nmethod patch (green and blue). Note that for baseline, ext root scan contains both code scan and system dictionary scan (and interned strings), while with the nmethod patch the code scan has been moved out of this phase. I.e. remaining is basically system dictionary scan. Unfortunately in the baseline run there has been a full gc where parts of the system dictionary has been unloaded. The time has been marked. The graph still shows that particularly max ext root scan (presumably system dictionary scan) time grows linearly; and if in the baseline run there were no full gc, it would at some point also reflect in the max ext root scan time. (Which is dominated by the code cache scan there)

Nashorn benchmark results before/after patch

Attached performance results testing the nashorn javascript benchmarks with/without the patch from https://wiki.openjdk.java.net/display/Nashorn/Monitoring+Nashorn+Performance (javascript engine, does a lot of compilations, so stressing registering and unregistering nmethods) i.e. using java -jar target/microbenchmarks.jar ".*octane.*" -wi 20 -i 100 -f 5 No difference. (The NavierStokes test looks slower here, but rerunning showed that the variance of the means is too high, i.e. basically showed a reverse result)

After testing with a small version of the program the original CR has been reported for (JDK-8004888): - there has been an issue with unsynchronized access to the per-region code cache remembered set; the call to unregister_method() has not been synchronized with the CodeCache_lock. This resulted in various mess-up of the data structure. - code root handling does not increase almost linearly with code cache size any more. Code root marking, done during an initial mark phase, settles after some time as expected, see the blue line in code_root_handling.png. Other additional (serial) components performed always are comparatively low, i.e. scanning the code roots themselves (yellow, so small it is not even visible). Code root fixup (updating non-evacuated regions, green), and code root migration (updating evacuated regions, red) may need to be parallelized. There is a big imbalance regarding threading in Code Root marking, that probably needs some more work. - there are still linearly growing components in the gc time, i.e. probably the system dictionary scanning as expected; this is JDK-8003420. See ext_root_scan.perf.png where the graph tends to grow in a linear fashion until it is interrupted by a full GC. Total time is also very small, but this is because we are not running a "big" test. - no performance regressions with builds not containing this patch - The graph in object_copy_update_rs_scan_rs.png shows object copy, update RS and scan RS times; as seen in the figure, remembered set handling takes a significant amount of the gc pause already, even more than object copying. See also object_copy_update_rs_scan_rs.detail.png for detail. (Another significant contributor in the same magnitude seems to be the new code root marking btw) - the remembered set approaches 10% of the live heap size (e.g. 106M remembered set, 1.3G live heap size) in this application which seems to be the cause for above problem with remembered set management taking a lot of time. - sometimes there is a lot of general "Other" time; this is probably JDK-8019342: "G1: High "Other" time most likely due to card redirtying" with higher impact than described in the CR. I.e. this "Other" time taking >20% of total gc time sometimes. - logs about memory consumption show a big imbalance in references in the code root remembered sets: sometimes a few regions have ~30k nmethods to scan, while others are almost empty. The problem seems to be that work is distributed on a per-region basis, which does not work very well. - total memory overhead of the new remembered set is 1.15MB in total - negigible in light of the other remembered set overhead.

Summary comments from Thomas and Jenny and Bengt(concerning 25k runs) 1 The logs regarding to new nmethods is there: 'Code Root Scanning' 'Code Root Marking ' and 'Code Root Migration' 2. It seems the average object copy time is slightly better in nmethod build, but is offset by the code root handling. 3. there are 2 to-space exhausted in the base runs, which gives larger paruse time variation. 4. Given your comments about the pmap output and the finished aurora testing with no regressions I think we can conclude that the additional memory consumption is negligible (at least for this benchmark). 5. It sounds like this is change is beneficial to overall performance so it looks good.

See description.