Bug ID: JDK-8143321 Reduce the C2 compiler's memory usage

JDK-8143321 : Reduce the C2 compiler's memory usage

Type: Enhancement
Component: hotspot
Sub-Component: compiler
Affected Version: 8,9,10

Priority: P3
Status: Closed
Resolution: Won't Fix

Submitted: 2015-11-19
Updated: 2018-03-08
Resolved: 2018-03-08

Related Reports

Relates :	JDK-8129847 - Compiling methods generated by Nashorn triggers high memory usage in C2
Relates :	JDK-8058148 - MaxNodeLimit and LiveNodeCountInliningCutoff should be increased
Relates :	JDK-8165193 - Workaround intermittent failures of JavacTreeScannerTest and SourceTreeScannerTest due to C2 memory usage
Relates :	JDK-8059241 - C2: Excessive RemoveUseless passes during incremental inlining
Relates :	JDK-8137160 - Use Compile::live_nodes instead of Compile::unique() in appropriate places -- followup
Relates :	JDK-8014959 - assert(Compile::current()->live_nodes() < (uint)MaxNodeLimit) failed: Live Node limit exceeded limit
Relates :	JDK-8163999 - Workaround intermittent failures of TreePosTest.java due to C2 memory usage
Relates :	JDK-8011858 - Use Compile::live_nodes() instead of Compile::unique() in appropriate places

Description

The C2 compiler's memory usage increased significantly starting with JDK8. The increase memory usage is most noticeable when executing JavaScript applications on top of the VM. For example, for an application that  consists of a set of JavaScript scripts (the reproducer attached to JDK-8129847), the VM's and the application's performance is described by the following numbers:

JDK version | LiveNodeCountInliningCutoff | RSS (MB) | Total runtime | Compilation time	| Application time
================================================================
7u80        | 20'000 (default)            | 163      | 60s           | 11s              | 49s
8u60        | 20'000                            | 522      | 166s          | 127s             | 39s
8u60        | 40'000 (default)            | 976      | 414ss         | 371s             | 43s


The measurement was executed on	a Linux x86_64 machine with -Xbatch and a maximum heap size of 100 MB. The LiveNodeCountInliningCutoff column illustrates the value of the flag with the same name. As suggested by the numbers, the VM's memory usage increases by around 3.2X from 7u80 to 8u60. The most likely reason for the increase is that the Nashorn JavaScript engine is used by default in 8u60. The VM's memory usage increases further (by around 1.9X) when the 8u60 VM is executed with the default value for the LiveNodeCountInliningCutoff flag. The flag's value has been increased from 20'000 to 40'000 by JDK-8058148. An other likely reason for the increased memory usage is the change of the MaxNodeLimit flag's default value by JDK-8014959 and JDK-8058148).

In total, the VM's memory usage for the application considered increases by 6X from 7u80 to 8u60. JDK9 is similar to JDK8 and is also affected by this problem.

A number of issues have targeted reducing the VM's memory usage (JDK-8011858, JDK-8137160, JDK-8129847). The patches for the first two bugs result in a slight reduction of memory usage, the patch for JDK-8129847 reduces memory usage by 20-30%. However, the VM's memory usage should be further reduced.

The goal of this enhancement is to further reduce the memory usage of the compiler. This issue is supposed to investigate three ways the compiler's memory usage can be reduced.

(1) Change arrays directly addressed with node IDs (the _idx field of every compiler node) to use hash tables instead. This change should target arrays with a high impact on the compiler's memory usage.

(2) For compilations with a large number of nodes, introduce and additional chunk size (in addition to the existing sizes tiny, init, medium, size,  non_pool_size). The new chunk size should be larger than the existing chunk sizes and should allow the reuse of large memory chunks that are currently allocated with the operating system's memory allocator.

(3) Incremental (or post-parse) inlining in C2 produces lots of dead nodes (observed on Octane/Nashorn). Multiple PhaseRenumberLive passes during incremental inlining can help further reduce peak memory usage in that scenario. Since the pass can be expensive, it can be triggered when the gap between unique and live node counts becomes too large and performed with PhaseIdealLoop (see Compile::inline_incrementally). 
 
(4) PhaseRemoveUseless and PhaseIterGVN are performed too frequently (that problem is targeted by JDK-8059241). 

Here are some notes related to (1):

Code locations that use directly-referenced arrays:
- PhaseIdealLoop::Dominators -- allocates dfsorder and ntarjan arrays of size unique();
- PhaseIdealLoop::dom_depth and PhaseIdealLoop::_idom -- proportional to unique();
- PhaseCFG::global_code_motion -- recalc_pressure_nodes -- could be large, but size not necessarily proportional to unique();
- PhaseChaitin::stretch_base_pointer_live_ranges -- derived_base_map is allocated with malloc, size proportional to unique();
- PhaseIdealLoop::_preorders -- size proportional to unique();
- Compile::_node_bundling_base
- PhaseRegAlloc::_node_regs -- size proportional to unique();
- Scheduling::_node_bundling_base, _node_latency, _uses, _current_latency -- size most likely proportional to unique();
- Compile::fill_buffer -- allocates node_offsets array of size unique(), used only in fastdebug. 

Data structures that use directly-referenced arrays:
- GrowableArray -- example usages ConnectionGraph::nodes, DepGraph::_map, Compile::_node_note_array, LiveRangeMap::_names, LiveRangeMap::_uf_map, PhaseCFG::_node_latency
- Node_Array -- example usages ConnectionGraph::_node_map, Matcher::_old2new_map (only debug), Matcher::_new2old_map (only debug),  PhaseTransform::_nodes, Type_Array::_types
- Node_List -- example usages: Invariance::_old_new, PhaseCFG::schedule_local, Scheduling::_scheduled, Scheduling::_available
- Block_Array -- used in PhaseCFG::_node_to_block_mapping
- VectorSet -- uses _idx for checks -- already compressed but it could be maybe further optimized.

Comments

We don't support 32-bit VMs. Memory usage is not critical for 64-bit. And we are moving from C2 to Graal.
08-03-2018
Here is an example of a test running out of memory on 32-bit windows where this (or these) change(s) could help: JDK-8162907
12-08-2016
Yes, that is correct. We are searching other possibility to reduce memory usage. We will push JDK-8129847 changes anyway and we will investigate other possibilities - that this RFE is for. Using hash tables instead of linear arrays is one of such possibilities. We may try to do renumbering during parsing too to reduce initial graph. Nashorn compilation is good testing tool which may show what we should optimize.
20-11-2015
Vladimir K, my understanding is that regular node renumbering during compilation should solve the problem with using _idx. My reading of previous comments is that we already tried that in JDK-8129847, it gave us 30% improvement. But it is not enough, so we are looking for other optimization opportunity. Does my understanding correlate with reality? :-)
20-11-2015
Vladimir I., we are not suggesting to reduce current default limits of nodes. We are trying to figure out how to reduce size of structures and other things which use nodes count. As I remember from experiments the ratio between total number of nodes (_idx) and live nodes may reach 10 to 1 after first IGVN in some cases. LiveNodeCountInliningCutoff max set to 40K which is even less than original MaxNodeLimit 65K. So using live nodes count or something around it may help a lot I think.
20-11-2015
In that case using hash tables instead of arrays for sparsely populated data structures looks promising. Not sure we have many of them though, and with the current limit the IR size itself is bigger.
20-11-2015
Hi Vladimir, thank you for clarifying. The goal of JDK-8011858, JDK-8137160, and JDK-8129847 was/is to reduce memory consumption of the compiler without causing a performance regression. Therefore, all of the patches leave the thresholds MaxNodeLimit and LiveNodeCountInliningCutoff unchanged. The current issue is similar -- we hope to get some reduction by modifying data structures without negatively influencing application performance (incl. Octane/Nashorn). Thank yuo and best regards, Zoltan
20-11-2015
Some background, since I initiated default values increase. LambdaForm caching changes (JDK-8046703) significantly changed MethodHandle chain shapes and hence generated bytecode for compiled LambdaForms. It led to increase in IR size during compilations with extensive invokedynamic/j.l.i usage and caused severe performance regressions (e.g. on Octane/Nashorn). As I understand, JDK-8129847 is about improving handling of large number of unique and total IR nodes in C2. It shows 30% reduction in memory usage, but it doesn't address increase in unique IR nodes due to LF caching changes. Not sure we can do much about it w/o significantly regressing Nashorn peak performance. New defaults improved overall peak performance even w/o LF caching changes due to increased amount of inlining. Nashorn produces humongous methods and their compilation quickly reaches the limits.
20-11-2015
Hi Vladimir, thank you for pointing out that JDK-8014959 already increased the number of nodes! Here is a summary of the changes by both JDK-8014959 and JDK-8058148: - JDK-8014959: MaxNodeLimit: 65'000 -> 80'000 - JDK-8058148: LiveNodeCountInliningCutoff: 20'000 -> 40'000, MaxNodeLimit: 80'000 -> 240'000 for methods that use JSR292. Also, I updated the description of the bug to make it clearer and include all the comments. Thank you and best regards, Zoltan
20-11-2015
This RFE is filed to investigate further reduction of memory usage by C2. 30% reduction by JDK-8129847 changes is not enough.
19-11-2015
As I remember the first time we increased number of nodes was next changes: https://bugs.openjdk.java.net/browse/JDK-8014959
19-11-2015
Hi Vladimir, yes, Vladimir K also suggested renumbering in JDK-8129847. I have a working implementation of renumbering that reduces the memory usage of the reproducer in JDK-8129847 by around 30%. I will send out the RFR once I have the results of the performance evaluation ready (measurements are currently running). Thank you and best regards, Zoltan
19-11-2015
Zoltan, have you considered node renumbering instead of hash tables?
19-11-2015