JDK-8129847 : Compiling methods generated by Nashorn triggers high memory usage in C2
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 8,9
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2015-06-25
  • Updated: 2016-04-27
  • Resolved: 2015-12-01
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
8u101Fixed 9 b97Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
We saw an issue where native memory usage out of the box on large machines with many cores had increased a lot between JDK 7 and JDK 8:

jdk 7 uses about 550 mb of memory
jdk 8 goes up to 1.5 gb of memory

The issue seemed to be that with tiered compilation and many compiler threads, we compiled a lot more methods in JDK 8 which lead to a lot bigger native memory footprint.

The native memory usage was reduced by running with:

-XX:CICompilerCount=2' -XX:-TieredCompilation -XX:CompileThreshold=80000 -XX:ReservedCodeCacheSize=48m

We never tried each of these flags separately in order to find out which gave the biggest impact.

We should look over if any of our default values gets too big on large machines with a lot of memory and CPUs.
Vladimir K and I were working on a solution. Here is the latest pre-RFR webrev that we are currently discussing about: [9]: http://cr.openjdk.java.net/~zmajo/code/8129847/9/webrev.03/ [8]: http://cr.openjdk.java.net/~zmajo/code/8129847/8u/webrev.03/ Some data showing the effect of the newly added RenumberLiveNodes flag is here: http://cr.openjdk.java.net/~zmajo/code/8129847/node_count.html (and also attached to this bug).

Hi, here is an intermediary report of my understanding of (and progress with) the issue. Short summary ============= The high memory usage of the compiler is caused by incremental inlining and optimizations executed after incremental inlining. The impact of incremental inlining on memory usage is around 5X as large as the impact of the optimizations following it. As a workaround, the memory usage of the compiler can be reduced by changing the LiveNodeCountInliningCutoff threshold from 40'000 to 20'000 (the value used by JDK7). Experiments =========== I executed the reproducer as suggested by Vladimir K. I used a locally-built OpenJDK build that is based on a recent checkout from: http://hg.openjdk.java.net/jdk8u/jdk8u-dev/ For the investigation, I traced the memory usage of the following two methods while the methods are compiled with C2: (1) jdk.nashorn.internal.scripts.Script$Recompilation$190$341313AZ$r::L:100$loadLib$L:8141$parse$subscripts (no on the methods compiled) (2) jdk.nashorn.internal.scripts.Script$Recompilation$181$345772AA$r::L:100$loadLib$L:8141$prog1 (no restrictions on the methods compiled) For both methods I observe two "jumps" in memory usage: (1) during performing incremental inlining (for the first time) and (2) after escape analysis, but still in the optimization phase (I have not precisely identified the location yet). Here are some numbers about the jumps in the memory usage: Jump (1) | Usage (RSS in MB) Method | Before incremental inlining | After incremental inlining | Difference ======================================================================================== (1) | 524 | 779 | 255 (2) | 204 | 470 | 266 Jump (2) | Usage (RSS in MB) Method | Just after escape analysis | At the end of the | Difference ==================================================================================== (1) | 779 | 825 | 46 (2) | 484 | 528 | 44 Jump 1 (the jump during incremental inlining) is around 5X larger than Jump 2 (the jump that happens after escape analysis). As Vladimir pointed out, the compiler relases memory it uses. The problem is that most memory pages are returned to the memory allocator, but not to the OS. So it seems we are not leaking (a lot of) memory. Here is some output that was generated using the mallinfo() call after the last compilation (in CompileBroker::invoke_compiler_on_method): ... Looking into mallinfo Non-mmapped space allocated: 907 407 360 bytes Space allocated in mmapped regions: 18 440 192 bytes The total number of bytes in fastbin free blocks: 73552 bytes The total number of bytes used by in-use allocations: 51 851 344 bytes The total number of bytes in free blocks: 855 556 016 bytes The total amount of releasable free space at the top of the heap: 38 162 144 bytes ... Problems ======== There are two problems that must be addressed: (1) the high memory usage of incremental inlining; (2) the high memory usage of the optimization phases following escape analysis. Problem (1) and (2) are closely related. As I understand it, incremental inlining produces a large number of nodes. Subsequent phases occupy an amount of memory that is proportional to the number of nodes generated by incremental inlining. Problem 1 is currently tracked by JDK-8131702. The fix earlier proposed by Vladimir (see above) should reduce the impact of Problem 2. Workaround ========== The LiveNodeCountInliningCutoff flag sets a bound for incremental inlining. In JDK8, the flag is set by default to 20'000. In JDK8 and JDK9 the flag is set to 40'000 (see JDK-8058148 for details). I ran the reproducer with 7u80 and 8u60. In case of 8u60 I used both LiveNodeCountInliningCutoff=20'000 and LiveNodeCountInliningCutoff=40'000. Here are the results: JDK version | LiveNodeCountInliningCutoff | RSS (MB) | Total runtime | Compilation time =================================================================================== 7u80 | 20'000 | 163 | 1m0s | 11s 8u60 | 20'000 | 522 | 2m46s | 127s 8u60 | 40'000 | 976 | 6m54s | 371s Running 8u60 with a reduced LiveNodeCountInliningCutoff reduces the peak memory usage by 2. The differences in the runtime between the two 8u60 configurations are due to the different amount of time time spent in the compiler (the measurement was executed with -Xbatch). Also, 7u80 and 8u60 use different JavaScript engines. Reducing the LiveNodeCountInliningCutoff might be a workaround in cases when memory limitations cause the compiler to exit with an out of memory error. I'll continue the investigation by looking into ways to reduce the memory usage of C2. Thank you and best regards, Zoltan

Hi Vladimir, I'll start working on this issue once I finish the bugs/enhancements that are currently in progress. Thank you and best regards, Zoltan

To get NMT statistic add flags: -XX:NativeMemoryTracking=detail -XX:+PrintNMTStatistics

In an other window watch 'top' command output. Or use other tools which shows used (reserved) memory of java process. PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND 13460 vkozlov 15 0 0 971M 934M cpu/16 0:36 3.16% java

Use 8 update jdk. tar xfz reproduce-jdk-bug.tar.gz cd reproduce-jdk-bug export JAVA_HOME=/net/koori/onestop/jdk/8u60/latest/binaries/solaris-x64 $JAVA_HOME/bin/java -Dapplication.home=$JAVA_HOME -XX:-TieredCompilation -Xms8m -Xbatch -Xmx100m -XX:CICompilerCount=1 -XX:ParallelGCThreads=4 -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+PrintCompilation2 -cp $JAVA_HOME/lib/tools.jar:$JAVA_HOMEclasses com.sun.tools.script.shell.Main require/r.js -o opt_test.js

Zoltan, when you finish you current bugs/rfes, please, look on this.

Something like next: // No more new expensive nodes will be added to the list from here // so keep only the actual candidates for optimizations. cleanup_expensive_nodes(igvn); + // Reset nodes indexing and GVN data. + if (!failing()) { + initial_gvn()->replace_with(&igvn); + PhaseRemoveDead prd(initial_gvn()); + igvn = PhaseIterGVN(initial_gvn()); + } // Perform escape analysis if (_do_escape_analysis && ConnectionGraph::has_candidates(this)) {

Looks like the main problem is with huge (> 100K) nodes graph the max size of chunks (32k) in Arena is too small: [0xfffffd7f9d0cdd3a] void*Chunk::operator new(unsigned long,AllocFailStrategy::AllocFailEnum,unsigned long)+0x21a [0xfffffd7f9d0ceb08] void*Arena::grow(unsigned long,AllocFailStrategy::AllocFailEnum)+0x88 [0xfffffd7f9d0d0054] void*Arena::Amalloc(unsigned long,AllocFailStrategy::AllocFailEnum)+0x124 [0xfffffd7f9dbfe659] char*ResourceArea::allocate_bytes(unsigned long,AllocFailStrategy::AllocFailEnum)+0x159 [0xfffffd7f9dc12090] char*resource_allocate_bytes(unsigned long,AllocFailStrategy::AllocFailEnum)+0x40 [0xfffffd7f9d559351] void PhaseIdealLoop::Dominators()+0x61 [0xfffffd7f9d9af045] void PhaseIdealLoop::build_and_optimize(bool,bool)+0x725 [0xfffffd7f9d3f4031] PhaseIdealLoop::PhaseIdealLoop #Nvariant 1(PhaseIterGVN&,bool,bool)+0xd1 ******* os::malloc(18814592) = 0xfffffd7f70010030 As result we use OS malloc and free for such allocations. But by default freed memory does not returned to system except when we use libumem.so. I tried to use huge chunks (up to 16Mb) to avoid mallocs but it did not help much. Note, C2 use unique() to create a lot of tables which elements are referenced by Node::_idx. It can't use live_nodes() for that. Note 2, live_nodes() could be just 10% of unique() nodes. So the only solution I think we can do is reset _idx of all live nodes after incremental inlining. So that unique() value will be current live_nodes(). We have to create new PhaseIterGVN with all its tables (_types, _table, _worklist etc.) recreated from scratch.

ILW=big memory usage, possible memory leak;in case of lots of compiler threads;none=MMH=>P3

An other observation is it could be loop opts since method has a lot of loops: Loop: N0/N0 Loop: N348429/N8259 limit_check predicated Loop: N348430/N17961 limit_check predicated Loop: N348431/N69230 limit_check predicated Loop: N348432/N72658 limit_check predicated Loop: N348433/N66843 limit_check predicated Loop: N348434/N94292 limit_check predicated Loop: N348435/N125477 limit_check predicated Loop: N348436/N146052 limit_check predicated Loop: N348437/N178160 limit_check predicated Loop: N348438/N214461 limit_check predicated Loop: N348439/N246489 limit_check predicated Loop: N348440/N278034 limit_check predicated Loop: N348441/N313971 limit_check predicated Loop: N348442/N344761 limit_check predicated

Looks like C2 compilation of Nashorm methods is culprit indeed. I run with -Xbatch -XX:-TieredCompilation and observed RES memory of java (with 'top' command). And during compilation, between next 2 lines of PrintCompilation output RES memory jumped by 500Mb: 42300 1789 b jdk.nashorn.internal.scripts.Script$Recompilation$212$341313AZ$r::L:100$loadLib$L:8141$parse$subscripts (315 bytes) 70846 1789 size: 702776(269472) time: 28544 inlined: 75834 bytes Also it looks like NMT does not track that usage.

Looks like memory usage is scaled with number of compiler threads (and we have a lot of them on sparc with tiered default mode): > no arguments - RSS 3GB > -J-XX:CICompilerCount=2 - RSS 780 mb > -J-XX:CICompilerCount=4 - RSS 1.53 mb > -J-XX:CICompilerCount=12 - RSS 3GB (default on his machine), NMT shows that there are requests from compiler for huge memory in arena: [0x00007fa294497cb6] Arena::grow(unsigned long, AllocFailStrategy::AllocFailEnum)+0x46 [0x00007fa294b1681d] PhaseIterGVN::PhaseIterGVN(PhaseGVN*)+0x39d [0x00007fa29465143d] Compile::inline_incrementally_one(PhaseIterGVN&amp;)+0x13d [0x00007fa294654beb] Compile::inline_incrementally(PhaseIterGVN&amp;)+0x34b (malloc=426040KB #325) Log compilation shows that there are compilation with 300K and more nodes. Note, only 40k nodes are live: <parse_done nodes='311444' live='40074' memory='82825032' stamp='10.398'/> But even with 300K and say 100 bytes node's size we should use only 30Mbytes. Looks like we are leaking memory in Arena. Note, after compilation that memory is released so it is just during compilation memory usage is spiking. I think I found the leak. In inline_incrementally_one() we call do igvn = PhaseIterGVN(gvn); which overwrites IGVN created in caller Compie::Optimize(). Mostly it is fine because we did before: gvn->replace_with(&igvn); So that _table and _types still point to the same data arrays. But gvn is PhaseGVN and PhaseIterGVN has additional structure Node_Stack _stack. PhaseIterGVN(gvn) allocates new one each time: PhaseIterGVN::PhaseIterGVN( PhaseGVN *gvn ) : PhaseGVN(gvn), _worklist(*C->for_igvn()), _stack(C->unique() >> 1), _delay_transform(false) Note, we also use C->unique() instead of live_nodes(). So in theory if we have a lot of incremental inlining it may addup. INode size is 16 bytes. 300K / 2 * 16 = 2.4Mbytes