JDK-6316156 : C2 method-size tuning parameters need update
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 6,9,10
  • Priority: P3
  • Status: Closed
  • Resolution: Won't Fix
  • OS: generic
  • CPU: generic
  • Submitted: 2005-08-25
  • Updated: 2020-07-28
  • Resolved: 2018-03-08
Related Reports
Relates :  
Relates :  
Relates :  
Certain tunable parameters are sensitive to the size of compiled methods.
They need revisiting, since machines are larger than when the parameters
were last tuned (Tiger or before).

In particular, certain newer optimizations (bimorphic inlining) create larger
methods which in turn fall foul of the restrictively turned parameters.

Parameters which may need inflation include:

Vladimir reports that changes optimizing JVM98 run into InlineSmallCode limits below 2200.

No yet. Instead of tuning these flags we may consider different approaches for inlining.

We are moving from C2 to Graal which has different parameters.

Some inlining heuristics estimate the potential benefit of inlining a method based on metrics derive from that method in a context-free manner. The simplest (probably oldest) metric is the number of bytes in the bytecode encoding of the method body. More complex metrics could be collected in the same phase the generates EA summaries; see bcEscapeAnalyzer.hpp. As noted above, bytecode size should be weighted or adjusted according to the likely native code size of the bytecodes, and also the likelihood of those bytecodes being on actual hot code paths. For example, data movement instructions, and/or instructions on dead-end paths, can generally be ignored. A summarization phase can spend a little more time computing and storing metrics that reflect these realities. A summarization phase can also look for opportunities for in-context strength reduction or dead code elimination. For example, if an integral method parameter is compared against some value, and the result used to gate a block of code in a potentially inlined method, the summary for that method could include that fact, something to the effect that "if arg#2 is (or is not) of range-type T, then the size metric decreases by N". That would cover the common case of methods with optional parameters, where the default value of the optional disables some complex side-behavior of the method, and effectively simplifies that method's code. Similarly, a method argument whose dynamic type is used for polymorphic dispatch (e.g., Object.hashCode called by HashMap.get) can be noted in a method summary, and trigger increased inlining effort when the caller is known to pass a monomorphically typed reference for that argument. These argument sensitivities are approximately transitive across multiple levels of inlining, which means that recursive method summarization is a good pattern for collecting this information. E.g., HashMap.get calls the helper routine hash, which then calls Object.hashCode, but the same optimization for monomorphic key references applies regardless of which method actually makes the call to Object.hashCode. Thus, a recursive method summarization algorithm for inlining should look at HashMap.get and HashMap.hash as a unit. The summary for HashMap.get should record an assumption that HashMap.hash *must* be inlined as well, or else the predicted benefit from a monomorphic reference will *not* materialize. Thus, the summary of incremental benefit from inlining HashMap.get depends on a provisional inlining decision *inside* the summarizer. In other words, such smart inlining heuristics include a chicken-vs-egg ordering problem. Warning: This is typical of graph-tiling algorithms, and contributes to their intractable, unstable nature. But it's worth a try, if we can get some clear visibility into the effects of our attempts.

Stability is a huge problem, and not just stability of profile inputs, but stability of the final result. Choosing an inlining of a dynamic call graph is a kind of graph-tiling problem, which seems to have no efficient solution (probably NP-complete). To make decisions, we use an irregular suite of ad hoc heuristics. The implementation complexity is high, making it hard to add more heuristics. Worse, the end result is unstable. It is common for us to try to a promising heuristic that makes many programs work faster, to find some important customer workload is made suddenly slower. It's typical of NP problems that attempts at incremental solutions produce such sudden regressions. Even adjusting the existing tuning parameters is dangerous. They were set at the current points after benchmark-driven experiments about 15 years ago. They are almost certainly out of data, but even incremental changes to them are likely to cause sudden regressions, for some small percentage of users. Dealing with those risks would require a large amount of testing and ad hoc remediation. We can tinker more with the existing fragile C2 code, but I think the best payoff to improve the heuristics will be in a C2 replacement, for two reasons: 1. a more workable (Java-on-Java) implementation, and more importantly 2. a cleaner slate of customer expectations. Another approach that can coexist successfully with the C2 code would be some sort of meta-heuristic that retunes the existing parameters in exceptional circumstances, while leaving them mostly stable for legacy code. In fact, the reason method handles and invokedynamic are able to transcend the limitations of the C2 inliner is they have a substantially different set of heuristics, which are tuned for the workloads that use method handles and invokedynamic. It would have been awesome to find a larger meta-framework that would subsume the existing C2 heuristics while optimizing method handles also, but that's a research problem, and meanwhile we have engineering to do. Feedback-driven approaches (such as Marcus Lagergren suggests) are going to be much more stable than "dead reckoning" retunings. Again, that is easier to do in a fresher (Java-based) IR than that of C2. In addition to feedback, some sort of "smoothing" of a larger number of discrete heuristics (by a trainable neural network) is probably the next big thing to try.

More comments from Vitaly Davidovich: ... The inlining stuff is just a nuisance - you can even see, and I'm sure you know already, that even core libs folks are contorting code to play nice with the inliner (see the recent mail thread from Martin Buchholz there). http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-October/044224.html ... I think a lot of the current issues are due to rather naive heuristic of bytecode size. It's particularly disappointing because the JIT has profile info, so if it were to peek inside a candidate inlinee it would likely see which bytecodes are actually called and get a more accurate view of the "effective" size of the method. It would still be a heuristic, but a better one. It also doesn't seem like it would cost a lot in compile time. Few more ideas: 1) Basic blocks terminating in a throw of an exception should not be counted towards inlining. People that use exceptions for control flow can pay the price if inlining hurts. Most code will benefit. You typically see people working around this by moving that basic block into a helper method. Workable, but awkward and most people don't know about this. 2) switch statements balloon bytecode tremendously, even though the assembly is tight. Maybe discount them similarly. 3) if a callee is receiving JIT-time constants as args, bump its inlining "bonus". Those constants may fold a bunch of code. This is also important for when you breach InlineSmallCode limits - inlining the callee, even though it's been inlined eslewhere, may be beneficial. Still a heuristic, but a decent one I think. 4) Does InlineSmallCode consider how many times the callee has been inlined elsewhere? If not, I think it should. I think the big one here is being smarter about what bytecode to actually count and how to count it (e.g. switch statements).

Recent discussions about hand-splitting of varargs APIs (via switch) are complicated by this problem. Ref: http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-November/036382.html Example: @SafeVarargs static <E> List<E> varargs_switch10(E... ea) { switch (ea.length) { case 0: return explicit(); case 1: return explicit(ea[0]); case 2: return explicit(ea[0], ea[1]); case 3: return explicit(ea[0], ea[1], ea[2]); case ... default: return varargs(ea); } } Suggestion #1: Support hand-split switches by more accurate measurement of the inline weight of switch-heavy code. Most switch cases will be unreached, and should not contribute to the inline cost estimate. Suggestion #2: If a switch is made on a parameter or a value easily derived from a parameter (ea.length above), and if the value is a compile-time constant, use an inlining cost estimate that takes the constant value into account. In this case, all but one of the switch-cases would disappear from the inlining cost. Suggestion #3: Provide users with more appropriate notations than hand-expanded switches, when hand-splitting is desired, and optimize the simpler notation more simply.

Vitaly Davidovich notes an issue with dense switches that get compiled to multi-way jumps. A switch should not be measured by the raw size of the instruction. If several adjacent keys branch to the same successor, surely that should count as a single test and branch. http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-June/018165.html

This problem torments well-intentioned Core Libs programmers who want to sprinkle "asserts" in their short methods. Although an assert adds exactly zero overhead to native-compiled code, it can kick the bytecode instruction size of a method over the inlining cliff (at 35 bytes, whether executed or not). This makes asserts appear to have a performance cost, even though they should not. If we fix this bug in a way where non-executed code is not charged to an inlining heuristic budget, we will stop penalizing programmers for adding asserts.

Also: assertions count as code weight, which is very bad. So instead of using byte code size at all - can't HotSpot generate the inline candidate and throw it away if it's too large or something? Ideally it'd do something like this do { generate ir for hot child if child ir too big trivial check break splice ir into graph apply shrinking transforms } while (parent ir not too big && enough time left for optimisation) I've spent significant time tuning Nashorn hot methods to be as small as possible so that they will be considered for inlining. Sometimes, by splitting a method into two, with half the logic in each, I reach my goal - and we should try to think about how to abstract this away from the Java programmer Regards Marcus

List of rt. jar methods that are above the byte code inlining limit: https://groups.google.com/forum/#!topic/jitwatch/KJKEgVLTGg8

ILW = {Impact: Med, Likelihood: Med, Workaround: Med} Impact: Users report sudden unacceptable performance loss. Difficult to analyze. Likelihood: Recurring reports. Probably also underreported. Workaround: When recognized, split methods into smaller methods. Time-consuming and error-prone. Severity = 3

For better visibility into generated code decisions, see https://wiki.openjdk.java.net/display/HotSpot/PrintAssembly and search the web for PrintAssembly. See also LogCompilation and "jitwatch".

Surprising inline failures are a common problem. This is a deepening issue for dynamic languages like Nashorn, since they consist of small bits of simple "plumbing" joined together. If 5% of the joints in the plumbing fail to inline, they can dominate performance. Such surprises may also crop up in new releases when Java programmers adopts new language features. For example, this happened when the "assert" keyword was introduced. Seemingly simple assertions in hot code can disturb inlining and performance, even when assertions are not enabled. A fix to this, and other fast/slow idiom performance, should discount unreached code, in both inline heuristics and parsing (IR generation, JDK-8030976).

Untaken paths should not contribute to inline-limiting metrics.

MaxInlineSize (default = 35 bytecode bytes) is used to detect bytecoded methods which are small enough to to inline almost always. It should be compared against a better metric than Method::code_size, which is the textual size of the bytecodes. A better metric would be a weighted instruction count, with low or zero weight given to data movement instructions and heavier weight given to invocations and control transfers. Unreached instructions (including those never reached so far according to the profile) should not contribute to the weight. This will allow methods to contain unused slow paths (e.g., for exception throws) that do not interfere with the inlining of fast paths. A similar metric would be a weighted IR node count (immediately after parsing), but that would more difficult to derive. InlineSmallCode (default = a few thousand native instruction bytes) is similarly used to detect native-compiled n-methods which are too large to inline. The problem with it is that the metric is compared against native code bytes, and many of those bytes are "cold" slow paths which are never executed (and thus never hit the instruction cache). Slow paths (to uncommon traps) tend to be numerous and burdened with trivial data motion code. Also, this metric is deeply machine-dependent, and so needs complete re-tuning for each CPU architecture and (even worse) for each change in the JIT back end or middle end. A better metric would be something related to a machine-independent view of the hot path, such as the weighted bytecode instruction count (see above) or IR size. Background: Given a strongly interconnected control flow between inline-able methods A->B->C->���->A, the InlineSmallCode limit tends to reduce the number of native copies of A, B, etc. The pathology it interrupts occurs if an application makes hot entries into the graph at multiple points A, B, C, ���, which (except for InlineSmallCode) would tend to create inlined n-methods containing A->B->C->���->A, B->C->���->A->B, C->���->A->B->C, etc., which triggers compilation work quadratic in the size of the graph cycle, and instruction cache traffic potentially quadratic.

EVALUATION See description.