JDK-8227745 : Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: jvmti
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2019-07-16
  • Updated: 2024-06-14
  • Resolved: 2020-10-20
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 16
16 b21Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
Escape analysis (EA) should be enabled for better performance, when the vm is running with JVMTI
agents loaded.

Main intent is to be able to start a production system in a mode that allows to
initiate a debugging session anytime later if necessary or desired without the
need to disable escape analysis at start-up. In most cases debugging will never
be activated and the production systems should run at the best possible
performance while still being ready for debugging. The enhancement will improve
performance also when a debugger has attached to the vm.

Another important scenario for the enhancement is heap diagnostics. Agents with
that purpose need not be loaded at start-up. They can be loadded into a running
system whenever necessary or desired. Unfortunately the current JVMTI
implementation does not and cannot give access to scalar replaced objects which
can hinder diagnostics. JDK-8233915 is an example for this issue that will be
fixed by this enhancement also.

Currently EA is disabled if a JVMTI agent added the capability
can_access_local_variables, because an access to a local reference variable
potentially changes the escape state of the referenced object and thereby
invalidates optimizations based on it.

There are more JVMTI capabilities that allow agents to acquire object references
from stack frames:

1. can_access_local_variables
2. can_get_owned_monitor_info
3. can_get_owned_monitor_stack_depth_info
4. can_tag_objects
   This allows for example to walk the object graph beginning at its roots,
   which include local variables.

JDK-8230677 switches EA off if capabilities 2. or 3. are taken. This workaround is not possible for
4. as can_tag_objects is an always capability. JDK-8233915 tracks this issue.

In addition EA is disabled if

5. can_pop_frame

is added. Not because it gives access to local variables, but because the
implementation of PopFrame interferes with object reallocation during
deoptimization of compiled frames.

It is likely a bug that EA is not disabled if

6. can_force_early_return

is added as ForceEarlyReturn has the same issues with deoptimization.

This enhancement shall allow the JVM to run with escape analysis enabled even if any of the
capabilities 1. to 6. is requested by a JVMTI agent.

Summary of Proposed Implementation
----------------------------------

The JVMTI implementation is changed to revert EA based optimizations just before objects
escape through JVMTI. At runtime there is no escape information for each object
in scope. Instead each scope is annotated, if non-escaping objects exist and if
some are passed as parameters.  If a JVMTI agent accesses a reference on stack,
then the owning compiled frame C is deoptimized, if any non-escaping object is
in scope. Scalar replaced objects are reallocated on the heap and objects with
eliminated locking are relocked. This is called "deoptimizing objects" for
short.

If the agent accesses a reference in a callee frame of C and C is passing any
non-escaping object as argument then C and its objects are deoptimized as well.

Deoptimizing Objects
---------------------

Early reallocation of scalar replaced (aka virtual) objects, where reallocation
is done independently of and potentially long before replacing the owning
compiled frame with equivalent interpreter frames, is a preexisting
functionality that is leveraged by the enhancement (see
materializeVirtualObjects).

Reallocating and relocking objects is called "deoptimizing objects".
Deoptimized objects are kept as deferred updates (preexisting
JavaThread::_deferred_locals_updates).  Either all objects of a compiled frame
are deoptimized or none. It is annotated at the corresponding deferred updates
if it happened already in order to avoid doing it twice.

EscapeBarrier
------------------

The class EscapeBarrier is the interface to synchronize and trigger
deoptimization before objects escape.

C2 Changes
----------

During EA C2 annotates each safepoint if it has non-escaping
objects in scope and each java call if it has non-escaping objects in its
parameter list.
This information is persisted in the CompiledMethod's debug information.

Escape Information at Runtime
-----------------------------

There is preexisting information about scalar replaced objects and eliminated
locking (note that locks are not only eliminated based on EA, but
also nested locks are omitted).

The implementation adds information about non-escping objects in scope and in
argument lists at call sites:

compiledVFrame::not_global_escape_in_scope()
compiledVFrame::arg_escape()
ScopeDesc::not_global_escape_in_scope()
ScopeDesc::arg_escape()

Synchronization
---------------

Competing agents use the new flag '_obj_deopt' in Thread::_suspend_flags and
the new Monitor EscapeBarrier_lock to synchronize and to suspend their
target thread.

Deoptimization can be concurrent for different target threads.

A self deoptimization cannot be concurrent with other deoptimizations.

Deoptimizing everything (e.g. before heap walks) cannot be concurrent with other
deoptimizations.

See EscapeBarrier::sync_and_suspend_one() and EscapeBarrier::sync_and_suspend_all()

PopFrame and ForceEarlyReturn
-----------------------------

Objects are deoptimized before the PopFrame/ForceEarlyReturn operation and
JVMTI_ERROR_OUT_OF_MEMORY is returned if reallocations fail. This avoids
reallocation failures during the operation.

Performance
-----------

Performance should not be affected if no JVMTI agent is loaded.

If a JVMTI agent is loaded that adds any of the capabilities listed above, but
remains inactive, then there should be a performance gain as high as the gain of
EA.

The performance impact is expected to be still positive if debugging interactively.

jvm2008 results are attated to the RFE.

Microbenchmark results: http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.6.microbenchmark/

Testing
-------

The proposed implementation comes with a significant abount of dedicated test
code.

The new develop flag DeoptimizeObjectsALot allows for stress testing, where
internal threads are started that deoptimize frames and objects in millisecond
intervals given with DeoptimizeObjectsALotInterval. The number of threads
started are given with DeoptimizeObjectsALotThreadCountAll and
DeoptimizeObjectsALotThreadCountSingle. The former targets all existing threads
whereas the latter operates on a single thread selected round robin.


Comments
On Linux ppc64le, the _arg_escape field of MachCallJavaNode seems to be sometimes uninitialized, please see https://bugs.openjdk.org/browse/JDK-8332903 . Any idea why we do not see the issue on other platforms ? There was a comment : "I think Matcher::match_sfpt is supposed to initialize the fields" - is this not called for some reason on Linux ppc64le ?
14-06-2024

Changeset: 40f847e2 Author: Richard Reingruber <rrich@openjdk.org> Date: 2020-10-20 15:31:55 +0000 URL: https://git.openjdk.java.net/jdk/commit/40f847e2
20-10-2020

Posted new webrev: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-August/032685.html I moved the webrev directory up one level: Webrev.7: http://cr.openjdk.java.net/~rrich/webrevs/8227745/webrev.7/ Delta: http://cr.openjdk.java.net/~rrich/webrevs/8227745/webrev.7.inc/
18-08-2020

Posted new webrev: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-July/032247.html Microbenchmark comparing webrev.6 with jdk-16+4: http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.6.microbenchmark/
13-07-2020

Sounds promising. Is there a JBS item for tracking? I would be keen to have a look once you are ready to share your work.
13-12-2019

Yes, I think you could use target thread only executed handshakes. I have a patch for this, but this also introduces a per thread handshake queue. So you send a handshake with a mutex wait in it. It will not be executed by vm thread. If the thread is in native/blocked you can continue since it will execute the handshake when it do a transition and thus will be waiting on that mutex. If the thread is not safe, you wait until it is either waiting on that mutex or native/blocked.
13-12-2019

Re VM_ThreadSuspendAllForObjDeopt: I agree. Will try. TL;DR: maybe the new mechanism to suspend a single thread could be used instead of _suspend_flags? Could I build on it? Will it have a recursion count? I think this would be required. I'd love to use handshakes. Alas I think it's not possible, because the VM thread cannot (re-)allocate on the java heap. Direct handshakes (JDK-8230594) look promising to me, but these I cannot use either, because nested vm operations (e.g. GC triggered by allocation failure) are prohibited. I could imagine, though, to allow nested vm operations for reallocation of scalar replaced objects. This is the scenario: - JVMTI agent A is about to acquire all non-escape locals - of not suspended target thread T - per JVMTI function F that could be implemented as vm operation The challange is 1. reallocated scalar replaced objects of T 2. collect result set of F while preventing T from pushing new frames with scalar replaced objects between 1. and 2. An other alternative could be, to switch T to interpreted execution as step 0. VM_EnterInterpOnlyMode makes all nmethods on stack not_entrant. I guess this for historical reasons and not necessary anymore.
13-12-2019

Please use handshakes, if possible, instead of _suspend_flags. We are trying to get rid of _suspend_flag and only have one mechanism for stopping a single thread. Also I believe VM_ThreadSuspendAllForObjDeopt can be a handshake all threads operation instead of a safepoint.
13-12-2019

tier1-8 testing passed with good results - no new failures.
10-12-2019

So far testing looks very good.
06-12-2019

http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.3/ It applied cleanly to latest jdk/jdk sources and I start testing it.
04-12-2019

Rebased again. Most recent version can be found at http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.most.recent/ Please notify me, if it cannot be applied.
04-12-2019

New webrev at http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.2/ Resolved TODOs. One is remaining, for which I would like to request comments.
28-11-2019

It looks like I can't apply this patch as it. Please, rebase it.
23-11-2019

New webrev: http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.1/ JDK-8226705 factorized out code from Deoptimization::fetch_unroll_info_helper(). This helped to make the patch smaller and clearer. Also some refactoring that touched cpu-specific files was removed.
30-10-2019

Hello Vladimir, currently I don't have more benchmark results of my own, but I'd like to refer to the paper "Escape Analysis for Java" [Choi99] Jong-Deok Shoi, et al. https://www.cc.gatech.edu/~harrold/6340/cs6340_fall2009/Readings/choi99escape.pdf C2's EA is based on it. There are some performance results and I'd like to claim the same speed-up, e.g. when running with the jdwp agent loaded. This enhancement is included in SAP's JVM since many years. The maintenance effort is really small. It would be extremely small compared to the effort spent on EA itself, where I read about issues almost on a daily basis. Please note that I added the main motivation behind this to the description (should have done in the first version): The goal is not directely better performance while debugging, but the best possible performance until a debugger attaches. We would like to run production system in a mode we like to call debugging-on-demand. In this mode we can tell the vm to open a port at any time and then attach a debugger. For most systems this never happens, and, being production systems, they should run at the best possible performance. Feedback from our users and SAP support is that this is one of the most valuable features as it allows to analyse issues when they occur. Especially because they are often hard to reproduce. Thanks, Richard.
26-09-2019

Thank you, Richard. The only noticeable case is monte_carlo with very small hot loop and Random object (if I remember correctly) which does not escape. I am not convinced that one case justifies complexity of changes. Do you have other cases which you think are important? Note, I am grateful to you for finding incorrect behavior when JVMTI is present. They should be fixed.
24-09-2019

Hi Vladimir, sorry for the delayed response. I haven't received a notification for your comment. I've rebased on rev. 56100 (updated webrev in place), and conducted a couple of SPECjvm2008 runs overnight: Benchmark............ Baseline @ 56100.. With JDK-8227745..Speed-Up ......................AVG of 3 Runs......AVG of 3 Runs compiler.compiler..........529.9 ops/m........535.3 ops/m......1.01 compiler.sunflow.......... 215.4 ops/m........215.8 ops/m......1.00 compress.................. 199.6 ops/m........198.1 ops/m......0.99 crypto.aes..................71.7 ops/m........ 63.8 ops/m......0.89 crypto.rsa................1210.2 ops/m...... 1217.2 ops/m......1.01 crypto.signverify..........477.8 ops/m........494.0 ops/m......1.03 derby......................330.4 ops/m........311.2 ops/m......0.94 mpegaudio..................134.5 ops/m........135.1 ops/m......1.00 scimark.fft.large.......... 98.7 ops/m........103.3 ops/m......1.05 scimark.sor.large.......... 41.5 ops/m........ 42.3 ops/m......1.02 scimark.lu.small.......... 652.0 ops/m........648.3 ops/m......0.99 scimark.sparse.small...... 160.5 ops/m........155.0 ops/m......0.97 scimark.monte_carlo........171.0 ops/m........343.8 ops/m......2.01 serial.................... 161.2 ops/m........156.4 ops/m......0.97 sunflow....................205.6 ops/m........203.4 ops/m......0.99 xml.transform..............295.3 ops/m........290.3 ops/m......0.98 xml.validation............ 408.7 ops/m........419.1 ops/m......1.03 ...................................................... Geomean 1.03 Server:...... 20x Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz, 128GB RAM ..............Linux lu0486 4.4.0-159-generic #187-Ubuntu SMP Command Line: java ..............-Djava.io.tmpdir=./tmp ..............-Xmx1g ..............--limit-modules java.base,java.desktop,java.sql,java.naming,java.xml ..............-agentlib:jdwp=transport=dt_socket,address=9000,server=y,suspend=n ..............-jar ./SPECjvm2008.jar ..............-wt 120 ..............-it 240 ..............-bt 4 See attached jvm2008.zip for more details. I'll open separate bugs as you suggested. Thanks, Richard.
29-08-2019

Hi [~Reingruber]. I think you should push cases when we should disable EA (ciEnv::jvmti_state_changed()) as separate bug. There is also small fix in escape.cpp (last block). Can you give example how your changes improved debugging performance?
06-08-2019

http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.0/
06-08-2019

This is a multi-faceted bug affecting serviceability, compiler and runtime. It will be difficult to find people from each area to go through this in detail. So far the RFR has attracted no responses.
02-08-2019

Moving from hotspot/runtime to hotspot/jvmti. This RFE seems primarily in the JVM/TI area so the Serviceability team should triage this.
16-07-2019