JDK-8229895 : Avoid GC lock when no array parameters are passed to critical native
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 14
  • Priority: P3
  • Status: Resolved
  • Resolution: Won't Fix
  • Submitted: 2019-08-19
  • Updated: 2021-01-07
  • Resolved: 2020-10-22
Related Reports
Relates :  
Relates :  
Description
A discussion on shenandoah-dev mailing list (https://mail.openjdk.java.net/pipermail/shenandoah-dev/2019-August/010422.html), as a way to improve jni performance, proposed by Ioannis Tsakpinis <iotsakp@gmail.com>

It's true that CriticalJNINatives were added as an efficient way to
access Java arrays from JNI code. However, the overhead of JNI calls
affects all methods, especially methods that accept or return primitive
values only and the JNI code does nothing but pass the arguments to
another native function.

There are thousands of JNI functions in LWJGL and almost all are like
that, they simply cast arguments to the appropriate type and pass them
to a target native function. Libraries like JNR and other JNI binding
generators also look the same.

The major benefit of using CriticalJNINatives for such functions is the
removal of the first two standard JNI parameters: JNIEnv* and jclass.
Normally that would only mean less register pressure, which may help in
some cases. In practice though, native compilers are able to optimize
away any argument shuffling and convert everything to a simple
tail-call, i.e. a single jump instruction.

We go from this for standard JNI:

Java -> shuffle arguments -> JNI -> shuffle arguments -> native call

to this for critical JNI:

Java -> shuffle arguments -> JNI -> native call

Example code and assembly output: https://godbolt.org/z/qZRIi1

This has a measurable effect on JNI call overhead and becomes more
important the simpler the target native function is. With Project Panama
there is no JNI function and it should be possible to optimize the first
argument shuffling too. Until then, this is the best we can do, unless
there are opportunities to slim down the JNI wrapper even further for
critical native methods (e.g. remove the safepoint polling if it's safe
to do so).

To sum up, the motivation is reduced JNI overhead. My argument is that
primitive-only functions could benefit from significant overhead
reduction with CriticalJNINatives. However, the GC locking effect is a
major and unnecessary disadvantage. Shenandoah does a perfect job here
because it supports region pinning and there's no actual locking
happening in primitive-only functions. Every other GC though will choke
hard with applications that make heavy use of critical natives (such as
typical LWJGL applications). So, two requests:

- PRIMARY: Skip check_needs_gc_for_critical_native() in primitive-only
functions, regardless of GC algorithm and object-pinning support.

- BONUS: JNI call overhead is significantly higher (3-4ns) on Java 10+
compared to Java 8 (with or without critical natives). I went through
the timeline of sharedRuntime_x86_64.cpp but couldn't spot anything that
would justify such a difference (thread-local handshakes maybe?). I was
wondering if this is a performance regression that needs to be looked
into.

Comments
We are deprecating CriticalNative functionality in favor of Project Panama. That said, the deprecation work in JDK-8233343 does improve performance of native functions if CriticalNatives are used until then.
22-10-2020

The case of not passing parameters to CriticalNative will be replaced with project panama. https://openjdk.java.net/projects/panama/
13-10-2020

I have a prototype that removes the thread_in_native transition completely that performs even better.
09-09-2020

Benchmark results from Ioannis' improved version (JDK-8229895_prototype_skip_native_trans.patch). Benchmark Mode Cnt Score Error Units JNIBenchmark.func0 avgt 3 10.090 �� 2.139 ns/op JNIBenchmark.func0Crit avgt 3 9.860 �� 1.187 ns/op JNIBenchmark.func1 avgt 3 10.173 �� 0.526 ns/op JNIBenchmark.func1Crit avgt 3 9.840 �� 0.991 ns/op JNIBenchmark.func2 avgt 3 10.500 �� 0.307 ns/op JNIBenchmark.func2Crit avgt 3 10.202 �� 0.120 ns/op JNIBenchmark.func3 avgt 3 10.518 �� 2.543 ns/op JNIBenchmark.func3Crit avgt 3 9.960 �� 0.208 ns/op JNIBenchmark.func4 avgt 3 10.611 �� 2.757 ns/op JNIBenchmark.func4Crit avgt 3 10.501 �� 0.353 ns/op
22-08-2019

It turns out the initial prototype is incorrect, it can not completely elide safepoint checking. JDK-8229895_prototype_fix.patch fixed that and above "after" benchmarks were updated accordingly.
21-08-2019

A quick prototyping on x86_64. Before: Benchmark Mode Cnt Score Error Units JNIBenchmark.func0 avgt 3 10.938 �� 4.137 ns/op JNIBenchmark.func0Crit avgt 3 10.633 �� 0.152 ns/op JNIBenchmark.func1 avgt 3 10.938 �� 0.788 ns/op JNIBenchmark.func1Crit avgt 3 10.684 �� 2.462 ns/op JNIBenchmark.func2 avgt 3 11.241 �� 1.798 ns/op JNIBenchmark.func2Crit avgt 3 10.871 �� 2.377 ns/op JNIBenchmark.func3 avgt 3 11.330 �� 1.408 ns/op JNIBenchmark.func3Crit avgt 3 11.005 �� 0.474 ns/op JNIBenchmark.func4 avgt 3 11.460 �� 0.370 ns/op JNIBenchmark.func4Crit avgt 3 11.135 �� 2.583 ns/op Benchmark Mode Cnt Score Error Units JNIBenchmark.func0 avgt 3 10.867 �� 0.699 ns/op JNIBenchmark.func0Crit avgt 3 10.552 �� 0.562 ns/op JNIBenchmark.func1 avgt 3 10.889 �� 0.614 ns/op JNIBenchmark.func1Crit avgt 3 10.335 �� 3.353 ns/op JNIBenchmark.func2 avgt 3 11.176 �� 0.024 ns/op JNIBenchmark.func2Crit avgt 3 10.682 �� 0.514 ns/op JNIBenchmark.func3 avgt 3 11.141 �� 0.199 ns/op JNIBenchmark.func3Crit avgt 3 10.726 �� 1.239 ns/op JNIBenchmark.func4 avgt 3 11.262 �� 0.632 ns/op JNIBenchmark.func4Crit avgt 3 10.981 �� 1.534 ns/op After: Benchmark Mode Cnt Score Error Units JNIBenchmark.func0 avgt 3 10.968 �� 4.395 ns/op JNIBenchmark.func0Crit avgt 3 10.238 �� 0.438 ns/op JNIBenchmark.func1 avgt 3 10.810 �� 1.577 ns/op JNIBenchmark.func1Crit avgt 3 10.240 �� 1.131 ns/op JNIBenchmark.func2 avgt 3 11.053 �� 2.185 ns/op JNIBenchmark.func2Crit avgt 3 10.599 �� 0.712 ns/op JNIBenchmark.func3 avgt 3 11.332 �� 3.552 ns/op JNIBenchmark.func3Crit avgt 3 10.467 �� 1.017 ns/op JNIBenchmark.func4 avgt 3 11.411 �� 1.326 ns/op JNIBenchmark.func4Crit avgt 3 11.387 �� 3.873 ns/op Benchmark Mode Cnt Score Error Units JNIBenchmark.func0 avgt 3 10.833 �� 0.691 ns/op JNIBenchmark.func0Crit avgt 3 10.244 �� 0.408 ns/op JNIBenchmark.func1 avgt 3 10.656 �� 0.726 ns/op JNIBenchmark.func1Crit avgt 3 10.244 �� 1.791 ns/op JNIBenchmark.func2 avgt 3 11.161 �� 0.316 ns/op JNIBenchmark.func2Crit avgt 3 10.427 �� 1.718 ns/op JNIBenchmark.func3 avgt 3 11.151 �� 0.456 ns/op JNIBenchmark.func3Crit avgt 3 10.600 �� 0.870 ns/op JNIBenchmark.func4 avgt 3 11.494 �� 0.033 ns/op JNIBenchmark.func4Crit avgt 3 10.530 �� 0.940 ns/op
21-08-2019

[~zgu] what benchmark was used to measure the JNI call overhead?
21-08-2019

There have been changes to some of the validation and checks performed by JNI calls which likely account for the increased costs. See for example JDK-8147451 and follow up work. Though some of that at least may only apply when run with -Xcheck:jni
21-08-2019

Benchmark provided by Ioannis: I have prepared a benchmark similar to what I've used in my testing, here: https://github.com/Spasi/JDK-8229895 The README includes instructions and some performance results that I'm seeing locally. The called native functions do nothing at all and they are tested both with and without CriticalJNINatives. The slowdown when going from JDK 8 to JDK 10+ is obvious on Windows & Linux, but I was not able to reproduce it on macOS. - Ioannis
21-08-2019