A discussion on shenandoah-dev mailing list (https://mail.openjdk.java.net/pipermail/shenandoah-dev/2019-August/010422.html), as a way to improve jni performance, proposed by Ioannis Tsakpinis <iotsakp@gmail.com>
It's true that CriticalJNINatives were added as an efficient way to
access Java arrays from JNI code. However, the overhead of JNI calls
affects all methods, especially methods that accept or return primitive
values only and the JNI code does nothing but pass the arguments to
another native function.
There are thousands of JNI functions in LWJGL and almost all are like
that, they simply cast arguments to the appropriate type and pass them
to a target native function. Libraries like JNR and other JNI binding
generators also look the same.
The major benefit of using CriticalJNINatives for such functions is the
removal of the first two standard JNI parameters: JNIEnv* and jclass.
Normally that would only mean less register pressure, which may help in
some cases. In practice though, native compilers are able to optimize
away any argument shuffling and convert everything to a simple
tail-call, i.e. a single jump instruction.
We go from this for standard JNI:
Java -> shuffle arguments -> JNI -> shuffle arguments -> native call
to this for critical JNI:
Java -> shuffle arguments -> JNI -> native call
Example code and assembly output: https://godbolt.org/z/qZRIi1
This has a measurable effect on JNI call overhead and becomes more
important the simpler the target native function is. With Project Panama
there is no JNI function and it should be possible to optimize the first
argument shuffling too. Until then, this is the best we can do, unless
there are opportunities to slim down the JNI wrapper even further for
critical native methods (e.g. remove the safepoint polling if it's safe
to do so).
To sum up, the motivation is reduced JNI overhead. My argument is that
primitive-only functions could benefit from significant overhead
reduction with CriticalJNINatives. However, the GC locking effect is a
major and unnecessary disadvantage. Shenandoah does a perfect job here
because it supports region pinning and there's no actual locking
happening in primitive-only functions. Every other GC though will choke
hard with applications that make heavy use of critical natives (such as
typical LWJGL applications). So, two requests:
- PRIMARY: Skip check_needs_gc_for_critical_native() in primitive-only
functions, regardless of GC algorithm and object-pinning support.
- BONUS: JNI call overhead is significantly higher (3-4ns) on Java 10+
compared to Java 8 (with or without critical natives). I went through
the timeline of sharedRuntime_x86_64.cpp but couldn't spot anything that
would justify such a difference (thread-local handshakes maybe?). I was
wondering if this is a performance regression that needs to be looked
into.