JDK-8011102 : Clear AVX registers after return from JNI call
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: x86
  • Submitted: 2013-03-29
  • Updated: 2015-04-17
  • Resolved: 2013-04-03
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7 JDK 8 Other
7u40Fixed 8Fixed hs24Fixed
Related Reports
Relates :  
Relates :  
Description
A native library may use wide 256bit YMM registers and does not clean them after that.
Add vzeroupper instruction after return from JNI call to avoid SSE <-> AVX transaction penalty.

From customer report:

Hello,
I've got a question related to my project. It is Java wrapper for Libav libraries and I have some performance issues with it.

If I compile the libraries with AVX instructions enabled the whole testing application uses approximately 130% of CPU time in comparison to the same libraries with AVX disabled. The problem is definitely in "bad" transitions between SSE and AVX instructions. These transitions are costly in case the upper part of YMM registers is not zeroed using VZEROUPPER or VZEROALL instruction before using SSE.

There is no problem with those libraries, if they are not used from Java. I used Intel's Software Developer Emulator to find those bad AVX <-> SSE transitions and I found thousands of them. The origin of almost all bad transitions from AVX -> SSE (I mean the code that uses AVX 256 instructions and does not call VZEROUPPER) is somewhere inside anonymous memory blocks (according to Intel's SDE and pmem).

Libav mixes SSE and AVX 128 instructions a lot. It cannot cause any trouble if the upper part of YMM registers is zeroed. But in case it is not zeroed it would oscillate between B and C states (according to the Agner's terminology). Both of these transitions costs quite a lot of CPU cycles.

So here is my question: Is it possible that JIT compiler compiles some bytecode into native instructions, uses some AVX 256 instrucitons, does not use VZEROUPPER and puts the result into some anonymous memory block?

Ondrej Perutka

Comments
We may need to add VZEROUPPER before JNI call if compiled code may use wide vectors (MaxVectorSize >= 32 || UseAVX >= 2). (UseAVX >= 2) is used in arraycopy stubs to guards 256bit move instructions which use YMM registers. An other place is before calls into Runtime in 64bit VM (it uses SSE instructions).
30-03-2013

http://software.intel.com/en-us/forums/topic/301853 What we decided to do was to optimize for the common scenario of uniform blocks of 128-bit (or all scalar) code separated from uniform blocks of 256-bit code. We maintain an internal record of when we transition between states where the upper bits contain something nonzero - to a point where the state is guaranteed to be zero. We give you a fast (1* cycle throughput) way to the second state VZEROUPPER (though VZEROALL, XRSTOR and reboot also work). Once you're in that state of Zeroed-upperness, you can execute 128-bit code (or scalar) - VEX prefixed or not - and you pay no transition penalty. You can also transition back to executing 256-bit instructions also with no penalty. You can transition freely between any VEXed instruction of any width and pay no penalty. The downside is that if you try to move from 256-bit instructions to legacy 128-bit instructions without that VZEROUPPER, you're going to pay. The way we chose to make you pay is to optimize for the common use of long blocks of SSE code you pay once during the transition to legacy 128-bit code instead of on every instruction. We do it by copying the upper 128-bits of all 16 regis ters to a special scratchpad and this copying takes time (something like 50 cycles - still TBD). Then the legacy SSE code can operate as long as it wants with no penalty. When you transition back to a VEX-prefixed instruction, you have to pay the penalty again to restore state from that scratchpad. The solution for this problem is for software to use VZEROUPPER prior to leaving your basic block of 256-bit code. Use it prior to calling any (ABI compliant) function, and prior to any blind return jump.
29-03-2013