JDK-8209670 : CompilerThread releasing code buffer in destructor is unsafe
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 11,12
  • Priority: P1
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2018-08-19
  • Updated: 2018-08-29
  • Resolved: 2018-08-21
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 12
11 b28Fixed 12Fixed
Related Reports
Duplicate :  
Relates :  
Relates :  
Description
I believe it is unsafe for CompilerThread to release code buffer in its destructor.

Before CompilerThread reaches its destructor, it has to be removed from Threads list, which made it no longer participates safepoints. It opens up a race window that can result safepoint scanning to stumble over the code buffer just deleted by CompilerThread's destructor.

Although, the destructor takes CodeCache_lock, but safepoint walk does not take this lock, so it is a race.

I have seen this problem during Shenandoah tests on a big machine with many cores.




Comments
Fix request approved.
21-08-2018

Interestingly, it looks like we now also hit this in our testing (JDK-8209767).
21-08-2018

Fix Request: It's important to fix this bug because failures are very hard to identify/track and therefore it's not easy to apply the correct workaround (-XX:-UseDynamicNumberOfCompilerThreads). Part of the fix is reverting the code to before JDK-8205499 and the other part only affects code that was added by JDK-8205499 and is guarded by the UseDynamicNumberOfCompilerThreads flag (that means even if the fix would fail, the workaround still applies). The fix is therefore low risk. I've tested the fix with hs-tier1-3, jdk-tier1-3 and hs-precheckin-comp and will run higher level tiers as well before pushing. The fix was reviewed by Vladimir Kozlov and Zhengyu Gu (waiting for more reviews). http://cr.openjdk.java.net/~thartmann/8209670/webrev.00/
21-08-2018

Zhengyu, thanks a lot for verifying!
21-08-2018

Okay, then we need formal request for approval for push into jdk 11. I will leave it to Tobias.
20-08-2018

It turned out that it was Shenandoah's bug that failed Tobias' webrev.00. After fixing Shenandoah, I reran tests three times, all came back clean. I think webrev.00 is good to go. Thanks, -Zhengyu
20-08-2018

[~mdoerr] We are considering to fix it by disabling dynamic compiler threads feature by default (Tobias's webrev.01) if we don't find final fix. We have only tomorrow to push a fix and currently proposed fix webrev.00 is not working for Zhengyu, as I understand from side conversation. I want to know if you are okay with this. Thanks, Vladimir
20-08-2018

Well, my understanding is that at the new location(s), the CodeCache::free can not be executed concurrently with the safepoint walking code. Or am I missing something here? Edit: Okay, thanks for looking at this.
20-08-2018

I don't see how moving the free where you did changes the race condition ?? Edit: yes I do, it's now done before the thread is removed from the threads-list.
20-08-2018

Thanks Zhengyu! I'll take care of the fix. EDIT: Out for review: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2018-August/030178.html
20-08-2018

Hi Tobias, The hs_err files I got were from shenandoah specific tests (attached one), the problem eventually turned out to be none shenandoah specific. I was about to purpose different approach (patch attached), but you have better understanding of this particular problem, so I am going to leave the decision to you (reassigned to you). This bug is extremely hard to track, if without some lucks, so if possible, please do fix to root cause.
20-08-2018

Here's a proper fix for this issue (we have to be careful to not re-introduce JDK-8205499): http://cr.openjdk.java.net/~thartmann/8209670/webrev.00/ Or, as I've suggested above, we go with disabling the UseDynamicNumberOfCompilerThreads feature for now: http://cr.openjdk.java.net/~thartmann/8209670/webrev.01/
20-08-2018

Zhengyu, thanks for all the details! Could you please also attach the hs_err and/or stack traces? Depending on how we define the likelihood of this (since it's a race condition it's usually very intermittent but could still affect many applications), it's either a P1 or P2: ILW = Crash during safepoint scanning due to released code blob in code cache, race condition, -XX:-UseDynamicNumberOfCompilerThreads = H(M/H)M = P1/P2 I tend towards making this a P1 because the issue will be very hard to track if it happens with a customer application (and the workaround only helps if one can identify the problem). Since we are very late in the JDK 11 release cycle, I would suggest to disable UseDynamicNumberOfCompilerThreads and fix this properly in JDK 12: diff -r f729ca27cf9a src/hotspot/share/runtime/globals.hpp --- a/src/hotspot/share/runtime/globals.hpp Fri Aug 17 09:51:28 2018 -0700 +++ b/src/hotspot/share/runtime/globals.hpp Mon Aug 20 09:23:30 2018 +0200 @@ -1138,7 +1138,7 @@ range(0, max_jint) \ constraint(CICompilerCountConstraintFunc, AfterErgo) \ \ - product(bool, UseDynamicNumberOfCompilerThreads, true, \ + experimental(bool, UseDynamicNumberOfCompilerThreads, false, \ "Dynamically choose the number of parallel compiler threads") \ \ diagnostic(bool, ReduceNumberOfCompilerThreads, true, \
20-08-2018

Poisoned content (debug only) helped to catch the problem, I think it is real in product build too. I am also puzzled why this has not been caught earlier ... I think SMR may prolong the race window ? I don't understand why you think it is CodeCache::free() issue, caller does take CodeCache_lock, it should be safe in normal circumstance. The race window is created after removing Compiler thread from Threads list, made it no longer honors safepoints, while it is still holding and manipulating code cache, but safepoint scan is assuming code cache should be stable, cause all compiler threads should already be stopped. We (Shenandoah) did see codecache scan issues periodically, but did not get enough information to find out the root causes. I happened to run some shenandoah stress tests on one of our big test machines over weekend, somehow, it became very reproducible, that helped me to catch this problem. The fix that I am currently testing, suggests it is the root cause. If you think it is not P1, please lower the priority ... I am ok with it, since Shenandoah is targeted to JDK12. However. I do think it also affects other GCs, since they all do codecache scan during safepoints.
20-08-2018

I'm trying to understand the issue. You said the problem is the content is poisoned - we only call CodeCache::free() ergo somewhere in that path I assumed we were doing the poisoning when we should not. I see what you are saying now. We return the code buffer to the codecache asynchronously wrt. the scanning that can occur at a safepoint. The free uses the lock but the scanning does not as it assumes nothing can be interacting with the code cache during a safepoint. That assumption is violated now that we have dynamic compiler threads which can terminate and so return their code buffer.
19-08-2018

Is this only an issue in debug builds? You've filed a P1 bug against JDK 11 when we are at the final stages of producing a release candidate.
19-08-2018

That seems to be a bug with the CodeCache::free logic then. We are freeing the buffer so it should be reusable, so unclear why it should contain "poison content". ??
19-08-2018

Moved to compiler sub-component This relates to the introduction of dynamic compiler threads in JDK 11.
19-08-2018

It is not really *deleting* the code buffer, but returning it back to code heap: CompilerThread::~CompilerThread() { .... if (get_buffer_blob() != NULL) { MutexLockerEx mu(CodeCache_lock, Mutex::_no_safepoint_check_flag); CodeCache::free(get_buffer_blob()); } ... } In debug build, it poisons content with badCodeHeapFreeVal, which triggers crash during safepoint scan.
19-08-2018

How can a safepoint scan find a deleted code buffer if the CompilerThread has been removed from the Threads list?
19-08-2018