Bug ID: JDK-8168914 Crash in ClassLoaderData/JNIHandleBlock::oops

JDK-8168914 : Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 8,9

Priority: P2
Status: Resolved
Resolution: Fixed
CPU: generic

Submitted: 2016-10-28
Updated: 2018-03-26
Resolved: 2017-02-25

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 10	JDK 8	JDK 9
10Fixed	8u131Fixed	9 b161Fixed

Related Reports

Duplicate :	JDK-8171198 - JRE Crash - JNIHandleBlock::oops_do(OopClosure*)+0x20
Duplicate :	JDK-8186710 - JVM crashes with JNIHandleBlock::oops_do(OopClosure*)+0x30
Duplicate :	JDK-8178406 - JVM crash at oopDesc::size()
Duplicate :	JDK-8180345 - Unexpected JVM Crash, Problematic frame V [libjvm.so+0x709670] JNIHandleBlock::oops_do(OopClosure*)+0x20
Duplicate :	JDK-8191009 - Additional information to JDK-8168914 - G1GC SIGSEGV oopDesc::size()
Duplicate :	JDK-8200218 - JVM crash in JNIHandleBlock::oops_do(OopClosure*)+0x20
Relates :	JDK-8178406 - JVM crash at oopDesc::size()

Description

Rafael Winterhalter reported a crash in JNIHandleBlock::oops_do() during concurrent marking which only happened if massive class redefinition was performed in parallel. He supplied two hs_err files which had similar stack traces (see attached hs_err files for full details):

#  SIGSEGV (0xb) at pc=0x00007f80940887eb, pid=3809, tid=0x00007f8090870700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_102-b14) (build 1.8.0_102-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.102-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x4c87eb]  oopDesc::size()+0x2b

V  [libjvm.so+0x4c87eb]  oopDesc::size()+0x2b
V  [libjvm.so+0x4ca770]  CMTask::make_reference_grey(oopDesc*, HeapRegion*)+0xc0
V  [libjvm.so+0x70cf6a]  JNIHandleBlock::oops_do(OopClosure*)+0x6a
V  [libjvm.so+0x468b50]  ClassLoaderData::oops_do(OopClosure*, KlassClosure*, bool)+0x70
V  [libjvm.so+0x656ad0]  InstanceMirrorKlass::oop_oop_iterate_nv(oopDesc*, G1CMOopClosure*)+0x40
V  [libjvm.so+0x4cbb44]  CMBitMapClosure::do_bit(unsigned long)+0x74
V  [libjvm.so+0x4c522e]  CMTask::do_marking_step(double, bool, bool)+0xa9e
V  [libjvm.so+0x4cc483]  CMConcurrentMarkingTask::work(unsigned int)+0x293
V  [libjvm.so+0xae6e0f]  GangWorker::loop()+0xcf
V  [libjvm.so+0x9249c8]  java_start(Thread*)+0x108

#  SIGSEGV (0xb) at pc=0x00007f370686df20, pid=12573, tid=0x00007f370411e700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_102-b14) (build 1.8.0_102-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.102-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x70cf20]  JNIHandleBlock::oops_do(OopClosure*)+0x20
V  [libjvm.so+0x70cf20]  JNIHandleBlock::oops_do(OopClosure*)+0x20
V  [libjvm.so+0x468b50]  ClassLoaderData::oops_do(OopClosure*, KlassClosure*, bool)+0x70
V  [libjvm.so+0x63d4cf]  InstanceClassLoaderKlass::oop_oop_iterate_nv(oopDesc*, G1CMOopClosure*)+0x3f
V  [libjvm.so+0x4c421b]  CMTask::drain_local_queue(bool)+0x14b
V  [libjvm.so+0x4c49ac]  CMTask::do_marking_step(double, bool, bool)+0x21c
V  [libjvm.so+0x4cc483]  CMConcurrentMarkingTask::work(unsigned int)+0x293
V  [libjvm.so+0xae6e0f]  GangWorker::loop()+0xcf
V  [libjvm.so+0x9249c8]  java_start(Thread*)+0x108

From a quick look at the code in classLoaderData.cpp I found that ClassLoaderData::oops_do() iterates over the class loader's handles:

  if (_handles != NULL) {
    _handles->oops_do(f);
  }

while they can be concurrently updated by:

jobject ClassLoaderData::add_handle(Handle h) {
  MutexLockerEx ml(metaspace_lock(),  Mutex::_no_safepoint_check_flag);
  if (handles() == NULL) {
    set_handles(JNIHandleBlock::allocate_block());
  }
  return handles()->allocate_handle(h());
}

Notice that updating the handles in ClassLoaderData::add_handle() is done under the 'metaspace_lock' while iterating over the handles in ClassLoaderData::oops_do() isn't synchronized at all.

Incidentally, ClassLoaderData::add_handle() is called during class redefinition:

ClassLoaderData::add_handle(Handle h)
  ConstantPool::initialize_resolved_references
    Rewriter::make_constant_pool_cache
      Rewriter::Rewriter
        Rewriter::rewrite(instanceKlassHandle klass
          VM_RedefineClasses::load_new_class_versions
            VM_RedefineClasses::doit_prologue()

 which according to Rafael triggers the bug at his customer.

I've tried to build a simple test case which reproduces the crash but haven't succeeded until now. With a simple test case we could verify if using the same 'metaspace_lock' that's already used in ClassLoaderData::add_handle() during iteration in ClassLoaderData::oops_do() as well, would actually solve the problem.

Than of course we'd also have to evaluate the performance costs of such an additional lock and maybe come up with another, cheaper solution (i.e. an additional lock just to serialize ClassLoaderData::add_handle()/ClassLoaderData::oops_do() instead of the more heavy-weight 'metaspace_lock').

Comments

Can you use src/share/utilities/chunkedList.hpp for the thread safe list? It's similar to what you added.
02-02-2017
Sure, I'll take care of it
01-02-2017
[~ehelin] My approach above doesn't work because of lock ordering issues with CMS (no surprise!) and then it hangs on some tests. I think your approach is best, except can the ChunkHandleList be in the .cpp file since it's a special implementation for CLD. There is also no remove function which is not used now but we do have an existing bug (unfiled) where we leak resolved-references with redefinition. This would prevent fixing that by adding a remove, wouldn't it? Or should remove just zero the oop in the block? Can you take this bug back? I don't want it anymore!
31-01-2017
--- a/src/share/vm/classfile/classLoaderData.cpp +++ b/src/share/vm/classfile/classLoaderData.cpp @@ -144,6 +144,8 @@ return; } + // G1 traverses this concurrently with mutator threads. + MutexLockerEx ml(metaspace_lock(), Mutex::_no_safepoint_check_flag); f->do_oop(&_class_loader); _dependencies.oops_do(f); if (_handles != NULL) { I'm not entirely convinced that walking dependencies has all the proper order access operations either, and this shouldn't be a hot lock. The only concern I had was running into a lock-ordering problem but running the gc-test-suite didn't uncover any assertions wrt. lock ranks. I'll run more tests (please suggest!) Some of this code is going to change in jdk10 so it doesn't seem like adding a new type would be a good idea.
25-01-2017
[~coleenp] sure, no problem. I did write a patch and uploaded it to http://cr.openjdk.java.net/~ehelin/8168914/00/ (passes JPRT). If you prefer another approach, then go for that, I don't mind :) Please remember that JNIHandleBlock also is used "thread-locally", so you don't add synchronization costs when a JNIHandleBlock is used by just one thread. I've assigned the bug to you and moved it to the runtime sub-component.
25-01-2017
[~ehelin] I do not think you should implement another type of oop block mechanism, and give the GC another type of oop thing to walk. That's what JNIHandleBlock is. [~kbarrett] and I were discussing a simple fix to this yesterday. Can I take this bug since it is in the Runtime code?
25-01-2017
[~ehelin] Do we have other concurrent chunked list implementations in the jvm that can be reused? I feel like we keep adding code like this. This also ties into our plans to not use jobjects for metadata. Wouldn't it be better to fix JNIHandleBlock to do memory ordered adds? JNIHandleBlocks are chunked list implementations really. It seems like for jdk9 you should just add the requisite load_ptr_acquire and store releases to JNIHandleBlock and do a larger change in 10.
20-01-2017
[~ehelin] you probably refer to this code in JNIHandleBlock::allocate_handle: // Try last block if (_last->_top < block_size_in_oops) { oop* handle = &(_last->_handles)[_last->_top++]; // XXX _last->_handles)[_last->_top] now points to unassigned memory until the next assignment *handle = obj; return (jobject) handle; } This is also the place we've recently found with [~tonyp]. I added a delay between incrementing _top and assigning the new handle but still couldn't reproduce until now.
20-01-2017
Me and [~stefank] read through the code. There is no synchronization in JNIHandleBlock when adding an oop. Hence, if a concurrent marking thread is iterating over the oops in JNIHandleBlock while _top is being incremented by a class loading thread, then this crash would typically happen. I think the best way to solve this is to just rip out JNIHandleBlock from ClassLoaserData and use a lock-free chunked linked list that can handle concurrent iteration and appending. I can take a look at that.
20-01-2017
I reread the description and yes, I don't see how VM_RedefineClasses needs to be involved. Any class loading will add a handle for resolved_references in parallel with concurrent marking. I guess the question is when walking the JNIHandleBlock, is adding an oop reference to it is atomic. The concurrent thread should see either nothing or a real oop. We should study that code. It's probably not atomic.
20-01-2017
[~coleenp]: thanks :) Yes, I've experimented with delays without success until now. I'm however pretty sure now that the potential race between VM_RedefineClasses and the CMConcurrentMarkingTask which I've described in the bug report, can not happen. That's because VM_RedefineClasses is a VM operation which is executed at a safepoint where all the concurrent marking threads are stopped. Maybe the applications also triggers normal class-loading of classes in new class loaders or something similar?
20-01-2017
This will need to go through the deferral process if not fixed in 9.
20-01-2017
Oh, you can have it back. :) I think even adding the metaspace_lock to handles->oops_do() should go in jdk10 first and be backported. [~ehelin] is also working on a reproducer too. Did you try adding delays in CLD::add_handle() and the G1 code that calls oops do? There's a stress test for redefinition in jdk/test/java/lang/instrument/RedefineBigClass.sh. We have others that should be moved to the open once we have time for it.
19-01-2017
[~coleenp]: I agree. I'm still working on this but I can't guarantee that I'll fix it until ZBB. If I should be able to find a simple solution until then, we can still re-target it to 9 and fix it right there. Coleen, you've assigned this bug to yourself. Does it mean that you've started to work on it? Just asking to avoid duplicating the work :)
19-01-2017
Given the rarity of this bug, I think it should be fixed in 10 and backported to 9 after more testing. [~simonis] do you disagree?
19-01-2017
I'd be surprised if the metaspace_lock gets very hot unless there is a lot of class loading. I think your crash would reproduce more frequently if there was a lot of contention on this lock. I worry more about lock ranking here. [~simonis], did you try this and what did you find? We're getting close to the jdk9 ZBB code freeze.
09-12-2016