United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-6419370 : new jmethodID code has tiny holes in synchronization

Details
Type:
Bug
Submit Date:
2006-04-28
Status:
Resolved
Updated Date:
2011-09-22
Project Name:
JDK
Resolved Date:
2009-09-30
Component:
hotspot
OS:
generic
Sub-Component:
jvmti
CPU:
generic
Priority:
P4
Resolution:
Fixed
Affected Versions:
6
Fixed Versions:
hs17 (b02)

Related Reports
Backport:
Backport:
Relates:
Relates:
Relates:

Sub Tasks

Description
We (Robert, Dan and Serguei) after a email discussion agreed that there are
some holes in syncronization of the following new jmethodID code:

src/share/vm/oops/instanceKlass.cpp:

913 // Lookup or create a jmethodID
914 jmethodID instanceKlass::jmethod_id_for_impl(instanceKlassHandle ik_h, methodHandle method_h) {
915   size_t idnum = (size_t)method_h->method_idnum();
916   jmethodID* jmeths = ik_h->methods_jmethod_ids_acquire();
917   size_t length = 0;                            // first assignment of length is debugging crumb 
918   jmethodID id = NULL;
919   // array length stored in first element, other elements offset by one
920   if (jmeths == NULL ||                         // If there is no jmethodID array,
921       (length = (size_t)jmeths[0]) <= idnum ||  // or if it is too short,
922       (id = jmeths[idnum+1]) == NULL) {         // or if this jmethodID isn't allocated
923     MutexLocker ml(JNIIdentifier_lock);         // then do double check locking
924     // Retry lookup after we got the lock
925     jmeths = ik_h->methods_jmethod_ids_acquire();
926     if (jmeths == NULL || (length = (size_t)jmeths[0]) <= idnum) {
927       size_t size = MAX2(idnum+1, (size_t)ik_h->idnum_allocated_count());
928       jmethodID* new_jmeths = NEW_C_HEAP_ARRAY(jmethodID, size+1);
929       memset(new_jmeths, 0, (size+1)*sizeof(jmethodID));
930       if (jmeths != NULL) {
931         // We have grown the array: copy the existing entries, and delete the old array
932         for (size_t index = 0; index < length; index++) {
933           new_jmeths[index+1] = jmeths[index+1];
934         }
935         FreeHeap(jmeths);
936       }
937       new_jmeths[0] =(jmethodID)size;
938       ik_h->release_set_methods_jmethod_ids(jmeths = new_jmeths);
939     } else {
940       id = jmeths[idnum+1];
941     }
942     if (id == NULL) {
943       if (method_h->is_old() && !method_h->is_obsolete()) {
944         // The method passed in is old (but not obsolete), we need to use the current version
945         methodOop current_method = ik_h->method_with_idnum(idnum);
946         assert(current_method != NULL, "old and but not obsolete, so should exist");
947         methodHandle current_method_h(current_method == NULL? method_h() : current_method);
948         id = JNIHandles::make_jmethod_id(current_method_h);
949       } else {
950         // It is the current version of the method or an obsolete method, 
951         // use the version passed in
952         id = JNIHandles::make_jmethod_id(method_h);
953       }
954       jmeths[idnum+1] = id;
955     }
956   }
957   return id;
958 }

961 // Lookup a jmethodID, NULL if not found.  Do no blocking, no allocations, no handles
962 jmethodID instanceKlass::jmethod_id_or_null(methodOop method) {
963   size_t idnum = (size_t)method->method_idnum();
964   jmethodID* jmeths = methods_jmethod_ids_acquire();
965   size_t length;                                // length assigned as debugging crumb 
966   jmethodID id = NULL;
967   if (jmeths != NULL &&                         // If there is a jmethodID array,
968       (length = (size_t)jmeths[0]) > idnum) {   // and if it is long enough,
969     id = jmeths[idnum+1];                       // Look up the id (may be NULL)
970   }
971   return id;
972 }

975 // Cache an itable index
976 void instanceKlass::set_cached_itable_index(size_t idnum, int index) {
977   int* indices = methods_cached_itable_indices_acquire();
978   if (indices == NULL ||                         // If there is no index array,
979       ((size_t)indices[0]) <= idnum) {           // or if it is too short
980     // Lock before we allocate the array so we don't leak
981     MutexLocker ml(JNIIdentifier_lock);      
982     // Retry lookup after we got the lock
983     indices = methods_cached_itable_indices_acquire();
984     size_t length = 0;
985     // array length stored in first element, other elements offset by one
986     if (indices == NULL || (length = (size_t)indices[0]) <= idnum) {
987       size_t size = MAX2(idnum+1, (size_t)idnum_allocated_count());
988       int* new_indices = NEW_C_HEAP_ARRAY(int, size+1);
989       // Copy the existing entries, if any
990       size_t i;
991       for (i = 0; i < length; i++) {
992         new_indices[i+1] = indices[i+1];
993       }
994       // Set all the rest to -1
995       for (i = length; i < size; i++) {
996         new_indices[i+1] = -1;
997       }
998       if (indices != NULL) {
999         FreeHeap(indices);  // delete any old indices
1000       }
1001       release_set_methods_cached_itable_indices(indices = new_indices);
1002     }
1003   }
1004   // This is a cache, if there is a race to set it, it doesn't matter
1005   indices[idnum+1] = index; 
1006 }

1009 // Retrieve a cached itable index
1010 int instanceKlass::cached_itable_index(size_t idnum) {
1011   int* indices = methods_cached_itable_indices_acquire();
1012   if (indices != NULL && ((size_t)indices[0]) > idnum) {
1013      // indices exist and are long enough, retrieve possible cached
1014     return indices[idnum+1]; 
1015   }
1016   return -1;
1017 }

Please, see some fragments of our email discussion below.

-------- Original Message --------
Subject: 	Re: [Fwd: Re: OK FOLKS. I really need reviewers. TODAY! 
PLEASE! [Method add/delete on redefine]]
Date: 	Wed, 26 Apr 2006 17:00:01 -0700
Wrom: MNNSKVFVWRKJVZCMHVIBGDADRZFSQHYUCDDJBLVLMH
To: 	Serguei Spitsyn <###@###.###>, Serviceability Group 
<###@###.###>
References: 	<###@###.###>


Serguei Spitsyn wrote:

> Hi Robert,
>
> You have not replied on this email.
> I just want to be sure you remember about this.

Rats! I had let this fall through the cracks, thanks for re-sending.

> Integrating your fixes is urgent and more important,
> so I'm Ok if you reply or resolve the sync issues later.

Good.  That is the approach we have decided to take.  See below.

> Let me know if you disagree with my observation.

I don't.

>
> Thanks,
> Serguei
>
> -------- Original Message --------
> Subject: Re: OK FOLKS. I really need reviewers. TODAY! PLEASE! [Method add/delete on redefine]
> Date: Tue, 25 Apr 2006 23:35:33 -0700
> Wrom: AALPTCXLYRWTQTIPWIGYOKSTTZRCLBDXRQBGJSNBOHMKHJY
> To: Robert Field <###@###.###>
>
> Robert Field wrote:
>> Serguei Spitsyn wrote:
>>
>>> Hi Robert,
> ...
>>> 5. Setting the index must be protected by locking the JNIIdentifier_lock.
>>
>> It doesn't need to be protected, I have added this comment:
>>       // This is a cache, if there is a race to set it, it doesn't matter
>> The whole point of having it is the speed of JNI static invokes, so 
>> we don't want to lock unless it is critical to do so.
>
> Sure, performance is important.
> But let's consder this:
>   Thread T1 executes the line (deallocates current indeces):
>     987         FreeHeap(indices);  // delete any old indices
>   After that thread T2 allocates new heap array with NEW_C_HEAP_ARRAY(),
>   gets the same range of addresses and fill in new values.
>   Thread T3 concurrently with T1 and T2 executes the line:
>     992   indices[idnum+1] = index;
>
>   There is still a small probability that T3 sets the new index into
>   newly allocated heap array. There can be a crash or even something 
> worse.
>   Such errors are very hard to debug.
>   Please, let me know if you think such a scenario is not possible.
>
>>> src/share/vm/oops/instanceKlass.cpp:
>>>  992   indices[idnum+1] = index;
> ...
>
>>> 7. Depending on the usage we may want to protect this function with the
>>>    JNIIdentifier_lock as well. (Need to check if it is really needed).
>>>
>> See above
>
>
> I'm afraid a similar scenario is possible as considered above.
> The T1 and T2 threads do the same as above:
>   - T1 deallocates the 'indeces' array
>   - T2 allocates new heap array, gets the same range of addresses
>     and set new values into it
>   - T3 (cached_itable_index) returns wrong value from the 
> indices[idnum+1].

There is indeed a possible crash here and same in the jmethodID code.  
Fortunately, it would be very very rare: it would require (I'll use 
jmethodID)--

    * A jmethodID is created in class C [array is created]
    * Then C is redefined with an added or obsoleted method
      [idnum_allocated_count increased]
    * T1 wants a jmethodID, and between the time it retrieves the
      jmethodID array and it dereferences it (a few simple instructions) ...
    * T2 comes along and wants a jmethodID for one of the new or
      obsolete methods (idnum larger than before) [jmethodID grows]
    * Somebody scribbles on the released array
    * Then T1 resumes and uses the scribbled on array - boom

If you can write a test that makes that happen, I'll buy you a ticket to Hawaii!

The right way to fix this would be to have redefine always grow the 
array (if it exists) while at a safepoint.   Doing this would require a 
full run of tests.

We decided that compared to the known and reported jmethodID crash: get 
a jmethodID, unload its class, use the jmethodID -- boom
that this fixes (as a side-effect) that it was OK to submit now.  And 
fix this later.

------------------------------------------------------------------------------------
This is an observation from Dan:

Greetings,

Okay, we have really "smart" double check locking going on and
that has opened a potential race or has it? Before we go tweaking
the code, I want to make sure we understand the problem from all
the different angles.

The core problem pointed out by Serguei's diligent code review is:

- _methods_jmethod_ids refers to an array of jmethodIDs or NULL
- _methods_cached_itable_indices refers to an array of ints or NULL
- because of our double check locking algorithm, Thread1 can be
  accessing one of the above arrays while Thread2 has just freed
  it with FreeHeap

Each of the two arrays goes through two types of transitions:

- from NULL to non-NULL
- from non-NULL to bigger non-NULL

The transition from NULL to non-NULL is safe for multi-threaded use
when the idnum for indexing the array is within the initial expected
size of the array (initial method count):

- the initial thread down the path sees the NULL, grabs the
  JNIIdentifier_lock and starts the array allocation procedure
- if a competing thread sees the NULL value, it heads down the
  JNIIdentifier_lock path also and blocks
- the newly allocated arrays are not advertised via the
  release_set_*() calls until after the array length element @0
  and the array values have been initialized.
- if a competing thread sees a non-NULL value:
  - for _methods_jmethod_ids array
    - if the requested idnum is out of range (it won't be for this
      case) or if the jmethodID value for requested idnum is NULL,
      then the competing thread heads down the JNIIdentifier_lock path
    - otherwise, the fetched jmethodID value is returned
  - for_methods_cached_itable_indices
    - if the requested idnum is out of range (it won't be for this
      case), then the competing thread heads down the
      JNIIdentifier_lock path
    - otherwise, the specified index value is cached

The above transition is safe because competing threads either see
stable data that they can use or they head down the locking path
when there is more work to do.

The transition from NULL to non-NULL when one of the competing
threads is looking for an idnum slot outside the initial expected
size of the array (initial method count) is the same as a non-NULL
to bigger non-NULL transition.

Finally let's look at non-NULL to bigger non-NULL transitions.
Thread1 in this case is looking for an idnum slot that is past the
current size of the array so it heads down the JNIIdentifier_lock
path. Thread2 in this case is looking for an idnum slot that is
within the current size of the array. Here is a transaction diagram:

  Thread1                                      Thread2
  ===========================================  ===================
  allocates new array                          sees non-NULL array
  copies old array contents to new array       :
  frees old array                              :
  uses release_set_*() to advertise new array  fetches element 0

Just the fetch of the length element (array element 0) is a problem.
After the FreeHeap() call, the memory could be:

- still there with all the old element 0 value
- reused for something else so a bogus element 0 value

For _methods_jmethod_ids array, we are going to fetch a potentially
bogus jmethodID value.
For _methods_cached_itable_indices array, we are going to store
in memory that isn't ours anymore.

Looks bad and it is bad. There is only one possible saving grace.
If the call that causes the array to grow comes from code that
is called only at a safepoint, then I don't think there can be
other competing threads.

jmethodID_for_impl() is only called by methodOopDesc::jmethod_id().
set_cached_itable_index() is only called by
methodOopDesc::set_cached_itable_index(). I don't think we can
guarantee that jmethod_id() and set_cached_itable_index() are only
called from a safepoint. So the existing code indeed has this
wonderful race. For those of you that were already convinced, sorry
I took you on a long hike...

We can't guarantee all the calls happen safely, but what if the first
jmethod_id() call and set_cached_itable_index() call happens in
RedefineClasses() by the doit() code path, then things are safe.

On further thought, these "initializing" calls only need to be made if
the instanceKlass already has a non-NULL _methods_jmethod_ids array
and/or _methods_cached_itable_indices array. If they don't exist, then
we are in the NULL to non-NULL transition case and things are safe.

This is all kind of stream of consciousness so I have no idea whether
there are hole here also or not.

Comments, questions, corrections?

Dan

                                    

Comments
EVALUATION

As for the cached itable stuff, the only code changes between
the code in the description and HSX-16 is the following:

% diff -w cached_itable_1.306.1.6 cached_itable_hsx-16
1,2d0
<
<
9c7
<     MutexLocker ml(JNIIdentifier_lock);
---
>     MutexLocker ml(JNICachedItableIndex_lock);
30a29,30
>   } else {
>     CHECK_UNHANDLED_OOPS_ONLY(Thread::current()->clear_unhandled_oops());

Thread #1 sees that it needs to allocate a new indices cache
and heads down that path. The new cache is allocated on line
1087, the old cache is freed on line 1098 and the new cache
is advertised for use on line 1100.

Meanwhile thread #2 sees the non-NULL indices cache value on
line 1077 (before thread #1 reaches line 1100), sees that its
idnum is in range on line 1078, and gets ready to update the
cache on line 1106. Thread #2 decides to go for coffee...

While thread #2 is between line 1078 and line 1106, thread #1
has made it all the way to line 1098 and has freed the old
indices cache. Thread #3 allocates some memory and gets the
block that contained the old indices cache and writes stuff
into it.

Thread #2 returns from coffee and finally decides to execute
line 1106 and writes into the old indices cache and chaos
ensues...


   1077   if (indices == NULL ||                         // If there is no index array,
   1078       ((size_t)indices[0]) <= idnum) {           // or if it is too short

<snip>

   1087       int* new_indices = NEW_C_HEAP_ARRAY(int, size+1);

<snip>

   1098         FreeHeap(indices);  // delete any old indices

<snip>

   1100       release_set_methods_cached_itable_indices(indices = new_indices);

<snip>

   1105   // This is a cache, if there is a race to set it, it doesn't matter
   1106   indices[idnum+1] = index;
                                     
2009-08-22
EVALUATION

In comment note #1 dated 2006-04-28 04:20:40 MDT, Serguei
documents the idea that the jmethodID code can deadlock. I
believe that the deadlock was fixed by the work for the
following bug:

    6447640 2/2 methodOopDesc::jmethod_id() can deadlock when
                used by the VMThread

However, I believe that the stale jmethodID fetch race is
still present in the current code base (as of HSX-16-B08).
In that race, thread #1 sees that it needs to allocate a
new jmeths array and heads down that path. The new array
is allocated on line 991, advertised for use on line 1039
and the old jmeths array is freed on line 1052.

Meanwhile thread #2 sees the non-NULL jmeths value on line 979
(before thread #1 reaches line 1039), sees that its idnum is
in range on line 980, and gets ready to fetch the jmethodID
value on line 981. Thread #2 decides to go for coffee...

While thread #2 is between line 980 and line 981, thread #1
has made it all the way to line 1052 and has freed the old
jmeths array. Thread #3 allocates some memory and gets the
block that contained the old jmeths array and writes stuff
into it.

Thread #2 returns from coffee and finally decides to execute
line 981 and fetches a now stale jmethodID value. Bad jmethodID
is returned and chaos ensues...


    979   if (jmeths == NULL ||                         // If there is no jmethodID array,
    980       (length = (size_t)jmeths[0]) <= idnum ||  // or if it is too short,
    981       (id = jmeths[idnum+1]) == NULL) {         // or if this jmethodID isn't allocated

<snip>

    991       new_jmeths = NEW_C_HEAP_ARRAY(jmethodID, size+1);
    992       memset(new_jmeths, 0, (size+1)*sizeof(jmethodID));

<snip>

   1039     ik_h->release_set_methods_jmethod_ids(jmeths = new_jmeths);

<snip>

   1052   FreeHeap(to_dealloc_jmeths);
                                     
2009-08-22
EVALUATION

The two caches of interest to this bug are defined in
src/share/vm/oops/instanceKlass.hpp: class instanceKlass:

    jmethodID* _methods_jmethod_ids;
    int*       _methods_cached_itable_indices;

There are reader races with both caches and there is a writer race with
_methods_cached_itable_indices. The stale JNI itable index write race
described in Evaluation note #2 is the nasty one of the races because
it can corrupt the malloc memory pool. See the following bug for
details on the malloc memory pool corruption:

    6875393 2/3 JNI itable index cache is broken

Here are the JNI itable index cache issues:

1a - the writer race as described in Evaluation note #2
1b - deallocation of the old cache while JNICachedItableIndex_lock is held
1c - stale cache read on old line 1078
1d - missing cache size initialization (fixed by 6875393)

Issue 1a)
    The writer race does not occur in the regular system since the
    cache only goes from NULL to non-NULL so the regular system is
    unaffected by this part of the fix.

    When JVM/TI RedefineClasses() is used, the cache can go from
    non-NULL to bigger non-NULL and that is when the writer race comes
    into play. The solution is for all cache accesses to acquire the
    lock when JVM/TI RedefineClasses() is used. See new line 1187.

Issue 1b)
    The cache is never resized by the regular system so this issue only
    comes into play when JVM/TI RedefineClasses() is used.

    The JNICachedItableIndex_lock does not have to be held when the old
    cache is freed so that operation should be moved outside of the
    locked region. In fact, if FreeHeap() is considered safepointable
    because of a possible malloc subsystem lock grab, then the VMThread
    might still deadlock with the FreeHeap() call where it is.

    The solution is to save a reference to the old cache in the locked
    region and free it outside the locked region. See new lines 1170,
    1210 and 1232.

Issue 1c)
    When JVM/TI RedefineClasses() is used, the cache size check on old
    line 1078 can access the old cache instead of the new cache which
    can lead to an incorrect assumption about the existing cache size.
    The solution (from 1a) is for all cache accesses to acquire the
    lock when JVM/TI RedefineClasses() is used. See new line 1187.

Issue 1d)
    Fixed by 6875393 on new line 1197 so nothing more to say here.

Onward to the other cache...


The stale jmethodID fetch race described in Evaluation note #1 is
generally considered less nasty since it doesn't directly cause a
crash. However, is stale and/or corrupted data that is not easily
detected better or worse than a crash? I guess that depends on your
personal point of view...

Here are the jmethodID cache issues:

2a - the stale jmethodID fetch race described in Evaluation note #1
2b - stale cache reads on old lines 980 and 981
2c - incomplete jmethodID object on old line 981
2d - deallocation of the old cache or new jmethodID while
     JmethodIdCreation_lock is held

Issue 2a)
    The stale jmethodID fetch race does not occur in the regular system
    since the cache only goes from NULL to non-NULL so the regular
    system is unaffected by this part of the fix.

    When JVM/TI RedefineClasses() is used, the cache can go from
    non-NULL to bigger non-NULL and that is when the stale jmethodID
    fetch race comes into play. The solution is for all cache accesses
    to acquire the lock when JVM/TI RedefineClasses() is used (see
    footnote 1). See new lines 1003-1011.

Issue 2b)
    When JVM/TI RedefineClasses() is used, the cache size check on old
    line 980 and the cache element fetch on old line 981 can access the
    old cache instead of the new cache which can lead to an incorrect
    assumption about the existing cache size or lead to a fetch of a
    stale jmethodID value.

    The solution (from 2a) is for all cache accesses to acquire the
    lock when JVM/TI RedefineClasses() is used (see footnote 1). See
    new lines 1003-1011.

Issue 2c)
    The cache element fetch on old line 981 or new line 1002 will see a
    new jmethodID's address that was stored on old line 1046. However,
    because the cache read does not acquire a lock, there is no
    guarantee the jmethodID object referred to by that address is
    complete. This issue affects both the regular system and when JVM/TI
    RedefineClasses() is used.

    The solution is to call OrderAccess::release_store_ptr() on new
    line 1126. That call will make sure the jmethodID object is
    complete before the jmethodID object's address is advertised.

Issue 2d)
    The cache is never resized by the regular system so that half of
    this issue only comes into play when JVM/TI RedefineClasses() is
    used. However, jmethodID allocation is done by both the regular
    system and when JVM/TI RedefineClasses() is used so that half of
    this issue always applies.

    The JmethodIdCreation_lock does not have to be held when the old
    cache or the new jmethodID is freed so those operations should be
    moved outside of the locked region. In fact, if FreeHeap() or
    destroy_jmethod_id() are considered safepointable because of a
    possible malloc subsystem lock grab or JNI handle subsystem lock
    grab, then the VMThread might still deadlock with the FreeHeap()
    and/or destroy_jmethod_id() calls where they are.

    The solution is to save a reference to the old cache and the new
    jmethodID in the locked region and free them outside the locked
    region. See new lines 1027 and 1069 for the old cache resolution.
    See new lines 1028 and 1073 for the new jmethodID resolution.


Footnote 1:
    The above evaluation talks about acquiring the lock for "all cache
    accesses" which is true and safe for the JNI itable index cache.
    However, the jmethodID cache is also used by the VMThread so
    "acquiring the lock" is done more cautiously. If we are single
    threaded or at a safepoint, then the lock is not acquired because
    it is not needed (single threaded case) or it is not safe (at a
    safepoint case).

Update: updated line numbers to match code review round 1.
                                     
2009-09-01
SUGGESTED FIX

See proposed fix in 6419370-webrev-cr0.tgz attachment.
                                     
2009-09-01
SUGGESTED FIX

See revised proposed fix in 6419370-webrev-cr1.tgz attachment.

I made some tweaks to the comments and code based on feedback
from the round 0 reviewers. In particular, Karen asked if we
could narrow the locking in the JVM/TI RedefineClasses() case
down to just classes that have actually been redefined. It
turned out to be easier to do it that way instead of the big
hammer way that I was doing before.
                                     
2009-09-11
EVALUATION

http://hg.openjdk.java.net/jdk7/hotspot-rt/hotspot/rev/74a5db69c1fe
                                     
2009-09-21



Hardware and Software, Engineered to Work Together