JDK-8008766 : G1: Non-sensical negative target marking step times
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Priority: P3
  • Status: Closed
  • Resolution: Won't Fix
  • Submitted: 2013-02-23
  • Updated: 2013-03-04
  • Resolved: 2013-03-04
Related Reports
Relates :  
Description
The issue seen in 8008546 was a guarantee failure in the number sequences that are used for tracking the differences between the specified marking step time and the elapsed time of a marking task that aborts due to a time out.

When a marking task starts it uses this number sequence to generate a time difference that is subtracted from the given task time. The result is the target time of the marking task. The marking task will time out if a regular_clock_call() (called frequently during the marking task) detects that the elapsed time has exceeded the target.

The number sequence keeps track of the difference between the elapsed task time and the target time.

We do this to try to avoid going over the specified time target. For example if the given time target is 10ms - we subtract a value predicted from the number sequence (say 0.5ms) to give the target time of the step 9.5ms. The first regular clock call after an elapsed time of 9.5ms will cause the task to time out. Suppose the task times out after 9.7ms, the difference (9.7 - 9.5) is added to number sequence. When the marking task restarts we'll generate a new value of how much earlier than the given target that we have to abort the marking step.

In 8008546, the user specified of G1ConfidencePercent that caused this number sequence to generate increasingly divergent values (after the first incidence of exceeding the target) resulting in larger and larger values being stored in the number sequence.

The values started to diverge when the marking task exceeded its target time:

[0] >>>>>>>>>> START, call = 3,
     _num = 7, _sum =   1.595, _sum_of_squares =   0.519
     _davg =   0.242, _dvariance =   0.012, _alpha =   0.700
         _length = 10, _next = 7

        [0]=  0.500    [1]=  0.137    [2]=  0.126    [3]=  0.036 [4]=  0.382
        [5]=  0.189    [6]=  0.226    [7]=  0.000    [8]=  0.000 [9]=  0.000
davg: 0.2421194, sig: 3.0000000, dsd: 0.1100790, first: 0.5723563
davg: 0.2421194, num: 7, cf: 1.0000000, second: 0.2421194
Result: 0.5723563
[0] >>>>>>>>>> START, call = 3, target = 9.43ms >>>>>>>>>>
[0] Adding 16.8998093 to counter
[0] <<<<<<<<<< ABORTING, target = 9.43ms, elapsed = 26.33ms <<<<<<<<<< 

In the instrumentation above, the target marking step time was 9.43ms. The task actually aborted after an elapsed time of 26.33ms. This resulted in 16.9ms being added into the number sequence.

When the marking task restarted:

[0] >>>>>>>>>> START, call = 4,
     _num = 8, _sum =  18.495, _sum_of_squares = 286.122
     _davg =   5.239, _dvariance =  40.798, _alpha =   0.700
         _length = 10, _next = 8

        [0]=  0.500    [1]=  0.137    [2]=  0.126    [3]=  0.036 [4]=  0.382
        [5]=  0.189    [6]=  0.226    [7]= 16.900    [8]=  0.000 [9]=  0.000
davg: 5.2394264, sig: 3.0000000, dsd: 6.3873188, first: 24.4013827
davg: 5.2394264, num: 8, cf: 1.0000000, second: 5.2394264
Result: 24.4013827
[0] >>>>>>>>>> START, call = 4, target = -14.40ms >>>>>>>>>>
[0] Adding 14.5933617 to counter
[0] <<<<<<<<<< ABORTING, target = -14.40ms, elapsed = 0.19ms <<<<<<<<<<

The recently added large number skewed the prediction of how early we have to abort the marking to avoid exceeding the specified marking step time. The value came back as 24.40ms. The specified marking step time was 10ms. This means that we would have to abort marking 14.40 ms BEFORE the marking step started.

As a result, the first regular_clock_call() aborted the marking step after only 0.19ms. Since the task timed out we add the difference (0.19 - (-14.40)) to the number sequence. Another large value skewed the prediction again.....

There are a couple of ways to deal with this. We either stop the marking task's target time becoming negative or we do not add large numbers to the number sequence. 
Comments
Non issue.
04-03-2013

Fix is easy but isn't necessary.
04-03-2013

Even with the fix for 8008546, we can still get negative target times when the elapsed time for the marking task is significantly larger than the target time: [0] >>>>>>>>>> START, call = 4, _num = 10, _sum = 1.416, _sum_of_squares = 0.402 _davg = 0.139, _dvariance = 0.009, _alpha = 0.700 _length = 10, _next = 7 [0]= 0.034 [1]= 0.330 [2]= 0.004 [3]= 0.040 [4]= 0.319 [5]= 0.129 [6]= 0.094 [7]= 0.007 [8]= 0.400 [9]= 0.060 davg: 0.1391693, sig: 1.0000000, dsd: 0.0924721, first: 0.2316414 davg: 0.1391693, num: 10, cf: 1.0000000, second: 0.1391693 Result: 0.2316414 [0] >>>>>>>>>> START, call = 4, target = 9.77ms >>>>>>>>>> [0] Adding 17.9910714 to counter [0] <<<<<<<<<< ABORTING, target = 9.77ms, elapsed = 27.76ms <<<<<<<<<< [0] >>>>>>>>>> START, call = 5, _num = 10, _sum = 19.400, _sum_of_squares = 324.081 _davg = 5.495, _dvariance = 46.853, _alpha = 0.700 _length = 10, _next = 8 [0]= 0.034 [1]= 0.330 [2]= 0.004 [3]= 0.040 [4]= 0.319 [5]= 0.129 [6]= 0.094 [7]= 17.991 [8]= 0.400 [9]= 0.060 davg: 5.4947399, sig: 1.0000000, dsd: 6.8449599, first: 12.3396998 davg: 5.4947399, num: 10, cf: 1.0000000, second: 5.4947399 Result: 12.3396998 [0] >>>>>>>>>> START, call = 5, target = -2.34ms >>>>>>>>>> [0] Adding 2.5294458 to counter [0] <<<<<<<<<< ABORTING, target = -2.34ms, elapsed = 0.19ms <<<<<<<<<< [0] >>>>>>>>>> START, call = 6, _num = 10, _sum = 21.529, _sum_of_squares = 330.319 _davg = 4.605, _dvariance = 34.090, _alpha = 0.700 _length = 10, _next = 9 [0]= 0.034 [1]= 0.330 [2]= 0.004 [3]= 0.040 [4]= 0.319 [5]= 0.129 [6]= 0.094 [7]= 17.991 [8]= 2.529 [9]= 0.060 davg: 4.6051517, sig: 1.0000000, dsd: 5.8386642, first: 10.4438159 davg: 4.6051517, num: 10, cf: 1.0000000, second: 4.6051517 Result: 10.4438159 [0] >>>>>>>>>> START, call = 6, target = -0.44ms >>>>>>>>>> [0] Adding 0.6238039 to counter [0] <<<<<<<<<< ABORTING, target = -0.44ms, elapsed = 0.18ms <<<<<<<<<< [0] >>>>>>>>>> START, call = 7, _num = 10, _sum = 22.093, _sum_of_squares = 330.704 _davg = 3.411, _dvariance = 26.193, _alpha = 0.700 _length = 10, _next = 0 [0]= 0.034 [1]= 0.330 [2]= 0.004 [3]= 0.040 [4]= 0.319 [5]= 0.129 [6]= 0.094 [7]= 17.991 [8]= 2.529 [9]= 0.624 davg: 3.4107473, sig: 1.0000000, dsd: 5.1179210, first: 8.5286684 davg: 3.4107473, num: 10, cf: 1.0000000, second: 3.4107473 Result: 8.5286684 [0] >>>>>>>>>> START, call = 7, target = 1.47ms >>>>>>>>>> [0] Adding 0.4176504 to counter [0] <<<<<<<<<< ABORTING, target = 1.47ms, elapsed = 1.89ms <<<<<<<<<< But the target times quickly revert to a positive value because the value of sigma() (sig in the output above) doesn't artificially boost the prediction values coming from the NumberSeq. Though a negative target time for the marking task does not make sense, it doesn't seem to hurt and the values quickly revert to positive. Closing as non-issue.
04-03-2013

We should prevent the target marking task times becoming negative. Regardless if whether the target times was negative or zero, the first regular_clock_call() made by the task will see that the elapsed time is above the target and abort the task. A zero value for the task's target time will prevent the entries in the NumberSeq becoming divergent.
04-03-2013

Log from specjbb2005 run with G1ConfidencePercent=300. The entries in the number sequence start getting large until we end up trying to add an infinity into the number sequence, causing the guarantee failure.
23-02-2013

Attaching log from test run.
23-02-2013

Email in response to code review to 8008546: Hi Bengt, Vladimir, Bumping up the value of G1ConfidencePercent eventually caused the guarantee to fail. I believe I know what's going on here now. The following is tracing and instrumentation from CMTask::do_marking_step(), from the marking cycle where it starts to go wrong... 3.788: [GC concurrent-mark-start] [0] >>>>>>>>>> START, call = 1, _num = 5, _sum = 1.180, _sum_of_squares = 0.432 _davg = 0.275, _dvariance = 0.023, _alpha = 0.700 _length = 10, _next = 5 [0]= 0.500 [1]= 0.137 [2]= 0.126 [3]= 0.036 [4]= 0.382 [5]= 0.000 [6]= 0.000 [7]= 0.000 [8]= 0.000 [9]= 0.000 davg: 0.2747654, sig: 3.0000000, dsd: 0.1517040, first: 0.7298773 davg: 0.2747654, num: 5, cf: 1.0000000, second: 0.2747654 Result: 0.7298773 [0] >>>>>>>>>> START, call = 1, target = 9.27ms >>>>>>>>>> [0] Adding 0.1885053 to counter [0] <<<<<<<<<< ABORTING, target = 9.27ms, elapsed = 9.46ms <<<<<<<<<< [0] >>>>>>>>>> START, call = 2, _num = 6, _sum = 1.369, _sum_of_squares = 0.467 _davg = 0.249, _dvariance = 0.017, _alpha = 0.700 _length = 10, _next = 6 [0]= 0.500 [1]= 0.137 [2]= 0.126 [3]= 0.036 [4]= 0.382 [5]= 0.189 [6]= 0.000 [7]= 0.000 [8]= 0.000 [9]= 0.000 davg: 0.2488873, sig: 3.0000000, dsd: 0.1311627, first: 0.6423756 davg: 0.2488873, num: 6, cf: 1.0000000, second: 0.2488873 Result: 0.6423756 [0] >>>>>>>>>> START, call = 2, target = 9.36ms >>>>>>>>>> [0] Adding 0.2263276 to counter [0] <<<<<<<<<< ABORTING, target = 9.36ms, elapsed = 9.58ms <<<<<<<<<< [0] >>>>>>>>>> START, call = 3, _num = 7, _sum = 1.595, _sum_of_squares = 0.519 _davg = 0.242, _dvariance = 0.012, _alpha = 0.700 _length = 10, _next = 7 [0]= 0.500 [1]= 0.137 [2]= 0.126 [3]= 0.036 [4]= 0.382 [5]= 0.189 [6]= 0.226 [7]= 0.000 [8]= 0.000 [9]= 0.000 davg: 0.2421194, sig: 3.0000000, dsd: 0.1100790, first: 0.5723563 davg: 0.2421194, num: 7, cf: 1.0000000, second: 0.2421194 Result: 0.5723563 [0] >>>>>>>>>> START, call = 3, target = 9.43ms >>>>>>>>>> [0] Adding 16.8998093 to counter [0] <<<<<<<<<< ABORTING, target = 9.43ms, elapsed = 26.33ms <<<<<<<<<< The step time difference prediction we obtain for call #1 was 0.73ms, resulting in a target step time of 9.27ms. The marking step aborts after 9.46ms and adds 0.189ms to the number sequence. This causes the step time difference prediction we obtain for call #2 to be calculated as 0.642ms; target step time of 9.36ms. The marking step aborts after an elapsed 9.58ms. The difference (9.58 - 9.36) 0.226ms is added to the number sequence. This causes the step time difference prediction we obtain for call #3 to be calculated 0.572ms; the target step time of 9.43. The marking step aborts after an elapsed time of 26.33ms. We add the difference (26.33 - 9.43) 16.90 to the number sequence. This is where it gets interesting. In call 4: [0] >>>>>>>>>> START, call = 4, _num = 8, _sum = 18.495, _sum_of_squares = 286.122 _davg = 5.239, _dvariance = 40.798, _alpha = 0.700 _length = 10, _next = 8 [0]= 0.500 [1]= 0.137 [2]= 0.126 [3]= 0.036 [4]= 0.382 [5]= 0.189 [6]= 0.226 [7]= 16.900 [8]= 0.000 [9]= 0.000 davg: 5.2394264, sig: 3.0000000, dsd: 6.3873188, first: 24.4013827 davg: 5.2394264, num: 8, cf: 1.0000000, second: 5.2394264 Result: 24.4013827 [0] >>>>>>>>>> START, call = 4, target = -14.40ms >>>>>>>>>> [0] Adding 14.5933617 to counter [0] <<<<<<<<<< ABORTING, target = -14.40ms, elapsed = 0.19ms <<<<<<<<<< the predicted step time difference (because of the large value added to the number sequence the previous call) is calculated as 24.40ms. This gives a NEGATIVE target step time of -14.40ms. The marking step aborts after an extremely short 0.19ms. We add the difference (0.19 - (-14.40)) 14.59 to the number sequence. From this point on the predicted step time difference gets larger and larger until we start seeing infinities for dsd, sum of squares, etc. This gives a negative infinity as the time step target and we attempt to add infinity into the number sequence which results in a variance value of infinity. Why was the elapsed time for call #3 much longer than the target step time? My guess is that the marking step claimed a large enough object that it just took that amount of time to scan it. (Note this was with debug binary). Why does a higher G1ConfidencePercent value cause the problem? The confidence percent is used to calculate the value of G1CollectorPolicy::sigma() which is used in the expression that predicts a new value from a given number sequence: return MAX2(seq->davg() + sigma() * seq->dsd(), seq->davg() * confidence_factor(seq->num())); it's a multiplier of the decaying standard deviation. So a value larger than 100 will result in larger and larger values in the LHS of the MAX2. Vladimir: your fix is good and a valid fix for this issue but I wonder if we should also be preventing the target step time from becoming negative, e.g.: [0] >>>>>>>>>> START, call = 17, _num = 10, _sum = 1.816, _sum_of_squares = 0.554 _davg = 0.177, _dvariance = 0.009, _alpha = 0.700 _length = 10, _next = 7 [0]= 0.057 [1]= 0.541 [2]= 0.245 [3]= 0.084 [4]= 0.248 [5]= 0.231 [6]= 0.085 [7]= 0.030 [8]= 0.035 [9]= 0.258 davg: 0.1774224, sig: 3.0000000, dsd: 0.0971153, first: 0.4687682 davg: 0.1774224, num: 10, cf: 1.0000000, second: 0.1774224 Result: 0.4687682 [0] >>>>>>>>>> START, call = 17, target = 9.53ms >>>>>>>>>> [0] Adding 20.1697202 to counter [0] <<<<<<<<<< ABORTING, target = 9.53ms, elapsed = 29.70ms <<<<<<<<<< [0] >>>>>>>>>> START, call = 18, _num = 10, _sum = 21.955, _sum_of_squares = 407.371 _davg = 6.175, _dvariance = 58.761, _alpha = 0.700 _length = 10, _next = 8 [0]= 0.057 [1]= 0.541 [2]= 0.245 [3]= 0.084 [4]= 0.248 [5]= 0.231 [6]= 0.085 [7]= 20.170 [8]= 0.035 [9]= 0.258 davg: 6.1751117, sig: 3.0000000, dsd: 7.6655934, first: 29.1718919 davg: 6.1751117, num: 10, cf: 1.0000000, second: 6.1751117 Result: 29.1718919 [0] >>>>>>>>>> START, call = 18, target = 0.00ms >>>>>>>>>> [0] Adding 0.1926280 to counter [0] <<<<<<<<<< ABORTING, target = 0.00ms, elapsed = 0.19ms <<<<<<<<<< This should prevent the values in the sequence diverging and reading Inf. Or perhaps we shouldn't even add large step time differences to the number seq. Whatever we decide to do (if anything) should be done as separate CR. Regards, JohnC On 2/21/2013 10:50 AM, John Cuthbertson wrote: > Hi Bengt, > > Thanks for the stack trace. It helps. The counter is obviously _marking_step_diffs_ms and the confidence value feeds into this via G1CollectorPolicy::get_new_prediction(). I ran jbb2005 with the flags you suggest and still no failure. I'm still a little confused though. If we look at TruncatedSeq:add() and AbsSeq::variance(): > > void TruncatedSeq::add(double val) { > AbsSeq::add(val); > > // get the oldest value in the sequence... > double old_val = _sequence[_next]; > // ...remove it from the sum and sum of squares > _sum -= old_val; > _sum_of_squares -= old_val * old_val; > > // ...and update them with the new value > _sum += val; > _sum_of_squares += val * val; > > // now replace the old value with the new one > _sequence[_next] = val; > _next = (_next + 1) % _length; > > // only increase it if the buffer is not full > if (_num < _length) > ++_num; > > guarantee( variance() > -1.0, "variance should be >= 0" ); > } > > the guarantee can only trip if the result of variance() is -1.0 or less. Correct? What value is being returned by variance()? > > Now look at AbsSeq::variance(): > > double AbsSeq::variance() const { > if (_num <= 1) > return 0.0; > > double x_bar = avg(); > double result = _sum_of_squares / total() - x_bar * x_bar; > if (result < 0.0) { > // due to loss-of-precision errors, the variance might be negative > // by a small bit > > // guarantee(-0.1 < result && result < 0.0, > // "if variance is negative, it should be very small"); > result = 0.0; > } > return result; > } > > How can this return a result -1.0 or less? If result is negative then the return value should be 0.0 or a small magnitude negative; not -1.0 or less. If the value of result is close enough to 0.0 to make the "result < 0.0" evaluate to false, it shouldn't be able to result in "variance() > -1.0" evaluate to false. > > Vladimir: I still think your change is good but I'm not sure it's a true fix for the problem. > > Regards, > > JohnC > > On 2/21/2013 1:56 AM, Bengt Rutisson wrote: >> >> John, >> >> The bug report lists a guarantee in nuberSeq.cpp. You need a concurrent cycle for the issue to reproduce. >> >> I can repro it with this command line using SPECjbb2005: >> >> java -XX:+UseG1GC -XX:G1ConfidencePercent=300 -Xmx256m -XX:InitiatingHeapOccupancyPercent=1 -XX:+PrintGC -cp jbb.jar:check.jar spec.jbb.JBBmain >> >> Here is the stack trace from the hs_err file i get: >> >> # >> # A fatal error has been detected by the Java Runtime Environment: >> # >> # Internal Error (/Users/brutisso/repos/hs-gc-test-deprecated/src/share/vm/utilities/numberSeq.cpp:166), pid=34072, tid=16131 >> # guarantee(variance() > -1.0) failed: variance should be >= 0 >> # >> # JRE version: Java(TM) SE Runtime Environment (8.0-b68) (build 1.8.0-ea-b68) >> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b15-internal-jvmg mixed mode bsd-amd64 compressed oops) >> # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again >> # >> # If you would like to submit a bug report, please visit: >> # http://bugreport.sun.com/bugreport/crash.jsp >> # >> >> --------------- T H R E A D --------------- >> >> Current thread (0x00007feda9038000): ConcurrentGCThread [stack: 0x0000000129587000,0x0000000129687000] [id=16131] >> >> Stack: [0x0000000129587000,0x0000000129687000], sp=0x0000000129686190, free space=1020k >> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) >> V [libjvm.dylib+0xd48e80] VMError::report(outputStream*)+0x1262 >> V [libjvm.dylib+0xd4a57b] VMError::report_and_die()+0x88f >> V [libjvm.dylib+0x5328a7] report_vm_error(char const*, int, char const*, char const*)+0xc7 >> V [libjvm.dylib+0xac6dd2] TruncatedSeq::add(double)+0x170 >> V [libjvm.dylib+0x4d389a] CMTask::do_marking_step(double, bool, bool)+0x15ba >> V [libjvm.dylib+0x4d9d5a] CMConcurrentMarkingTask::work(unsigned int)+0x192 >> V [libjvm.dylib+0xda8c8e] GangWorker::loop()+0x666 >> V [libjvm.dylib+0xda7fc0] GangWorker::run()+0x3a >> V [libjvm.dylib+0xafef34] java_start(Thread*)+0x1d4 >> C [libsystem_c.dylib+0x14742] _pthread_start+0x147 >> C [libsystem_c.dylib+0x1181] thread_start+0xd >> >> >> Bengt >> >> >> On 2/20/13 9:13 PM, John Cuthbertson wrote: >>> Hi Vladimir, >>> >>> The change looks good to me. But I have a question: which counter was tripping the guarantee? I just ran jvm98 with a confidence percent value of 200 and 300 and the guarantee didn't fire for me. >>> >>> I kind of agree with Bengt that moving this kind of error checking earlier would be better. But that can be done as a separate CR. >>> >>> Thanks, >>> >>> JohnC >>> >>> On 2/20/2013 7:27 AM, vladimir kempik wrote: >>>> Hi Bengt, >>>> >>>> Thanks for looking at this! >>>> >>>> Here is an updated webrev based on your feedback: >>>> >>>> http://cr.openjdk.java.net/~mcherkas/vladimir/8008546/webrev.01/ >>>> >>>> I applied what you suggested. >>>> >>>> Thanks, >>>> Vladimir. >>>> >>>> On 20.02.2013 17:54, Bengt Rutisson wrote: >>>>> >>>>> Hi Vladimir, >>>>> >>>>> This looks very similar to how we treat G1ReservePercent, so I think it looks good. An alternative would have been to check this earlier in the initialization phase and update the flag G1ConfidencePercent so that PrintFlagsFinal would have printed the actual value. But for consistency I think it looks good this way. >>>>> >>>>> I think you can change G1ConfidencePercent to be an uintx instead of intx (in g1_globals.hpp). In that case you don't need the second if statment since it can't be less than 0. It is also more consistent with G1ReservePercent which is an uintx. >>>>> >>>>> Thanks, >>>>> Bengt >>>>> >>>>> On 2/20/13 2:31 PM, vladimir kempik wrote: >>>>>> Hi all, >>>>>> >>>>>> Could I have a couple of reviews for this change? >>>>>> >>>>>> http://cr.openjdk.java.net/~mcherkas/vladimir/8008546/webrev.00/ >>>>>> >>>>>> Input value for G1CONFIDENCEPERCENT wasn't checked before using. This results in crash sometimes if -XX:+UseG1GC -XX:G1ConfidencePercent=200 flags are used. Now checking the value same way as it was done for G1ReservePercent. Increase to 0 if negative, decrease to 100 if more than 100. >>>>>> >>>>>> Thanks, >>>>>> Vladimir >>>>>> >>>>> >>>> >>> >> >
23-02-2013