The issue seen in 8008546 was a guarantee failure in the number sequences that are used for tracking the differences between the specified marking step time and the elapsed time of a marking task that aborts due to a time out.
When a marking task starts it uses this number sequence to generate a time difference that is subtracted from the given task time. The result is the target time of the marking task. The marking task will time out if a regular_clock_call() (called frequently during the marking task) detects that the elapsed time has exceeded the target.
The number sequence keeps track of the difference between the elapsed task time and the target time.
We do this to try to avoid going over the specified time target. For example if the given time target is 10ms - we subtract a value predicted from the number sequence (say 0.5ms) to give the target time of the step 9.5ms. The first regular clock call after an elapsed time of 9.5ms will cause the task to time out. Suppose the task times out after 9.7ms, the difference (9.7 - 9.5) is added to number sequence. When the marking task restarts we'll generate a new value of how much earlier than the given target that we have to abort the marking step.
In 8008546, the user specified of G1ConfidencePercent that caused this number sequence to generate increasingly divergent values (after the first incidence of exceeding the target) resulting in larger and larger values being stored in the number sequence.
The values started to diverge when the marking task exceeded its target time:
[0] >>>>>>>>>> START, call = 3,
_num = 7, _sum = 1.595, _sum_of_squares = 0.519
_davg = 0.242, _dvariance = 0.012, _alpha = 0.700
_length = 10, _next = 7
[0]= 0.500 [1]= 0.137 [2]= 0.126 [3]= 0.036 [4]= 0.382
[5]= 0.189 [6]= 0.226 [7]= 0.000 [8]= 0.000 [9]= 0.000
davg: 0.2421194, sig: 3.0000000, dsd: 0.1100790, first: 0.5723563
davg: 0.2421194, num: 7, cf: 1.0000000, second: 0.2421194
Result: 0.5723563
[0] >>>>>>>>>> START, call = 3, target = 9.43ms >>>>>>>>>>
[0] Adding 16.8998093 to counter
[0] <<<<<<<<<< ABORTING, target = 9.43ms, elapsed = 26.33ms <<<<<<<<<<
In the instrumentation above, the target marking step time was 9.43ms. The task actually aborted after an elapsed time of 26.33ms. This resulted in 16.9ms being added into the number sequence.
When the marking task restarted:
[0] >>>>>>>>>> START, call = 4,
_num = 8, _sum = 18.495, _sum_of_squares = 286.122
_davg = 5.239, _dvariance = 40.798, _alpha = 0.700
_length = 10, _next = 8
[0]= 0.500 [1]= 0.137 [2]= 0.126 [3]= 0.036 [4]= 0.382
[5]= 0.189 [6]= 0.226 [7]= 16.900 [8]= 0.000 [9]= 0.000
davg: 5.2394264, sig: 3.0000000, dsd: 6.3873188, first: 24.4013827
davg: 5.2394264, num: 8, cf: 1.0000000, second: 5.2394264
Result: 24.4013827
[0] >>>>>>>>>> START, call = 4, target = -14.40ms >>>>>>>>>>
[0] Adding 14.5933617 to counter
[0] <<<<<<<<<< ABORTING, target = -14.40ms, elapsed = 0.19ms <<<<<<<<<<
The recently added large number skewed the prediction of how early we have to abort the marking to avoid exceeding the specified marking step time. The value came back as 24.40ms. The specified marking step time was 10ms. This means that we would have to abort marking 14.40 ms BEFORE the marking step started.
As a result, the first regular_clock_call() aborted the marking step after only 0.19ms. Since the task timed out we add the difference (0.19 - (-14.40)) to the number sequence. Another large value skewed the prediction again.....
There are a couple of ways to deal with this. We either stop the marking task's target time becoming negative or we do not add large numbers to the number sequence.