United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6991377 G1: race between concurrent refinement and humongous object allocation
JDK-6991377 : G1: race between concurrent refinement and humongous object allocation

Submit Date:
Updated Date:
Project Name:
Resolved Date:
Affected Versions:
Fixed Versions:
hs20 (b02)

Related Reports

Sub Tasks

While testing another set of changes I got a few BOT-related assertion failures when running the dacapo pmd benchmark. They always seemed to happen while the BOT was being set up as part of a humongous region / object allocation. The assertions were checking that the BOT had been correctly set up and complained if they were detecting inconsistencies. I also noticed that this always happened shortly after a cleanup.

Added instrumentation proved that the region that we just allocated to satisfy a humongous allocation request (and whose BOT was found ot be inconsistent) had just been freed during the last cleanup pause.

I'm trying to prove this with added instrumentation but this is what I think the race that's causing this failure is. When we do the cleanup pause we have some update buffers with entries that point into regions that we are about to free (this is definitely the case; I've proven that with instrumentation; I'm still trying to prove that a failure follows this scenario and the regions involved are the same). When we allocate one of those regions to satisfy the humongous allocation request, the concurrent refinement thread might try to refine parts of said region (its top() is set to end() before the BOT is set up) and it might try to make some of the BOT entries more fine-grain and do so concurrently with the thread that's allocating the humongous regions. So, the BOT was becoming inconsistent not because the thread that set it up did so wrongly, but because the concurrent refinement thread messed it up concurrently.



See Description.

There are a few ways to deal with this issue.

1. We can drain all available update buffers at the beginning of the cleanup pause, so there will be no buffers that point to empty regions after the cleanup pause. This will increase the duration of the cleanup pause though.

2. We can try to use timestamps the update buffers and regions as they are being allocated to prove that a region was allocated after a buffer was generated and, hence, any cards on that region on said buffers should be ignored. We should be able to re-use the same mechanism for filtering out cards on young regions (which we have to explicitly check for today). I haven't fully thought through how this will work, but maybe this is something we can consider in the future.

3. Change the code that does the humongous allocation to set up the BOT before it updates top. That way, while the BOT is being set up, cards on the regions being allocated will be ignored, as they will reside over top. After the BOT is set up, we can then update top.

I like that both 1 and 2 will avoid processing "out of date" cards, which will make those solutions a bit more robust. On the other hand, 3 is probably the easiest and more localized change (it only touches the humongous object allocation code, which is a relatively infrequent operation). I will try 3 first.

Testing with extra instrumentation has revealed the smoking gun.

I added code at the end of the cleanup pause that goes through all the available update buffers (in the global queue and those currently in use by the Java threads) and checks whether there are any cards available that point to empty regions (hoping to catch the case where there are cards to be processed on regions we just freed up). Extra instrumentation also prints out the ranges of the regions we are freeing up during cleanup.

It took more than half a day for pmd to fail with the extra instrumentation, but it did fail with the data I expected. We have a particular region that is freed during cleanup, there are buffers that contain cards on that region, and it's the region whose BOT we find to be corrupted while we're trying to allocate it as a humongous region shortly after the cleanup pause. I think that's enough proof that the scenario I outlined in the Description is the one that's causing the failure.


Hardware and Software, Engineered to Work Together