JDK-8046936 : JEP 270: Reserved Stack Areas for Critical Sections
  • Type: JEP
  • Component: hotspot
  • Sub-Component: runtime
  • Priority: P3
  • Status: Closed
  • Resolution: Delivered
  • Fix Versions: 9
  • Submitted: 2014-06-16
  • Updated: 2023-10-30
  • Resolved: 2016-09-29
Related Reports
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Sub Tasks
JDK-8187927 :  
Description
Summary
-------

Reserve extra space on thread stacks for use by critical sections, so
that they can complete even when stack overflows occur.


Goals
-----

  - Provide a mechanism to mitigate the risk of deadlocks caused by the
    corruption of critical data such as `java.util.concurrent` locks
    (such as `ReentrantLock`) caused by a `StackOverflowError` being
    thrown in a critical section.

  - The solution must be mostly JVM-based in order not to require
    modifications to `java.util.concurrent` algorithms or published
    interfaces, or existing library and application code.

  - The solution must not be limited to the `ReentrantLock` case, and
    should be applicable to any critical section in privileged code.


Non-Goals
---------

  - The solution doesn't aim to provide robustness against stack
    overflows to non-privileged code.

  - The solution doesn't aim to avoid `StackOverflowError`s, but rather
    to mitigate the risk that a such an error is thrown inside a critical
    section and thereby corrupts some data structures.

  - The proposed solution is a trade-off between solving some well-known
    corruption cases while preserving performance, with reasonable
    resource cost and relatively low complexity.


Motivation
----------

`StackOverflowError` is an asynchronous exception that can be thrown by
the Java Virtual Machine whenever the computation in a thread requires a
larger stack than is permitted (JVM spec §2.5.2 and
§2.5.6). The Java Language Specification permits a
`StackOverflowError` to be thrown synchronously by method invocation (JLS
§11.1.3). The HotSpot VM uses this property to implement a
"stack-banging" mechanism on method entry.

The stack-banging mechanism is a clean way to report that a stack
overflow has occurred while preserving the JVM's integrity, but it
doesn't provide a safe way for the application to recover from this
situation. A stack overflow could occur in the middle of a sequence of
modifications which, if not complete, could leave a data structure in an
inconsistent state.

For instance, when a `StackOverflowError` is thrown in a critical section
of the `java.util.concurrent.locks.ReentrantLock` class, the lock status
can be left in an inconsistent state, leading to potential deadlocks. The
`ReentrantLock` class uses an instance of `AbstractSynchronizerQueue` to
implement its critical section. The implementation of its `lock()` method
is:

    final void lock() {
        if (compareAndSetState(0, 1))
            setExclusiveOwnerThread(Thread.currentThread());
        else
            acquire(1);
    }

The method tries to change the status word with an atomic operation. If
the modification is successful then the owner is set by invoking a setter
method, otherwise the slow path is invoked. The problem is that if a
`StackOverflowError` is thrown after the status word has been changed and
before the owner has been effectively set then the lock becomes unusable:
Its status word indicates it is locked but no owner has been set, so no
thread can unlock it. Because stack-size checks are performed at
method-invocation time (in HotSpot, at least), a `StackOverflowError` can
be thrown either when `Thread.currentThread()` is invoked or when
`setExclusiveOwnerThread()` is invoked. In either case it leads to a
corruption of the `ReentrantLock` instance, and all threads trying to
acquire this lock will be blocked forever.

This particular problem caused some serious issues in JDK 7 because
parallel class loading was implemented using a `ConcurrentHashMap` and,
at that time, the `ConcurrentHashMap` code used `ReentrantLock`
instances. If a `ReentrantLock` instance was corrupted because of a
`StackOverflowError` then the class-loading mechanism itself could
deadlock. (This happened in stress tests
([JDK-7011862](https://bugs.openjdk.java.net/browse/JDK-7011862)), but
could also happen in the field.)

The implementation of the `ConcurrentHashMap` class was completely
changed in June 2013. The new implementation uses `synchronized`
statements rather than `ReentrantLock` instances, so JDK 8 and later
releases are not subject to class-loading deadlock due to corrupted
`ReentrantLock`s. However, any code using `ReentrantLock` can still be
impacted and cause deadlock. Such issues have already been reported on
the `concurrency-interest@cs.oswego.edu` mailing list.

The problem is not limited to the `ReentrantLock` class.

Java applications or libraries often rely on the consistency of data
structures to work properly. Any modification of those data structures is
a critical section: Before the execution of the critical section the data
structures are consistent, and after its execution the data structures
are consistent too. During its execution, however, the data structure
could go through transient inconsistent states.

If a critical section is made of a single Java method containing no other
method invocation, the current stack overflow mechanism works well:
Either the available stack is sufficient and the method executes without
trouble, or it is not sufficient and so a `StackOverflowError` is thrown
before the first bytecode of the method is executed.

The problem occurs when a critical section is made of several methods,
for instance a method A which invokes a method B. The available stack can
be sufficient to let method A start its execution. Method A starts to
modify a data structure and then invokes method B, but the remaining
stack is not sufficient to execute B, causing a `StackOverflowError` to
be thrown. Because method B and the remainder of method A have not been
executed, the consistency of the data structure might have been
compromised.


Description
-----------

The main idea of the proposed solution is to reserve some space on the
execution stack for critical sections, to allow them to complete their
execution where regular code would have been interrupted by a stack
overflow. The assumption is that critical sections are relatively small
and do not require enormous space on the execution stack to complete
successfully. The goal is not to rescue a faulty thread which hits its
stack limit, but rather to preserve shared data structures that could be
corrupted if the `StackOverflowError` is thrown in a critical section.

The main mechanism will be implemented in the JVM. The only modification
required in the Java source code is the annotation that must be used to
identify the critical sections. This annotation, currently named
`jdk.internal.vm.annotation.ReservedStackAccess`, is a runtime method annotation that can
be used by any class of privileged code (see paragraph below about the
accessibility of this annotation).

In order to prevent the corruption of shared data structures, the JVM
will try to delay the throwing of a `StackOverflowError` until the thread
in question has exited all of its critical sections. Each Java thread has
a new zone defined in its execution stack, called the reserved zone. This
zone can be used only if the Java thread has a method annotated with
`jdk.internal.vm.annotation.ReservedStackAccess` in its current call stack. When a stack
overflow condition is detected by the JVM, and the thread has an
annotated method in its call stack, the JVM grants temporary access to
the reserved zone until no more annotated methods are present in the call
stack. When access to the reserved zone is revoked, a delayed
`StackOverflowError` is thrown. If the thread has no annotated method in
its call stack when the stack overflow condition is detected then the
`StackOverflow` is thrown immediately (this is current JVM behavior).

Note that the reserved stack space is usable by annotated methods but
also by methods invoked, directly or transitively, from them. The nesting
of annotated methods is naturally supported, but there's a single shared
reserved zone per thread; that is, the invocation of an annotated method
does not add a new reserved zone. The sizing of the reserved zone must be
done according to the worst case of all annotated critical sections.

By default, the `jdk.internal.vm.annotation.ReservedStackAccess` annotation is applicable
only to privileged code (code loaded by the bootstrap or the extension
class loader). Both privileged code and non-privileged code can be
annotated with this annotation but by default the JVM will ignore it for
non-privileged code. The rationale behind this default policy is that the
reserved stack space for critical sections is a shared resource among all
critical sections. If any arbitrary code is able to use this space then
it is not a reserved space anymore, and this would defeat the whole
solution. A JVM flag is available, even in product builds, to relax this
policy and allow any code to be able to benefit from this feature.

### Implementation

In the HotSpot VM, each Java thread has two zones defined at the end of
its execution stack: the yellow zone and the red zone. Both memory areas
are protected against all accesses.

If, during its execution, a thread tries to use the memory in the yellow
zone, a protection fault is triggered, the protection of the yellow zone
is temporarily removed, and a `StackOverflowError` is created and
thrown. Before unwinding the thread execution stack to propagate the
`StackOverflowError`, the protection of the yellow zone is restored.

If the thread tries to use the memory in its red zone, the JVM
immediately branches to JVM error-reporting code, leading to the
generation of an error report and a crash dump of the JVM process.

The new zone defined by the proposed solution is placed just before the
yellow zone. This reserved zone will behave like regular stack space if
the thread has a `ReservedStackAccess`-annotated method in its call
stack, and like the yellow zone otherwise.

During the setup of the execution stack of a Java thread, the reserved
zone is protected the same way as the yellow and the red zones. If,
during its execution, the thread hits its reserved zone, a `SIGSEGV`
signal is generated and the signal handler applies the following
algorithm:

  - If the address of the fault is in the red zone, generate a JVM error
    report and a crash dump.

  - If the address of the fault is in the yellow zone, create and throw a
    `StackOverflowError`.

  - If the address of the fault is in the reserved zone, perform a stack
    walk to check if there's a method annotated with
    `jdk.internal.vm.annotation.ReservedStackAccess` on the call stack. If not, create and
    throw a `StackOverflowError`. If an annotated method is found, remove
    the protection of the critical zone and store in the C++ `Thread`
    object the stack pointer of the outermost activation (frame) related
    to an annotated method.

If the protection of the reserved zone has been removed to allow a
critical section to complete its execution, the protection must be
restored and the delayed `StackOverflowError` thrown as soon as the
thread exits the critical section. The HotSpot interpreter has been
modified to check if the registered outermost annotated method is being
exited. The check is performed on every frame-activation removal by
comparing the value of the stack pointer being restored with the value
stored in the C++ `Thread` object. If the restored stack pointer is above
the stored value (stacks grow downward), a call to the runtime is
performed to change the memory protection and reset the stack pointer
value in the `Thread` object before jumping to the `StackOverflowError`
generation code. The two compilers have been modified to perform the same
check on method exit, but only for `ReservedStackAccess` annotated
methods or methods with annotated methods in-lined in their compiled
code.

When an exception is thrown, the control flow doesn't go through the
regular method-exit code, so there's a possibility that the protection of
the reserved zone will not be restored correctly if the exception is
propagated above the annotated method. To prevent this situation, the
protection of the reserved zone is restored and the stack pointer value
stored in the C++ `Thread` object is reset each time an exception starts
being propagated. In this scenario, the delayed `StackOverflowError` is
not thrown. The rationale is that the thrown exception is more important
than the delayed `StackOverflowError` because it indicates a cause and a
point where normal execution has been interrupted.

Throwing a `StackOverflowError` is the Java way to notify the application
that a thread reached its stack limits. However, exceptions and errors
are sometime caught by Java code and the notification is lost or not
handled correctly, which can make the investigation of the issue really
hard. To ease troubleshooting of stack overflow errors in presence of a
reserved stack area, the JVM provides two other notifications when access
to the reserved stack area is granted: One is a warning printed by the
JVM (on the same stream as all other JVM messages), and the second is a
JFR event. Note that even if the delayed `StackOverflowError` is not
thrown because another exception has been thrown in a critical section,
the JVM warning and the JFR event are generated and are available for
troubleshooting.

The reserved-stack feature is controlled by two JVM flags, one to
configure the size of the reserved zone (all threads use the same size),
and one to allow non-privileged code to use the feature. Setting the size
of the reserved zone to zero disables the feature entirely. When
disabled, interpreted code and compiled code do not perform the check on
method exit.

Memory cost of this solution: For each thread the cost is the virtual
memory of its reserved zone, as part of its stack space. The option to
implement the reserved zone in a different memory area, as an alternate
stack, has been considered. It would, however, significantly increase
the complexity of any stack-walking code, so this option has been
rejected.

Performance cost: measurements done with
[JSR-166 tests](http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/test/loops/)
on `ReentrantLock`s didn't show any significant impact on performance on
x86 platforms.

### Performance

Here's how this solution could impact performance.

The most costly operation in this solution is the stack walking performed
when looking for an annotated method in the call stack. This operation is
performed only when the JVM has detected a potential stack
overflow. Without this fix, the JVM would throw a
`StackOverflowError`. So even if the operation is relatively costly, it
is better than the current behavior since it will prevent data
corruptions. The most frequently-executed part of this solution is the
check performed when an annotated method exits, to check if the
protection of the reserved zone has to be re-enabled or not. The
performance-critical version of this check is in the compiler. The
current implementation adds the following code sequence to the compiled
code of an annotated method:

    0x00007f98fcef5809: cmp    rsp,QWORD PTR [r15+0x298]
    0x00007f98fcef5810: jle    0x00007f98fcef583c
    0x00007f98fcef5816: mov    rdi,r15
    0x00007f98fcef5819: test   esp,0xf
    0x00007f98fcef581f: je     0x00007f98fcef5837
    0x00007f98fcef5825: sub    rsp,0x8
    0x00007f98fcef5829: call   0x00007f9910f62670  ;   {runtime_call}
    0x00007f98fcef582e: add    rsp,0x8
    0x00007f98fcef5832: jmp    0x00007f98fcef583c
    0x00007f98fcef5837: call   0x00007f9910f62670  ;   {runtime_call}

This code is for the x86_64 platform. In fast cases (no need to re-enable
protection of the reserved zone) it adds two instructions including a
small jump. The version for x86_32 is bigger because it doesn't have the
address of the `Thread` object always available in a register. The
feature is also implemented for Solaris/SPARC.

### Open issues

The default size of the reserved zone is still an open issue. This size
will depend on the longest critical zone in JDK code that uses the
`ReservedStackAccess` annotation and will also depend on the platform
architecture. We could also consider different defaults depending upon
whether the JVM is running on a high-end server or in a
virtual-memory-constrained environment.

To mitigate the sizing issue a debug/troubleshooting feature has been
added. This feature is enabled by default on debug builds and available
as a diagnostic JVM option in product builds. When activated, it is run
when the JVM is about to throw a `StackOverflowError`: It walks the call
stack and if one or more methods annotated with the
`ReservedStackAccess` annotation are found, their names are printed with
a warning message on the JVM standard output. The name of the JVM flag
controlling this feature is `PrintReservedStackAccessOnStackOverflow`.

The default size of the reserved area is one page (4K) and experiments
have shown that this is sufficient to cover the critical sections of
`java.util.concurrent` locks that have been annotated so far.

The reserved stack area is not fully supported on Windows
platforms. During the development of the feature on Windows, a bug was
found in the way the stack's special zones are controlled
([JDK-8067946](https://bugs.openjdk.java.net/browse/JDK-8067946)). This
bug prevents the JVM from granting access to the reserved stack area. As
a consequence, when a stack overflow condition is detected on Windows,
and an annotated method is on the call stack, the JVM warning is printed,
the JFR event is fired, and a `StackOverflowError` is thrown
immediately. There's no change in the behavior of the JVM for the
application. However, the JVM warning and the JFR event can help
troubleshooting, indicating that a potentially-harmful situation
occurred.


Alternatives
------------

Several alternative approaches have been considered and some of them have
been implemented and tested. Here's a list of those approaches.

Language-based solutions:

  - `try`/`catch`/`finally` constructs: They don't solve anything, since
    there's no guarantee that the `finally` clause will not trigger a
    stack overflow too.

  - New constructs such as:

        new CriticalSection(
               () -> {
                   // do critical section code
                }).enter();

     This construct might require significant work in `javac` and the
     JVM, and its usage is likely to have high impact on performance
     compared to the reserved stack area, even when not run in a
     stack-overflow condition.

Code-transformation solutions:

  - Avoid method calls (because stack overflow checks are performed at
    method invocation time) by forcing the JIT to inline all called
    methods: Inlining could require the loading and initialization of
    classes not used by the application, forcing inlining could conflict
    with compiler rules (code size, inlining depth), and inlining is not
    applicable to all code patterns (e.g., reflection).

  - Code refactoring to avoid method calls at source level: Refactoring
    would require the modification of already-complex code
    (`java.util.concurrent`), and this kind of refactoring would break
    encapsulation.

Stack-based solutions:

  - Extended stack banging: Bang the stack further before entering a
    critical section: This solution has a performance cost, even when not
    in a stack-overflow condition, and it is hard to maintain with nested
    critical sections.

  - Extensible stacks: Build stacks from several non-contiguous memory
    chunks, adding a new chunk when a stack overflow is detected: This
    solution adds significant complexity to the JVM to manage
    non-contiguous stacks (including all the logic currently based on
    pointer comparisons in stack management); it could also require us to
    copy/move some section of the stack, and it puts more pressure on the
    memory-allocation backend due to fragmentation issues.


Testing
-------

This change comes with a reliable unit test able to reproduce the
`java.util.concurrent.lock.ReentrantLock` corruption caused by a stack
overflow.


Dependencies
-----------

The reserved stack area relies on the "yellow pages" mechanism. This
mechanism is currently partly broken on Windows
[JDK-8067946](https://bugs.openjdk.java.net/browse/JDK-8067946), so the
reserved stack area is not fully supported in this platform.

Comments
I already created a sub-task as you previously requested: JDK-8187927
25-09-2017

Thanks David - I opened JDK-8187930 and linked it to this JEP to capture the build number.
25-09-2017

What's the integration bug ID for this one?
15-02-2016

TOI may be very short, and could be done together with other features
16-12-2015

Regardless of whether we implement this JEP, I suggest making stack overflow much rarer on 64-bit systems where address space is abundant. On my Linux system the default native stack size is 8MB, but the JVM reduces this to 1MB for Java threads, which is going in the wrong direction! Why not default to 64MB, if the address space is available?
02-12-2015

Thanks, Karen. I've been a cheerleader for getting rid of stack overflow for a while, but I am not personally competent to do the hotspot work and it seems unlikely that Google would invest seriously in making it happen in hotspot - it's just not a big enough pain point for those big Java servers Google cares about. Just run on 64-bit, bump up the stack size until StackOverflowErrors are sufficiently rare, and make sure the VM is never actually touching the stack unless it needs to (as calloc might!) and maybe madvise the stack back to the OS when unused. Wearing my jsr166 hat, it's easy for us to add commented out annotations into our sources that will become uncommented in our upstreaming script, so it's not a big burden for us.
24-11-2015

Martin, Thank you for reading this proposal and offering this suggestion and the links. We agree that the extensible stack solution has many positive characteristics and is technically feasible (and this work does not in any way preclude that.) However, both the segmented stack and stack reallocation approaches would be a major and risky investment, and as a matter of practicality, we prefer to pursue this simpler lower-risk strategy at this time. We are not opposed to the approaches you suggest -- far from it. In fact, if Google can put its resources and JDK experience behind exploring and implementing one these approaches, we would be very interested.
24-11-2015

I am opposed to this JEP, especially as a long term solution, and prefer that the JVM implementers work towards "Extensible stacks" instead, although it's surely a lot of work. Stack overflow is a symptom of "C programmer's disease" that Java should be in the forefront of eradicating. The Go, Rust, gcc and llvm implementer communities have forged ahead and created language runtimes free from stack overflow, and java can follow the existing implementation strategies. The JVM already has extra checks on stack frame allocation in order to throw StackOverflowError, so Extensible Stacks should be implementable without any extra overhead in the fast path. On 64-bit platforms, which the industry is increasingly moving to (especially for the kind of high-reliability computing being targeted here) it's cheap to pre-allocate a very large address space for each thread, so there should be no need for implementing segmented stacks. On 32-bit, there is a choice between implementing segmented stacks and the "realloc strategy" of resizing and moving the stack to a different location (go recently switched to the latter IIRC) and these are serious work to implement, but again prior art suggests it's not prohibitively difficult. http://cmr.github.io/blog/2013/10/21/on-stack-safety/ https://gcc.gnu.org/wiki/SplitStacks http://golang.org/s/contigstacks
23-11-2015

There would need to be a lot of compiler magic to eliminate the overhead implicit in creating a CriticalBlock and the lambda expression. That probably introduces more implicit allocations etc. :) I think an annotation is far less intrusive.
07-11-2014

Maybe a new, explicit construct like: final void lock() { new CriticalBlock(() -> { if (compareAndSetState(0, 1)) { setExclusiveOwnerThread(Thread.currentThread()); } else { acquire(1); }).enter(); } The VM will ensure that there's enough stack and thread-local allocation area before entering the critical block, or else will throw an OOM without executing anything in the critical block. In debug mode, or with a special VM parameter, we can check that the critical block does not exceed a pre-set limit of stack/heap resources, or else generate a warning or an assert. For performance, the JIT compiler should be able to analyze small critical blocks (as in the above example which seems to not do any allocation, and all calls can be inlined so no extrs stack is needed) and omit all checks.
07-11-2014

To pick up on John's final point, the generalization of this is the prevention of all "asynchronous exceptions" whilst in critical code. StackOverflowError and OutOfMemoryError are by far the two most likely causes - and ones that are relatively easy to cause happen - so it is reasonable to expend some effort to address them. I see implicit allocations becoming a bigger problem in the future, and something for which the programmer has little, if any, control over.
07-11-2014

From the performance perspective, current proposal seems to have neutral performance characteristics, since it only complicates the failure path exercised near the end of stack space, not the regular code path. That is modulo unforeseen effects of allocating a little more space for reserved stack for each thread -- how much are we talking about, 4K (a single page) per thread?
05-11-2014

My 2 (skeptical) cents: StackOverflowError in user thread doing the ReentrantLock.lock() seems unrecoverable, easily detectable, and avoidable-with-larger-Xss error. Raising that error in the middle of critical section may and will leave the critical section-protected code in the inconsistent state, as well as any other Error raised there, e.g. OutOfMemoryError. Why should we care? If only the solution was to never raise the exceptions in critical sections, e.g. resizing the stacks, I would understand the solution. But that's not doable with this proposal: there would be cases when reserved space would be too low to accommodate excess stack requirements. This will get us back to square one: unrecoverable, easily detectable, and avoidable-with-larger-(Xss|ReservedStackSize) error. Therefore, I don't see how this actually helps, except for pushing back the inevitable. If your application code plays near the stack capacity limits, you are already playing with fire -- so maybe we are better off providing the diagnostics for these cases -- e.g. warn users when their code enters the danger zone? Implementation-wise, introduce the GreenYellowishZone before the YellowZone, and start to print warning messages when code reaches it? It would seem similar to ReservedStack proposal, but with one crucial difference: it will be exercised by any code entering the danger zone, not some point cases for code with given @ReservedStackAccess annotation. Isn't it better for testability and reliability? The warning message will and should prompt users to increase the stack sizes for their applications without pushing VM to (hopelessly) try to recover from taking-fire scenario.
05-11-2014

Aleksey: We can set the reserved space such that all direct uses of ReentrantLock in core library code will not encounter StackOverflow. So this does help to make the core libraries more robust - which was the motivation for this. It can't help general user code that uses ReentrantLock.
05-11-2014