Bug ID: JDK-8220293 Deadlock in JFR string pool

Type: Bug
Component: hotspot
Sub-Component: jfr
Affected Version: 11,13

Priority: P2
Status: Closed
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2019-03-07
Updated: 2020-06-29
Resolved: 2019-05-21

JDK 11	JDK 13	Other
11.0.4-oracleFixed	13 b22Fixed	openjdk8u262Fixed

Hi,

there is a deadlock triggered by the JFR StringPool. The problem starts when the jfr_add_string_constant method tries to write a string and the called WriterHost::ensure_size() method find not enough memory in the current buffer. This leads to JfrStringPool::lease_buffer() being called indirectly, which then tries to get a suitable buffer from the memory space via mspace_get_free_lease_with_retry(). This will eventually try to get an already available buffer from the memory space via JfrMspaceRetrieval::get(size,..), which looks like this:

  static Type* get(size_t size, Mspace* mspace, typename Mspace::Iterator& iterator, Thread* thread) {
    while (iterator.has_next()) {
      Type* const t = iterator.next();
      if (t->retired()) continue;
      if (t->try_acquire(thread)) {
        assert(!t->retired(), "invariant");
        if (t->free_size() >= size) {
          return t;
        }
        t->set_retired();
        mspace->register_full(t, thread);
      }
    }
    return NULL;

When it finds a buffer in the list it does a try-lock on the buffer. Let's assume this works. Then it checks if the locked buffer can accomidate the data we want to write. If not, the buffer is set to retired and the register_full() method of the memory space is called. In this case it is the method JfrStringPool::register_full(), which looks like this:

void JfrStringPool::register_full(BufferPtr t, Thread* thread) {
  // nothing here at the moment
  assert(t->retired(), "invariant");
}

So the buffer we found is still locked. Later in the 'JFR Recorder Thread' a rotation is triggered. This causes JfrRecorderService::pre_safepoint_write() to eventually call the write StringPoolWriteOp::write() method of the locked buffer. This method first tries to get the lock of the buffer via acquire(), but this call will obviously never return.

Since the thread now hangs forever, the next safepoint request will block forever too. At this point the VM is more or less dead.

I could remove the deadlock by adding 't->release();' to the JfrStringPool::register_full() method, but since I don't have the best overview of the whole buffer handling, I'm not sure this is the correct fix.


You can reproduce the problem with the following program:

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

public class FileIoStresser implements Runnable {

    public static void main(String[] args) {
        for (int i = 0; i < 3; ++i) {
            new Thread(new FileIoStresser()).start();
        }
    }

    @Override
    public void run() {
        while (true) {
            StringBuilder sb = new StringBuilder();

            for (int i = (int) (Math.random() * 40 + 16); i >= 0; --i) {
                sb.append((char) ('a' + Math.random() * 26));
            }

            File f = new File("aaaa" + sb.toString());

            try (FileOutputStream fos = new FileOutputStream(f)) {
                fos.write(1);
                fos.write(2);
            } catch (IOException e) {
                // Ignore.
            }

            f.delete();
        }
    }
}

This program creates small files in the current directory and deletes them directly (they all start with 'aaaa'). The sizes of the filenames is choosen, so that the string pool will be hit often. When you profile it with JFR and enable the File I/O Trace with a threshold of 0, the deadlock should occur in a few minutes. You can add -Xlog:safepoint=debug to see when the safepoint is requested but never finishes. And if you make the chunk size small via -XX:FlightRecorderOptions=maxchunksize=1M, the deadlock occurs faster.


Best regards,
Ralf





Here is an actual stack trace when the JfrStringPool::register_full() method is called:

JfrStringPool::register_full(JfrStringPoolBuffer * t, Thread * thread) Line 228
JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool>::register_full(JfrStringPoolBuffer * t, Thread * thread) Line 94
JfrMspaceRetrieval<JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool> >::get(unsigned __int64 size, JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool> * mspace, StopOnNullIterator<JfrDoublyLinkedList<JfrStringPoolBuffer>,StackObj> & iterator, Thread * thread) Line 56
JfrMspaceSequentialRetrieval<JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool> >::get(unsigned __int64 size, JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool> * mspace, Thread * thread) Line 84
JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool>::get(unsigned __int64 size, Thread * thread) Line 101
mspace_get_free<JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool> >(unsigned __int64 size, JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool> * mspace, Thread * thread) Line 212	C++
mspace_get_free_with_retry<JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool> >(unsigned __int64 size, JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool> * mspace, unsigned __int64 retry_count, Thread * thread) Line 218
mspace_get_free_lease_with_retry<JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool> >(unsigned __int64 size, JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool> * mspace, unsigned __int64 retry_count, Thread * thread) Line 263
JfrStringPool::lease_buffer(Thread * thread, unsigned __int64 size) Line 122
JfrStringPool::flush(JfrStringPoolBuffer * old, unsigned __int64 used, unsigned __int64 requested, Thread * thread) Line 111
JfrStringPoolFlush::JfrStringPoolFlush(JfrStringPoolBuffer * old, unsigned __int64 used, unsigned __int64 requested, Thread * t) Line 32
Adapter<JfrStringPoolFlush>::flush(unsigned __int64 used, unsigned __int64 requested) Line 84
StorageHost<Adapter<JfrStringPoolFlush>,StackObj>::accommodate(unsigned __int64 used, unsigned __int64 requested) Line 66
WriterHost<EncoderHost<BigEndianEncoderImpl,BigEndianEncoderImpl>,EncoderHost<Varint128EncoderImpl,BigEndianEncoderImpl>,AcquireReleaseMemoryWriterHost<Adapter<JfrStringPoolFlush>,StackObj> >::ensure_size(unsigned __int64 requested) Line 172
WriterHost<EncoderHost<BigEndianEncoderImpl,BigEndianEncoderImpl>,EncoderHost<Varint128EncoderImpl,BigEndianEncoderImpl>,AcquireReleaseMemoryWriterHost<Adapter<JfrStringPoolFlush>,StackObj> >::be_write<signed char>(const char * value, unsigned __int64 len) Line 128
WriterHost<EncoderHost<BigEndianEncoderImpl,BigEndianEncoderImpl>,EncoderHost<Varint128EncoderImpl,BigEndianEncoderImpl>,AcquireReleaseMemoryWriterHost<Adapter<JfrStringPoolFlush>,StackObj> >::_write(_jstring * string) Line 233
JfrStringPool::add(bool epoch, __int64 id, _jstring * string, JavaThread * jt) Line 138
jfr_add_string_constant(JNIEnv_ * env, _jclass * jvm, unsigned char epoch, __int64 id, _jstring * string) Line 304


And here is a stack trace for the stuck recorder thread:
OrderAccess::load_acquire<CompiledMethod *>(CompiledMethod * const volatile * p) Line 338
JfrBuffer::acquire(const void * id) Line 130
StringPoolWriteOp::_write(JfrStringPoolBuffer * buffer, const unsigned char * data, unsigned __int64 size) Line 154
ConcurrentWriteOp<StringPoolWriteOp>::process(JfrStringPoolBuffer * t) Line 51
JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool>::iterate<ConcurrentWriteOp<StringPoolWriteOp>,StopOnNullIterator<JfrDoublyLinkedList<JfrStringPoolBuffer>,StackObj> >(ConcurrentWriteOp<StringPoolWriteOp> & callback, bool full, jfr_iter_direction direction) Line 122
process_free_list_iterator_control<ConcurrentWriteOp<StringPoolWriteOp>,JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool>,StopOnNullIterator<JfrDoublyLinkedList<JfrStringPoolBuffer>,StackObj> >(ConcurrentWriteOp<StringPoolWriteOp> & processor, JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool> * mspace, jfr_iter_direction direction) Line 325
process_free_list<ConcurrentWriteOp<StringPoolWriteOp>,JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool> >(ConcurrentWriteOp<StringPoolWriteOp> & processor, JfrMemorySpace<JfrStringPoolBuffer,JfrMspaceSequentialRetrieval,JfrStringPool> * mspace, jfr_iter_direction direction) Line 344
JfrStringPool::_write() Line 175
ServiceFunctor<JfrStringPool,&JfrStringPool::_write>::process() Line 192
WriteCheckpointEvent<ServiceFunctor<JfrStringPool,&JfrStringPool::_write> >::process() Line 168
write_stringpool_checkpoint(JfrStringPool & string_pool, JfrChunkWriter & chunkwriter) Line 405
JfrRecorderService::pre_safepoint_write() Line 431
JfrRecorderService::_write() Line 385
JfrRecorderService::finalize_current_chunk() Line 378
JfrRecorderService::chunk_rotation() Line 372
JfrRecorderService::rotate(int msgs) Line 331
recorderthread_entry(JavaThread * thread, Thread * unused) Line 76
JavaThread::thread_main_inner() Line 1870
JavaThread::run() Line 1853
Thread::call_run() Line 405
thread_native_entry(Thread * thread) Line 470

requesting a backport of the fix into openjdk8u262, considered important for the stability of the jfr backport included into this release. the patch applies cleanly except for files layout changes
16-06-2020
fix incl in CPU - removing BPR label
27-06-2019
Fix request: This bugfix shall be taken to OpenJDK 11.0.4, since it is also part of Oracles 11.0.4. The patch applies cleanly. Testing in SAP's test environment.
31-05-2019
I was able to reproduce on 11.0.3 with testcase from the description.
19-04-2019
[~rschmelter] Thanks Ralf for a great bug report! The StringPool subsystem is less tested compared to for example the buffer subsystem. There are slightly different invariants; I think you are right on track with the analysis. I will double-check it using your reproducer. Thanks Markus
20-03-2019
ILW: HMM -> P2 I: High: hang L: Medium, reproducer exists, but this does not always show W: Medium, increase buffer size
12-03-2019