JDK-8133206 : "java.lang.OutOfMemoryError: unable to create new native thread" caused by upgrade to zlib 1.2.8
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.jar
  • Affected Version: 7u80
  • Priority: P2
  • Status: In Progress
  • Resolution: Unresolved
  • Submitted: 2015-08-07
  • Updated: 2017-07-06
Related Reports
Duplicate :  
Relates :  
Relates :  
Description
One of our customers started to see start-up failures of their application server caused by out of memory errors on native (non-Java) level. In the server logs they see many messages like this one:

java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
...

The error started to appear after they upgraded from Java 7u79 32 bit to Java 7u80 32 bit. The platform is Windows Server 2012 R2.
During the investigation it was discovered that the error is caused by zlib 1.2.3 => 1.2.8 library update. Replacing zip.dll in 7u80 with its version from 7u79 fixes the problem. So technically this issue is a regression of the following change:
https://bugs.openjdk.java.net/browse/JDK-8044725
Comments
code review mail thread : http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-October/035884.html
15-10-2015

I contacted Mark Adler, one of the authors of zlib library, and, instead of changing the library code, he recommended to change inflate() calls so that they use a flush value of Z_FINISH. This way we preserve the old behavior (as before the problematic change in zlib 1.2.6) with the only difference that Z_BUF_ERROR is returned instead of Z_OK if it can't finish the inflation in one inflate() call. This code is absolutely not fatal and is just an indication that inflation didn't finish even though it was claimed that it would, and so a 32K window was allocated before inflate() returned so that inflation can continue.
12-10-2015

In zlib repository there are the comments discovering the rationale behind the problematic change - adding support for Microsoft CAB file format. Here are 3 related changesets: https://github.com/madler/zlib/commit/77b47d55f14be032c0ee10da44bbe7591be0abd8 https://github.com/madler/zlib/commit/afe7cf78d51b819dcdc5b0f4cb85a25a52a9fcd0 https://github.com/madler/zlib/commit/421c7a61f08ff255bf898f0e687b7d7c8fbdae77
12-10-2015

The problem is a regression of the following small change in inflate.c file (I'm comparing zlib 1.2.5 version and 1.2.6 version; the corresponding file in OpenJDK is src/share/native/java/util/zip/zlib-1.2.8/inflate.c): @@ -1214,7 +1233,8 @@ */ inf_leave: RESTORE(); - if (state->wsize || (state->mode < CHECK && out != strm->avail_out)) + if (state->wsize || (out != strm->avail_out && state->mode < BAD && + (state->mode < CHECK || flush != Z_FINISH))) if (updatewindow(strm, out)) { state->mode = MEM; return Z_MEM_ERROR; Here is how I discovered this. Firstly, I took OpenJDK 7u80 and tried it with different versions of zlib code. The test scenario used only the following API: int inflateInit2(z_streamp strm, int windowBits); int inflate(z_streamp strm, int flush); int inflateReset(z_streamp strm); int inflateEnd(z_streamp strm); with "flush" parameter value always equal to Z_PARTIAL_FLUSH. The results were: 1.2.3 - PASSED 1.2.4 - PASSED 1.2.5 - PASSED 1.2.6 - FAILED Then, I searched for the minimal change between 1.2.5 and 1.2.6 that leads to the failure. It turned up to be the lines above. 1.2.5 plus these lines leads to the failure in the test scenario. After that, I prepared an instrumented build of OpenJDK 7u80 with its standard zlib 1.2.8, where I added the following debug statements: 1) Added debug prints to zcalloc() and zcfree() functions so that I could count the number of calls and the amount of allocated memory. 2) Added the following debug output to inflate.c: @@ -1254,11 +1254,22 @@ inf_leave: RESTORE(); if (state->wsize || (out != strm->avail_out && state->mode < BAD && - (state->mode < CHECK || flush != Z_FINISH))) + (state->mode < CHECK || flush != Z_FINISH))) { + if (state->mode >= CHECK && state->mode < BAD) { + sprintf(msg, "inflate: m=%d, f=%d, w=%s\n", state->mode, flush, + (state->window == Z_NULL) ? "n" : "y"); + stream = fopen("4444_log.txt", "a+t"); + if (stream != NULL) { + fwrite(msg, sizeof(char), strlen(msg), stream); + fclose(stream); + } + } + if (updatewindow(strm, strm->next_out, out - strm->avail_out)) { state->mode = MEM; return Z_MEM_ERROR; } + } in -= strm->avail_in; out -= strm->avail_out; strm->total_in += in; In the resulting log there were about 56K lines inflate: m=28, f=1, w=n It means, a lot of updatewindow() calls happen in DONE state with the window not created yet. Every such call results in allocation of 32KB. And the total number of zcalloc() calls was 146K for 2785MB in total. It means that in the test scenario the total amount of memory allocated via zcalloc() increased after the change in zlib library by 169%.
12-10-2015

I managed to get the native memory map for "good" and "bad" runs using the Native Memory Tracking feature. In the "good" run (7u80 + zlib 1.2.3) the 32 bit address space is enough to fit all VM structures, including Java heap and thread stacks for ~250 threads - the application starts successfully. It looks like there is still at least 300 MB of the address space available. In the "bad" run (7u80 + zlib 1.2.8) the address space is exhausted after creating ~100 threads, and all further thread creation attempts fail with OOM. In the address space the thread stacks are located much sparser that in the "good" run.
10-08-2015