Bug ID: JDK-6423457 (coll) High cpu usage in HashMap.get()

JDK-6423457 : (coll) High cpu usage in HashMap.get()

Type: Bug
Component: core-libs
Sub-Component: java.util:collections
Affected Version: 1.4.2_09

Priority: P3
Status: Closed
Resolution: Not an Issue
OS: solaris_9
CPU: sparc

Submitted: 2006-05-09
Updated: 2012-10-08
Resolved: 2006-05-12

Description

In WS6.0 with 1.4.2_09 an HashMap is loaded with 200 items at startup (in syncronized code) and then accessed by multiple threads with high ratio (formatCell is called 600 times/sec) in unsyncronized code.
See source code in http://balabiot.italy/calls/37570730_borsait/0304/src/FieldFormatter.java
After a few week it happens that some threads use all the cpu running get():

http://balabiot.italy/calls/37570730_borsait/1004/prstat.out:

 PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/LWPID      
  4384 www      2911M 2604M cpu0    20   10   0:16:51  27% webservd/112
  4384 www      2911M 2604M run     20   10   0:16:51  24% webservd/157
  4384 www      2911M 2604M run     21   10   0:44:56  23% webservd/111
  4384 www      2911M 2604M sleep   29   10   0:00:02 1.3% webservd/177

We have asked to collect when it happens prstat, multiple pstack, pmap and a gcore and then we extracted via SA scripts java dump, heap stats, class histogram, java flags.
No truss available.
/net/balabiot.italy/calls/37570730_borsait/0304
/net/balabiot.italy/calls/37570730_borsait/1004

http://balabiot.italy/calls/37570730_borsait/1004/core.4384.mwsbit03.jstack

Deadlock Detection:
No deadlocks found.
....
Thread t@111: (state = IN_JAVA, current Java SP = null
)
- java.util.HashMap.get(java.lang.Object) @bci=72, line=325,
pc=0xf8d32788, methodOop=0xf0063b10 (Compiled frame; information may be
imprecise)
- com.inferentiadnm.display.dao.FieldFormatter.formatCell(java.lang.String,
java.lang.String, java.util.Locale) @bci=125, line=248, pc=0xf8903ac8,
methodOop=0xf0b0dcc0 (Compiled frame)
...
Thread t@112: (state = IN_JAVA, current Java SP = null
)
- java.util.HashMap.get(java.lang.Object) @bci=6, line=317,
pc=0xf8d3264c, methodOop=0xf0063b10 (Compiled frame; information may be
imprecise)
- com.inferentiadnm.display.dao.FieldFormatter.formatCell(java.lang.String,
java.lang.String, java.util.Locale) @bci=125, line=248, pc=0xf8903ac8,
methodOop=0xf0b0dcc0 (Compiled frame)
...
Thread t@157: (state = IN_JAVA, current Java SP = null
)
- java.util.HashMap.get(java.lang.Object) @bci=6, line=317,
pc=0xf8d3264c, methodOop=0xf0063b10 (Compiled frame; information may be
imprecise)
- com.inferentiadnm.display.dao.FieldFormatter.formatCell(java.lang.String,
java.lang.String, java.util.Locale) @bci=125, line=248, pc=0xf8903ac8,
methodOop=0xf0b0dcc0 (Compiled frame)

The program counter are:
Thread t@111: pc=0xf8d32788
Thread t@112: pc=0xf8d3264c
Thread t@157: pc=0xf8d3264c

We see a couple of threads (112 e 157) running the same optimized native code, and exactly using the same amount of cpu time

From the sequence of 3 pstack, and looking at the first arguments only we see all the threads are running the same code with different arguments (e.g:f8d32790)

pstack1.out thread# 111: f8d32648 (4055b100, 4054a990, 4055b100, 72127600, 0, 4055b0d0)
pstack2.out thread# 111: f8d32790 (4055b100, 4054a990, 4055b100, 72127600, 0, 4055b0d0)
pstack3.out thread# 111: f8d3264c (4055b100, 4054a990, 4055b100, 72127600, 0, 4055b0d0)

pstack1.out thread# 112: f8d32790 (4e8401b0, 4e6bb890, 4e8401b0, 7212bd40, 0, 4e840180)
pstack2.out thread# 112: f8d3264c (4e8401b0, 4e6bb890, 4e8401b0, 7212bd40, 0, 4e840180)
pstack3.out thread# 112: f8d3264c (4e8401b0, 4e6bb890, 4e8401b0, 7212bd40, 0, 4e840180)

pstack1.out thread# 157: f8d32790 (4e19a448, 4e11aef0, 4e19a448, 7162c638, 0, 4e19a418)
pstack2.out thread# 157: f8d3279c (4e19a448, 4e11aef0, 4e19a448, 7162c638, 0, 4e19a418)
pstack3.out thread# 157: f8d32658 (4e19a448, 4e11aef0, 4e19a448, 7162c638, 0, 4e19a418)

In the previous set of data (http://balabiot.italy/calls/37570730_borsait/0304) 2 threads where using all the cpu running HashMap.get() and eq() which is in get implementation

Total: 69 processes, 474 lwps, load averages: 2.47, 2.77, 3.00
 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/LWPID
 8090 www 2501M 2385M run 0 10 64:43:10 48% webservd/139
 8090 www 2501M 2385M cpu1 0 10 0:55:06 48% webservd/109
 4037 bitsys02 2848K 1704K sleep 59 0 0:00:03 0.2% top/1
 
 ----------------- t@139 -----------------
 0xf8a1e8d0 0xf8a1e8d0 * java.util.HashMap.get(java.lang.Object) bci:72 line:325 methodOop:0xf0063b10 (Compiled frame; information may be imprecise)
 0xf88652c8 0xf88652c8 * com.inferentiadnm.display.dao.FieldFormatter.formatCell(java.lang.String, java.lang.String, java.util.Locale) bci:125 line:248 methodOop:0xf0b10c80 (Compiled frame)
...
 ----------------- t@109 -----------------
 0xf8a1e8c8 0xf8a1e8c8 * java.util.HashMap.eq(java.lang.Object, java.lang.Object) bci:7 line:274 methodOop:0xf0063dd8 (Compiled frame; information may be imprecise)
 0xf88652c8 0xf88652c8 * com.inferentiadnm.display.dao.FieldFormatter.formatCell(java.lang.String, java.lang.String, java.util.Locale) bci:125 line:248 methodOop:0xf0b10c80 (Compiled frame)
...

http://balabiot.italy/calls/37570730_borsait/0304/trace/pstack1_dem.out
-----------------  lwp# 109 / thread# 109  --------------------
 f8a1e884 ???????? (f0009018, 93ecd750, fffbe4e7, 71483fe8, f0592648, 331d1c68)
-----------------  lwp# 139 / thread# 139  --------------------
 f8a1e8d4 ???????? (f0009018, 73f42118, fffbe4e7, 7180c7f8, f059b9f8, 36b199a0)

http://balabiot.italy/calls/37570730_borsait/0304/trace/pstack2_dem.out
-----------------  lwp# 109 / thread# 109  --------------------
 f8a1e884 ???????? (f0009018, 93ecd750, fffbe4e7, 71483fe8, f0592648, 331d1c68)
-----------------  lwp# 139 / thread# 139  --------------------
 f8a1e8c8 ???????? (f0009018, 73f42118, fffbe4e7, 7180c7f8, f059b9f8, 36b199a0)

http://balabiot.italy/calls/37570730_borsait/0304/trace/pstack3_dem.out
-----------------  lwp# 109 / thread# 109  --------------------
 f8a1e884 ???????? (f0009018, 93ecd750, fffbe4e7, 71483fe8, f0592648, 331d1c68)
-----------------  lwp# 139 / thread# 139  --------------------
 f8a1e8d4 ???????? (f0009018, 73f42118, fffbe4e7, 7180c7f8, f059b9f8, 36b199a0)

Comments

EVALUATION Doug Lea writes: "This is a classic symptom of an incorrectly synchronized use of HashMap. Clearly, the submitters need to use a thread-safe HashMap. If they upgraded to Java 5, they could just use ConcurrentHashMap. If they can't do this yet, they can use either the pre-JSR166 version, or better, the unofficial backport as mentioned by Martin. If they can't do any of these, they can use Hashtable or synchhronizedMap wrappers, and live with poorer performance. In any case, it's not a JDK or JVM bug." I agree that the presence of a corrupted data structure alone does not indicate a bug in the JDK.
10-05-2006
EVALUATION Brought up a core file in HSDB and found a broken HashMap: the first thread I looked at, t@157, is in the attached picture - thread stack memory is being viewed, and there's a HashMap$Entry in the frame for HashMap.get, which I looked at in the Inspector. Very quickly we can see it is circular, which I would think is a problem.
10-05-2006
EVALUATION If a HashMap is used in a concurrent setting with insufficient synchronization, it is possible for the data structure to get corrupted in such a way that infinite loops appear in the data structure and thus get() could loop forever. We believe this is very likely to be an error in the application code. If this is a real bug, we would need a small reproducible test case using only core JDK apis. Else we will close this as Not Reproducible. One way to be more confident that the map is not modified during the "read-only" phase is to divide the application into two phases with two different maps: First populate the map with a thread-safe implementation like m = Collections.synchronizedMap(new HashMap()); then make an immutable copy using m2 = Collections.unmodifiableMap(new HashMap(m));
09-05-2006
EVALUATION There is a 3rd party backport of java.util.concurrent, available at: http://dcl.mathcs.emory.edu/util/backport-util-concurrent/
09-05-2006
WORK AROUND we have suggested to: - implement full syncronization in FieldFormatter.java by adding Collections.synchronizedMap for each new HashMap, as suggested in http://java.sun.com/j2se/1.4.2/docs/api/java/util/HashMap.html, but they had problems in code with static initializations and also said that this small HashMap (200 items) is already syncronized in write, and it is static because it is initialized in the first 10 minutes and never modified, so no syncronization in get should be required - run with .hotspot_compiler : exclude java/util/HashMap get exclude com/inferentiadnm/display/dao/FieldFormatter formatCell (this to exclude any problem due to optimization and to have better info in pstack and core) but they refused to do that to avoid any performance issue (formatCell is called 600 times per sec and is used in 80% of loaded pages, where the total number of page hits is nearly 20 M each month) - upgrade to WS6.1 and java 5: cu said it requires a lot of time to test application in a new web+java, and the have an application outage every week - upgrade to 1.4.2_11: they requested the fully analyze and eventually fix of this problem first - try to use (suggested by Lew) http://gee.cs.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/ConcurrentHashMap.html and they are currently using this workaround But here is no support for ConcurrentHashMap available in 1.4.2_XX and there would not be any backport of ConcurrentHashMap to 1.4.2_XX as these are new APIs.
09-05-2006