JDK-4720694 : java apps crash on Solaris 9 Ultra-80 machine by using 1.4.1
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 1.3.1_16,1.4.1
  • Priority: P2
  • Status: Closed
  • Resolution: Won't Fix
  • OS: solaris_8,solaris_9
  • CPU: sparc
  • Submitted: 2002-07-25
  • Updated: 2006-12-01
  • Resolved: 2006-12-01
Related Reports
Duplicate :  
Relates :  
Relates :  
Relates :  
J2SE Version (please include all output from java -version flag):
  java version "1.4.1-rc"
  Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1-rc-b16)
  Java HotSpot(TM) Server VM (build 1.4.1-rc-b16, mixed mode)

Does this problem occur on J2SE 1.3 or 1.4?  Yes / No (pick one)
  Not sure. It is apparently a threading bug, so might be rare.

Operating System Configuration Information (be specific):  Solaris 9
Hardware Configuration Information (be specific):   4x450 Ultra-80
   Works fine on Ultra10 1-cpu machine.
Bug Description:  Crash with core file. Sometimes it just hang for 1K classes/threads.

Steps to Reproduce (be specific):
1) unzip concurrent.zip (Get util.concurrent package from http://gee.cs.oswego.edu - also concurrent.zip attached here).
2) cd concurrent
3) mkdir classes
4) javac -d classes *.java
5) cp misc/* classes/.
6) cd classes
7) javac *.java
8) mkdir EDU/oswego/cs/dl/util/concurrent/misc
9) mv *.class EDU/oswego/cs/dl/util/concurrent/misc/.
10) mv *.java EDU/oswego/cs/dl/util/concurrent/misc/.
11) mv *.html EDU/oswego/cs/dl/util/concurrent/misc/.
12) java -server -Xmx128m EDU.oswego.cs.dl.util.concurrent.misc.SynchronizationTimer
13) Above command should result in an application window launching. Below are steps user
    needs to execute to reproduce issue (also reflected in 'panel-operation.JPG'):
    NOTE: user must set path (PATH env) to a valid java executable variable before launching GUI.
14) In application GUI, click "no classes".
15) Click "waitfreeQueue"
16) Set "128k calls per thread" in combo box
17) Set "1M iterations per barrier" in combo box.
18) Click "start"
19) You should get follwoing hotspot error message: 

# HotSpot Virtual Machine Error, Internal Error
# Please report this error at
# http://java.sun.com/cgi-bin/bugreport.cgi
# Java VM: Java HotSpot(TM) Server VM (1.4.1-rc-b16 mixed mode)
# Error ID: 53484152454432554E54494D450E435050014F 01
# Problematic Thread: prio=4 tid=0x5ab228 nid=0x8b4 runnable 

And a core file been generated.

EVALUATION To clarify closing the bug ... The intent is to re-run the test case against current builds and open explicit C/R(bugs) against any failures, as this bug is misleading (there are delivered fixes against 1.4.2 and 5.0 for the crash ) and yet the explicit bug is against 1.4.1 and there have been hangs reported in some testing after the above fixes. Closeing in regards to the 1.3.1 aspect only.

EVALUATION The fix resolves the reported issue (meaning, the crash no longer occours with fix). However, there still are hangs in 1.3.1. The following jvm versions were used to test the fix: $>java -version -version -server -Xmx128m EDU.oswego.cs.dl.util.concurrent.misc.SynchronizationTimer & java version "1.5.0_07" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-b02) Java HotSpot(TM) Client VM (build 1.5.0_07-b02, mixed mode) $>java -version -server -Xmx128m EDU.oswego.cs.dl.util.concurrent.misc.SynchronizationTimer & java version "1.3.1_18-internal" Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1_18-internal-wsmgr_14_mar_2006_15_00) Java HotSpot(TM) Client VM (build 1.3.1_18, mixed mode) Here are the test results: * costume.sfbay (SunBlade 1000, 1 x 900 mhz, Solaris 9, 1 GB Memory): + 1.5.0_07: finishes correctly. + 1.3.1_18: hangs at 256 threads. * tryout.sfbay (SunBlade 1000, 2 x 750 mhz, Solaris 8, 2 GB Memory): + 1.5.0_07: finishes correctly. + 1.3.1_18: hangs at 256 threads. * somerset.sfbay (SunBlade 2500, 2 x 1600 mhz, Solaris 9, 2 GB Memory): + 1.5.0_07: finishes correctly. + 1.3.1_18: hangs at 256 threads. * producer.sfbay (E4500, 14 x 400 mhz, Solaris 8, 8GB Memory): + 1.5.0_07: finishes correctly. + 1.3.1_18: hangs at 256 threads. * scoot.sfbay.sun.com (E4500, 14 x 400 mhz, Solaris 10, 4GB Memory): + 1.5.0_07: hangs indefinitely when it gets to last column (i.e. 1000 threads - please see attached image). + 1.3.1_18: hangs at 256 threads. * jcteu80x2.sfbay (Ultra80, 4 x 450 mhz, 2 GB Memory): - + 1.5.0_07: hangs indefinitely (GUI grayed-out). + 1.3.1_18: sometimes hangs indefinitely (doesn't seem to be dependent on number of threads). As I mentioned, no crash occoured on any of above machine - prstat output is available here: * prstat info for Tiger: o /net/nightsvr/export3/jpse/regress.library/4720694B/tests/test_results/1.5.0u7 * prstat info for 1.3.1: o /net/nightsvr/export3/jpse/regress.library/4720694B/tests/test_results/1.3.1

CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: 1.4.1_07 generic mantis-rc FIXED IN: 1.4.1_07 mantis-rc INTEGRATED IN: 1.4.1_07 mantis-b20 mantis-rc tiger-b05

EVALUATION ###@###.### 2002-07-25 Here is the testing results for different machine configurations: * Ultra10 1-cpu, Solaris 8, 1024MB Memory - b16: OK - b17: OK * E3500, 6x400mz, Solaris 9, 3GB Memory: - b17: crash, core file generated - b16: crash, core file generated - 1.4: java.lang.OutOfMemoryError exception and hang * SunBlade, 2x750mhz, Solaris 8, 2GB Memory: - b16: Hang on 256 classes/threads - 1.4: hang on 256 classes/threads * Ultra80 4x450, Solaris 9, 4GB Memory: - 1.4: hang on 256 classes/threads - b17: crash, core file generated * Ultra80 4x450, Solaris 8, 4GB Memory: - b17: OK Tested on U80 4x450, Solaris 9 2gb mem Failed/Hung with JDK 1.4.1_01 Passed with JDK 1.4.2-b07 Would like to close this out, please re-test with latest JDK 1.4.2-b07 or greater. Awaiting your feedback... ###@###.### 2002-11-15 Will be closing this bug out by 2002-11-22 ###@###.### 2002-11-20 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###@###.### 2002-12-09 I removed irrelevant comments regarding -Xcheck:jni. The -Xcheck:jni checking mistakenly rejects a null argument in IsSameObject used by AWT. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I've run into another issue, scharnhorst 116 =>go.local Wed Nov 27 12:58:54 EST 2002 [error occured during error reporting] # # HotSpot Virtual Machine Error, Internal Error # Please report this error at # http://java.sun.com/cgi-bin/bugreport.cgi # # Java VM: Java HotSpot(TM) Server VM (1.4.2-beta-b08 mixed mode) # # Error ID: 53484152454432554E54494D450E435050014F 01 # # Problematic Thread: prio=4 tid=0x007361a0 nid=0x3fe runnable # Internal Error Fatal: exception happened outside interpreter, nmethods and vtable stubs (1) Do you want to debug the problem? ----------------------- The above failure mode is the same as found in 4778176 (now closed as a duplicate of this bug) and 4674904, which is no longer is reproducible. It appears that when attempting to come to a safepoint, the function handle_illegal_instruction_exception is seeing a SEGV of its own because the ThreadCodeBuffer corresponding to the ThreadSafepointState is NULL. The ThreadCodeBuffer has been released by CompiledCodeSafepointHandler::setup because the call to reposition_and_resume_thread() failed. Evidently, though, the thread appears to get restarted in the ThreadCodeBuffer... This bug may be related to the fix for 4645393, since it was first reported after that putback. Only a hunch as of yet, though. ###@###.### 2003-01-17 --------------------- The smaller java program t4720694 (attached) demonstrates the bug. It is not an SQE quality test case, but the boiled down remnants of Doug Lea's program. It runs an infinite loop, but will eventually get an assert that indicates the problem. The assert is typically is in safepoint.cpp around line 467, but due to the race condition nature of the bug, I have seen about 6 different assertions fail. I believe that this is a Solaris only bug. It may have been exposed with the fix for 4645393. For best results, use a fastdebug build. Run with +SafepointALot, CompileOnly=.take and -Xcomp. The compiler flags are necessary to ensure that C2 uses an implict null check in the generated code for take(). If one forces C2 to not use implicit checks with -ImplicitNullChecks, the program runs quietly, and presumably, correctly (forever). ###@###.### 2003-03-21 --------------------- The race during safepointing goes something like this: The VM thread starts the safepoint synchronize procedure. The Java thread is in the midst of an implicit null pointer check. That is, the C2 generated code has SEGV'ed. JVM_handle_solaris_signal has selected the stub handler_for_null_exception_entry() and reset the PC there. The thread stops. The VM thread uses get_top_frame() to query the pc of Java thread. The query does not report the stub address, but the address of the instruction that SEGV'd. The VM thread proceeds with moving the Java thread to a compiled safepoint, eventually calling reposition_and_resume_thread(). The Java thread is awakened by the callback and executes SetThreadPC_Callback. It can't validate the expected current pc, and resumes the thread without altering the pc. Now the race begins. If the VM thread can destroy the ThreadCodeBuffer before the Java thread gets very far, the program continues as expected. However, if the Java thread proceeds in handling the implicit null check, the shared runtime function compute_exception_return_address() will direct the thread to continue processing in the ThreadCodeBuffer right before it is destroyed. It in is this case that we fail. In debug VMs this failure exhibits itself as an assertion. In production VMs, the Java thread eventually takes a SIGILL executing in the destroyed ThreadCodeBuffer. The function handle_illegal_instruction_exception() is called, but the ThreadCodeBuffer is NULL, causing a second, fatal, signal. ###@###.### 2003-03-21 Upon further review, 4695690 is probably a duplicate of this bug. ###@###.### 2003-03-24 --------------------------------- Chuck is right. thread->safepoint_state()->_code_buffer is updated when the code buffer is first allocated. At that time, we don't know if we can reposition the thread or not. If we fail to reposition thread, we have to resume thread and destroy the code buffer. Before 4645393, thead is resumed after VM deletes the code buffer. In order to fix 4645393, we have to resume thread at the same time we attempt to reposition the therad. There is a chance that the thread is restarted before VM deletes thread code buffer. Then the thread might see a non-null value of thread code buffer first, but when it actually needs to access it, VM thread has deleted the code buffer and reset _code_buffer to NULL, causing the failures. A possible fix is to update thread->safepoint_state()->_code_buffer only when we know we can successfully reposition the thread. I verified the change with t4720694b.java. Also it can fix 4695690 for 1.4.1_02. Move this bug to runtime category, and assign it to myself. ###@###.### 2003-03-26