United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-4720694 : java apps crash on Solaris 9 Ultra-80 machine by using 1.4.1

Details
Type:
Bug
Submit Date:
2002-07-25
Status:
Closed
Updated Date:
2006-12-01
Project Name:
JDK
Resolved Date:
2006-12-01
Component:
hotspot
OS:
solaris_9,solaris_8
Sub-Component:
runtime
CPU:
sparc
Priority:
P2
Resolution:
Won't Fix
Affected Versions:
1.3.1_16,1.4.1
Fixed Versions:
1.3.1_20

Related Reports
Backport:
Backport:
Duplicate:
Relates:
Relates:
Relates:

Sub Tasks

Description
J2SE Version (please include all output from java -version flag):
  java version "1.4.1-rc"
  Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1-rc-b16)
  Java HotSpot(TM) Server VM (build 1.4.1-rc-b16, mixed mode)

Does this problem occur on J2SE 1.3 or 1.4?  Yes / No (pick one)
  Not sure. It is apparently a threading bug, so might be rare.

Operating System Configuration Information (be specific):  Solaris 9
Hardware Configuration Information (be specific):   4x450 Ultra-80
   Works fine on Ultra10 1-cpu machine.
Bug Description:  Crash with core file. Sometimes it just hang for 1K classes/threads.

Steps to Reproduce (be specific):
=================================
1) unzip concurrent.zip (Get util.concurrent package from http://gee.cs.oswego.edu - also concurrent.zip attached here).
2) cd concurrent
3) mkdir classes
4) javac -d classes *.java
5) cp misc/* classes/.
6) cd classes
7) javac *.java
8) mkdir EDU/oswego/cs/dl/util/concurrent/misc
9) mv *.class EDU/oswego/cs/dl/util/concurrent/misc/.
10) mv *.java EDU/oswego/cs/dl/util/concurrent/misc/.
11) mv *.html EDU/oswego/cs/dl/util/concurrent/misc/.
12) java -server -Xmx128m EDU.oswego.cs.dl.util.concurrent.misc.SynchronizationTimer
13) Above command should result in an application window launching. Below are steps user
    needs to execute to reproduce issue (also reflected in 'panel-operation.JPG'):
    NOTE: user must set path (PATH env) to a valid java executable variable before launching GUI.
14) In application GUI, click "no classes".
15) Click "waitfreeQueue"
16) Set "128k calls per thread" in combo box
17) Set "1M iterations per barrier" in combo box.
18) Click "start"
19) You should get follwoing hotspot error message: 

# HotSpot Virtual Machine Error, Internal Error
# Please report this error at
# http://java.sun.com/cgi-bin/bugreport.cgi
#
# Java VM: Java HotSpot(TM) Server VM (1.4.1-rc-b16 mixed mode)
#
# Error ID: 53484152454432554E54494D450E435050014F 01
#
# Problematic Thread: prio=4 tid=0x5ab228 nid=0x8b4 runnable 
#

And a core file been generated.

                                    

Comments
EVALUATION

###@###.### 2002-07-25

Here is the testing results for different machine configurations:

  * Ultra10 1-cpu, Solaris 8, 1024MB Memory
    - b16: OK
    - b17: OK

  * E3500, 6x400mz, Solaris 9, 3GB Memory:
    - b17: crash, core file generated
    - b16: crash, core file generated
    - 1.4: java.lang.OutOfMemoryError exception and hang

  * SunBlade, 2x750mhz, Solaris 8, 2GB Memory:
    - b16: Hang on 256 classes/threads
    - 1.4: hang on 256 classes/threads

  * Ultra80 4x450, Solaris 9, 4GB Memory:
    - 1.4: hang on 256 classes/threads
    - b17: crash, core file generated

  * Ultra80 4x450, Solaris 8, 4GB Memory:
    - b17: OK


Tested on U80 4x450,  Solaris 9 2gb mem
Failed/Hung with JDK 1.4.1_01
Passed with JDK 1.4.2-b07

Would like to close this out, please re-test with latest JDK 1.4.2-b07 or greater. 
Awaiting your feedback...
###@###.### 2002-11-15



Will be closing this bug out by 2002-11-22

###@###.### 2002-11-20 


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
###@###.### 2002-12-09
I removed irrelevant comments regarding -Xcheck:jni.  The -Xcheck:jni
checking mistakenly rejects a null argument in IsSameObject used
by AWT.  
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



I've run into another issue,
scharnhorst 116 =>go.local
Wed Nov 27 12:58:54 EST 2002
[error occured during error reporting]
#
# HotSpot Virtual Machine Error, Internal Error
# Please report this error at
# http://java.sun.com/cgi-bin/bugreport.cgi
#
# Java VM: Java HotSpot(TM) Server VM (1.4.2-beta-b08 mixed mode)
#
# Error ID: 53484152454432554E54494D450E435050014F 01
#
# Problematic Thread: prio=4 tid=0x007361a0 nid=0x3fe runnable 
#
Internal Error
Fatal: exception happened outside interpreter, nmethods and vtable stubs (1)

Do you want to debug the problem? 

-----------------------

The above failure mode is the same as found in 4778176 (now closed as a duplicate of this bug) and 4674904, which is no longer is reproducible.

It appears that when attempting to come to a safepoint, the function handle_illegal_instruction_exception is seeing a SEGV of its own because the ThreadCodeBuffer corresponding to the ThreadSafepointState is NULL. 

The ThreadCodeBuffer has been released by CompiledCodeSafepointHandler::setup because the call to reposition_and_resume_thread() failed.  Evidently, though, the thread appears to get restarted in the ThreadCodeBuffer...

This bug may be related to the fix for 4645393, since it was first reported after that putback.  Only a hunch as of yet, though.

###@###.### 2003-01-17
---------------------

The smaller java program t4720694 (attached) demonstrates the bug.  It
is not an SQE quality test case, but the boiled down remnants of Doug
Lea's program.  It runs an infinite loop, but will eventually get an
assert that indicates the problem.

The assert is typically is in safepoint.cpp around line 467, but due
to the race condition nature of the bug, I have seen about 6 different
assertions fail.

I believe that this is a Solaris only bug.  It may have been exposed
with the fix for 4645393.

For best results, use a fastdebug build. Run with +SafepointALot,
CompileOnly=.take and -Xcomp. The compiler flags are necessary to
ensure that C2 uses an implict null check in the generated code for
take(). If one forces C2 to not use implicit checks with
-ImplicitNullChecks, the program runs quietly, and presumably,
correctly (forever).

###@###.### 2003-03-21

---------------------

The race during safepointing goes something like this:

The VM thread starts the safepoint synchronize procedure.

The Java thread is in the midst of an implicit null pointer check.
That is, the C2 generated code has SEGV'ed. JVM_handle_solaris_signal
has selected the stub handler_for_null_exception_entry() and reset the
PC there. The thread stops.

The VM thread uses get_top_frame() to query the pc of Java thread.
The query does not report the stub address, but the address of the
instruction that SEGV'd.  The VM thread proceeds with moving the Java
thread to a compiled safepoint, eventually calling
reposition_and_resume_thread().

The Java thread is awakened by the callback and executes
SetThreadPC_Callback.  It can't validate the expected current pc, and
resumes the thread without altering the pc.

Now the race begins.

If the VM thread can destroy the ThreadCodeBuffer before the Java
thread gets very far, the program continues as expected.

However, if the Java thread proceeds in handling the implicit null
check, the shared runtime function compute_exception_return_address()
will direct the thread to continue processing in the ThreadCodeBuffer
right before it is destroyed. It in is this case that we fail.

In debug VMs this failure exhibits itself as an assertion. In
production VMs, the Java thread eventually takes a SIGILL executing in
the destroyed ThreadCodeBuffer.  The function
handle_illegal_instruction_exception() is called, but the
ThreadCodeBuffer is NULL, causing a second, fatal, signal.


###@###.### 2003-03-21

Upon further review, 4695690 is probably a duplicate of this bug.

###@###.### 2003-03-24

---------------------------------

Chuck is right. thread->safepoint_state()->_code_buffer is updated when
the code buffer is first allocated. At that time, we don't know if we can
reposition the thread or not. If we fail to reposition thread, we have
to resume thread and destroy the code buffer.

Before 4645393, thead is resumed after VM deletes the code buffer. In order
to fix 4645393, we have to resume thread at the same time we attempt to
reposition the therad. There is a chance that the thread is restarted before
VM deletes thread code buffer. Then the thread might see a non-null value of
thread code buffer first, but when it actually needs to access it, VM thread
has deleted the code buffer and reset _code_buffer to NULL, causing the
failures.

A possible fix is to update thread->safepoint_state()->_code_buffer only
when we know we can successfully reposition the thread. I verified the change
with t4720694b.java. Also it can fix 4695690 for 1.4.1_02.

Move this bug to runtime category, and assign it to myself.

###@###.### 2003-03-26
                                     
2003-03-26
CONVERTED DATA

BugTraq+ Release Management Values

COMMIT TO FIX:
1.4.1_07
generic
mantis-rc

FIXED IN:
1.4.1_07
mantis-rc

INTEGRATED IN:
1.4.1_07
mantis-b20
mantis-rc
tiger-b05


                                     
2004-06-14
EVALUATION

The fix resolves the reported issue (meaning, the crash no longer occours with fix).  However, there still are hangs in 1.3.1.  The following jvm versions were used to test the fix:

$>java -version -version -server -Xmx128m EDU.oswego.cs.dl.util.concurrent.misc.SynchronizationTimer &
   java version "1.5.0_07"
   Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-b02)
   Java HotSpot(TM) Client VM (build 1.5.0_07-b02, mixed mode)

$>java -version -server -Xmx128m EDU.oswego.cs.dl.util.concurrent.misc.SynchronizationTimer &
    java version "1.3.1_18-internal"
    Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1_18-internal-wsmgr_14_mar_2006_15_00)
    Java HotSpot(TM) Client VM (build 1.3.1_18, mixed mode)

Here are the test results:
    * costume.sfbay (SunBlade 1000, 1 x 900 mhz, Solaris 9, 1 GB Memory):       
                + 1.5.0_07: finishes correctly.
                + 1.3.1_18: hangs at 256 threads.
    * tryout.sfbay (SunBlade 1000, 2 x 750 mhz, Solaris 8, 2 GB Memory):     
                + 1.5.0_07: finishes correctly.
                + 1.3.1_18: hangs at 256 threads.
    * somerset.sfbay (SunBlade 2500, 2 x 1600 mhz, Solaris 9, 2 GB Memory):
                + 1.5.0_07: finishes correctly.
                + 1.3.1_18: hangs at 256 threads.
    * producer.sfbay (E4500, 14 x 400 mhz, Solaris 8, 8GB Memory):      
                + 1.5.0_07: finishes correctly.
                + 1.3.1_18: hangs at 256 threads.
    * scoot.sfbay.sun.com (E4500, 14 x 400 mhz, Solaris 10, 4GB Memory):
                + 1.5.0_07: hangs indefinitely when it gets to last column (i.e. 1000 threads - please see attached image).
                + 1.3.1_18: hangs at 256 threads.
    * jcteu80x2.sfbay (Ultra80, 4 x 450 mhz, 2 GB Memory): - 
                + 1.5.0_07: hangs indefinitely (GUI grayed-out).
                + 1.3.1_18: sometimes hangs indefinitely (doesn't seem to be dependent on number of threads).

As I mentioned, no crash occoured on any of above machine - prstat output is available here:
    * prstat info for Tiger: 
          o /net/nightsvr/export3/jpse/regress.library/4720694B/tests/test_results/1.5.0u7
    * prstat info for 1.3.1:
          o /net/nightsvr/export3/jpse/regress.library/4720694B/tests/test_results/1.3.1
                                     
2006-04-11
EVALUATION

To clarify closing the bug ... 
The intent is to re-run the test case against current builds and open 
explicit C/R(bugs) against any failures, as this bug is misleading
(there are delivered fixes against 1.4.2 and 5.0 for the crash )
and yet the explicit bug is against 1.4.1 and there have been hangs reported
in some testing after the above fixes. Closeing in regards to the
1.3.1 aspect only.
                                     
2006-12-01



Hardware and Software, Engineered to Work Together