JDK-4472895 : zero out the instructions when threads are currently executing causes VM crash
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 1.4.0
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: solaris_7,solaris_8
  • CPU: generic,sparc
  • Submitted: 2001-06-21
  • Updated: 2001-07-18
  • Resolved: 2001-07-18
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
1.4.0 beta2Fixed
Related Reports
Duplicate :  
Relates :  
Description
VM crash bug reported by one of CAP members. Their tests are not 
very portable, it would take a lot of work to set up and run. 
Error message(hs_err_pid26796.log) and core files were attached instead.
They are willing to work with someone in HotSpot team to narrow it down
and produce a case outside their product code if possible.

----------------------------------------------------------------------------------
J2SE Version (please include all output from java -version flag):

java version "1.4.0-beta"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0-beta-b65)
Java HotSpot(TM) Client VM (build 1.4.0-beta-b65, mixed mode)

Does this problem occur on J2SE 1.3?  Yes / No (pick one)

No

Operating System Configuration Information (be specific):

Solaris 7 with the latest patches.

Hardware Configuration Information (be specific):

This was run on a Sun 420R quad processor with 1 GB RAM

Bug Description:
When using more than one thread in a section of code it is possible 
(it seems) for HotSpot to zero out the instructions which one of the 
two (or more) threads is currently executing (or will execute in the 
near future).  This appears to happen after executing the code several 
hundred times, as though HotSpot is coming back and reoptimizing the 
code segment by zeroing the old instructions out and then letting another 
thread kill the JVM because the SPARC processor cannot execute the 
instruction 0x0.


Detail problem description from customer:
+++++++++++++++++++++++++++++++++++++++++

JVM was configured with:
Tests 1 - 7: -server -Xms128m -Xmx512m
Tests 8:     -server -Xms512m -Xmx512m
Tests 9:     -server -Xms512m -Xmx512m

Tests:
Test 1:
Running any test against MapXtreme Java causes the JVM to crash.  
Every crash is caused specifically by Solaris delivering a SIGILL 
to one of the threads running inside of the JVM.  The JVM traps 
the SIGILL (signal 4) and prints an error report to the console, 
and then calls abort() to create a core file.  Note that abort() 
kills its calling process by sending a SIGKILL (signal 9) to itself.

This "test" was actually run many times with differing user loads.  
The number of virtual users ranged from 50 down to 1, always running 
in stress mode (no think time).  In all tests involving more than 10 
users, 10 users were started at test start, and the number of users 
was ramped up at a rate of 10 users per minute.

Crashes produced during this test always had the same core image 
file "appearance".  Many threads were active at once (say 50 threads), 
but only one had received the SIGILL which caused the process to shutdown.
Additionally, the stack frame for that thread had 24 HotSpot created 
functions on the stack frame.  (HotSpot created functions do not 
have symbol entries in the symbol tables and as a result appear 
"??" in GDB.) 

Oddly enough, all memory around the instruction which caused the 
SIGILL (and the instruction itself) are zerod out in the core file.  
Is this a "feature" of the core dump facility on Solaris, a bug 
in GDB or really what happened to the JVM?  (i.e. did HotSpot or 
the GC zero out memory it wasn't supposed to zero out?)  I don't 
think this is a GDB bug, as other core files created due to SIGSEGV 
seem to have legal SPARC instructions at the reported PC.

Test 2:
What effect does single threading the server have by reconfiguring 
Silk to only issue one request at a time per user?  If this the 
crash is thread related we should not see a crash in a single 
threaded case.

One virtual user was configured to issue only one request at a time, 
but with no think time between requests.

Result: 3,559 requests in 28 minutes.  No errors.

Test 3:
If test #2 holds true, that single threading works, then running only 
two threads may be able to reproduce the crash.

One virtual user was configured to issue two concurrent requests 
at a time, with no think time between requests.

Result: crash around 5,000 requests after 10 minutes of load.  Crash 
appears to be the same as observed in test #1.  (SIGILL delivered 
with 24 stack frames of HotSpot created functions.)

Test 4:
According to the help for the "java" command, -Xbatch should prevent 
HotSpot from replacing the code of a method at runtime.  Since the 
crash occurs only after processing several hundred requests successfully, 
the crash must be caused by either replacement code generated by 
HotSpot in the middle of the test or a periodic background cleanup 
task triggering off.  Since the single user case held, I'm leaning 
towards the former case.  I'm also thinking that perhaps HotSpot is 
changing the code for a method while another thread is attempting to 
execute it, and the crash is because the executing thread is seeing 
the machine code in the middle of the change (an unstable state).

Result: Using -Xbatch just causes the JVM to suffer from many 
NoClassDefFoundError exceptions in com.ibm.xml.parser.Token.getName.  
Very odd error.  I can only conclude that -Xbatch does not work in 
this version of the JVM.  This test was run many times with the 
same result.  The good news is the JVM does not SIGILL when using 
-Xbatch, but I don't think it gets far enough to receive the SIGILL point.

Additionally, the stack trace created from the NoClassDefFoundError 
is 34 methods deep. Unless HotSpot was able to inline 10 methods to 
reduce the stack depth, this NoClassDefFoundError cannot be the problem 
we are seeing in the other 3 tests.  On top of this, absolutely no 
request completes successfully with -Xbatch enabled, so this is really 
not of any help.

Test 5:
I split the client servlet (SimpleMapTestPlus) into its own JVM, 
seperate from the MapXtreme Java server servlet.  Both ran on the same 
machine underneath the same Apache server.

The result was the same as all other tests (except test #4).  The 
server JVM process died with a SIGILL and its stack trace (as reported 
by GDB) shows 24 HotSpot functions on the call stack.

Test 6:
Paul Jossman suggested using MapXtreme Java 4.0 build 20 as the XML 
parser has been switched away from IBM's XML parser to Apache's Xerces 
parser.  This test was run with a single virtual user allowed to 
make 4 concurrent connections to MapXtreme Java.  Both client and 
server servlets were in the same JVM, and no think time was allowed 
between requests.

The JVM made it through about 4,000 requests before it died.  Its 
death yielded the same SIGILL and 24 frame stack trace as every 
other crash.

Test 7:
MapXtreme Java 4.0 build 20 was tested with -Xbatch enabled to see 
if this cleared up the NoClassDefFoundError seen in Test 4.

After 186 successfully requests, the JVM died with a SIGSEGV (signal 11).  
At least with the Xerces XML parser we do not see the ClassDefNotFoundError.
Examining the core file in gdb reveals that the thread which caught 
the signal was executing in a JVM internal method:

int PhaseChaitin::strech_base_pointer_live_rangs(ResourceArea*)

Some of the classes calling this method seemed to refer to the runtime 
HotSpot compiler.  Perhaps the reason for the crash is an invalid 
pointer dereference in the HotSpot compiler itself.

Test 8:
Under the assumption that test 7's error was a result of trying to 
increase the size of the heap during runtime, I resized the initial 
heap to be the same as the maximum.  However, since the call stack 
contained references to the "Compiler" object, I doubt this is the case.

The test ran successfully for an hour, completing 18,278 image requests 
for one virtual user, no think time and 4 concurrent connections.  
12 hours after the test completed (around 4:20 am) when there was no 
load on the server the JVM randomly SIGSEGVd.

Conclusions:
It would seem as though this particular version of the JVM has some 
errors in its HotSpot code generator (the runtime compiler).  On Solaris, 
after a period of time we see the 32 bit JVM crash with a SIGILL 
having been delivered to the process.  The SIGILL is always issued at 
the same point in our software:  24 stack frames down on the runtime 
stack.  Each of these entries is a dynamicly created method, with only 
the HotSpot error trapping code on one end of the stack frame and the 
JVM thread root functions on the other.  There appear to be 9 functions 
assocaited with the Tomcat call stack, leaving the last 15 functions 
to be (possibly) ones from MapXtreme Java's server servlet.

It would seem as though the crash occurs after handling about 533 
requests. Typically the easiest way to cause the crash is to run 
MapXtreme Java with 1 user stress test loading all 13 images in 
the MapXtreme Java 3.1 test Shawn B. created.  In this test, Silk 
is running 4 concurrent connections per user to the server.

Perhaps the issue with multiple threads is that HotSpot has recreated 
the instructions for the method, but has somehow screwed up in copying 
the new instructions into the method's storage in memory.  As a result, 
some other thread calls into the new method before the new method is 
truely ready for execution, and trys to execute a partially complete 
instruction, or something which was not an instruction (but rather older 
data laying in memory).  Is the invalid instruction a null word or 
something silly like that?

It would seem as though load has no bering on when this crash will 
occur, as even one user (with no think time) can bring the JVM down with 
this error.  I now have 3 core dumps showing identical stack traces from 
a thread dying with this SIGILL.  Unfortunately, since HotSpot 
generates the code on the fly, there is no symbol table associated 
with the stack frames to uncover what method of MapXtreme Java is 
causing the error.  If we can identify the current method of the thread 
that received the SIGILL signal, perhaps we can give Sun a test case 
which can reproduce the error.

With newer releases of the 1.4 JVM, we have to wonder if this 
particular HotSpot bug has been fixed, or if the bug is still present 
but other changes to HotSpot's runtime compiler will cause the number 
of stack frames seen and their alignment in memory to be shifted such 
that it doesn't appear to be the same error.

All fingers point to MapXtreme Java as the software causing HotSpot 
to generate illegal machine code, as the NullServlet test with Tomcat 
did not suffer from this runtime problem.  Since we can run several 
hundred successful requests through the server before the crash, I am 
lead to believe that either HotSpot recompiles the victim method 
improperly as a performance improvement, or that the victim method 
really is just called only infrequently by MapXtreme Java (and as it 
happens is only called once every few hundred requests as a background 
cleanup process for example).

Update:

After examining most of the core files from the tests, it would appear 
as though the memory has been zeroed around the instruction which is 
causing the illegal instruction.  I examined the stacks of every 
active thread visible in the core files, and it would appear as though 
the memory was zeroed while the victim thread was sleeping.  When it 
woke up it died with a SIGILL.  It is not known when the memory zeroing 
occured - it may have occured while the thread was waiting for IO or a 
system call (or AWT) call to complete, and then it called a HotSpot'd 
method which had been zeroed over.  Or it was preempted, and while 
waiting for control one or more methods were zeroed behind its back.

This really looks like a race condition, and the method being zeroed 
looks like its also a MapXtreme Java server side method.


Comments
CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: merlin-beta2 FIXED IN: merlin-beta2 INTEGRATED IN: merlin-beta2
14-06-2004

EVALUATION Will fix for Merlin. david.spott@Eng 2001-06-27 Customer is helpful; failure is in his public beta bits. He will attach app + install & run instructions. Still fails with latest engineering build with "cannot compile from the compiler" assert in javaCalls (-Xcomp fastdebug build); looks like the server compiler is attempting to run Java code. Might need a large multi-headed U3 box (lightside?). cliff.click@eng 2001-06-27 Works well with main/baseline. Fails with C2! Also with C2 recompiles a few methods hundred's of times. Dies on my U80. cliff.click@eng 2001-07-10 Recompiles are caused by known C2 issue having no MDO in -Xcomp mode. Requires the apache webserver be re-started each time. Re-using the webserver 2nd time has connect times 100x slower than before (why?) and this hides the bug. Bug has only been seen when running -Xcomp mode, fastdebug mode with new webserver and no MDO's on deopt. I varied lots of things but not the 'new webserver' thing, so now that I know about that one I have to retry the debug-only-with-mdo version and see if that fails. cliff.click@eng 2001-07-11 Got it to die in debug mode. Requires 17000+ compiles before it settles down and waits on the socket. After running "bin/hit_svr 1" it does another 3000 compiles and then asserts. Bug is in the CI will_link(); it calls 'resolve_field' taking the default "update_pool" to be true. Updating the pool requires resolving the klass containing the field; this in turn runs Java code from the protection domain inside the compiler thread and we choke. Fixed by (1) calling resolve_field with update_pool set to false, and (2) checking for containing klass not resolving in resolve_field; this in turn throws a NoSuchField error which IS caught by the CI and will_link() correctly reports false. cliff.click@eng 2001-07-11
11-07-2001