United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-4761344 : NullPointerException just before data loss

Details
Type:
Bug
Submit Date:
2002-10-10
Status:
Resolved
Updated Date:
2003-03-19
Project Name:
JDK
Resolved Date:
2003-03-04
Component:
hotspot
OS:
solaris_8,generic
Sub-Component:
compiler
CPU:
sparc,generic
Priority:
P1
Resolution:
Fixed
Affected Versions:
1.4.1,1.4.1_01a,1.4.2
Fixed Versions:
1.4.1_03 (03)

Related Reports
Backport:
Backport:
Duplicate:
Relates:
Relates:
Relates:
Relates:
Relates:

Sub Tasks

Description
===============================================
09/10/2002 11:17:47
===============================================
java version "1.4.1"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1-b21)
Java HotSpot(TM) 64-Bit Server VM (build 1.4.1_01-b01, mixed mode)

com.adventnet.snmp.snmp2.agent.AgentRuntimeException: Failure in opening SnmpSession, errstat: 0, errvalue: 0

	at com.adventnet.snmp.snmp2.agent.SnmpAgent.addClientAndOpenSession(SnmpAgent.java:801)
	at com.adventnet.snmp.snmp2.agent.SnmpAgent.restartSnmpAgent(SnmpAgent.java:939)
	at com.adventnet.snmp.snmp2.agent.SnmpAgent.restartSnmpAgent(SnmpAgent.java:955)
	at com.adventnet.snmp.snmp2.agent.SnmpAgent.setPort(SnmpAgent.java:1034)
	at com.xacct.xacctusage.util.xmi.snmpagent.XacctUsage.initXacctUsage(XacctUsage.java:250)
	at com.xacct.xacctusage.util.xmi.snmpagent.XacctUsage.init(XacctUsage.java:320)
	at com.xacct.xacctusage.core.Application$1.run(Application.java:897)
	at java.lang.Thread.run(Thread.java:536)
CSVLogReader.CreateMapArguments() - Exception !!!
java.lang.NullPointerException

All files mentioned in the description are in the attachments

Problem description :
------------------------------

We have CSVLogReader.class that performs CSV file line parsing . After
reading the line from the file, the CreateMapArguments(String line)
method is called in order to put each field to the buffer according to its
type. Inside this method there is a try/catch (Exception E) block, that
sometimes
catches NullPointerException. In order to understand the reason, we've
inserted E.printStackTrace() call, and we've expected to see the stack trace
for this
Exception object in the standard output. But only
"java.lang.NullPointerException" was printed. So the first question is: why
we didn't see the full stack trace ?
Using
                Throwable X = E.fillInStackTrace();
                X.printStackTrace();
gives the same results.

Then we've concluded that the only way to see  what object caused the
NullPointerException is to produce a thread dump immediately after catching
the exception.
From the program we called to a script that sent kill -QUIT signal to the
process. The results of the thread point that the source of this Exception
is inside Java classes.

Examples
---------------

Here are some examples from several thread dumps that show the threads which
called to the CreateMapArguments() method:

1.  Thread dump from gath7_stdout.log

"Thread-24" prio=5 tid=0x1053d0680 nid=0x4a runnable
[fffffffec7a00000..fffffffec7a014a0]
	at
ISM.ISM_114_1_0.CSVFieldTokenizer.fieldAsString(CSVFieldTokenizer.java:161)
	at ISM.ISM_114_1_0.CSVLogReader.CreateMapArguments(CSVLogReader.java:426)
	at ISM.ISM_114_1_0.CSVLogReader.processRecord(CSVLogReader.java:333)
	at ISM.ISM_114_1_0.LogReader.readNewLines(LogReader.java:575)
	at ISM.ISM_114_1_0.LogReader.run(LogReader.java:484)
	- locked <fffffffee8b85858> (a ISM.ISM_114_1_0.CSVLogReader)

Here is the  ISM.ISM_114_1_0.CSVFieldTokenizer.fieldAsString() method with
line 161 specified:

    public String fieldAsString(int FieldNo)
    {
        char[] arr = new char[m_FieldLen[FieldNo]]; // line 161
        System.arraycopy(m_Buffer[FieldNo], 0, arr, 0, m_FieldLen[FieldNo]);
        return (new String(arr));
    }

As you can see, no our objects were called in this line.

The same situation we can see in the gath9_stdout.log :

"Thread-15" prio=5 tid=0x1055f2180 nid=0x30 runnable
[fffffffecae00000..fffffffecae014a0]
	at
ISM.ISM_114_1_0.CSVFieldTokenizer.fieldAsString(CSVFieldTokenizer.java:161)
	at ISM.ISM_114_1_0.CSVLogReader.CreateMapArguments(CSVLogReader.java:426)
	at ISM.ISM_114_1_0.CSVLogReader.processRecord(CSVLogReader.java:333)
	at ISM.ISM_114_1_0.LogReader.readNewLines(LogReader.java:575)
	at ISM.ISM_114_1_0.LogReader.run(LogReader.java:484)
	- locked <fffffffee8b86ef8> (a ISM.ISM_114_1_0.CSVLogReader)

2.  Thread dump from gath17_stdout.log

"Thread-26" prio=5 tid=0x1002eb170 nid=0x53 runnable
[fffffffec6600000..fffffffec66014a0]
	at java.lang.Integer.intValue(Integer.java:591)
	at Util.MapArguments.putString(MapArguments.java:889)
	at ISM.ISM_114_1_0.CSVLogReader.CreateMapArguments(CSVLogReader.java:426)
	at ISM.ISM_114_1_0.CSVLogReader.processRecord(CSVLogReader.java:333)
	at ISM.ISM_114_1_0.LogReader.readNewLines(LogReader.java:575)
	at ISM.ISM_114_1_0.LogReader.run(LogReader.java:484)
	- locked <fffffffee8b85070> (a ISM.ISM_114_1_0.CSVLogReader)

Here is the  Util.MapArguments.putString() method with line 889 specified:

    public void putString(Integer Key, String Value)
    {
	putString(Key.intValue(), Value); // line 889
    }

As you can see, the Integer.intValue() method is called before Exception was
thrown.

3. Thread dump from gath20_stdout.log

"Thread-24" prio=5 tid=0x100301430 nid=0x44 runnable
[fffffffec8400000..fffffffec84014a0]
	at java.util.Calendar.complete(Calendar.java:1095)
	at java.util.Calendar.get(Calendar.java:942)
	at ISM.ISM_114_1_0.SuperParser.parseDateTime(SuperParser.java:286)
	at ISM.ISM_114_1_0.CSVLogReader.CreateMapArguments(CSVLogReader.java:476)
	at ISM.ISM_114_1_0.CSVLogReader.processRecord(CSVLogReader.java:333)
	at ISM.ISM_114_1_0.LogReader.readNewLines(LogReader.java:575)
	at ISM.ISM_114_1_0.LogReader.run(LogReader.java:484)
	- locked <fffffffee81867c8> (a ISM.ISM_114_1_0.CSVLogReader)

The relevant line in the ISM.ISM_114_1_0.SuperParser.parseDateTime() is :
        int Year = m_CalObj.get(Calendar.YEAR); // line 286

The same situation we can see in the gath24_stdout.log :

"Thread-23" prio=5 tid=0x1002fa8e0 nid=0x43 runnable
[fffffffec8500000..fffffffec85014a0]
	at java.util.Calendar.get(Calendar.java:943)
	at ISM.ISM_114_1_0.SuperParser.parseDateTime(SuperParser.java:286)
	at ISM.ISM_114_1_0.CSVLogReader.CreateMapArguments(CSVLogReader.java:476)
	at ISM.ISM_114_1_0.CSVLogReader.processRecord(CSVLogReader.java:333)
	at ISM.ISM_114_1_0.LogReader.readNewLines(LogReader.java:575)
	at ISM.ISM_114_1_0.LogReader.run(LogReader.java:484)
	- locked <fffffffee8181e08> (a ISM.ISM_114_1_0.CSVLogReader)

As you can see, the Calendar m_CalObj method is not null here,  and the
problem is in the Calendar methods.

4. Thread dump from gath28_stdout.log

"Thread-30" prio=5 tid=0x105562140 nid=0x52 runnable
[fffffffec6f00000..fffffffec6f014a0]
	at Util.MapArguments.putString(MapArguments.java:903)
	at Util.MapArguments.putString(MapArguments.java:889)
	at ISM.ISM_114_1_0.CSVLogReader.CreateMapArguments(CSVLogReader.java:426)
	at ISM.ISM_114_1_0.CSVLogReader.processRecord(CSVLogReader.java:333)
	at ISM.ISM_114_1_0.LogReader.readNewLines(LogReader.java:575)
	at ISM.ISM_114_1_0.LogReader.run(LogReader.java:484)
	- locked <fffffffee8b814a0> (a ISM.ISM_114_1_0.CSVLogReader)

The relevant line in the Util.MapArguments.putString() is :
             m_StringErr [KeyDesc.Index] = NO_ERROR; // line 903

The m_StringErr is a array of booleans.

How do you explain this situation ? What object can cause the
NullPointerException ? And why we can't see the stack trace
by using printStackTrace() call ?

Attached with this email :
1. All mentioned thread dumps (gath7_stdout.log, gath9_stdout.log,
gath17_stdout.log, gath20_stdout.log, gath24_stdout.log, gath28_stdout.log)
.
2. The CSVLogReader  source file.


                                    

Comments
EVALUATION


According to the description, the customer is looking for the stack trace
on the standard output.  The printStackTrace() method prints the stack trace
on the standard error, not standard output.  However it does appear that the
log file do capture standard error becaue the traceback from the second
call to printStackTrace() does come out (the description implies that it
doesn't.)    He should modify his program to write his extra debugging
output to System.err instead of System.out and capture stderr and stdout separately when running the program.  The combined output to the 2 output
streams  may not be intermixed in the same order as written.

His invocation of the "printstacktrace" script does not do what he wants.
The execution will run concurrently with his program, so the thread he
is interested in could have continued execution far beyond his call
to Runtime.exec().  He needs to add a call to MyProcess.waitFor()
after the call to Runtime.getRuntime().exec().  However this will
show the location in the exception handling code, not where the original
exception occurred.  His analysis of where the NullPointerException
occurred from the thread dumps is incorrect,  the thread dumps show
where the program execution had avanced to by the time the SIGQUIT
was delivered.

It does seem to be a bug that the printStackTrace() call on the original
exception object does not print the stack trace.  This does work correctly
on a small test case I have written.  Without more information on how to
reproduce the bug I cannot evaluate it any further.

###@###.### 2002-10-16
----------------------------------------

###@###.### 2002-11-19

Initially, I suspected it to be a compiler issue, but after inspecting the 2 cores at the given location, using SA, I could find very few compiled frames:  
com.xacct.xacctusage.gatherer.flow.reliable.ReliableTriggerCompFlowStep.forwardPendingData()
java.util.Calendar.clear()
Util.Pulse.xacctRun().

And the method CreateMapArguments() that is seeing the data loss, wasn't compiled in any of the cores. 
I suspect the generated interpreter code right now. To make sure it's an interpreter issue & not a compiler problem, I would suggest a run with
java -Xcomp. And if that run fails, do another run with disabling compilation of above three methods. 


###@###.### 2002-11-20

New class file was developed as suggested and tested.  In summary,
it is getting worse :-(

The class file is under /dumps/CSVLogReader.class, if needed.

There have been several crashes due to Null pointer exception.
For a sample, look in the file /t3-5/gatherer/g5/Log/Gatherer_stdout.log
towards the end of the logfile.

There was also another crash with little information.  Pertinent details
are in /t3-1/gatherer/g29/Log/Gatherer_stdout.log and its corresponding
core file is /t3-1/gatherer/g29/Gatherer/core.

There were also Hashtable hang bug and its details are in
/t3-7/gatherer/g21/Log/Gatherer_stdout.log

Using -Xcomp is NOT an acceptable workaround for several reasons.
- We do not ask a customer to run with un-documented flags in production.
- We very well know that using it is deterimetal to performance and
  that's probably the reason we don't document it.
- This application's (ISVs) main competitive feature is performance,
  the main reason for which I (MDE) am working with them now but
  in the middle of these bugs.  let me stop with this.......

The system is on SWAN.  When I left I was told that there's some
maintanance work going on with the building router but that should
not affect this system.  However, I was not able to get to my mail
in sfbay domain; but, I can access that E10K from here now.
So, the connectivity may be intermittent but please don't give up.

> Date: Wed, 20 Nov 2002 19:41:53 -0500 (GMT)
> From: ###@###.###
> 
> Synopsis: NullPointerException just before data loss
> 
> ******************************************************
> 
> Progress:
> 
> ** Nov 20 2002  5:56AM R,Anupam **
> Event: The test with -Xcomp doesn't reproduce the problem. So most probably it's the generated interpreter code, that is the culprit. I am currently looking at the genereated codelets for bytecodes. My suspecion right now is on iconst_m1 opcode, which will put a -1(0xffffffffffffffff) on the operand stack.
> Looking in CreateMapArguments() the place where this opcode is coming from is,
> 
> boolean      Found  = false; 
> 
> Can you try a run, after making Found as "int" data type and using values of 1 & 2 as false & true respectively. This will avoid iconst_m1.
> 
> But since, -Xcomp doen't reproduce the problem that can be used as workaround for the time being.
> 
> I will continue to need the machine and I haven't made any modification to the set up on E10K today, as requested.
>  

-----------------------------------------
###@###.### 2002-11-21

Looked at every word in the stack frame of failing method's frame using SA. I would expect the local reference which is getting corrupted to be present there, but didn't find any oop of type Util.MapArguments. I did see a null value in the region where other local oops were present. Knowing how interpreter frames are laid out, would help get the exact offset for this local. I think an earlier bytecode execution corrupted(null) this location, so next iload_2 loading from that location puts null on the operand stack. 

----------------------------------------


###@###.### 2002-12-10

Test doesn't fail when compilation of CrateMapArguments() is disabled and in -Xcomp mode.

Looking at TraceDeoptimization outputs, CrateMapArguments() is both compiled and deoptimized twice:
1. First one is a normal compile, and is deopt due to failed optimized checkcast. Deopt here looks proper and execution proceeds normally.
2. Second compile is an OSR compile, due to the for loop in CrateMapArguments(), and is deoptimized due to "unresolved class: Util/KDSServer". Excecution after this deopt gives a NullPointerException. 
This points to a problem in setting up of interpreted frames after deopt.   

--------------------------------------------


###@###.### 2003-01-03

looking at .class and PrintOpto, it looks solely  a OSR compile only problem and deopt may not be the issue. Although, deopt packing showed map as NULL, deopt itself happened inside the catch block, and catch was executed because map was already null. PrintOpto shows that at calls to parse*() methods in the pseudo compiled code L2=#NULL(that's the place where local#2, map, will be kept). Whereas, for normal compiles(where there is no data corruption) L2 has good values. Compiler has assumed _L2(map) to be dead in a range where it should have been alive. 

--------------------------------------------

Attached file BadNullAssert.java shows a test case for a possible root cause of this bug.

The problem is subtle:  An OSR method is compiled using a model of typestates which are
derived from flow analysis over the bytecodes.  This flow analysis is necessarily incomplete
if the interpreter has not yet exercised all paths in the code, leaving some classes still
unloaded.  In effect the CFG for the method grows over time, as the interpreter exercises
new paths and loads new classes.  Exercising a new path must invalidate pre-existing OSR
methods, since they may have compiled into them assumptions about typestates that must
change as new CFG edges come into being.

The problem can manifest (in the test case above, for example) when an unloaded class is
the subject of a checkcast operation.  The flow analysis that precedes an OSR compilation
will simply assert that any value casted to an unloaded class must be null, and the
compilation proceeds on that assumption.  Later on, if the class loads, the interpreter
can checkast non-null values successfully, leading to typestates that were previously
proved impossible.  The OSR method can be invoked on a typestate created by the interpreter
containing non-null values, and yet the OSR code will assume that they are null, based
on the out-of-date type analysis.

Solution is to have OSR methods register dependencies on unloaded classes, and get flushed
if the classes ever load.  This is not an issue with regular (non-OSR) methods, since
a normal method creates all its own typestates, from the method's entry point forward.
There is no way for the interpreter dump surprising typestates into a non-OSR method.

A binary which fixes this bug is here: /net/lukasiewicz/export/lukasiewicz1/4761344
There are product, optimized, and fastdebug flavors.  It is based on the current 1.4.2
baseline.

To gather more information, testing should use the following extra flags:
  -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:LogFile=hotspot.log

The file "hotspot.log" will be written with diagnostic information.
The name can be adjusted to any pathname.

###@###.### 2003-01-17

--------------------------------------------

The libjvm.so binaries containing the fix are under this directory:
  /net/lukasiewicz.sfbay/export/lukasiewicz1/4761344

The 32-bit VM builds are:
  fastdebug/libjvm_g.so  asserts enabled, compiled optimized, for java_g
  optimized/libjvm.so    debugging options compiled in
  product/libjvm.so      product build

The corresponding 64-bit VM builds are:
  sparcv9/fastdebug/libjvm_g.so
  sparcv9/optimized/libjvm.so
  sparcv9/product/libjvm.so

All of these were built from the current 1.4.2 baseline, but they
can run inserted into the 1.4.1 JRE.

To verify installation, start a VM with -Xinternalversion, and make
sure the version string includes "jrose in mantis".

If these binaries do not fix the problem, please re-run with the
previously suggested three options including +LogCompilation.

###@###.### 2003-01-22

--------------------------------------------

After some intensive analysis, we are more confident the root problem is the compiler
erroneously deciding that a particular class is unloaded, due to the fact that it is
defined in a delegate class loader and not directly visible in the current class loader.
We believe we can fix this cleanly and reliably by using the class loader constraints
recorded by the VM, to correctly report that the class in question is in fact loaded
in the VM, even though it has not yet been fetched from the local class loader.

I am working on a VM that includes this adjustment.  Meanwhile, run with logging
turned on, and the log file set to a non-existent file, such as "/no/such/file".
This will cause the VM to choose a unique file in /tmp/, avoiding the problem
with multiple VMs overwriting a single log.

###@###.### 2003-02-12

--------------------------------------------

There is an updated VM which fixes the problems mentioned above.
Please test with it.

  /net/lukasiewicz.sfbay/export/lukasiewicz1/4761344

The 64-bit VM builds are:
  sparcv9/fastdebug/libjvm_g.so
  sparcv9/optimized/libjvm.so
  sparcv9/product/libjvm.so

###@###.### 2003-02-14


--------------------------------------------

Recent JDC comments report that there is some variation of
this bug active in 1.4.2.  The particular bug fixed under
this report had to do with the compiler getting confused
by aliasing of classes through different clas loaders,
causing it to "know" that a given class was not yet loaded
and therefore all references "must" be null--leading to
embarassing spurious NPEs.

On top of this problem, up until 1.5 the compiler optimizes
the treatment of NPEs by using a preallocated NPE object,
rather than creating it every time.  The preallocated NPE
object works great, except that it lacks a backtrace.

A backtrace-free NPE makes it hard to diagnose any NPE bug,
whether it comes from user error (usually) or a JVM bug (rarely).
We have opened a separate bug 4292742 for the missing backtrace,
and fixed it in 1.5beta2.

This backtrace problem is probably bothering people on 1.4.2 release.
It is probably making it difficult to debug application errors in
in compiled code, forcing them to run -Xint or -client to debug.
If this is true for you, I suggest transferring your JDC vote to bug
4292742, so we can consider backporting a version of the 1.5 fix.

On the other hand, NPE exceptions that are truly spurious, and are
known to be due to JVM errors, should probably be reported as new bugs.
However, if they are clearly cases of the complex syndrome described
in this bug, we would appreciate an explicit note to this effect,
so we can re-evaluate the fix to the class loading logic.

###@###.### 2004-03-26
                                     
2004-03-26
PUBLIC COMMENTS

NullPointerExceptions are thrown by compiled code, but when the source
code is inspected, it is provably impossible for a null object to be
present.

The problem is exacerbated by the behavior of bug 4292742,
which causes NPE backtraces to be lost in server-compiled methods.

This particular JVM failure has been linked to the use of multiple
application-defined class loaders and aliasing of classes among them.
This confuses the compiler and causes it to "prove" to itself, wrongly,
that a certain reference must be null, leading to the NPE.

Customers suspecting occurrences of this bug should first verify that
it is not the simpler "missing backtrace" problem of 4292742.
                                     
2004-09-17
SUGGESTED FIX

###@###.### 2003-01-13

Record additional "dependence" information for OSR compiles so that the runtime 
system will invalidate/deoptimize the OSR code when the specified class is loaded.

Problem still exists in 1.4.2 but may be less common after the fix for 4778368
since this allows most type information to flow through ciTypeFlow's checkcast.
===== =====
                                     
2004-09-17
CONVERTED DATA

BugTraq+ Release Management Values

COMMIT TO FIX:
1.4.1_03
mantis-beta

FIXED IN:
1.4.1_03
mantis-beta

INTEGRATED IN:
1.4.1_03
mantis-b18
mantis-beta
tiger


                                     
2004-09-17



Hardware and Software, Engineered to Work Together