JDK-6732194 : Data corruption dependent on -server/-client/-Xbatch
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 6,6u10
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • OS: linux,solaris_10
  • CPU: x86
  • Submitted: 2008-07-31
  • Updated: 2011-03-07
  • Resolved: 2011-03-07
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6 JDK 7 Other
6u11Fixed 7Fixed hs10Fixed
Related Reports
Duplicate :  
Relates :  
Description
FULL PRODUCT VERSION :
java version "1.6.0_06"
Java(TM) SE Runtime Environment (build 1.6.0_06-b02)
Java HotSpot(TM) Server VM (build 10.0-b22, mixed mode)


ADDITIONAL OS VERSION INFORMATION :
Linux 2.6.24-19-generic (Ubuntu 8.04) i686, also reported on Debian Lenny amd64.

A DESCRIPTION OF THE PROBLEM :
Downstream bug report: https://bugzilla.wikimedia.org/show_bug.cgi?id=14610

Data corruption is observed in our application when JVM is run with -server, but not with -client or "-server -Xbatch". I suspect it's due to background compilation in the region of com.fluendo.jheora.Decode.ExtractToken(). It's a heisenbug: attempting to instrument this function with debugging statements changed the behaviour of the function in odd ways.

Reported to be a regression from 5.0, I haven't confirmed this personally.


STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
* svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/cortado
* cd cortado
* ant applet-ovt
(alternatively get a jar from http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/OggHandler/cortado-ovt-stripped-wm_r31776.jar)
* wget http://pozimski.eu/itheora/data/bd1_pp.ogg

appletviewer test file test.html (adjust the .jar version number if necessary):

<html>
 <head>
 </head>
 <body>
   <applet code="com.fluendo.player.Cortado.class"
           archive="output/dist/applet/cortado-ovt-debug-wm_r36880.jar"
	   width="384" height="288">
     <param name="url" value="bd1_pp.ogg"/>
     <param name="local" value="true"/>
     <param name="duration" value="224"/>
     <param name="keepAspect" value="true"/>
     <param name="video" value="true"/>
     <param name="audio" value="false"/>
     <param name="debug" value="1"/>
   </applet>
 </body>
</html>

Then:
* appletviewer -J-Xbatch -J-server test.html
* appletviewer -J-server test.html
* appletviewer -J-client test.html


EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
A video will play. Apparently some sort of German-speaking pirate.
ACTUAL -
With -J-server, it does a frame or two before the bug kicks in. Then you get corruption of the video frame, with blocks of colour appearing, and finally the corrupted token stream causes an ArrayIndexOutOfBoundsException in the application code.

ERROR MESSAGES/STACK TRACES THAT OCCUR :
Details may vary, depending on the exact contents of the garbage returned by ExtractToken() post-bug.

java.lang.ArrayIndexOutOfBoundsException: 66
	at com.fluendo.jheora.DCTDecode.ExpandToken(DCTDecode.java:542)
	at com.fluendo.jheora.Decode.unpackAndExpandToken(Decode.java:460)
	at com.fluendo.jheora.Decode.unPackVideo(Decode.java:603)
	at com.fluendo.jheora.Decode.loadAndDecode(Decode.java:655)
	at com.fluendo.jheora.State.decodePacketin(State.java:74)
	at com.fluendo.plugin.TheoraDec$2.chainFunc(TheoraDec.java:212)
	at com.fluendo.jst.Pad.chain(Pad.java:257)
	at com.fluendo.jst.Pad.push(Pad.java:271)
	at com.fluendo.plugin.Queue$1.taskFunc(Queue.java:135)
	at com.fluendo.jst.Pad.run(Pad.java:339)
	at java.lang.Thread.run(Thread.java:619)


REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
Sorry, I haven't been able to isolate this any further.
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Use -Xbatch.
Customer later clarified that -Xbatch workaround is not really sufficient as it's quite invasive.  A workaround that can be used in the HTML tag or some other simple approach is needed.

Comments
EVALUATION In most cases using a multidef LRG doesn't cause problem but cases where it would definitely cause problems can be detected by code like this: diff --git a/src/share/vm/opto/reg_split.cpp b/src/share/vm/opto/reg_split.cpp --- a/src/share/vm/opto/reg_split.cpp +++ b/src/share/vm/opto/reg_split.cpp @@ -318,6 +318,27 @@ Node *PhaseChaitin::split_Rematerialize( } if (lidx < _maxlrg && lrgs(lidx).is_multidef()) { +#ifdef ASSERT + int defidx = 0; + for( uint i = 0; i < b->_nodes.size(); i++ ) { + if (b->_nodes[i] == def) { + defidx = i; + break; + } + } + for (uint i = defidx; i < insidx; i++) { + if (n2lidx(b->_nodes[i]) == lidx) { + in->dump(); + spill->dump(); + b->_nodes[i]->dump(); + b->dump(); + C->method()->print(); + C->set_print_assembly(true); + break; + } + } +#endif + // walkThru found a multidef LRG, which is unsafe to use, so // just keep the original def used in the clone. in = spill->in(i);
28-10-2008

EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-comp/hotspot/rev/ea18057223c4
19-08-2008

WORK AROUND In FrArray.java in getNextBBit use NextBit = (byte) ( NextBit ^ 1) instead of NextBit = (byte) ( NextBit == 1 ? 0 : 1). This avoids the pattern that needs to rematerialize compare. It's also more efficient. The compiler can't do this for you since it doesn't know that NextBit is always either 1 or 0.
05-08-2008

EVALUATION while rematerializing a compI_eReg_imm the wrong input is hooked into the clone so the calculation goes wrong. Before: 103 compI_eReg_imm === _ 107 [[ 102 ]] #1 401 loadConI === 18 [[ 102 ]] #1 450 loadConI0 === 18 [[ 451 102 ]] #0 451 MachProj === 450 [[]] #1 !orig=[106] 102 cmovI_reg === _ 103 401 450 [[ 99 255 ]] eq !jvms: FrArray::getNextBBit @ bci:21 FrArray::test2 @ bci:29 after: 501 MachSpillCopy === _ 107 [[ 103 ]] 103 compI_eReg_imm === _ 501 [[]] #1 401 loadConI === 18 [[ 102 502 ]] #1 450 loadConI0 === 18 [[ 451 102 ]] #0 451 MachProj === 450 [[]] #1 !orig=[106] 502 compI_eReg_imm === _ 401 [[ 102 ]] #1 !orig=103 102 cmovI_reg === _ 502 401 450 [[ 99 255 ]] eq !jvms: FrArray::getNextBBit @ bci:21 FrArray::test2 @ bci:29 501 was produced for use by 502 because 107 is part of a multidef LRG so it needs a new LRG. Because walkThru is true in the call to split_Rematerialize we search to find the original 107 and find the node to use from its LRG, thus defeating the whole purpose of creating 501. This is pretty much the same bug as 6207830 though I in that case I assume that walkThru was passed as false. I've got a hack fix for this but I think to make it work correctly split_Rematerialize needs to be substantially rearranged. I'm going to do some more tests to make sure this is the only issue.
05-08-2008

EVALUATION I've constructed a test case that shows the problem with 1.7 though the same test case doesn't fail with 1.6 so it's possible there are multiple bugs or it could just be that it manifests differently. It appears to be a bug in the register allocator. It clones a compare instruction and when it's fixing up the inputs of the clone it grabs the wrong input so that instead of comparing a load with 1 it compares 1 with 1.
05-08-2008

EVALUATION I can reliably reproduce this on solaris as well going back to at least build 49 of jdk 1.6.0. The range check failure in ExpandToken appears to be caused by the arguments passed in or some other external state, since excluding it from compilation still shows a failure. Increasing the number of compiler threads seems to encourage the failure as well. I'm currently testing with appletviewer -J-server -J-XX:CompileOnly=com/fluendo/jheora -J-XX:CICompilerCount=4 -J-XX:+PrintCompilation -J-XX:-PrintInlining -J-XX:CompileCommand=exclude,com/fluendo/jheora/DCTDecode.ExpandToken test.html
31-07-2008