JDK-8038348 : Instance field load is replaced by wrong data Phi
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 7u25,8,9
  • Priority: P2
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2014-03-25
  • Updated: 2019-01-28
  • Resolved: 2016-08-23
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7 JDK 8 JDK 9
7u171Fixed 8u121Fixed 9 b137Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
In C2, instance field loads may be replaced by a corresponding, already existing data Phi in LoadNode::identity():
  // Search for an existing data phi which was generated before for the same
  // instance's field to avoid infinite generation of phis in a loop.

The problem is, that we only check if the Phi is from the same region and corresponds to the same instance field. We don't check if it contains the actual "up to date" value from the last store to the input memory:
  if (phi->is_Phi() && phi != mem &&
      phi->as_Phi()->is_same_inst_field(this_type, this_iid, this_index, this_offset)) {
    return phi;
  }

This problem was reported by the Apache Lucene because it causes "impossible" assertions in one of their tests:
https://issues.apache.org/jira/browse/LUCENE-5168

The problem is still reproducible with JDK 7, 8 and 9.
Comments
The approach of webrev.00 is quite complicated and error prone. I tried to record more information during Phi creation. Instead of using a CloneMap, I added a _inst_mem_id field to the PhiNode to store the node index of the memory Phi node through which we split the LoadNode. This is similar to _inst_id which records the node index of the allocation node. In LoadNode::Identity() we can now pass the index of the memory input to PhiNode::is_same_inst_field() to check if the data Phi represents the instance field value corresponding to the same memory state. This solves the case where we accidentally select a data Phi that represents an old value of the same instance field. Since we store the node index of a memory Phi, we need to be careful if this node is replaced or removed. To take care of replacement, I added code to PhaseIterGVN::subsume_node() to update the inst_mem_id of "linked" data Phi nodes to the index of the new node. We could also add such code to Node::clone() but then we would have to maintain a list of ids and it's not guaranteed that the cloned Phi remains "valid" because the node may be changed and used in a different context. If a referenced memory Phi is removed, it's only a problem if its node index is reused by another memory Phi because then data Phis may wrongly point to the new node. Currently, this is not a problem because PhaseRenumberLive is executed before EA. I added an assert for sanity. New webrev: http://cr.openjdk.java.net/~thartmann/8038348/webrev.01/
19-08-2016

I added additional checks to verify that we use the correct value Phi (if any) to replace the LoadNode. This solves the bug with JDK 7, 8 and 9. I was not able to write a regression test because the failure is hard to reproduce and depends on the order in which nodes are processed. Prototype fix: http://cr.openjdk.java.net/~thartmann/8038348/webrev.00/
28-07-2016

It seems that the problem is that a LoadNode is replaced by a Phi Node in LoadNode::Identity(), see: // Search for an existing data phi which was generated before for the same // instance's field to avoid infinite generation of phis in a loop. The data Phi returns true for "is_same_inst_field()", i.e., contains a value of that same instance field but the value is "old". There are multiple such data Phis for the instance field attached to the input Region node but only one contains the value that corresponds to the memory Phi. Here's the interesting part of the graph: 19995 Region ... [20004, 20007, 20010] 20004 Phi 19995 19984 19985 [...] 20007 Phi 19995 19988 199987 [20161] 20010 Phi 19995 10758 19096 10758 AndI ... [20010 19984] 19096 Phi ... [20010 19985] 19984 AddI 10758 ... [19988 20004] 19985 AddI 19096 ... [19987 20004] 19988 StoreI 19984 ... [20007] 19987 StoreI 19985 ... [20007] 20161 LoadI 20007 [...] The 20161 LoadI is replaced by 20010 Phi but should be replaced by 20004 Phi. We have a region connected to 3 Phis. One is a memory Phi (20007), merging the memory state of two stores (19988, 19987) and the other 2 Phis are data Phis representing the value of the instance field that we store to. One data Phi (20004) represents the value that is stored and the other data phi (20010) represents and "old" value. We replace the instance field load with the wrong data phi.
22-07-2016

The problem is in the C2 compiled method 'org.apache.lucene.index.FreqProxTermsWriterPerField.flush'. An object of type 'ByteSliceReader' is created and the 'upto" field is updated (incremented) by inlined methods in a loop. Escape analysis determined that the object is non-escaping and the allocation is removed. We crash because 'upto' contains an incorrect value due to a missing increment. I looked at the generated code and it seems that a branch misses to write back the result of incrementing 'upto': 0xf2b32396: mov 0xa4(%esp),%eax 0xf2b3239d: inc %eax <-- the result of this increment is not written back to 0xa4(%esp) 0xf2b3239e: mov 0xc0(%esp),%ebx 0xf2b323a5: mov 0x8(%ebx),%ecx 0xf2b323a8: mov 0xa4(%esp),%ebx 0xf2b323af: cmp %ecx,%ebx 0xf2b323b1: jae 0xf2b37f71 0xf2b323b7: vmovd %edi,%xmm2 0xf2b323bb: mov 0xc0(%esp),%ebp 0xf2b323c2: movsbl 0xc(%ebp,%ebx,1),%ebx ;*putfield field {reexecute=0 rethrow=0 return_oop=0} As a result, subsequent iterations of the loop use the old value of 'upto'. I tried to trace this back in the C2 IR but the graph contains thousands of nodes and I wasn't able to find the root cause yet. It seems that there is a missing data dependency between writing the 'upto' field in one loop iteration and reading the field in the next iteration. As reported by the Lucene team, the problem first appeared with JDK-7112478. I verified this and bisected the fix. The bug is triggered by the following change: --- a/src/share/vm/opto/escape.hpp Tue Nov 22 09:45:57 2011 +0100 +++ b/src/share/vm/opto/escape.hpp Mon Nov 28 15:46:31 2011 -0800 @@ -333,10 +339,9 @@ } // Notify optimizer that a node has been modified - // Node: This assumes that escape analysis is run before - // PhaseIterGVN creation void record_for_optimizer(Node *n) { _igvn->_worklist.push(n); + _igvn->add_users_to_worklist(n); } Without the add_users_to_worklist(n), the problem does not show up with revision 2883 (JDK-7112478). However, removing the line from the latest JDK 9 version does not help (probably the nodes are added to the worklist through other branches).
19-07-2016

I'm able to reproduce this in 1/15 runs with a recent 32-bit JDK 9 build on Linux x86_64. Steps to reproduce: export JAVA_HOME=[...] svn co https://svn.apache.org/repos/asf/lucene/dev/branches/branch_5x/ cd branch_5x svn update -r 1512729 cd lucene/highlighter ant test -Dtests.failfast=true -Dtests.seed=CF -Dtests.jvms=1 -Dtests.dynamicAssignmentRatio=0 -Dargs="-server -XX:+UseG1GC -XX:-TieredCompilation -Xbatch -XX:CICompilerCount=1"
04-07-2016

ILW=incorrect behaviour;c2 w/ G1GC;-DoEscapeAnalysis=HMM=>P2
10-06-2015

Dawid Weiss wrote: Tested with -XX:-OptimizePtrCompare, here are the results: > changeset: 2874:8c57262447d3 ant test -Dtests.failfast=true -Dtests.seed=CF -Dtests.jvms=1 \ -Dtests.dynamicAssignmentRatio=0 -Dargs="-server -XX:+UseG1GC \ -XX:-TieredCompilation -Xbatch -XX:CICompilerCount=1 \ -XX:-OptimizePtrCompare" => PASSED advanced to one-before-the-problem: > changeset: 2882:db2e64ca2d5a ant test -Dtests.failfast=true -Dtests.seed=CF -Dtests.jvms=1 \ -Dtests.dynamicAssignmentRatio=0 -Dargs="-server -XX:+UseG1GC \ -XX:-TieredCompilation -Xbatch -XX:CICompilerCount=1 \ -XX:-OptimizePtrCompare" => PASSED but the next one again causes the bug. > changeset: 2883:cc81b9c09bbb ant test -Dtests.failfast=true -Dtests.seed=CF -Dtests.jvms=1 \ -Dtests.dynamicAssignmentRatio=0 -Dargs="-server -XX:+UseG1GC \ -XX:-TieredCompilation -Xbatch -XX:CICompilerCount=1 \ -XX:-OptimizePtrCompare" => ASSERTION HIT I'll send you access details to that vbox on prv. Dawid On Wed, Sep 11, 2013 at 7:52 AM, Dawid Weiss <dawid.weiss@gmail.com> wrote: > Now that I know which change triggers the problem I can do some more > digging. For the record -- JDK8 also exhibits the same problem, but it > only surfaces in 32-bit JVM (and the same thing applies to JDK7 -- > only 32 bit mode is affected). Perhaps the patch in question only > helps another issue to become visible by changing the opto graph, I > don't know. > > I'll do some more testing around the code fragments you indicated and > will let you know. > > Dawid > > > On Wed, Sep 11, 2013 at 3:46 AM, Vladimir Kozlov > <vladimir.kozlov@oracle.com> wrote: >> On 9/10/13 3:26 AM, Dawid Weiss wrote: >>> >>> No need to be sorry, David -- a hint too many is better than no hint at >>> all >>> >>> Anyway. I did the backtracking -- it required some time because I had >>> to set up an environment from two years back (old ubuntu in a virtual >>> machine). I did some binary searching and probing and narrowed the >>> problem to this commit: >>> >>> changeset: 2883:cc81b9c09bbb >>> user: kvn >>> date: Mon Nov 28 15:46:31 2011 -0800 >>> summary: 7112478: after 7105605 JRuby >>> bench_define_method_methods.rb fails with NPE >>> >>> (Since 'kvn' seems to be Vladimir I believe I was bothering the right >>> person all the time >> >> >> Yes, you caught me >> >> What worries me is that you tested with -XX:-OptimizePtrCompare and still >> got the problem. >> Can you test 2874(7105605) changeset with -XX:-OptimizePtrCompare to see if >> your problem was introduced by it but was hidden by NPE problem? >> >> And there was 6890673 changes in between. Which could also introduce the >> problem. Test it is also with -XX:-OptimizePtrCompare if testing of 7105605 >> shows nothing. >> >> There was several followup fixes after 7112478 changes but all of them are >> present in 7u40 (for example, 7129284 and 7146442). So I don't see >> outstanding changes in escape.cpp in jdk8 which helped to fix the problem. >> But it could be fix in an other file. You said before that you don't have >> problem with jdk8. If you have spare testing cycles can you find which >> changes fixed it? >> >> Next step will be bisecting 7112478 changes. >> >> Thanks, >> Vladimir >> >>> >>> After this commit the problem with assertion-tripping in Lucene starts >>> to consistently show up (you already know it's not the assertion >>> itself but the underlying issue of missed update causing it). The >>> commits before 2883 -- everything after 2874 7(105605) all end up in a >>> >>> null pointer, probably as mentioned in the above commit summary. I >>> have a strong feeling the fix introduced in 2883 fixed the NPE but >>> introduced a regression somewhere else because 2873 works just fine. >>> >>> I looked at the patch but it goes too deep for me to understand given >>> the limited time I have for this pet project. If there's anything else >>> I can help you with, Vladimir, ask. I can provide an image of ubuntu >>> 9.04 (virtualbox) which reproduces the issue every time I run the test >>> suite (couldn't extract a smaller, isolated test case though). >>> >>> Dawid >>> >>> On Mon, Sep 9, 2013 at 7:08 PM, David Chase <david.r.chase@oracle.com> >>> wrote: >>>> >>>> >>>> On 2013-09-07, at 3:44 PM, Dawid Weiss <dawid.weiss@gmail.com> wrote: >>>>> >>>>> David: thanks for the tip. I know about bisect, it's also present in >>>>> git. ... >>>> >>>> >>>> Oh, sorry, I just assumed you were as ignorant of hg as I am of git. >>>> I know that one of the gripes about our why-have-just-one-tree repository >>>> design is that it gets in the way of exactly what you want to do. >>>> >>>> David >>>> >>
25-03-2014

Dawid Weiss wrote: My quest continues I thought it'd be interesting to see how far back I can trace this issue. I fetched the official binaries for jdk17 (windows, 32-bit) and did a binary search with the failing Lucene test command. The results show that, in short: ... jdk1.7.0_03: PASSES jdk1.7.0_04: FAILS ... and are consistent before and after. jdk1.7.0_04, 64-bit does *NOT* exhibit the issue (and neither does any version afterwards, it only happens on 32-bit; perhaps it's because of smaller number of available registers and the need to spill?). jdk1.7.0_04 was when G1GC was "officially" made supported but I don't think this plays a big difference. I'll see if I can bsearch on mercurial revisions to see which particular revision introduced the problem. Anyway, the problem has to be a long-standing issue and not a regression. Which makes it even more interesting I guess. Dawid On Mon, Sep 2, 2013 at 11:31 AM, Dawid Weiss <dawid.weiss@gmail.com> wrote: > Oh, I forgot to mention - that incremented value cannot be saved to > [esp + #136] immediately because there's a value read based on the > unincremented value (buffer[upto++]). This is all fine, but the > problem is that the incremented spill slot [esp + #144] is never > copied to that other spill slot that is read from the other inline. > > Again -- I have no clue what makes this particular code path so > special but I can see it in the debugger clearly. > > Dawid > > On Mon, Sep 2, 2013 at 11:17 AM, Dawid Weiss <dawid.weiss@gmail.com> wrote: >> I've traced the assembly on a running instance. The problem is quite >> interesting -- the code in question is purely computational, I still >> don't see how G1GC can be of importance here. In short what's >> happening is a method gets inlined in a larger loop more than once and >> a field access is placed on the esp spill. The problem is both of >> these inlines use a different spill slot and one of the updates is not >> properly propagated -- it is saved to one spill slot, then the code >> execution flow returns to the top of the loop and another inline reads >> a stale value from another spill slot. When you disable escape >> analysis everything works as expected. >> >> The crucial block is: >> >> 7e25 B1020: # B405 B1021 <- B1017 B1019 Freq: 0.00811184 // XXX >> jump from b1017 >> 7e25 MOV EBX,[ESP + #136] // ebx = 80e >> 7e2c MOV ECX,EBX >> 7e2e INC ECX // ecx = 80f, increment. >> 7e2f MOV [ESP + #144],ECX // this is the >> incremented spill but it never gets propagated to [esp + #136] or >> re-read from memory. >> 7e36 MOV EBX,[ESP + #132] >> 7e3d MOV [EBX + #12],ECX ! Field: >> org/apache/lucene/index/ByteSliceReader.upto >> 7e40 MOV EBX,[ESP + #152] >> 7e47 TEST EBX,EBX >> 7e49 Jne B405 P=0.999999 C=-1.000000 // taken, null check >> >> >> The incremented local value of upto gets saved to [ESP + #144] and to >> memory but upon reiterating the loop it is re-read from [ESP + #136] >> so it looks like a dependency is missed somewhere in the graph causing >> escape analysis to fail. >> >> Ehm, obviously I have no idea how or where to fix this in the JVM code >> >> Full optoassembly dump and annotated opto assembly analysis is here: >> http://www.cs.put.poznan.pl/dweiss/tmp/hotspot-bug-analysis.zip >> >> Let me know if I can do anything else. >> >> Dawid >> >> On Tue, Aug 27, 2013 at 9:15 AM, Dawid Weiss <dawid.weiss@gmail.com> wrote: >>> Thanks Vladimir. I had the impression I may be looking at the >>> interpreter. Just for completeness, I also thought/tried the following >>> ways to stop at the jitted code: >>> >>> 1) call Windows debugger API endpoints from kernel32 -- >>> >>> http://msdn.microsoft.com/en-us/library/windows/desktop/ms679297(v=vs.85).aspx >>> >>> via JNI/ JNA. This worked surprisingly well -- you can even wait for >>> the debugger to attached to the code in a loop, works like a charm. >>> Although once you break into the debugger I again couldn't make sense >>> of the stack -- probably the infrequent call causes a deopt again and >>> we're back in the interpreter. >>> >>> 2) use Unsafe to force a null pointer dereference This should send >>> a debugger signal, right? I didn't try it though, didn't have the time >>> yet. I suspect it may be moved to an infrequent trap too and we're >>> back where we started, eh. >>> >>>> I still did not have time to look on your logs. >>> >>> No worries, it's been a very very educational experience for me. >>> Besides, I used to write tons of assembly (not on x86) and it actually >>> feels very good to be back at the assembly mnemonic level >>> >>> Don't spoil my weekend pet project And seriously -- if you want to >>> check out those logs let me know in advance and I'll prepare an update >>> with hsx-tip if needed. >>> >>> Dawid >>> >>> On Tue, Aug 27, 2013 at 3:07 AM, Vladimir Kozlov >>> <vladimir.kozlov@oracle.com> wrote: >>>> Dawid, >>>> >>>> Let it crush with AbortVMOnException (without ShowMessageBoxOnError) and >>>> look on call stack in hs_err_pidxxx.log file. Jited code usually have >>>> uncommon trap for exceptions so the method most likely was deoptimized and >>>> you will see interpeter frames on call stack instead of compiled/jited. >>>> There should be deoptimization event listed down in hs_err_pidxxx.log which >>>> may say in which address in jited code it hit uncommon trap. You can look on >>>> it in optoassembly output. Then during method compilation >>>> (-XX:+PrintCompilation -Xbatch -XX:CICompilerCount=1 should give you info >>>> when the method is compiled, you can verify it in Compile() by >>>> this->method()->print()) in debugger stop in ciEnv::register_method() after >>>> call to nmethod::new_nmethod(). nm->insts_begin() shows where instructions >>>> start. Add instruction offset from optoassembly and set breakepoint at that >>>> address. This way you can stop in jited code. >>>> >>>> G1 affects inlining since its write barriers code size is larger then other >>>> collectors (it has pre and post barriers for oop stores). >>>> >>>> I still did not have time to look on your logs. >>>> >>>> Vladimir >>>> >>>> >>>> On 8/26/13 12:56 PM, Dawid Weiss wrote: >>>>>> >>>>>> An ideal scenario and what I want to achieve is I'd like to stop on an >>>>>> assertion somewhere in the jitted code and go back to it to follow the >>>>>> execution (so that I can see what code is responsible for the issing >>>>> >>>>> increment). I'll need to sync this up (manually or otherwise) with the >>>>> optoassembly too [...] >>>>> >>>>> Would somebody be so kind and provide some light on whether the above >>>>> is feasible? I'm getting up to speed with visual studio -- haven't >>>>> used assembly-level debugging in a longer while -- but I'm having >>>>> troubles connecting the dots with verbose optoassembly output (the >>>>> addresses don't seem to match). Thanks and again, apologies if it's >>>>> something trivial but I couldn't find any helpful resources on the >>>>> net. >>>>> >>>>> Dawid >>>>> >>>>> On Sun, Aug 25, 2013 at 9:43 PM, Dawid Weiss <dawid.weiss@gmail.com> >>>>> wrote: >>>>>> >>>>>> I know why the assertion is hit. I still don't know (and probably >>>>>> won't figure it out) why it is so difficult to reproduce or which >>>>>> optimization leads to it. >>>>>> >>>>>> In short, the following code in ByteSliceReader#readByte: >>>>>> >>>>>> public byte readByte() { >>>>>> assert !eof(); >>>>>> assert upto <= limit; >>>>>> if (upto == limit) { >>>>>> nextSlice(); >>>>>> } >>>>>> return buffer[upto++]; >>>>>> } >>>>>> >>>>>> is inlined in a number of places. If I include: >>>>>> >>>>>> public byte readByte() { >>>>>> assert !eof(); >>>>>> assert upto <= limit; >>>>>> if (upto == limit) { >>>>>> nextSlice(); >>>>>> baz = upto; >>>>>> } >>>>>> return buffer[upto++]; >>>>>> } >>>>>> static int baz; >>>>>> >>>>>> I see a number of assignments to 'baz'. This is because methods like >>>>>> readVInt fork out many conditional branches, as in: >>>>>> >>>>>> public int readVInt() throws IOException { >>>>>> byte b = readByte(); >>>>>> if (b >= 0) return b; >>>>>> int i = b & 0x7F; >>>>>> b = readByte(); >>>>>> i |= (b & 0x7F) << 7; >>>>>> if (b >= 0) return i; >>>>>> b = readByte(); >>>>>> i |= (b & 0x7F) << 14; >>>>>> if (b >= 0) return i; >>>>>> b = readByte(); >>>>>> i |= (b & 0x7F) << 21; >>>>>> if (b >= 0) return i; >>>>>> b = readByte(); >>>>>> // Warning: the next ands use 0x0F / 0xF0 - beware copy/paste >>>>>> errors: >>>>>> i |= (b & 0x0F) << 28; >>>>>> if ((b & 0xF0) == 0) return i; >>>>>> throw new IOException("Invalid vInt detected (too many bits)"); >>>>>> } >>>>>> >>>>>> One of these executions (when upto == limit, and possibly with some >>>>>> other upper condition) leads to missing increment on buffer[upto++] -- >>>>>> upto remains unmodified (but the value at buffer[upto] is returned >>>>>> properly). Because upto is moved to a register in many of those >>>>>> inlined methods I suspect either the register may be incremented and >>>>>> not saved or the increment may be removed due to some local escape >>>>>> analysis magic (that has to be incorrect, obviously). How G1GC >>>>>> interacts with all this (remember, it only happens with G1GC) I have >>>>>> no clue. >>>>>> >>>>>> Anyway, I've tried looking at -XX:+PrintEscapeAnalysis but I couldn't >>>>>> make much sense of it. I also dug into some of the opto assembly >>>>>> branches trying to map them back into the original bytecode, but it's >>>>>> a very long process -- had to map PcDesc manually to the assembly >>>>>> offset, then go to bytecode again. Then I fired visual studio and >>>>>> halted it like Vladimir mentioned: >>>>>> -XX:AbortVMOnException=java.lang.AssertionError >>>>>> -XX:+ShowMessageBoxOnError >>>>>> >>>>>> Once I attach to the halted JVM I "break all" and I see the reported >>>>>> thread's stack trace halted inside ntdll (with the closest JVM frame >>>>>> stopped at get_deopt_original_pc). The frames below it don't seem to >>>>>> make much sense to me -- they don't seem to point to any addresses >>>>>> reported in the compilation log (PcDesc clauses). I'm a complete >>>>>> newbie to this, so I apologize if it's a lame question, but how can I >>>>>> find out which part of the original jitted code this assertion was >>>>>> invoked from? >>>>>> >>>>>> An ideal scenario and what I want to achieve is I'd like to stop on an >>>>>> assertion somewhere in the jitted code and go back to it to follow the >>>>>> execution (so that I can see what code is responsible for the missing >>>>>> increment). I'll need to sync this up (manually or otherwise) with the >>>>>> optoassembly too, otherwise I'll probably lose any remaining hair >>>>>> trying to figure out what's what >>>>>> >>>>>> A hint (or any other pointer; RTFM) would be very welcome. >>>>>> >>>>>> Dawid >>>>>> >>>>>> >>>>>> On Tue, Aug 20, 2013 at 11:32 AM, Dawid Weiss <dawid.weiss@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> One follow-up with respect to those encoded integers -- I swapped the >>>>>>> correct and incorrect sequence in my last e-mail, sorry. I then also >>>>>>> added a sysout here: >>>>>>> >>>>>>> @@ -451,6 +453,7 @@ >>>>>>> termFreq = 1; >>>>>>> } else { >>>>>>> termFreq = freq.readVInt(); >>>>>>> + System.err.print("# (tf): " + termFreq + " "); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> With this the assertion is no longer hit but it shows where the >>>>>>> difference in numbers comes from: >>>>>>> >>>>>>> Correct sequence is: >>>>>>> >>>>>>> # (eof) >>>>>>> # (code): 0 >>>>>>> # (tf): 3 >>>>>>> # (code): 4 >>>>>>> # (tf): 3 >>>>>>> # (code): 2 >>>>>>> # (tf): 5 >>>>>>> # (code): 2 >>>>>>> # (tf): 2 >>>>>>> # (code): 3 >>>>>>> # (code): 5 >>>>>>> # (code): 2 >>>>>>> >>>>>>> And the incorrect, assertion-hit run shows: >>>>>>> >>>>>>> # (eof) >>>>>>> # (code): 0 >>>>>>> # (code): 3 >>>>>>> # (code): 4 >>>>>>> # (code): 3 >>>>>>> # (code): 2 >>>>>>> # (code): 5 >>>>>>> # (code): 2 >>>>>>> # (code): 2 >>>>>>> # (code): 3 >>>>>>> >>>>>>> So there's something odd happening after the first readvint which >>>>>>> returns 0 -- the next number that should be read by: >>>>>>> >>>>>>> termFreq = freq.readVInt(); >>>>>>> >>>>>>> is re-read as the next code. >>>>>>> >>>>>>> Dawid >>>>>>> >>>>>>> >>>>>>> On Tue, Aug 20, 2013 at 11:11 AM, Dawid Weiss <dawid.weiss@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> Thanks for comments guys. >>>>>>>> >>>>>>>>> It is great that you can build fastdebug VM. I assume that if I give >>>>>>>>> you a >>>>>>>>> patch to try you can also build with it. >>>>>>>> >>>>>>>> >>>>>>>> Yes, absolutely. >>>>>>>> >>>>>>>>> First, you can run with next options and then send zipped output >>>>>>>>> (related >>>>>>>>> part before the method compilation and optoassembler output, don't use >>>>>>>>> hsdis >>>>>>>>> for that) and hs_err file when test fails: >>>>>>>> >>>>>>>> >>>>>>>> I did that. Used the options you requested -- a full log of all I did >>>>>>>> is included in the ZIP file here: >>>>>>>> http://www.cs.put.poznan.pl/dweiss/tmp/debug-files.zip >>>>>>>> >>>>>>>> Cat debug-files\debug-log.txt. There are several runs included in that >>>>>>>> ZIP file, in short: >>>>>>>> >>>>>>>> - jdk18-fastdebug, b102, full run (no abort on assertion, explained in >>>>>>>> debug-log.txt): >>>>>>>> 001-no-abort-on-assertion >>>>>>>> 002-no-abort-on-assertion >>>>>>>> >>>>>>>> - jdk18-fastdebug, b102, abort on assertion (includes mem dumps) >>>>>>>> 003-with-abort-on-assertion >>>>>>>> 004-with-abort-on-assertion >>>>>>>> >>>>>>>> - jdk18-fastdebug, hsx tip, abort on assertion (includes mem dumps) >>>>>>>> 005-hsx-tip >>>>>>>> 006-hsx-tip >>>>>>>> >>>>>>>> - jdk18-fastdebug, hsx tip, excluded readvint method inlining (passed >>>>>>>> build, so only compilation logs available). >>>>>>>> 007-excluded-readvint >>>>>>>> >>>>>>>>> Looking on java code in FreqProxTermsWriterPerField::flush() and based >>>>>>>>> on >>>>>>>>> LUCENE-5168 report generated code somehow missed EOF check. Am I >>>>>>>>> right? This >>>>>>>>> is why the Assert is thrown? >>>>>>>> >>>>>>>> >>>>>>>> It's not the only method it trips on. Depending on the seed the >>>>>>>> problem shows up in different places. For the dumps I included the >>>>>>>> issue seems to be here: >>>>>>>> >>>>>>>>> } else { >>>>>>>>> final int code = freq.readVInt(); >>>>>>>>> if (!readTermFreq) { >>>>>>>>> docID += code; >>>>>>>>> termFreq = -1; >>>>>>>> >>>>>>>> >>>>>>>> If I add sysouts as in: >>>>>>>> >>>>>>>> Index: >>>>>>>> core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java >>>>>>>> =================================================================== >>>>>>>> --- >>>>>>>> core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java >>>>>>>> (revision 1512807) >>>>>>>> +++ >>>>>>>> core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java >>>>>>>> (working copy) >>>>>>>> @@ -427,6 +427,7 @@ >>>>>>>> //System.out.println(" cycle"); >>>>>>>> final int termFreq; >>>>>>>> if (freq.eof()) { >>>>>>>> + System.err.print("# (eof)"); >>>>>>>> if (postings.lastDocCodes[termID] != -1) { >>>>>>>> // Return last doc >>>>>>>> docID = postings.lastDocIDs[termID]; >>>>>>>> @@ -442,6 +443,7 @@ >>>>>>>> } >>>>>>>> } else { >>>>>>>> final int code = freq.readVInt(); >>>>>>>> + System.err.print("# (code): " + code + " "); >>>>>>>> if (!readTermFreq) { >>>>>>>> docID += code; >>>>>>>> termFreq = -1; >>>>>>>> >>>>>>>> then for a constant seed you'll get an identical sequence of values. >>>>>>>> Once the assertion hits though, the sequence deviates. Comparing a >>>>>>>> passing run (excluding readvint) and a failing run I get a lot of >>>>>>>> initial aligned outputs and then a deviation: >>>>>>>> >>>>>>>> (correct run): >>>>>>>> # (eof) >>>>>>>> # (eof) >>>>>>>> # (eof) >>>>>>>> # (code): 0 >>>>>>>> # (code): 3 >>>>>>>> # (code): 4 >>>>>>>> # (code): 3 >>>>>>>> >>>>>>>> (incorrect run): >>>>>>>> # (eof) >>>>>>>> # (eof) >>>>>>>> # (eof) >>>>>>>> # (code): 0 >>>>>>>> # (code): 4 <- wtf? >>>>>>>> # (code): 2 >>>>>>>> # (code): 2 >>>>>>>> >>>>>>>> How can a variable encoded integer be misinterpreted from 3 to 4 is >>>>>>>> beyond me, sorry. But it's not an EOF condition I think, at least the >>>>>>>> deviation happens where the original run had was entering (code) >>>>>>>> branch. >>>>>>>> >>>>>>>>> Could you also send latest version of FreqProxTermsWriterPerField.java >>>>>>>>> you >>>>>>>>> are testing? >>>>>>>> >>>>>>>> >>>>>>>> I'm testing against a fixed branch (the info is included in >>>>>>>> debug-log.txt). >>>>>>>> >>>>>>>> If there's anything else I can do, let me know. I'm also curious how >>>>>>>> you approach debugging this thing. It may be time-based so I don't >>>>>>>> think it'll reproduce on your machine out of the box, but who knows. >>>>>>>> All I know is that, for example, if you add one more sysout before the >>>>>>>> while loop it suddenly stops reproducing so probably something gets >>>>>>>> compiled differently... Wild >>>>>>>> >>>>>>>> Dawid
25-03-2014

On 8/19/13 4:40 PM, Christian Thalinger wrote: > > On Aug 19, 2013, at 4:12 PM, Vladimir Kozlov <vladimir.kozlov@oracle.com> wrote: > >> Christian, >> >> 'exclude' or 'compileonly' will not help because the failure reproduction depends on inlining. > > Right. This brings me back to something we discussed a couple times already: compileonly should not disable inlining of callees. Someone should fix this. It may still not help because of InlineSmallCode - if other methods are not compiled they will be inlined. So it is different behavior. What we need is ReplayCompile which should not depend on state of java classes and be like PrintOptoAssembly when we dump initial MDOs of main and inlined methods, inlining decisions and results of all calls into VM during compilation to reproduce compilation exactly. Vladimir > > -- Chris > >> >> Dawid, >> >> It is great that you can build fastdebug VM. I assume that if I give you a patch to try you can also build with it. >> >> First, you can run with next options and then send zipped output (related part before the method compilation and optoassembler output, don't use hsdis for that) and hs_err file when test fails: >> >> -XX:CICompilerCount=1 -XX:+PrintInlining -XX:+TraceLoopOpts -XX:-BlockLayoutByFrequency -XX:-BlockLayoutRotateLoops -XX:CompileCommand=print,org.apache.lucene.index.FreqProxTermsWriterPerField::flush -XX:AbortVMOnException=java.lang.AssertionError >> >> Also, please, pull latest C2 Compiler changes from http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot >> >> Looking on java code in FreqProxTermsWriterPerField::flush() and based on LUCENE-5168 report generated code somehow missed EOF check. Am I right? This is why the Assert is thrown? >> >> And the failing code is next: >> >> } else { >> final int code = freq.readVInt(); >> if (!readTermFreq) { >> docID += code; >> termFreq = -1; >> } else { >> docID += code >>> 1; >> if ((code & 1) != 0) { >> termFreq = 1; >> } else { >> termFreq = freq.readVInt(); >> } >> } >> >> assert docID != postings.lastDocIDs[termID]; >> } >> >> Could you also send latest version of FreqProxTermsWriterPerField.java you are testing? >> >> Thanks, >> Vladimir >> >> On 8/19/13 2:45 PM, Christian Thalinger wrote: >>> >>> On Aug 19, 2013, at 1:29 PM, Dawid Weiss <dawid.weiss@gmail.com> wrote: >>> >>>> Thanks Vladimir, the tiered compilation hint was really very useful. I >>>> managed to reproduce this error on a 1.8 fastbuild and I can dump >>>> pretty much anything I want. But I still cannot figure out what's >>>> wrong -- it's beyond me. Here are the things I've tried per your >>>> suggestions >>>> >>>> - turning off -XX:-OptimizePtrCompare or -XX:-EliminateAutoBox >>>> doesn't help (error still present), >>>> - I do have a -Xbatch with a single compiler thread -- it was more >>>> difficult to hit an error seed (the tests are randomized) but it's >>>> still possible. >>>> >>>> I know it would be best to provide a stand-along reproduce package but >>>> it's not trivial given how complex Lucene tests and testing framework >>>> is. Can I provide you with anything that would be helpful except the >>>> above? Specifically I can: >>>> >>>> 1) dump an opto assembly from a failed and non-failed run >>>> 2) provide an opto assembly from a g1gc run vs. any other gc (which >>>> doesn't seem to exhibit the problem), >>>> 3) provide a -XX:+PrintCompilation -XX:+PrintCompilation2 or a verbose >>>> hotspot.log. >>>> >>>> Let me know if any of these (or anything else) would be useful. If >>>> not, I'll try to extract a stand-alone package that would reproduce >>>> the issue, although this is a real killer to pull off. >>> >>> How many total compiles are there? Maybe you can try to limit the methods being compiled with: >>> >>> -XX:CompileCommand=compileonly,foo::bar >>> >>> or >>> >>> -XX:CompileCommand=exclude,foo::bar >>> >>> One problem I see though is that using compileonly also prevents inlining which might hide the issue. >>> >>> -- Chris >>> >>>> >>>> Dawid >>>> >>>> >>>> On Fri, Aug 16, 2013 at 9:50 AM, Vladimir Kozlov >>>> <vladimir.kozlov@oracle.com> wrote: >>>>> On 8/15/13 11:45 PM, Dawid Weiss wrote: >>>>>>> >>>>>>> It is with high probability Compiler problem. >>>>>> >>>>>> >>>>>> I believe so. I've re-run the tests with 1.8b102 and the problem is >>>>>> still there, although it's more difficult to show -- I ran a 100 full >>>>>> builds yesterday, five of them tripped on assertions that should be >>>>>> unreachable. >>>>> >>>>> >>>>> We switched on -XX:+TieredCompilation by default in b102. Switch it off to >>>>> use only C2 compiler which has the problem. >>>>> >>>>> >>>>>> >>>>>>> G1 has larger write-barrier code then other GCs. It can affect inlining >>>>>>> decisions. You can try to change -XX:InlineSmallCode=1000 value. It >>>>>>> controls >>>>>>> inlining of methods which were already compiled. >>>>>>> >>>>>>> You can also try -Xbatch -XX:CICompilerCount=1 to get serial >>>>>>> compilations. >>>>>> >>>>>> >>>>>> Thanks for these tips, Vladimir -- very helpful. I hope you don't mind >>>>>> me asking one more question - we had a discussion with another Lucene >>>>>> developer yesterday -- is -Xbatch deterministic in the sense that if >>>>>> you run a single thread/ deterministic piece of code it will always >>>>>> trigger compiles at the same time? What happens if there are two >>>>>> uncoordinated threads that hit a set of the same methods (and thus >>>>>> when the compiler kicks in the statistics will probably be different >>>>>> for each independent run)? >>>>> >>>>> >>>>> -Xbatch (equivalent to -XX:-BackgroubdCompilation) will block only thread >>>>> which first put compilation task on compile queue. Other threads check that >>>>> the task in the queue and resume execution without waiting. >>>>> You still can't get full determinism with several java threads, as you >>>>> notice. But it can reduce some variations in inlining decision because >>>>> compilation will be executed by one Compiler thread (instead of 2 by >>>>> default). So if compilation tasks are put on queue at the same order in >>>>> different runs you most likely will get the same code generation. Of cause >>>>> usually the order is slightly different (especially during startup when >>>>> there are a lot of compilation requests) so you can still get different >>>>> results. >>>>> >>>>> >>>>>> >>>>>> This question originated from a broader discussion where we were >>>>>> wondering how you, the compiler-guru guys approach the debugging in >>>>>> case something like this pops up -- a bug that is very hard to >>>>>> reproduce, that manifests itself rarely and for which pretty much any >>>>>> change at the Java level changes the compilation and thus generates >>>>>> completely different code. This seems to be a tough nut to crack. >>>>> >>>>> >>>>> We usually try to reproduce the problem with debug version of VM which have >>>>> a lot asserts and we may hit one which helps identify the problem. You are >>>>> lucky if you can reproduce a problem in debug VM in debugger. >>>>> We try to get assembler output of compiled method during run when it >>>>> crushes. hs_err file has address and offset in compiled code and small code >>>>> snippet which helps to find the code. After that we "look hard" on assembler >>>>> code and try to figure out what is wrong with it and which compiler part can >>>>> generate such code pattern. >>>>> There is debug flag -XX:AbortVMOnException==java.lang.NullPointerException >>>>> which allow to abort VM on exceptions. And with -XX:+ShowMessageBoxOnError >>>>> flag we allow to attach debugger to VM when it happened. >>>>> When we get only core file it is tough. We try to use Serviceability Agent >>>>> to extract information and compiled code from it and other data. >>>>> >>>>> An other suggestion for you. Since you can avoid problem with switched off >>>>> EA you can try to switch off only >>>>> >>>>> -XX:-OptimizePtrCompare "Use escape analysis to optimize pointers >>>>> compare" >>>>> -XX:-EliminateAutoBox "Control optimizations for autobox elimination" >>>>> >>>>> Vladimir >>>>> >>>>>> >>>>>> Dawid >>>>>> >>>>> >>> >
25-03-2014

On Wed, Aug 14, 2013 at 8:58 AM, Mikael Gerdin wrote: > Hi Dawid, > > > On 2013-08-14 08:27, Dawid Weiss wrote: >> >> Hi everyone, >> >> I am a committer to the Lucene/Solr project. We've recently hit what >> we believe is a JIT/GC bug -- it manifests itself only when G1GC is >> used, on a 32-bit VM: >> >> Using Java: 32bit/jdk1.8.0-ea-b102 -server -XX:+UseG1GC >> Java: 32bit/jdk1.7.0_25 -server -XX:+UseG1GC >> >> Here is the Lucene issue where more information is available: >> https://issues.apache.org/jira/browse/LUCENE-5168 >> >> In the essence, the problem is that the code hits an assertion (in >> Java) which it should never reach. There used to be a problem with our >> implementation of readByte which tripped C2, but this was patched by >> an alternate implementation a while back -- see here, line 97 (inside >> readVInt): >> >> >> http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/core/src/java/org/apache/lucene/store/DataInput.java >> >> This time it seems to be something else and is *not* easily >> reproducible on a smaller example (it's not even reproducible on that >> particular test all the time). >> >> Is there anything you can think of that we can do and which would help >> you in narrowing down what the problem might be? I initially thought >> to pass -XX:+PrintCompilation -XX:+PrintAssembly but this will result >> in a huge log as this happens some time in the middle of a test run >> (and not always). If there's a shorter route I'd be happy to use it. > > > If you have a guess about which method is mis-compiled you can try with > -XX:CompileCommand="print org/apache/foo::method" > This enables +PrintNMethods on a per-method basis. > > If you suspect several methods you can use CompileCommandFile and create a > text file with several "print" commands. > > You also need to compile the hsdis disassembler library and place it in the > jre/lib/i386 directory to get the actual output from > +Print{NMethods,Assembly}. > > /Mikael > >> >> Dawid >>
25-03-2014