JDK-8369506 : Bytecode rewriting causes Java heap corruption on AArch64
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 21,22,23,24,25
  • Priority: P2
  • Status: Resolved
  • Resolution: Fixed
  • CPU: aarch64
  • Submitted: 2025-10-09
  • Updated: 2025-10-17
  • Resolved: 2025-10-16
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 25 JDK 26
25.0.2Fixed 26 masterFixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
`RewriteBytecodes` is on by default and on AArch64 we have confirmed that it causes Java heap corruption. The root cause is a missing barrier on the path which updates the bytecode. `ResolvedFieldEntry::fill_in` uses`STLR` to resolve the `ResolvedFieldEntry` and the interpreter uses `STR` to update the bytecode shortly after. Unfortunately AArch64 allows `STR` after `STLR` to be reordered, allowing other threads to observe the patched bytecode before `ResolvedFieldEntry` is resolved. Since the fast bytecode handlers expect `ResolvedFieldEntry` to be resolved, this leads to corruption.

This was confirmed by inserting logic near the beginning of the fast bytecode handlers `TemplateTable::fast_*` which checks whether the field offset from `ResolvedFieldEntry` is 0 and calls stop if it is. We observed in rare circumstances the stop was triggered. This confirms the markWord being clobbered.

This manifests itself as impossible branches being taken (`return foo == null || foo.isEmpty()` throwing NullPointerException for `foo` or `if (foo != null)` not being taken when `foo` is guaranteed to have not been null), interpreter crashes related to corrupted receiver oops on the stack, or GC crashes related to corrupted oops. Typically if the JVM is going to crash (it does not always), it does so within the first minute or two.

We do not have a reproducing test case to share as it is extremely rare, but it is observable at scale. We have an internal test which reproduces it, but it is large, proprietary, and impossible to isolate.

The only workaround is disabling bytecode rewritting by passing `-XX:-RewriteBytecodes`.

NOTE: This affects OpenJDK going all the way back through at least 21.

We performed benchmarks with bytecode rewritting on and off, and the results were negligible. This begs the question, on modern hardware with the existance of C1 and C2, is bytecode rewritting necessary?

There are 3 options to resolve this:

1. Disable bytecode rewritting on AArch64 and likely other weak memory model architectures.
2. Ditch bytecode rewritting all together in Hotspot.
3. Add the necessary barriers. I have a patch which adds the minimum barriers necessary to make it safe according to AArch64 specifications.
Comments
A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk21u-dev/pull/2355 Date: 2025-10-17 14:50:15 +0000
17-10-2025

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk25u/pull/322 Date: 2025-10-17 00:11:19 +0000
17-10-2025

[jdk25u-fix-request] Approval Request from Justin King
17-10-2025

Changeset: 18fd0477 Branch: master Author: Justin King <jcking@openjdk.org> Date: 2025-10-16 19:59:13 +0000 URL: https://git.openjdk.org/jdk/commit/18fd04770294e27011bd576b5ea5fe43fa03e5e3
16-10-2025

> With the new STLR when writing patched_bc, I wonder if it is correct and better to move the membar(MacroAssembler::LoadLoad) earlier, immediately after loading patched_bc. E.g.: It wouldn't hurt, but this patch is mostly about GBOoO machines, which will prefetch everything anyway.
16-10-2025

> I was also initially toying with the idea of deprecating `RewriteBytecodes`. It looks like it still provide some value today. It is worth keeping given that we can fix this bug with small code change. Yes, absolutely. Every efficient system is composed of thousands of small optimizations, each one too small to make a significant difference on its own.
16-10-2025

> [~shade]: I tried a few workloads on x86_64 server. The most impact with -XX:-RewriteBytecodes I could measure is ~0.5% in pure -Xint on a large workload. When compilers are enabled, the difference is a wash. Startup does not seem to be affected. The only configuration that seems to care is Zero, where non-quickened bytecodes hurt for ~5%. I also ran 2009 DaCapo benchmarks for `-XX:-RewriteBytecodes` on both x86_64 and aarch64. I set a large heap size to minimize the impact from GC. The difference is negligible when JIT compilers are enabled. However, with `-Xint`, I see noticeable regression with `-XX:-RewriteBytecodes`: - x86_64: average of ~8% regression for throughput and CPU - aarch64: average of ~15% regression for throughput and CPU On both architectures, DaCapo's sunflow shows the largest regression. I was also initially toying with the idea of deprecating `RewriteBytecodes`. It looks like it still provide some value today. It is worth keeping given that we can fix this bug with small code change.
15-10-2025

Relying on the membar(MacroAssembler::LoadLoad) from InterpreterMacroAssembler::load_field_entry() looks right to ensure load order of patched_bc; rfe_contents. With the new STLR when writing patched_bc, I wonder if it is correct and better to move the membar(MacroAssembler::LoadLoad) earlier, immediately after loading patched_bc. E.g.: void TemplateTable::fast_storefield(TosState state) { membar(MacroAssembler::LoadLoad); transition(state, vtos); ByteSize base = ConstantPoolCache::base_offset(); jvmti_post_fast_field_mod(); __ load_field_entry(r2, r1); load_resolved_field_entry(r2, r2, noreg, r1, r5); ... }
15-10-2025

> Given that there is only one load memory barrier, why do we need two release stores? > Sure, but what is the first one for? Because AFAICS there are two types of consumers. We are fixing one of them, but the other is still there. The second one is whoever treats RFE::(get|set)_code as witness that RFE is resolved: inline void set_bytecode(u1* code, u1 new_code) { ... AtomicAccess::release_store(code, new_code); } // Populate the strucutre with resolution information void fill_in(InstanceKlass* klass, int offset, u2 index, u1 tos_state, u1 b1, u1 b2) { ... // These must be set after the other fields set_bytecode(&_get_code, b1); set_bytecode(&_put_code, b2); } u1 get_code() const { return AtomicAccess::load_acquire(&_get_code); } u1 put_code() const { return AtomicAccess::load_acquire(&_put_code); } bool is_resolved(Bytecodes::Code code) const { switch(code) { case Bytecodes::_getstatic: case Bytecodes::_getfield: return (get_code() == code); case Bytecodes::_putstatic: case Bytecodes::_putfield: return (put_code() == code); default: ShouldNotReachHere(); return false; } }
15-10-2025

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/27748 Date: 2025-10-10 16:21:17 +0000
15-10-2025

> I tried it with herd7, switching the second STLR to STR fails and allows the impossible condition to be observed. Sure, but what is the first one for?
15-10-2025

> Ha! YES! We have fixed JDK-8248219 and JDK-8327647 before. So it looks we only need STLR on the bytecode patching side. Assuming DMB ISHLD from membar(LoadLoad) composes well with STLR? Ah, good find. With the load barrier the two STLRs are enough. > Given that there is only one load memory barrier, why do we need two release stores? I tried it with herd7, switching the second STLR to STR fails and allows the impossible condition to be observed.
15-10-2025

Ha! YES! We have fixed JDK-8248219 and JDK-8327647 before. So it looks we only need STLR on the bytecode patching side. Assuming DMB ISHLD from membar(LoadLoad) composes well with STLR? Looks like it does: ``` AArch64 STLR+STLR+DMBLD { 0:X1=x; 0:X3=y; 1:X5=y; 1:X7=x; } P0 | P1 ; MOV X0,#1 | LDR X4,[X5]; STLR X0,[X1] | CMP X4,#0 ; MOV X2,#1 | B.EQ under ; STLR X2,[X3] | DMB ISHLD ; | LDR X6,[X7]; | under: ; exists(1:X4=1 /\ 1:X6=0) ```
15-10-2025

> The only way to address this, without LDAR on all bytecodes, is to replicate the field resolution logic which basically makes the slow and fast paths almost the same. Maybe we all missed this at the end of InterpreterMacroAssembler::load_field_entry: lea(cache, Address(cache, index)); // Prevents stale data from being read after the bytecode is patched to the fast bytecode membar(MacroAssembler::LoadLoad);
15-10-2025

> The only way to address this, without LDAR on all bytecodes is to replicate the field resolution logic which basically makes the slow and fast paths almost the same AFAICS, you can also put ISB somewhere strategically, so it is between entering the fast bytecode codelet and touching the ResolvedFieldEntry. I'd say somewhere in TemplateTable::fast_* that load RFE, i.e. those touching the fields. But it is unclear to me how much it will slow these fast bytecodes down. Maybe not a lot? Maybe a quite a bit? And if it does slow down, is it still better than not rewriting the bytecodes to begin with?.. Plus, you'll need that for every other weakly-ordered architecture, as it does not look uniquely AArch64 problem. > So perhaps we just retire the flag entirely, make it a noop, remove it from template interpereters, and let Zero continue rewriting? RewriteBytecodes is already product_pd flag, which requires explicit decision per architecture. So you can just go around globals_${arch}.hpp and flip them to `false`, leaving globals_zero.hpp flipping it to `true`. I propose to float that idea in a separate RFE.
15-10-2025

Given that there is only one load memory barrier, why do we need two release stores?
15-10-2025

> The test is not entirely correct: the P1 thread basically tests: `if ([x5] == 0) x6 = [x7]`. I think you wanted to test for `!=`, so replace `B.EQ` -> `B.NE`. A clue that something is off is replacing `STLR` -> `STR`, without test catching anything interesting :) Whoops. You are correct. In that case, the `0` field offset observed and the only way to not observe it is putting LDAR on the bytecode load. Which would have to be done for all bytescodes. The only way to address this, without LDAR on all bytecodes, is to replicate the field resolution logic which basically makes the slow and fast paths almost the same. In that case, I think I have to concur that RewriteBytecodes is likely not useful outside of Zero and it should probably go. But IIRC Zero requires it. So perhaps we just retire the flag entirely, make it a noop, remove it from template interpereters, and let Zero continue rewriting?
14-10-2025

RT Triage: ILW - HLH = P2
14-10-2025

> A herd7 test would add a little more evidence. It is literally MP+dmb+ctrl vs MP+dmb+ctrlisb in Maranget/Sarkar/Sewell. herd7 has MP+dmb.sy+po bundled, which you can quickly modify to show both examples: ``` AArch64 MP+dmb.sy+ctrl(isb) { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W0,#1 | LDR W0,[X1] ; STR W0,[X1] | CBZ W0,LC00 ; DMB SY | LC00: ; MOV W2,#1 | ; (*** insert ISB here ***) STR W2,[X3] | LDR W2,[X3] ; exists (1:X0=1 /\ 1:X2=0) ``` ...or if you want to have a second read exactly under the branch, although M/S/S and AArch64 MM tells me there is no difference: ``` AArch64 MP+dmb.sy+ctrl(isb)+8369506 { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W0,#1 | LDR W0,[X1] ; STR W0,[X1] | CBZ W0,LC00 ; DMB SY | ; (*** insert ISB here ***) MOV W2,#1 | LDR W2,[X3] ; STR W2,[X3] | LC00: ; exists (1:X0=1 /\ 1:X2=0) ```
14-10-2025

> That seems to show that two STLRs are enough? But I am not super confident I wrote it correctly. The test is not entirely correct: the P1 thread basically tests: `if ([x5] == 0) x6 = [x7]`. I think you wanted to test for `!=`, so replace `B.EQ` -> `B.NE`. A clue that something is off is replacing `STLR` -> `STR`, without test catching anything interesting :) Or simplify it like this: ``` AArch64 STLR+STLR { 0:X1=x; 0:X3=y; 1:X5=y; 1:X7=x; } P0 | P1 ; MOV X0,#1 | LDR X4,[X5]; STLR X0,[X1] | CMP X4,#0 ; MOV X2,#1 | B.EQ under ; STLR X2,[X3] | ; (*** insert ISB here ***) | LDR X6,[X7]; | under: ; exists(1:X4=1 /\ 1:X6=0) ```
14-10-2025

I tried my hand at writing a test for herd7: AArch64 STLR+STLR { 0:X1=x; 0:X3=y; 1:X5=y; 1:X7=x; } P0 | P1 ; MOV X0,#1 | LDR X4,[X5]; STLR X0,[X1]| CMP X4,#0 ; MOV X2,#1 | B.EQ under ; STLR X2,[X3]| MOV X6,#2 ; | B over ; | under: ; | LDR X6,[X7]; | over: ; exists(1:X4=1 /\ 1:X6=0) That seems to show that two STLRs are enough? But I am not super confident I wrote it correctly.
14-10-2025

What a bug, very nicely done finding it! I read the report over and over again, getting confused and unconfused several times. So let me see if I got it right. Thread 1: (resolving and patching the bytecode) ; now in ResolvedFieldEntry::fill_in entry->_field_holder = ... entry->_field_offset = ... entry->_field_index = ... entry->_tos_state = ... // === (1) ANNOUNCE RFE IS RESOLVED === release_store(&entry->_get_code, b1) release_store(&entry->_put_code, b2) ; gets back to interpreter ; eventually comes up to code generated by TemplateTable::patch_bytecode: __ load_field_entry(temp_reg, bc_reg); ... __ lea(temp_reg, Address(temp_reg, in_bytes(ResolvedFieldEntry::get_code_offset()))); ... __ ldarb(temp_reg, temp_reg); // acquire __ movw(bc_reg, bc); __ cbzw(temp_reg, L_patch_done); // don't patch ... // === (2) PATCHING BYTECODE === __ strb(bc_reg, at_bcp(0)); __ bind(L_patch_done); Thread 2: (observes patched/fast bytecode) ; in interpreter, in code generated by TemplateTable::fast_accessfield __ load_field_entry(r2, r1); ; === (3) READING RFE FIELDS (sometimes broken) === __ load_sized_value(r1, Address(r2, in_bytes(ResolvedFieldEntry::field_offset_offset())), sizeof(int), true /*is_signed*/); __ load_unsigned_byte(r3, Address(r2, in_bytes(ResolvedFieldEntry::flags_offset()))); Rephrasing the situation in a simpler pseudo-code test: // Initially all 0 int rfe_contents; int rfe_code; int patched_bc; Thread 1: rfe_contents = 1; release_store(&rfe_code, 1) int t = acquire(&rfe_code); patched_bc = t; // can assume storing 1 here for simplicity: rfe_code loads are red herring Thread 2: if (patched_bc == 1) { // we are here because we see "fast" patched bc int r1 = rfe_contents; assert(r1 != 0); // fails } So the actual data race is on `patched_bc` itself: we effectively use it as "witness" that `RFE` is resolved. But it is a racy witness, so it is broken. So... > Our internal tested patch updates the bytecode patching to use `STLR` and adds `LDAR` for the first load from `ResolvedFieldEntry` in the `TemplateTable::fast_*`. See that putting acquire over `rfe_contents` does not actually resolve the race, because it is not a witness! In other interpretation, you really want: Thread 1: rfe_contents = 1; ... release_store(&rfe_code, 1) release_store(&patched_bc, 1) Thread 2: if (load_acquire(&patched_bc) == 1) { // rfe_contents load cannot float up int r1 = rfe_contents; assert(r1 != 0); // should pass } ...but with acquire over `rfe_contents` you get: Thread 1: rfe_contents = 1; ... release_store(&patched_bc, 1) Thread 2: if (patched_bc == 1) { // patched_bc load can float down int r1 = load_acquire(&rfe_contents); // acquire is TOO LATE assert(r1 != 0); // fails } Putting acquire over `rfe_contents` may help to hide the problem, depending on what hardware implementation actually does. Also, in our case, this would require heavy ahead speculation: we need to assume the bytecode is patched, speculatively enter the codelet for it, start loading `rfe_contents` before our initial speculation on bytecode value is substantiated. Still, I think this is plausible, and so doing LDAR on first RFE load is not a memory model tight solution for a generic architecture. If all of above is true, then the solution to the problem is a bit worse. Either we fully sync up bytecode patches (with releases) and bytecode loads (with acquires): Thread 1: rfe_contents = 1; release_store(&rfe_code, 1) ... { // logically in bytecode patching release_store(&patched_bc, 1) } Thread 2: // logically in bytecode loads int pbc = load_acquire(&patched_bc); if (pbc == 1) { int r1 = rfe_contents; assert(r1 != 0); // should pass } This is bad, because it effectively says to acquire every bytecode load, on the off-chance it is a fast bytecode. (Aside: It does not have to be load-acquiring `LDAR`, it can be dependency-based; but it would probably be a control dependency, which would require ISB.) Alternatively, and more crudely, we can fence up `RFE` to have barriers on its side: Thread 1: { // Logically in RFE::fill_in rfe_contents = 1; release_store_fence(&rfe_code, 1) // <--- explicit fence at the end } patched_bc = 1; Thread 2: int pbc = patched_bc; if (pbc == 1) { { // Logically in fast bytecodes access to RFE fence(); // <--- explicit load fence int r1 = rfe_contents; assert(r1 != 0); // should pass } } I suspect both would give us a similar interpreter performance hit as just disabling RewriteBytecodes. I tried a few workloads on x86_64 server. The most impact with -XX:-RewriteBytecodes I could measure is ~0.5% in pure -Xint on a large workload. When compilers are enabled, the difference is a wash. Startup does not seem to be affected. The only configuration that seems to care is Zero, where non-quickened bytecodes hurt for ~5%. This is why I think the sane way out is to turn off `RewriteBytecodes` for all template interpreters :)
14-10-2025

> No doubt, but the memory model has changed substantially since then, as has the herd7 simulator. That's true. I was merely pointing out that this part seems to be rock-solid over the years: M/S/S paper explicitly calls this out, old 2017-ish AArch64 spec has a human-readable paragraph on what powers reads and writes get under control dependency, new 2025-ish AArch64 spec has a confusing stack of definitions which amounts to the same, and herd7 directly shows the (counter-)example. So all the sources agree that you need ISB for control dependency to enforce ordering for read->read chains.
14-10-2025

> It is literally MP+dmb+ctrl vs MP+dmb+ctrlisb in Maranget/Sarkar/Sewell. No doubt, but the memory model has changed substantially since then, as has the herd7 simulator.
14-10-2025

> To be specific, control dependency hooks reads->*writes* in dependency-ordered-before. For reads->reads, control dependencies only acts in Instruction-fetch-barrier- ordered-before, which needs a context synchronization event (e.g. ISB). [Edited] I think I found the section to which you refer. It's All of the following apply: — There is a Control dependency from E1 to E2 . — One of the following applies: — E2 is an Explicit Memory Write Effect. ... A herd7 test would add a little more evidence.
14-10-2025

Oh, there is draft PR, nice. It seems to implement a mix of `release_store(&_patched_bc, 1)` and control-dependency between `patched_bc` load and `rfe_contents` load. Still, ARMv8 MM gives power to this control dependency only in limited cases. New formulation of ARMv8 MM spec (https://developer.arm.com/documentation/ddi0487/latest/) is really confusing to read, but the gist of it looks the same as older spec formulation. It defines the control dependency in B2.3.6 (we have one here), but gives that control dependency power only in specific cases, B2.3.7. To be specific, control dependency hooks reads->*writes* in dependency-ordered-before. For reads->reads, control dependencies only acts in Instruction-fetch-barrier-ordered-before, which needs a context synchronization event (e.g. ISB). It is explained in much more down-to-Earth manner in "A Tutorial Introduction to the ARM and POWER Relaxed Memory Models", 4.3 "Control-isb/isync Dependencies": https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
13-10-2025

I updated the description to remove the mention of the missing barrier in the `TemplateTable::fast_*` handlers. I was mistaken on its necessity. I re-tested without them, and just the STLR when patching the bytecode seems to be enough. We are still discussing internally on whether anything is required in the `TemplateTable::fast_*` handlers and in https://github.com/openjdk/jdk/pull/27748.
13-10-2025

Yes. This would require ISB for control dependency to have power for this read-read, IIRC. I checked it says so as well in Maranget/Sarkar/Sewell's tutorial.
13-10-2025

Thread 2: int pbc = patched_bc; if (pbc == 1) { { // Logically in fast bytecodes // There is a control dependency between reading the patched bytecode and reading RFE // This load is control dependent on pbc: int r1 = rfe_contents;
13-10-2025

This may well be right, but I believe a reproducer may be possible. I'd try altering the interpreter to rewrite bytecodes repeatedly, then run jshell with -Xint.
11-10-2025

Moved do hotspot/runtime as bytecode rewriting is an interpreter (not a JIT) optimization.
10-10-2025

Thank you Justin for adding the details on how you tracked this down. I can imagine how difficult and time consuming the investigative work would be. Impressive work.
10-10-2025

It's both the store reordering and the missing ldar in the fast bytecode path. Just adding the ldar in the fast bytecode path was not enough.
10-10-2025

> The problem here is that the STLR happens on `ResolvedFieldEntry::_get_code` and `ResolvedFieldEntry::_put_code`, but the fast bytecode path has no corresponding LDAR or any load on either of these two fields. The description indicates the problem was store reordering. That said, anywhere you require release semantics on a store there must be corresponding acquire semantics on a load (directly or indirectly), for it to have the desired affect.
10-10-2025

> I think people misunderstand where the "release" part of STLR actually happens. It really pays to think of this as "release(); store()". That is a very effective way of viewing it, and I wish the AArch64 docs were that clear. That makes it even more clear what is going on. That is effectively a `STR` followed by a `STR`, of which there is no ordering requirement as there is no dependency between them. The core is free to make either observable first. Our internal tested patch updates the bytecode patching to use `STLR` and adds `LDAR` for the first load from `ResolvedFieldEntry` in the `TemplateTable::fast_*`. I am not confident the `LDAR` is required, but based on your explanation of `STLR` the new `STLR` when setting the bytecode is definitely needed.
09-10-2025

> [~jcking] what kind of hardware have you observed this on? It is on our proprietary Axion ARM CPU: https://cloud.google.com/blog/products/compute/introducing-googles-new-arm-based-cpu > I think people misunderstand where the "release" part of STLR actually happens. It really pays to think of this as "release(); store()". The problem here is that the STLR happens on `ResolvedFieldEntry::_get_code` and `ResolvedFieldEntry::_put_code`, but the fast bytecode path has no corresponding LDAR or any load on either of these two fields.
09-10-2025

[~jcking] what kind of hardware have you observed this on? > `ResolvedFieldEntry::fill_in` uses`STLR` to resolve the `ResolvedFieldEntry` and the interpreter uses `STR` to update the bytecode shortly after. I think people misunderstand where the "release" part of STLR actually happens. It really pays to think of this as "release(); store()".
09-10-2025

[~jpai] Lots of trial, error, and pain. With our internal repro we systematically ruled out various components of the JVM such as C1, C2, and G1 by disabling them during runs. That left the interpreter. We had a suspicion it was related to atomics and ended up auditing parts of the AArch64 specific interpreter bits. We had not observed the same issues on x86. Errors disappeared when we passed `-XX:-RewriteBytecodes` (note that errors still occur if you just disable `-XX:-RewriteFrequentPairs` instead of `-XX:-RewriteBytecodes`). No more impossible branches or JVM crashes. At that point, we added custom assertions in TemplateTable based on our suspicions to narrow it down. It was computationally expensive to narrow down, as we had to run the test thousands of times to reproduce 1-2 times. And the test itself has subtests which are sharded. And each one can take minutes to run.
09-10-2025

Hello Justin [~jcking], out of curiosity, how did you narrow down the JVM crashes and the impossible branches issues back to this very specific part of the JVM code?
09-10-2025

[~aph] Please let me know what you think.
09-10-2025