JDK-8311906 : Race condition in String constructor
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.lang
  • Affected Version: 8-pool-perf,11,17,20,21
  • Priority: P4
  • Status: In Progress
  • Resolution: Unresolved
  • OS: generic
  • CPU: generic
  • Submitted: 2023-07-11
  • Updated: 2023-09-25
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
A race condition in the String constructor taking a char[] (and probably other constructors too) allows creating a String with an incorrect coder: A String only containing latin-1 characters, but still encoded using UTF-16.
This is because in between the constructor checking if the content can be encoded using latin-1 and it being encoded as UTF-16, the content of the passed in array may have changed

See https://wouter.coekaerts.be/2023/breaking-string

Concurrently modify the char[] passed into the String constructor. See example code.

A String where .equals and other methods behave correctly
A String where .equals and other methods are inconsistent with its contents

---------- BEGIN SOURCE ----------
 * Given a latin-1 String, creates a copy that is
 * incorrectly encoded as UTF-16.
static String breakIt(String original) {
  if (original.chars().max().orElseThrow() > 256) {
    throw new IllegalArgumentException(
        "Can only break latin-1 Strings");

  char[] chars = original.toCharArray();

  // in another thread, flip the first character back
  // and forth between being encodable as latin-1 or not
  Thread thread = new Thread(() -> {
    while (!Thread.interrupted()) {
      chars[0] ^= 256;

  // at the same time call the String constructor,
  // until we hit the race condition
  while (true) {
    String s = new String(chars);
    if (s.charAt(0) < 256 && !original.equals(s)) {
      return s;

String a = "foo";
String b = breakIt(a);

// they are not equal to each other
// => false

// they do contain the same series of characters
// => true
---------- END SOURCE ----------

FREQUENCY : always

[~rriggs] Didn't know you are working on this bug, thought it was just another assignment that's forgotten over time. So here's a quick summary of my approach: Problem: racy input may lead to coder == UTF16 but value is totally LATIN1 compatible, and is the only case we need to defend against. Solution: when we promote Strings to UTF16, we must perform a final check that the value is not LATIN1 compatible when we obtain a trusted UTF16 value array. Implementation: I moved the compress calls in such untrusted contexts into a guarded method, and decode UTF-8 (which requires double reads) from a trusted array. The final checks are added, but there's still a pass before copying to create a UTF16 array (much like double null check + synchronization) Shortcoming: This will lead to increased CPU cost (compress array once more after putting into trusted UTF16 array) for correctness. The memory consumption should be the same. Testing: I ran test/jdk/java/lang/String tests. Might need to pass my current patch to more powerful devices to test the exact performance impacts and whether it affects hotspot. I also ran the reproduction case given here, but we might need JCStress tests to ensure correctness across platforms.

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/15902 Date: 2023-09-25 12:28:40 +0000

Requiring the input array to be unchanged during the call is most likely not sufficient to ensure correctness per Java memory model. A first plain read may see a write before that subsequent reads never see, so the writes to the array must be consistently visible to the plain array index reads.

BigInteger constructor has the same "problem". https://gist.github.com/turbanoff/bcd234bd3469e177f5da0e9f3cf85c2a But its javadoc has note that "val" array is assumed to be unchanged for the duration of the constructor call. https://docs.oracle.com/en/java/javase/20/docs/api/java.base/java/math/BigInteger.html#%3Cinit%3E(byte%5B%5D)

In general, we should read user arrays like how we read from benign races, where we only read once. The methods in StringLatin1 etc. should be updated to take the user array and convert it to byte array + coder at once, instead of reading the user array twice (once in trying to encode in latin1 and the other in encoding to utf16) It's also possible for the reverse situation, where an array is incorrectly interpreted as able to be encoded into latin1, as countNegatives + actual encoding performs more than one reads as well.

The observations on Windows 10: JDK 8: The reproducer is not applicable. JDK 11: Failed, false value returned. JDK 17: Failed. JDK 20: Failed. JDK 21ea+20: Failed.