JDK-8325590 : Regression in round-tripping UTF-16 strings after JDK-8311906
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.lang
  • Affected Version: 22
  • Priority: P1
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2024-02-10
  • Updated: 2024-07-03
  • Resolved: 2024-02-13
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 22 JDK 23
22.0.1Fixed 23 b10Fixed
Related Reports
Duplicate :  
Relates :  
Description
I'm seeing a regression when writing a string to a file as UTF-16 and reading it back in using `Files.readString(String, Charset)` after JDK-8311906 (Improve robustness of String constructors with mutable array inputs).

Repro:

```
import static java.nio.charset.StandardCharsets.UTF_16;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Arrays;

public class Y {
  public static void main(String[] args) throws IOException {
    String original = "🤸";
    Path tmp = Files.createTempFile("tmp", ".txt");
    Files.writeString(tmp, original, UTF_16);
    String actual = Files.readString(tmp, UTF_16);
    System.err.printf("expected (%s), was (%s)\n", original, actual);
    System.err.printf("expected UTF_16 bytes: %s\n", Arrays.toString(original.getBytes(UTF_16)));
    System.err.printf("actual UTF_16 bytes: %s\n", Arrays.toString(actual.getBytes(UTF_16)));
  }
}
```

$ java -fullversion
openjdk full version "21.0.2+13-58"
$ java Y
expected (🤸), was (🤸)
expected UTF_16 bytes: [-2, -1, -40, 62, -35, 56]
actual UTF_16 bytes: [-2, -1, -40, 62, -35, 56]

$ java -fullversion
openjdk full version "22+35-2369"
$ java Y
expected (🤸), was (>Ø8Ý)
expected UTF_16 bytes: [-2, -1, -40, 62, -35, 56]
actual UTF_16 bytes: [-2, -1, 0, 62, 0, -40, 0, 56, 0, -35]
Comments
A pull request was submitted for review. URL: https://git.openjdk.org/jdk22u/pull/55 Date: 2024-02-16 15:58:35 +0000
16-02-2024

Fix Request The fix for the regression of `java.nio.file.Files.readString(path, charset)` should be backported to jdk22u. The fix is in the main line, is well tested, and approved by the group lead.
13-02-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk22u/pull/52 Date: 2024-02-13 16:40:18 +0000
13-02-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk22/pull/110 Date: 2024-02-13 15:57:31 +0000
13-02-2024

Changeset: 13d9e8ff Author: Roger Riggs <rriggs@openjdk.org> Date: 2024-02-13 15:16:50 +0000 URL: https://git.openjdk.org/jdk/commit/13d9e8ff38536287b82c54bb63bd2d20f65615dc
13-02-2024

Deferral Request This issue with `java.nio.file.Files.readString(path, charset)` can be deferred to an update release. The scope of bug is narrow, affecting a single method and with less frequently used charsets. The most commonly used charsets US_ASCII, ISO_8859_1, and UTF-8 are not affected; other multi-byte charsets such as UTF-16 may produce incorrect results. The release note for the issue identifies a workaround to disable compact strings using the command line flag `-XX:-CompactStrings`. The fix is straightforward and is available for the next update release.
13-02-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/17817 Date: 2024-02-12 21:29:02 +0000
12-02-2024

The issue affects Files.readString with any character set with a multi-byte encoding. Strings resulting from the incorrect encoding may be difficult to diagnose.
12-02-2024

The usages in HexFormat and UUID are 8859_1 so it looks like the impact is limited to Files.readString. This is a new method since Java 11.A search of 11 million classes in 484k artifacts from Maven Central found 1174 usages but this doesn't tell us which charsets are used.
12-02-2024

[~cushon] Thanks for the bug report and the analysis, it does appear it should be calling coderFromArrayLen with the char array length, not the byte array length. I've proactively created a backport issue for 22. Need to decide quickly if this is serious enough to build another release candidate.
12-02-2024

Also, the expected behaviour is observed with -XX:-CompactStrings $ java -XX:-CompactStrings Y expected (🤸), was (🤸) expected UTF_16 bytes: [-2, -1, -40, 62, -35, 56] actual UTF_16 bytes: [-2, -1, -40, 62, -35, 56] $ java -XX:+CompactStrings Y expected (🤸), was (>Ø8Ý) expected UTF_16 bytes: [-2, -1, -40, 62, -35, 56] actual UTF_16 bytes: [-2, -1, 0, 62, 0, -40, 0, 56, 0, -35]
11-02-2024

I'm going to tentatively increase the priority for consideration as a JDK 22 release blocker. Sorry if I'm missing anything and this is a false alarm.
11-02-2024

I think the issue might be in this logic on L844 of String (https://github.com/openjdk/jdk/blob/af7eeffddb40a4786e672e1a4b5bd9426578cd87/src/java.base/share/classes/java/lang/String.java#L844) if (COMPACT_STRINGS) { byte[] val = StringUTF16.compress(ca, 0, caLen); int coder = StringUTF16.coderFromArrayLen(val, len); return new String(val, coder); } When we get to 'coderFromArrayLen', val is `[62, -40, 56, -35]` and len is 6. The docs for `coderFromArrayLen` say it should be equivalent to `(len == val.length) ? LATIN1 : UTF16`, which is UTF16 (1). The actual value is LATIN1 (0). The bit shift in the implementation effectively returns `(len < val.length) ? UTF16 : LATIN1`. However I think the culprit is probably that the argument for coderFromArrayLen should be caLen instead of len, to match the call to 'compress'. The value of caLen is 2, which is less than val.length, and then coderFromArrayLen would return UTF16. Additionally, the result of coderFromArrayLen is then widened from a byte to an int when it's returned, and then `new String(val, coder)` is calling `String(byte[] ascii, int hibyte)` instead of `String(byte[] value, byte coder)`. I think the fix may be: diff --git a/src/java.base/share/classes/java/lang/String.java b/src/java.base/share/classes/java/lang/String.java index b6e3805653a..fb6f738f855 100644 --- a/src/java.base/share/classes/java/lang/String.java +++ b/src/java.base/share/classes/java/lang/String.java @@ -841,7 +841,7 @@ private static String newStringNoRepl1(byte[] src, Charset cs) { } if (COMPACT_STRINGS) { byte[] val = StringUTF16.compress(ca, 0, caLen); - int coder = StringUTF16.coderFromArrayLen(val, len); + byte coder = StringUTF16.coderFromArrayLen(val, caLen); return new String(val, coder); } return new String(StringUTF16.toBytes(ca, 0, caLen), UTF16);
11-02-2024