JDK-8356165 : System.in in jshell replace supplementary characters with ??
  • Type: Bug
  • Component: tools
  • Sub-Component: jshell
  • Affected Version: 21,25
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2025-05-03
  • Updated: 2025-05-29
  • Resolved: 2025-05-20
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 25
25 b24Fixed
Description
ADDITIONAL SYSTEM INFORMATION :
openjdk 25-ea 2025-09-16
OpenJDK Runtime Environment (build 25-ea+21-2530)
OpenJDK 64-Bit Server VM (build 25-ea+21-2530, mixed mode, sharing)

Windows 11 24H2

-----

openjdk 24 2025-03-18
OpenJDK Runtime Environment Temurin-24+36 (build 24+36)
OpenJDK 64-Bit Server VM Temurin-24+36 (build 24+36, mixed mode, sharing)

Ubuntu 24.04 WSL

A DESCRIPTION OF THE PROBLEM :
System.in in JShell has a bug that it replace a supplementary character (an Unicode character out of BMP) with 2 `?` (ASCII question mark) read from the UTF-8 terminal.
In the following result, 63 means the ASCII code of `?`.
It looks like all BMP characters are kept as are. e.g. "¥" in the following result, and "あ1" (あ = 3 bytes in UTF-8).
Possibly each surrogate code unit in the surrogate pair of the supplementary character may be tried to be converted to UTF-8.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1. Run `chcp 65001` (CMD) or `[Console]::InputEncoding = [Console]::OutputEncoding = [System.Text.Encoding]::UTF8` (PowerShell) in advance in Windows
2. Launch `jshell`
3. Type `new String(System.in.readNBytes(4))` and press Enter to run
4. Input `👍11` and press Enter

Note: 👍 can be replaced with another supplementary character. 11 can be replaced with 2 other ASCII characters, or any one character encoded to 2 bytes in UTF-8 (e.g. ¥ or π).

5. Do 3. and 4. once more with `System.in.readNBytes(4)`

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
jshell> new String(System.in.readNBytes(4))
👍11
$1 ==> "👍"

jshell> System.in.readNBytes(4)
👍11
$2 ==> byte[4] { -16, -97, -111, -115 }

jshell> new String(System.in.readNBytes(4))
👍¥
$3 ==> "👍"

jshell> System.in.readNBytes(4)
👍¥
$4 ==> byte[4] { -16, -97, -111, -115 }
ACTUAL -
jshell> new String(System.in.readNBytes(4))
👍11
$1 ==> "??11"

jshell> System.in.readNBytes(4)
👍11
$2 ==> byte[4] { 63, 63, 49, 49 }

jshell> new String(System.in.readNBytes(4))
👍¥
$3 ==> "??¥"

jshell> System.in.readNBytes(4)
👍¥
$4 ==> byte[4] { 63, 63, -62, -91 }

---------- BEGIN SOURCE ----------
// JShell only.
---------- END SOURCE ----------


Comments
Changeset: e961b13c Branch: master Author: Jan Lahoda <jlahoda@openjdk.org> Date: 2025-05-20 06:04:33 +0000 URL: https://git.openjdk.org/jdk/commit/e961b13cd68bc352b86af17c7e53df8537519beb
20-05-2025

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/25079 Date: 2025-05-07 06:44:54 +0000
07-05-2025

The observations on Windows 11: JDK 21ea+16: Passed. No '??' observed. JDK 21ea+17: Failed, '??' returned. JDK 25ea+6: Failed.
05-05-2025

Impact -> H (Regression) Likelihood -> L (Uncommon uses) Workaround -> M (Somewhere in-between the extremes) Priority -> P3
05-05-2025