Bug ID: JDK-8272352 Java launcher can not parse Chinese character when system locale is set to UTF-8

Type: Bug
Component: core-libs
Sub-Component: java.util:i18n
Affected Version: 17,18

Priority: P3
Status: Resolved
Resolution: Fixed
OS: windows
CPU: generic

Submitted: 2021-08-12
Updated: 2022-09-09
Resolved: 2022-05-05

JDK 11	JDK 17	JDK 19
11.0.17Fixed	17.0.5Fixed	19 b22Fixed

Create on behalf of Glavo <zjx001202@gmail.com>
----

When we turned on "Use Unicode UTF-8 for worldwide language support"(something like https://stackoverflow.com/questions/56419639/what-does-beta-use-unicode-utf-8-for-worldwide-language-support-actually-do) option, java default launcher can not parse arguments that containing Chinese characters:

java Foo 你好世界

String[]args are actually garbled Chinese characters.

Backport of JDK-8272352: Java launcher can not parse Chinese character when system locale is set to UTF-8. Does not apply cleanly to 11u due to a minor change for UTF-8 support otherwise included in JDK-8264208: Console charset API, which has not been backported to 11u. (I added case 65001: for UTF-8 on line 83). tier1 tests pass (github actions) fix confirmed locally (as it was in 17u) as there isnt a specific test
18-07-2022
A pull request was submitted for review. URL: https://git.openjdk.org/jdk11u-dev/pull/1234 Date: 2022-07-16 00:38:20 +0000
16-07-2022
A pull request was submitted for review. URL: https://git.openjdk.org/jdk11u-dev/pull/1228 Date: 2022-07-14 19:27:47 +0000
14-07-2022
Fix request [posted on behalf of Stephanie Crater] Backport to allow java to correctly parse Chinese characters in file paths and string arguments passed to java.exe. Java runtime has been detecting the Windows system locale encoding using GetLocaleInfo(GetSystemDefaultLCID(), LOCALE_IDEFAULTANSICODEPAGE, ...), but it returns the legacy ANSI code page value, e.g, 1252 for US-English. In order to detect whether the user has selected UTF-8 as the default, the code page has to be queried with GetACP(). Also, the case if the call to GetLocaleInfo fails changed to fall back to UTF-8 instead of Cp1252. Clean backport, low risk, confirmed the fix ran locally (note that there's not jtreg test as per original commit, due to requirements of windows config changes and reboot)
06-07-2022
A pull request was submitted for review. URL: https://git.openjdk.org/jdk17u-dev/pull/530 Date: 2022-07-05 17:44:08 +0000
05-07-2022
A pull request was submitted for review. URL: https://git.openjdk.org/jdk17u-dev/pull/522 Date: 2022-07-01 19:11:42 +0000
01-07-2022
Changeset: 22934485 Author: Naoto Sato <naoto@openjdk.org> Date: 2022-05-05 19:59:58 +0000 URL: https://git.openjdk.java.net/jdk/commit/229344853126692d38ff7cb164dd5d17c5bf7fbb
05-05-2022
A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk/pull/8434 Date: 2022-04-27 20:23:32 +0000
27-04-2022
Relevant information on MS site (https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page)
27-04-2022
To be precise, GetACP() returns 65001 if the checkbox is checked, but GetLocaleInfo(GetSystemDefaultLCID(), LOCALE_IDEFAULTANSICODEPAGE, ...) returns 1252. The call should be replaced with GetACP().
27-04-2022
The issue here is that there is not any public document from MS wrt "Beta: Use Unicode UTF-8 for worldwide language support" check box. JDK is using the ANSI code page (CP_ACP) to translate the path/argument strings into Java's Strings, but GetACP() returns 1252, even if the check box is checked. It seems that it only affects the OEM code page (to 65001), thus this discrepancy. Would wait for the more clear definition of the functionality from MS (maybe Windows 11 will do?)
25-08-2021
The following piece will fix it, but it would also mean that all JNI related platform string would be affected: --- $ git diff diff --git a/src/java.base/share/classes/sun/launcher/LauncherHelper.java b/src/ java.base/share/classes/sun/launcher/LauncherHelper.java index 82b73d01c6b..985f33ce3c7 100644 --- a/src/java.base/share/classes/sun/launcher/LauncherHelper.java +++ b/src/java.base/share/classes/sun/launcher/LauncherHelper.java @@ -877,6 +877,7 @@ public final class LauncherHelper { } private static final String encprop = "sun.jnu.encoding"; + private static final String stdoutprop = "sun.stdout.encoding"; private static String encoding = null; private static boolean isCharsetSupported = false; @@ -887,7 +888,7 @@ public final class LauncherHelper { static String makePlatformString(boolean printToStderr, byte[] inArray) { initOutput(printToStderr); if (encoding == null) { - encoding = System.getProperty(encprop); + encoding = System.getProperty(stdoutprop, System.getProperty(encpro p)); isCharsetSupported = Charset.isSupported(encoding); } try { ---
24-08-2021
Looking into this issue, it is not that simple just to revert the change above. The issue here is that launcher is using the encoding from `sun.jnu.encoding`(= windows-1252) and the code tries to read it as `UTF-8`. Before the above fix, it happens to work because System.out's encoding is set to `windows-1252` (note that setting `cp65001` throws the exception for `setOut0()`, falling back to `windows-1252`) which just passthrough UTF-8 bytes for Ni-Hao.
19-08-2021
Looks like this is a regression caused by the fix to JDK-8266774. The following backout reverts the regression: --- diff --git a/src/java.base/windows/native/libjava/java_props_md.c b/src/java.base/windows/native/libjava/java_props_md.c index b3c16a453d7..754725264eb 100644 --- a/src/java.base/windows/native/libjava/java_props_md.c +++ b/src/java.base/windows/native/libjava/java_props_md.c @@ -147,8 +147,8 @@ static char* getConsoleEncoding() cp = GetConsoleCP(); if (cp >= 874 && cp <= 950) sprintf(buf, "ms%d", cp); - else if (cp == 65001) - sprintf(buf, "UTF-8"); +// else if (cp == 65001) +// sprintf(buf, "UTF-8"); else sprintf(buf, "cp%d", cp); return buf; ---
16-08-2021
I tried it again with my latest build, and it looks garbled. Will look into it.
16-08-2021
Can the submitter provide more information? I tried to reproduce the issue, however, I had exact the opposite result (which I expected), i.e., Chinese string cannot be displayed with non-UTF-8 command prompt (cp437, pic1), where they are displayed as four '?'s, while with the UTF-8 command prompt, they displayed correctly (pic2).
15-08-2021
Past report of the same problem: JDK-6584897 - Will Not Fix Open issues related to this: JDK-8124977 cmdline encoding challenges on Windows
13-08-2021
Moved to tools -> launcher.
13-08-2021

Relates :	JDK-8124977 - cmdline encoding challenges on Windows
Relates :	JDK-8260265 - UTF-8 by Default
Relates :	JDK-8266774 - System property values for stdout/err on Windows UTF-8
Relates :	JDK-6584897 - Cannot invoke class from command line with args containing non-ASCII characters