JDK-8124977 : cmdline encoding challenges on Windows
  • Type: Bug
  • Component: core-libs
  • Priority: P3
  • Status: Open
  • Resolution: Unresolved
  • OS:
    windows,windows_nt,windows_2008,windows_vista,windows_7,windows_2012,windows_8 windows,windows_nt,windows_2008,windows_vista,windows_7,windows_2012,windows_8
  • Submitted: 2015-06-17
  • Updated: 2017-02-16
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbd_majorUnresolved
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
Motivated by the discussion here:
  http://stackoverflow.com/questions/11927518/java-unicode-utf-8-and-windows-command-prompt

As well as this code:

=====
class Main {
  public static void main(String[] args) throws Exception {
    for (int i = 0; i < args.length; ++i) {
      if (i > 0) {
        System.out.print(' ');
      }
      System.out.print(args[i]);
    }
    System.out.println();
  }
}
=====
 
Create a batch file with the following text and with UTF-8 encoding without BOM. Now execute the batch file using CLI. ���f.txt��� does not contain the same characters as the input characters.
 
=========
chcp 65001
java Main ������ ������ > f.txt
==========

A good start on language issues in the windows console is in this post and elsewhere in this blog: http://www.siao2.com/2010/10/07/10072032.aspx

There are multiple areas involved in this problem.

First is how the command arguments are passed to an app. Powershell appears to pass them differently than cmd.exe. With cmd.exe after calling chcp 65001, I see that the args are kept in wchar_t as ucs2. With powershell [Console].OutputEncoding as 437, 1252 and utf8 they appeared to be in char as utf8 encoding.
NOTE: Chcp is a commandline tool to call SetConsoleOutputCP(). As far as I can see a process should not call SetConsoleOutputCP

The second is how the command arguments are retrieved by an app

int main(int argc, char**argv)
vs
int wmain(int argc, wchar_t**argv)
vs
char* GetCommandLineA()
vs
wchar_t* GetCommandLineW()

The JDK uses GetCommandLineA and should use GetCommandLineW to support Unicode args. This change should be controlled by the java commandline to ensure compatibility.

Second are the output streams (stdout, stderr) ��� These are involved when using > or | to put the results in a file and when writing to the console. This turns out to involve complex logic around using WriteConsoleW for console output, WriteFile for > and | with a final fallback to writing ascii in the GetConsoleOutputCP().

Third is getting the consoles to display gyphs for the Unicode characters being tested. The font selected in the cmd and powershell windows must be Lucida Console or Consolas. Also, additional language packs must be installed to get fallback fonts for the characters needed. Finally, using a console app (conemu, Console+, ..) should enable the proper display of Unicode glyphs for cmd and powershell windows that they start. PowerShellISE worked when the [Console].OutputEncoding is set to utf8.