JDK-8177951 : Charset problem when the name of the sound device contains Chinese character.
  • Type: Bug
  • Component: client-libs
  • Sub-Component: javax.sound
  • Affected Version: 8u121
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: other
  • CPU: x86
  • Submitted: 2017-04-02
  • Updated: 2017-09-08
  • Resolved: 2017-08-31
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 10
10 b23Fixed
Description
FULL PRODUCT VERSION :
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

ADDITIONAL OS VERSION INFORMATION :
Windows 10 64-bit build 15061
Simplified Chinese version. System default charset is GBK.

EXTRA RELEVANT SYSTEM CONFIGURATION :
Realtek built-in sound card.

A DESCRIPTION OF THE PROBLEM :
I'm working on a program that uses Java Sound API. When I want to enumerate the names of the sound devices, the Chinese characters (encoding in system default charset GBK) in the name became messy code.
For example: 
The name in Control Pane: "��������� (Realtek High Definition Audio)"
The name Mixer.getMixerInfo().getName() returns:"������������ (Realtek High Definition Audio)"

I don't have a Linux platform so I can't test under that. :(

The problem is caused by a mistake that Java made: Java "encoded" the GBK code into UTF-8 code. For example. the GBK code of char '���' is '0xD1EF', (2 bytes). = '0b11010001 0b11101111'
So what happened when Java read it? Java encoded it as UTF-8 encoding(see https://en.m.wikipedia.org/wiki/UTF-8#Description): for first byte, it cut it into '0b11' and '0b010001', added '0b110'(and padded zeros) and '0b10', then we got '0b11000011 0b10010001' = '0xc391'. Do so with the second bytes, we got '0b11000011 0b10101111'.
So problem is here: the code that Java encodes is not an Unicode code. Instead, it's a GBK code. So it's hard to recover it. It should be fixed in future release.

P.S. Thank dram who found the way that java encodes the code wrongly and the way to solve it.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
use code below:
Mixer.Info[] mi = AudioSystem.getMixerInfo();
            for (Mixer.Info info : mi) {
                System.out.println("info: " + info);
                Mixer m = AudioSystem.getMixer(info);
                System.out.println("mixer " + m);
                Line.Info[] sl = m.getSourceLineInfo();
                for (Line.Info info2 : sl) {
                    System.out.println("    info: " + info2);
                    Line line = AudioSystem.getLine(info2);
                    if (line instanceof SourceDataLine) {
                        SourceDataLine source = (SourceDataLine) line;

                        DataLine.Info i = (DataLine.Info) source.getLineInfo();
                        for (AudioFormat format : i.getFormats()) {
                            System.out.println("    format: " + format);
                        }
                    }
                }
                System.out.println("");
            }
( from http://stackoverflow.com/questions/12863081/how-do-i-get-mixer-channels-layout-in-java)

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
on my computer:
���������������?
��������� (Realtek High Definition Audio)
Line 1 (Virtual Audio Cable)
Line 2 (Virtual Audio Cable)

p.s. the name of the first line is originally broken. or broke in reading.
ACTUAL -
����������������������
������������ (Realtek High Definition Audio)
Line 1 (Virtual Audio Cable)
Line 2 (Virtual Audio Cable)

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
import javax.sound.sampled.*;
import java.util.Arrays;

import static javax.sound.sampled.AudioFormat.Encoding.PCM_SIGNED;

public class SoundEnumerator {
    public static void main(String[] args) throws LineUnavailableException {
        Mixer.Info[] mis = AudioSystem.getMixerInfo();
        for (Mixer.Info mi : mis) {
            Mixer m = AudioSystem.getMixer(mi);
            if (isMixerUsable(m)) {
                System.out.println(mi.getName());
            }
        }
    }

    private static boolean isMixerUsable(Mixer m) throws LineUnavailableException {
        final int[] count = {0};
        Arrays.stream(m.getSourceLineInfo())
                .filter((it) -> it instanceof SourceDataLine.Info)
                .filter((it) -> {
                    try {
                        return m.getLine(it) instanceof SourceDataLine;
                    } catch (LineUnavailableException e) {
                        return false;
                    }
                })
                .forEach((it) -> Arrays.stream(((DataLine.Info) it).getFormats())
                        .filter((af) -> !af.isBigEndian())
                        .filter((af) -> af.getEncoding() == PCM_SIGNED)
                        .filter((af) -> af.getSampleSizeInBits() != 24)
                        .forEach((af) -> count[0]++));
        return count[0] != 0;
    }
}

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Use method below to decode the broken utf-8 code and reconvert it as system charset string.

private static final byte HIGH_BIT = (byte) 0b11000000;
    private static String deMessyCode(String messyCode) {
        ByteOutputStream buf = new ByteOutputStream(messyCode.length());
        byte[] originalBytes = messyCode.getBytes(Charset.forName("UTF-8"));

        for (int i = 0; i < originalBytes.length; i++) {
            if ((byte) (originalBytes[i] & HIGH_BIT) == HIGH_BIT) {
                buf.write((originalBytes[i] << 6) | (originalBytes[i + 1] << 2 >>> 2));
                // DELETE 0b110000 and move 6, then delete the 0b10 prefix of the second byte.
                i++;
            } else {
                buf.write(originalBytes[i]);
            }
        }

        return new String(buf.getBytes(), Charset.forName(System.getProperty("file.encoding")));
    }

The problem of this solution is, you must know the system default encoding, and there may be real 2-byte unicode that encodes into UTF-8 that get "decoded" by my method.


Comments
http://mail.openjdk.java.net/pipermail/sound-dev/2017-June/000565.html
29-06-2017

Adding update from the submitter: "Unfortunately, the problem only appears on the Non-English version of Realtek Sound Card driver. And, based on my guess, the problem appears on all sound card driver instead of only Realtek. so actually you can reproduce the issue by any Non-English driver. As long as there are any Non-English and Non-UTF8 characters in the name of the sound device, the issue will appear. I'm working on modifying a sound driver for you to reproduce the issue on English OS. Please be patient." "I regret to inform you that after 3 days of research I have no result about how to reproduce this problem on an English version of Windows(failed on modifying the driver) because the issue itself is a localization and charset, so there's absolutely no problem on the English version. Maybe you can try to reinstall the driver on a Chinese/Japanese/Korean Windows and then switch back to English." In JDK9, jdk/src/java.desktop/windows/native/libjsound/PLATFORM_API_WinOS_DirectSound.cpp, Function DS_GetDesc_Enum, Line 236, the name of the device is gotten from the OS, in ANSI charset, in a LPCSTR. And you just copy the ANSI encoded string to the DirectAudioDeviceDescription struct. So look at the jdk/src/java.desktop/share/native/libjsound/DirectAudioDeviceProvider.c, Function getDirectAudioDeviceDescription and Java_com_sun_media_sound_DirectAudioDeviceProvider_nNewDirectAudioDeviceInfo, Line 48 and 98, you called NewStringUTF function with a ANSI encoded string. So we got a UTF-8 encoded ANSI string. But obviously we need a UTF-8 encoded Unicode String so I wrote some code to fix this issue. It just converts the ANSI string to a UTF-8 string. But because I have no compile environment, so I can't compile and test it. But the ANSIToUTF8 function is tested by me in VS2015. I'll upload the patch for you and you can try to apply it in JDK and see if it works. I'll keep on trying to compile and debug by myself but without any warranty. And, a picture of the first line of the correct output of SoundEnumerator will be attached.
27-06-2017

Checked this on Windows 10 Enterprise (b1607) with JDK 8u121 and 9 ea and couldn't confirm the issue completely. Switched language and keyboard input to Simplified Chinese version. ================================= 8u121: >java SoundEnumerator ??????? Speakers (High Definition Audio Device) ================================= Written back to the submitter with result with request to share additional information to reproduce this issue.
03-04-2017