Bug ID: JDK-4296969 Incorrect behaviours of several character converters

JDK-4296969 : Incorrect behaviours of several character converters

Type: Bug
Component: core-libs
Sub-Component: java.nio.charsets
Affected Version: 1.2.2,1.3.0

Priority: P4
Status: Closed
Resolution: Won't Fix
OS: generic,windows_nt,windows_2000
CPU: generic,x86

Submitted: 1999-12-05
Updated: 2001-03-23
Resolved: 2001-03-23

Related Reports

Relates :	JDK-4140796 - ISO2022CN_CNS & ISO2022CN_GB charset not supported in BufferedReader.
Relates :	JDK-4333733 - unix: method String.getBytes(String enc) throws java.lang.InternalError
Relates :	JDK-4361835 - Mapping mistakes in JIS0201, JIS0208, JIS0212, and SHIFTJIS.
Relates :	JDK-4429358 - Need to remove illegal SI/SO char to byte mappings for Cp93(0\|3\|5\|7\|9) encoders
Relates :	JDK-4429369 - ISO2022CN and ISO2022KR converters throw exception in response to illegal escape
Relates :	JDK-4429377 - IBM character converters: Need to remove some apparently obsolete mappings.

Description

\u001A' character.
'UN3' indicates a mapping to no character ('').
'MIS' indicates a mapping from one character to an entirely different character
(other than UN1 or UN2).

(For the MacDingbat encoding, every mismatch mapping was to '\u271F'.)

For 8859_1,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For 8859_2,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For 8859_3,	  EX1 = 0, EX2 = 0, UN1 = 64262, UN2 = 0, UN3 = 1024, MIS = 0.
For 8859_4,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For 8859_5,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For 8859_6,	  EX1 = 0, EX2 = 0, UN1 = 64300, UN2 = 0, UN3 = 1024, MIS = 0.
For 8859_7,	  EX1 = 0, EX2 = 0, UN1 = 64261, UN2 = 0, UN3 = 1024, MIS = 0.
For 8859_8,	  EX1 = 0, EX2 = 0, UN1 = 64293, UN2 = 0, UN3 = 1024, MIS = 0.
For 8859_9,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Big5,	  EX1 = 1024, EX2 = 0, UN1 = 50680, UN2 = 0, UN3 = 0, MIS = 0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteBig5]

For CNS11643,	  EX1 = 0, EX2 = 0, UN1 = 47696, UN2 = 0, UN3 = 1, MIS = 0.
For Cp037,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp1006,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1025,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp1026,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp1046,	  EX1 = 0, EX2 = 0, UN1 = 64256, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1097,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp1098,	  EX1 = 0, EX2 = 0, UN1 = 64258, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1112,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp1122,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp1123,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp1124,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1250,	  EX1 = 0, EX2 = 0, UN1 = 64261, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1251,	  EX1 = 0, EX2 = 0, UN1 = 64256, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1252,	  EX1 = 0, EX2 = 0, UN1 = 64262, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1253,	  EX1 = 0, EX2 = 0, UN1 = 64272, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1254,	  EX1 = 0, EX2 = 0, UN1 = 64262, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1255,	  EX1 = 0, EX2 = 0, UN1 = 64284, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1256,	  EX1 = 0, EX2 = 0, UN1 = 64263, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1257,	  EX1 = 0, EX2 = 0, UN1 = 64267, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1258,	  EX1 = 0, EX2 = 0, UN1 = 64264, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1381,	  EX1 = 0, EX2 = 0, UN1 = 55022, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp1383,	  EX1 = 0, EX2 = 0, UN1 = 55517, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp273,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp277,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp278,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp280,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp284,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp285,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp297,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp33722,	  EX1 = 0, EX2 = 0, UN1 = 55140, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp420,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64263, UN3 = 1024, MIS = 0.
For Cp424,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64293, UN3 = 1024, MIS = 0.
For Cp437,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp500,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp737,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp775,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp838,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64260, UN3 = 1024, MIS = 0.
For Cp850,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp852,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp855,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp857,	  EX1 = 0, EX2 = 0, UN1 = 64258, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp860,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp861,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp862,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp863,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp864,	  EX1 = 0, EX2 = 0, UN1 = 64261, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp865,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp866,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp868,	  EX1 = 0, EX2 = 0, UN1 = 64256, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp869,	  EX1 = 0, EX2 = 0, UN1 = 64264, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp870,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp871,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp874,	  EX1 = 0, EX2 = 0, UN1 = 64291, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp875,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64261, UN3 = 1024, MIS = 0.
For Cp918,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 64255, UN3 = 1024, MIS = 0.
For Cp921,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp922,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp930,	  EX1 = 11635, EX2 = 0, UN1 = 52648, UN2 = 0, UN3 = 1026, MIS =
0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteCp930]

For Cp933,	  EX1 = 10888, EX2 = 0, UN1 = 53406, UN2 = 0, UN3 = 1026, MIS =
0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteCp933]

For Cp935,	  EX1 = 9356, EX2 = 0, UN1 = 54990, UN2 = 0, UN3 = 1026, MIS =
0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteCp935]

For Cp937,	  EX1 = 20075, EX2 = 0, UN1 = 44273, UN2 = 0, UN3 = 1026, MIS =
0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteCp937]

For Cp939,	  EX1 = 11635, EX2 = 0, UN1 = 52648, UN2 = 0, UN3 = 1026, MIS =
0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteCp939]

For Cp942,	  EX1 = 0, EX2 = 0, UN1 = 55170, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp948,	  EX1 = 0, EX2 = 0, UN1 = 44305, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp949,	  EX1 = 0, EX2 = 0, UN1 = 54144, UN2 = 0, UN3 = 1024, MIS = 130.
For Cp950,	  EX1 = 0, EX2 = 0, UN1 = 44308, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp964,	  EX1 = 0, EX2 = 0, UN1 = 44278, UN2 = 0, UN3 = 1024, MIS = 0.
For Cp970,	  EX1 = 0, EX2 = 0, UN1 = 55819, UN2 = 0, UN3 = 1024, MIS = 122.
For EUCJIS,	  EX1 = 1024, EX2 = 0, UN1 = 51372, UN2 = 0, UN3 = 0, MIS = 2.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteEUC_JP]

For GB2312,	  EX1 = 1024, EX2 = 0, UN1 = 56938, UN2 = 0, UN3 = 0, MIS = 0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteEUC_CN]

For GBK,	  EX1 = 1024, EX2 = 0, UN1 = 40443, UN2 = 0, UN3 = 0, MIS = 0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteGBK]

For ISO2022CN_CNS,	  EX1 = 7650, EX2 = 57885, UN1 = 0, UN2 = 0, UN3 = 0,
MIS = 0.
	Exc1: [java.lang.ArrayIndexOutOfBoundsException]
	Exc2: [java.io.UnsupportedEncodingException: ISO2022CN_CNS]

For ISO2022CN_GB,	  EX1 = 0, EX2 = 65535, UN1 = 0, UN2 = 0, UN3 = 0, MIS =
0.
	Exc2: [java.io.UnsupportedEncodingException: ISO2022CN_GB]

For ISO2022KR,	  EX1 = 0, EX2 = 8224, UN1 = 0, UN2 = 0, UN3 = 57186, MIS = 0.
	Exc2: [java.lang.NullPointerException]

For JIS,	  EX1 = 1024, EX2 = 0, UN1 = 57439, UN2 = 0, UN3 = 3, MIS = 0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteISO2022JP]

For JIS0208,	  EX1 = 1024, EX2 = 0, UN1 = 0, UN2 = 0, UN3 = 57632, MIS = 0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteJIS0208]

For KOI8_R,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For KSC5601,	  EX1 = 1024, EX2 = 0, UN1 = 56159, UN2 = 0, UN3 = 0, MIS = 0.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteEUC_KR]

For MS874,	  EX1 = 0, EX2 = 0, UN1 = 64287, UN2 = 0, UN3 = 1024, MIS = 0.
For MacArabic,	  EX1 = 0, EX2 = 0, UN1 = 64281, UN2 = 0, UN3 = 1024, MIS = 0.
For MacCentralEurope,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024,
MIS = 0.
For MacCroatian,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024,
MIS = 0.
For MacCyrillic,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024,
MIS = 0.
For MacDingbat,	  EX1 = 0, EX2 = 0, UN1 = 0, UN2 = 0, UN3 = 1024, MIS = 64290.
For MacGreek,	  EX1 = 0, EX2 = 0, UN1 = 64256, UN2 = 0, UN3 = 1024, MIS = 0.
For MacHebrew,	  EX1 = 0, EX2 = 0, UN1 = 64297, UN2 = 0, UN3 = 1024, MIS = 0.
For MacIceland,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For MacRoman,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For MacRomania,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For MacSymbol,	  EX1 = 0, EX2 = 0, UN1 = 64311, UN2 = 0, UN3 = 1024, MIS = 0.
For MacThai,	  EX1 = 0, EX2 = 0, UN1 = 64261, UN2 = 0, UN3 = 1024, MIS = 0.
For MacTurkish,	  EX1 = 0, EX2 = 0, UN1 = 64256, UN2 = 0, UN3 = 1024, MIS = 0.
For MacUkraine,	  EX1 = 0, EX2 = 0, UN1 = 64255, UN2 = 0, UN3 = 1024, MIS = 0.
For SJIS,	  EX1 = 1024, EX2 = 0, UN1 = 57439, UN2 = 0, UN3 = 0, MIS = 2.
	Exc1: [java.lang.InternalError: Converter malfunction:
sun.io.CharToByteSJIS]

For UTF8,	  EX1 = 0, EX2 = 0, UN1 = 1024, UN2 = 0, UN3 = 0, MIS = 0.

I came across this bug while trying to convert between diffent encodings.  I was
trying to get some idea of the data loss, but because so many different methods
are used to indicate 'no mapping', this was made very difficult.  Much of this
would be addressed by bug 4241124.  I also read several bugs indicating that not
all encodings are not 'reversible', which address many of the 'EX2' errors.
However, what I can not understand is how I can map a character from unicode to
byte[] and back to unicode, and get an entirely different character!  This must
be a error in the underlying conversion tables.

I think that in the very least these inconsistencies between encodings should be
documented somewhere.  I had been under the impression that 'no mapping' whould
be indicated by '?' in the native form, and with the SUBSTITUTE character in
unicode.  I was not aware that some characters would be ommitted in the
conversion, that different methods would be used to indicate 'no mapping'
within the same encoding, that all sorts of errors could be generated, or that
conversions were not reversible.
(Review ID: 100000)
======================================================================

Name: skT45625			Date: 05/09/2000


java version "1.3.0rc1"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0rc1-T)
Java HotSpot(TM) Client VM (build 1.3.0rc1-S, mixed mode)

1. open the command prompt in Korean Windows 2000.
(the default codepage is 949)
2. run below code like this way.
C:> java -Duser.language=en -Duser.region=US -classpath . ShowLocale
import java.util.Locale;
public class ShowLocale {
        public static void main(String[] args) {
                System.out.println("default locale is " + Locale.getDefault());
        }
}
3. then the result is
default locale is ko_KR
I should expect en_US.
4. but if I change the codepage in console prompt like this way,
C:> chcp 1252
then, all works fine.
C:> java -Duser.language=blah -Duser.region=YADDA -classpath . ShowLocale
the result is
default locale is blah_YADDA

This problem happens also in JDK 1.2.2.
(Review ID: 102774)
======================================================================


Name: krT82822			Date: 12/05/99


12/5/99 eval1127@eng -- kestrel RA produces errors for several of the codepages.  Submitting this to supplement existing encoding bugs open for kestrel.

/*
J:\borsotti\jtest>java -version
java version "1.2"
Classic VM (build JDK-1.2-V, native threads)

There are several problems with the character converters.
They can be summarized as follows:

  - converters which are listed in the jdk documentation,
    but do not exist,
  - converters which do not map all Unicode characters, or
    do not decode back (to) what they encoded,
  - converters which crash

This java program tests each converter in turn and reports
the errors found:
*/

import java.io.*;
import java.util.*;
public class EncErr {

    /**
     * This is the list of encodings reported in
     *
	http://java.sun.com/products/jdk/1.2/docs/guide/internat/encoding.doc.html
     */

    private static String[] encodings = new String[] {
         "ASCII",            // ASCII
         "ISO8859_1",        // ISO 8859-1
         "ISO8859_2",        // ISO 8859-2
         "ISO8859_3",        // ISO 8859-3
         "ISO8859_4",        // ISO 8859-4
         "ISO8859_5",        // ISO 8859-5
         "ISO8859_6",        // ISO 8859-6
         "ISO8859_7",        // ISO 8859-7
         "ISO8859_8",        // ISO 8859-8
         "ISO8859_9",        // ISO 8859-9
         "Big5",             // Big5, Traditional Chinese
         "Cp037",            // USA, Canada(Bilingual, French), Netherlands, Portugal, Brazil, Australia
         "Cp1006",           // IBM AIX Pakistan (Urdu)
         "Cp1025",           // IBM Multilingual Cyrillic: Bulgaria, Bosnia, Herzegovinia, Macedonia(FYR)
         "Cp1026",           // IBM Latin-5, Turkey
         "Cp1046",           // IBM Open Edition US EBCDIC
         "Cp1097",           // IBM Iran(Farsi)/Persian
         "Cp1098",           // IBM Iran(Farsi)/Persian (PC)
         "Cp1112",           // IBM Latvia, Lithuania
         "Cp1122",           // IBM Estonia
         "Cp1123",           // IBM Ukraine
         "Cp1124",           // IBM AIX Ukraine
         "Cp1250",           // Windows Eastern European
         "Cp1251",           // Windows Cyrillic
         "Cp1252",           // Windows Latin-1
         "Cp1253",           // Windows Greek
         "Cp1254",           // Windows Turkish
         "Cp1255",           // Windows Hebrew
         "Cp1256",           // Windows Arabic
         "Cp1257",           // Windo",ws Baltic
         "Cp1258",           // Windows Vietnamese
         "Cp1381",           // IBM OS/2, DOS People's Republic of China (PRC)
         "Cp1383",           // IBM AIX People's Republic of China (PRC)
         "Cp273",            // IBM Austria, Germany
         "Cp277",            // IBM Denmark, Norway
         "Cp278",            // IBM Finland, Sweden
         "Cp280",            // IBM Italy
         "Cp284",            // IBM Catalan/Spain, Spanish Latin America
         "Cp285",            // IBM United Kingdom, Ireland
         "Cp297",            // IBM France
         "Cp33722",          // IBM-eucJP - Japanese (superset of 5050)
         "Cp420",            // IBM Arabic
         "Cp424",            // IBM Hebrew
         "Cp437",            // MS-DOS United States, Australia, New Zealand, South Africa
         "Cp500",            // EBCDIC 500V1
         "Cp737",            // PC Greek
         "Cp775",            // PC Baltic
         "Cp838",            // IBM Thailand extended SBCS
         "Cp850",            // MS-DOS Latin-1
         "Cp852",            // MS-DOS Latin-2
         "Cp855",            // IBM Cyrillic
         "Cp857",            // IBM Turkish
         "Cp860",            // MS-DOS Portuguese
         "Cp861",            // MS-DOS Icelandic
         "Cp862",            // PC Hebrew
         "Cp863",            // MS-DOS Canadian French
         "Cp864",            // PC Arabic
         "Cp865",            // MS-DOS Nordic
         "Cp866",            // MS-DOS Russian
         "Cp868",            // MS-DOS Pakistan
         "Cp869",            // IBM Modern Greek
         "Cp870",            // IBM Multilingual Latin-2
         "Cp871",            // IBM Iceland
         "Cp874",            // IBM Thai
         "Cp875",            // IBM Greek
         "Cp918",            // IBM Pakistan(Urdu)
         "Cp921",            // IBM Latvia, Lithuania (AIX, DOS)
         "Cp922",            // IBM Estonia (AIX, DOS)
         "Cp930",            // Japanese Katakana-Kanji mixed with 4370 UDC, superset of 5026
         "Cp933",            // Korean Mixed with 1880 UDC, superset of 5029
         "Cp935",            // Simplified Chinese Host mixed with 1880 UDC, superset of 5031
         "Cp937",            // Traditional Chinese Host miexed with 6204 UDC, superset of 5033
         "Cp939",            // Japanese Latin Kanji mixed with 4370 UDC, superset of 5035
         "Cp942",            // Japanese (OS/2) superset of 932
         "Cp948",            // OS/2 Chinese (Taiwan) superset of 938
         "Cp949",            // PC Korean
         "Cp950",            // PC Chinese (Hong Kong, Taiwan)
         "Cp964",            // AIX Chinese (Taiwan)
         "Cp970",            // AIX Korean
         "EUC_CN",           // GB2312, EUC encoding, Simplified Chinese
         "EUC_JP",           // JIS0201, 0208, 0212, EUC Encoding, Japanese
         "EUC_KR",           // KS C 5601, EUC Encoding, Korean
         "EUC_TW",           // CNS11643 (Plane 1-3), T. Chinese, EUC encoding
         "GBK",              // GBK, Simplified Chinese
         "ISO2022CN",        // ISO 2022 CN, Chinese
         "ISO2022CN_CNS",    // CNS 11643 in ISO-2022-CN form, T. Chinese
         "ISO2022CN_GB",     // GB 2312 in ISO-2022-CN form, S. Chinese
         "ISO2022JP",        // JIS0201, 0208, 0212, ISO2022 Encoding, Japanese
         "ISO2022KR",        // ISO 2022 KR, Korean
         "JIS0201",          // JIS 0201, Japanese
         "JIS0208",          // JIS 0208, Japanese
         "JIS0212",          // JIS 0212, Japanese
         "KOI8_R",           // KOI8-R, Russian
         "MS874",            // Windows Thai
         "MacArabic",        // Macintosh Arabic
         "MacCentralEurope", // Macintosh Latin-2
         "MacCroatian",      // Macintosh Croatian
         "MacCyrillic",      // Macintosh Cyrillic
         "MacDingbat",       // Macintosh Dingbat
         "MacGreek",         // Macintosh Greek
         "MacHebrew",        // Macintosh Hebrew
         "MacIceland",       // Macintosh Iceland
         "MacRoman",         // Macintosh Roman
         "MacRomania",       // Macintosh", Romania
         "MacSymbol",        // Macintosh Symbol
         "MacThai",          // Macintosh Thai
         "MacTurkish",       // Macintosh Turkish
         "MacUkraine",       // Macintosh Ukraine
         "SJIS",             // Shift-JIS, Japanese
         "UTF8",             // UTF-8
         };

    /**
     * Test an encoding. The following tests are done:
     * <ol>
     * <li>the existence of the encoder
     * <li>the existence of the decoder
     * <li>each character which is defined in Unicode is encoded,
     *   and then the result is decoded. The number of characters which
     *   are not encoded, or an encoded into an empty sequence of octects,
     *   or are encoded into a sequence which, once decoded, produces
     *   a character different from the original one or different from
     *   '?' is rekoned.
     * <li>several long strings are encoded and then decoded, and checked
     *   to be equal (apart from characters mapped into '?') to the original.
     * </ol>
     * The third and fourth steps are done only if the previous are successful.
     * In the last step, only characters which are encoded correctly are
     * used.
     *
     * @param      enc name of the encoding
     */

    private static void test(String enc){
        System.err.println("------ test ------- " + enc);

        // test existence of encoder

        boolean both = true;
        try {
            byte[] bb = new byte[] {0};
            String str = new String(bb,enc);
        } catch (UnsupportedEncodingException th){
            System.err.println("encoder " + enc + " not available");
            both = false;
        }

        // test existence of decoder

        try {
            byte[] bb = "abc".getBytes(enc);
        } catch (UnsupportedEncodingException th){
            System.err.println("decoder " + enc + " not available");
            both = false;
        }
        if (!both) return;

        // test mapping

        // remember which character is valid for the round-trip test

        boolean[] valid = new boolean[Character.MAX_VALUE+1];
        try {
            int nrEmpty = 0;
            int nrUnmapped = 0;
            int nrNoBack = 0;
            int nrDiffBack = 0;
            for (int c = Character.MIN_VALUE; c <= Character.MAX_VALUE; c++){
                if (!Character.isDefined((char)c)) continue;
                valid[c] = true;
                String s = String.valueOf((char)c);
                byte[] bb = null;
                try {
                    bb = s.getBytes(enc);
                    if (bb.length == 0){
                        nrEmpty++;
                        valid[c] = false;
                        continue;
                    }
                } catch (InternalError tr){
                    nrUnmapped++;
                    valid[c] = false;
                    continue;
                }
                try {
                    String str = new String(bb,enc);
                    if (str.length() != 1){
                        nrNoBack++;
                        valid[c] = false;
                        continue;
                    }
                    if ((str.charAt(0) != (char)c) &&
                       (str.charAt(0) != '?')){
                        nrDiffBack++;
                        valid[c] = false;
                        continue;
                    }
                } catch (InternalError tr){
                    nrNoBack++;
                }
            }
            if (nrUnmapped > 0){
                System.err.println(enc + " has " + nrUnmapped + " unmapped characters");
            }
            if (nrEmpty > 0){
                System.err.println(enc + " has " + nrEmpty + " empty mapped characters");
            }
            if (nrNoBack > 0){
                System.err.println(enc + " does not convert back " + nrNoBack + " characters");
            }
            if (nrDiffBack > 0){
                System.err.println(enc + " converts back " + nrDiffBack + " characters into a different one");
            }
            if (nrDiffBack > Character.MAX_VALUE / 2) return;
        } catch (Throwable th){
            System.err.println("encoding " + enc + " mapping error " + th);
            th.printStackTrace(System.err);
        }

        // test round-trip

        trip: for (int k = 0; k < 100; k++){
            byte[] bb = null;
            char[] ca = new char[10000];
            Random r = new Random();
            for (int i = 0; i < ca.length; i++){
                do {
                    ca[i] = (char)r.nextInt(Character.MAX_VALUE);
                } while (!valid[ca[i]]);
            }
            String old = String.valueOf(ca);
            try {
                bb = old.getBytes(enc);
                if (bb == null){
                    System.err.println(enc + " empty encoding");
                    return;
                }
            } catch (InternalError th){
                System.err.println(enc + " round-trip decoding error");
                break trip;
            } catch (UnsupportedEncodingException th){
            }
            try {
                String str = new String(bb,enc);
                if (!old.equals(str)){
                    if (old.length() != str.length()){
                        System.err.println("encoding " + enc +
                            " round-trip " + old.length() +
                            " back to " + str.length());
                        break trip;
                    }
                    for (int i = 0; i < ca.length && i < str.length(); i++){
                        if ((old.charAt(i) != str.charAt(i)) &&
                            (str.charAt(i) != '?')){
                            System.err.println(enc + " round-trip compare error");
                            break trip;
                        }
                    }
                }
            } catch (InternalError th){
                System.err.println(enc + " round-trip encoding error ");
                break trip;
            } catch (UnsupportedEncodingException th){
            }
        }
    }

    /**
     * Tests all encodings. On all encodings the tests defined above
     * are performed. Moreover, some specific tests are done on ISO2022CN
     * and ISO2022KR.
     */

    public static void main(String[] args){

        for (int i = 0; i < encodings.length; i++){
            test(encodings[i]);
        }

        try {
            byte[] bb = new byte[] {(byte)0x1b, (byte)')',  (byte)'x'};
            String str = new String(bb,"ISO2022CN");
        } catch (Throwable th){
            System.err.println("ISO2022CN error " + th);
        }

        try {
            byte[] bb = new byte[] {(byte)0x1b, (byte)')',  (byte)'x'};
            String str = new String(bb,"ISO2022KR");
        } catch (Throwable th){
            System.err.println("ISO2022KR error " + th);
        }

    }
}

/*
When run, it reports a considerable amount of errors.

Feel free to use it, and include in your test suite if you
like.
*/
(Review ID: 98558) 
======================================================================

Name: krT82822			Date: 02/08/2000


java version "1.2.2"
HotSpot VM (1.0.1, mixed mode, build g)

When using the String to convert from native encodings to unicode and back
again, different encodings behave erratically when dealing with characters for
which there is not a direct match.  Specifically, some encodings indicate a
mismatch with by mapping the character to '\u003F', '\u001A', or even no
character ''.  What is worse is that within a single encoding, multiple methods
are used.  In some cases, conversions throw undocumented exceptions.  The worst
behavior is when a conversion from unicode to byte and back again does not
generate an 'unkown' mapping or an exception, but maps to an entirely different
character.

My general technique for identifing these bugs was to step through all the
unicode characters for every encoding, and document the results.  For each
character, I'd convert from unicode to byte, and then from byte back to unicode.

'EX1' indicates an error converting from unicode to byte[].
'EX2' indicates an error converting from byte[] to unicode.
'UN1' indicates a mapping to the '\u003F' character.
'UN2' indicates a mapping to the '

Comments

EVALUATION The test case in the bug description tests primarily round-trip conversion from Unicode to another encoding and back to Unicode. While it is desirable that such a round-trip conversion results in the original character(s), this can not generally be guaranteed. The anomalies reported by the test in some cases indicate real bugs, but in some other cases just reflect the idiosyncrasies of the various encodings we support. This evaluation looks at the reported anomalies case-by-case. The basis is the J2SDK 1.3.0 FCS-P build. The converters that are not available are ISO2022CN (decoder), ISO2022CN_CNS (encoder), ISO2022CN_GB (encoder). This has already been reported as bug 4140796. For all encodings, the test produces a complaint like "encoding ASCII round-trip 10000 back to 9997", indicating that round-trip conversion of a 10000-character Unicode strings results in a slightly shorter string. This occurs primarily due to surrogate characters. The char-to-byte conversion in String uses the method CharToByteConverter.convertAny, which skips over any malformed input such as unpaired surrogate characters. If the test is modified to eliminate surrogate characters, replacing the String generation code Random r = new Random(); for (int i = 0; i < ca.length; i++){ do { ca[i] = (char)r.nextInt(Character.MAX_VALUE); } while (!valid[ca[i]]); } with char[] ca = new char[10000]; Random r = new Random(); for (int i = 0; i < ca.length; i++){ do { ca[i] = (char)r.nextInt(Character.MAX_VALUE); } while (!valid[ca[i]] || (ca[i] >= 0xD800 && ca[i] < 0xE000)); } then this error is no longer reported for most converters, the exceptions being Cp933, Cp949, and Cp970. TO DO: The documentation of String.getBytes should be updated to document the handling of malformed input. The Cp933, Cp949, and Cp970 char-to-byte converters have code to combine sequences of Jamo in the Hangul Jamo block. In all cases I saw where the test reported "round-trip 10000 back to 9997" or similar, I found sequences of such characters in the input strings. A shortened result of the roundtrip in these cases is to be expected. Other complaints: Cp037 converts back 47145 characters into a different one Cp1025 converts back 47145 characters into a different one Cp1026 converts back 47145 characters into a different one Cp1097 converts back 47144 characters into a different one Cp1112 converts back 47145 characters into a different one Cp1122 converts back 47145 characters into a different one Cp1123 converts back 47145 characters into a different one Cp273 converts back 47145 characters into a different one Cp277 converts back 47145 characters into a different one Cp278 converts back 47145 characters into a different one Cp280 converts back 47145 characters into a different one Cp284 converts back 47145 characters into a different one Cp285 converts back 47145 characters into a different one Cp297 converts back 47145 characters into a different one Cp420 converts back 47153 characters into a different one Cp424 converts back 47183 characters into a different one Cp500 converts back 47145 characters into a different one Cp838 converts back 47150 characters into a different one Cp870 converts back 47145 characters into a different one Cp871 converts back 47145 characters into a different one Cp875 converts back 47151 characters into a different one Cp918 converts back 47145 characters into a different one For some of these EBCDIC converters, \u0085 converts back to \u000A; in all other cases unmapped characters convert back to \u001A. The char-to-byte converters for these EBCDIC encodings do not set the subBytes to the EBCDIC question mark (0x6F), they retain the default value 0x3F. This byte value in EBCDIC represents the SUB control character, so the byte-to-char converter maps it back to \u001A. In the case of \u0085, some char-to-byte converters use this to express the EBCDIC NL control character, while the corresponding byte-to-char converters treat NL as a synonym of LF. Cp1381 converts back 3 characters into a different one \u00B7 converts back to \u30FB \u2014 converts back to \u2015 \u7AC2 converts back to \u30FB Cp1383 converts back 7 characters into a different one \u001A converts back to \u00A3 \u00B7 converts back to \u30FB \u2014 converts back to \u2015 \u50FF converts back to \u00A3 \u8EA2 converts back to \u30FB \uF83D converts back to \uFFE5 \uF83E converts back to \u4EDD Cp33722 converts back 52 characters into a different one \u2015 converts back to \u2014 \u2225 converts back to \u2016 \u4FE0 converts back to \u4FA0 \u525D converts back to \u5265 \u551E converts back to \u8749 \u555E converts back to \u5516 \u5699 converts back to \u565B \u56CA converts back to \u56A2 \u5861 converts back to \u586B \u5C5B converts back to \u5C4F \u5C62 converts back to \u5C61 \u6414 converts back to \u63BB \u6451 converts back to \u63B4 \u6522 converts back to \u6505 \u6805 converts back to \u67F5 \u688E converts back to \u688D \u6D00 converts back to \u6D9C \u6F1E converts back to \u9A28 \u6F51 converts back to \u6E8C \u7006 converts back to \u6D9C \u70FF converts back to \u4FA0 \u7130 converts back to \u7114 \u7626 converts back to \u75E9 \u79B1 converts back to \u7977 \u7C1E converts back to \u7BAA \u7E48 converts back to \u7E66 \u7E61 converts back to \u7E4D \u7E6B converts back to \u7E4B \u8141 converts back to \u80FC \u8346 converts back to \u834A \u840A converts back to \u83B1 \u8523 converts back to \u848B \u8741 converts back to \u5516 \u87EC converts back to \u8749 \u881F converts back to \u874B \u8EC0 converts back to \u8EAF \u8F91 converts back to \u2116 \u91AC converts back to \u91A4 \u91B1 converts back to \u9197 \u92CA converts back to \u565B \u9830 converts back to \u982C \u9839 converts back to \u983D \u985A converts back to \u985B \u9A52 converts back to \u9A28 \u9DD7 converts back to \u9D0E \u9E7C converts back to \u9E78 \u9EB4 converts back to \u9EB9 \u9EB5 converts back to \u9EBA \uF86F converts back to \u2116 \uFF0D converts back to \u2212 \uFF5E converts back to \u301C \uFFE4 converts back to \uFFFD Cp930 does not convert back 2 characters \u000E maps back to empty string \u000F maps back to empty string Cp930 converts back 52 characters into a different one \u0085 converts back to \u000A \u00A6 converts back to \uFFE4 \u2014 converts back to \u2015 \u2016 converts back to \u2225 \u2212 converts back to \uFF0D \u301C converts back to \uFF5E \u4FE0 converts back to \u4FA0 \u525D converts back to \u5265 \u555E converts back to \u5516 \u5699 converts back to \u565B \u56CA converts back to \u56A2 \u5861 converts back to \u586B \u5C5B converts back to \u5C4F \u5C62 converts back to \u5C61 \u6414 converts back to \u63BB \u6451 converts back to \u63B4 \u6522 converts back to \u6505 \u688E converts back to \u688D \u6BE1 converts back to \u5516 \u6D00 converts back to \u6D9C \u6F51 converts back to \u6E8C \u7006 converts back to \u6D9C \u70FF converts back to \u4FA0 \u7130 converts back to \u7114 \u7626 converts back to \u75E9 \u79B1 converts back to \u7977 \u7C1E converts back to \u7BAA \u7E48 converts back to \u7E66 \u7E61 converts back to \u7E4D \u7E6B converts back to \u7E4B \u8141 converts back to \u80FC \u840A converts back to \u83B1 \u841D converts back to \u8749 \u841F converts back to \u874B \u8523 converts back to \u848B \u87EC converts back to \u8749 \u881F converts back to \u874B \u8EC0 converts back to \u8EAF \u8F91 converts back to \u2116 \u91AC converts back to \u91A4 \u91B1 converts back to \u9197 \u92CA converts back to \u565B \u9830 converts back to \u982C \u9839 converts back to \u983D \u985A converts back to \u985B \u9A52 converts back to \u9A28 \u9B7E converts back to \u9A28 \u9DD7 converts back to \u9D0E \u9E7C converts back to \u9E78 \u9EB4 converts back to \u9EB9 \u9EB5 converts back to \u9EBA \uF86F converts back to \u2116 Except for \u0085, these are all cases where two Unicode characters map to the same Cp939 byte sequence, as specified in either the official IBM mapping table or in the request for additional characters for Microsoft compatibility in RFE 4199599. Cp933 has 10887 unmapped characters InternalError thrown in char-to-byte conversion Cp933 does not convert back 2 characters \u000E maps back to empty string \u000F maps back to empty string Cp935 does not convert back 2 characters \u000E maps back to empty string \u000F maps back to empty string Cp935 converts back 1 characters into a different one \u0085 converts back to \u000A Cp937 does not convert back 2 characters \u000E maps back to empty string \u000F maps back to empty string Cp937 converts back 1 characters into a different one \u0085 converts back to \u000A Cp939 does not convert back 2 characters \u000E maps back to empty string \u000F maps back to empty string Cp939 converts back 52 characters into a different one \u0085 converts back to \u000A \u00A6 converts back to \uFFE4 \u2014 converts back to \u2015 \u2016 converts back to \u2225 \u2212 converts back to \uFF0D \u301C converts back to \uFF5E \u4FE0 converts back to \u4FA0 \u525D converts back to \u5265 \u555E converts back to \u5516 \u5699 converts back to \u565B \u56CA converts back to \u56A2 \u5861 converts back to \u586B \u5C5B converts back to \u5C4F \u5C62 converts back to \u5C61 \u6414 converts back to \u63BB \u6451 converts back to \u63B4 \u6522 converts back to \u6505 \u688E converts back to \u688D \u6BE1 converts back to \u5516 \u6D00 converts back to \u6D9C \u6F51 converts back to \u6E8C \u7006 converts back to \u6D9C \u70FF converts back to \u4FA0 \u7130 converts back to \u7114 \u7626 converts back to \u75E9 \u79B1 converts back to \u7977 \u7C1E converts back to \u7BAA \u7E48 converts back to \u7E66 \u7E61 converts back to \u7E4D \u7E6B converts back to \u7E4B \u8141 converts back to \u80FC \u840A converts back to \u83B1 \u841D converts back to \u8749 \u841F converts back to \u874B \u8523 converts back to \u848B \u87EC converts back to \u8749 \u881F converts back to \u874B \u8EC0 converts back to \u8EAF \u8F91 converts back to \u2116 \u91AC converts back to \u91A4 \u91B1 converts back to \u9197 \u92CA converts back to \u565B \u9830 converts back to \u982C \u9839 converts back to \u983D \u985A converts back to \u985B \u9A52 converts back to \u9A28 \u9B7E converts back to \u9A28 \u9DD7 converts back to \u9D0E \u9E7C converts back to \u9E78 \u9EB4 converts back to \u9EB9 \u9EB5 converts back to \u9EBA \uF86F converts back to \u2116 Except for \u0085, these are all cases where two Unicode characters map to the same Cp939 byte sequence, as specified in either the official IBM mapping table or in the request for additional characters for Microsoft compatibility in RFE 4199599. Cp942 converts back 45 characters into a different one \u4FE0 converts back to \u4FA0 \u525D converts back to \u5265 \u551E converts back to \u8749 \u555E converts back to \u5516 \u5699 converts back to \u565B \u56CA converts back to \u56A2 \u5861 converts back to \u586B \u5C5B converts back to \u5C4F \u5C62 converts back to \u5C61 \u6414 converts back to \u63BB \u6451 converts back to \u63B4 \u6522 converts back to \u6505 \u688E converts back to \u688D \u6D00 converts back to \u6D9C \u6F1E converts back to \u9A28 \u6F51 converts back to \u6E8C \u7006 converts back to \u6D9C \u70FF converts back to \u4FA0 \u7130 converts back to \u7114 \u7626 converts back to \u75E9 \u79B1 converts back to \u7977 \u7C1E converts back to \u7BAA \u7E48 converts back to \u7E66 \u7E61 converts back to \u7E4D \u7E6B converts back to \u7E4B \u8141 converts back to \u80FC \u840A converts back to \u83B1 \u8523 converts back to \u848B \u8741 converts back to \u5516 \u87EC converts back to \u8749 \u881F converts back to \u874B \u8EC0 converts back to \u8EAF \u8F91 converts back to \u2116 \u91AC converts back to \u91A4 \u91B1 converts back to \u9197 \u92CA converts back to \u565B \u9830 converts back to \u982C \u9839 converts back to \u983D \u985A converts back to \u985B \u9A52 converts back to \u9A28 \u9DD7 converts back to \u9D0E \u9E7C converts back to \u9E78 \u9EB4 converts back to \u9EB9 \u9EB5 converts back to \u9EBA \uF86F converts back to \u2116 Cp949 converts back 129 characters into a different one \u1100 converts back to \uAC00 \u1101 converts back to \uAE4C \u1102 converts back to \uB098 \u1103 converts back to \uB2E4 \u1104 converts back to \uB530 \u1105 converts back to \uB77C \u1106 converts back to \uB9C8 \u1107 converts back to \uBC14 \u1108 converts back to \uBE60 \u1109 converts back to \uC0AC \u110A converts back to \uC2F8 \u110B converts back to \uC544 \u110C converts back to \uC790 \u110D converts back to \uC9DC \u110E converts back to \uCC28 \u110F converts back to \uCE74 \u1110 converts back to \uD0C0 \u1111 converts back to \uD30C \u1112 converts back to \uD558 \u1117 converts back to \uE0D4 \u1118 converts back to \uE320 \u1135 converts back to \u25BC \u113A converts back to \u3138 \u113B converts back to \u3384 \u1150 converts back to \u63C0 \u1151 converts back to \u660C \u1154 converts back to \u6CF0 \u1158 converts back to \u7620 \u1159 converts back to \u786C \u1161 converts back to \uAC00 \u1162 converts back to \uAC1C \u1163 converts back to \uAC38 \u1164 converts back to \uAC54 \u1165 converts back to \uAC70 \u1166 converts back to \uAC8C \u1167 converts back to \uACA8 \u1168 converts back to \uACC4 \u1169 converts back to \uACE0 \u116A converts back to \uACFC \u116B converts back to \uAD18 \u116C converts back to \uAD34 \u116D converts back to \uAD50 \u116E converts back to \uAD6C \u116F converts back to \uAD88 \u1170 converts back to \uADA4 \u1171 converts back to \uADC0 \u1172 converts back to \uADDC \u1173 converts back to \uADF8 \u1174 converts back to \uAE14 \u1175 converts back to \uAE30 \u1176 converts back to \uAE4C \u1177 converts back to \uAE68 \u1178 converts back to \uAE84 \u1179 converts back to \uAEA0 \u117A converts back to \uAEBC \u117B converts back to \uAED8 \u117C converts back to \uAEF4 \u117D converts back to \uAF10 \u117E converts back to \uAF2C \u117F converts back to \uAF48 \u1180 converts back to \uAF64 \u1181 converts back to \uAF80 \u1182 converts back to \uAF9C \u1183 converts back to \uAFB8 \u1184 converts back to \uAFD4 \u1185 converts back to \uAFF0 \u1186 converts back to \uB00C \u1187 converts back to \uB028 \u1188 converts back to \uB044 \u1189 converts back to \uB060 \u118A converts back to \uB07C \u118B converts back to \uB098 \u118C converts back to \uB0B4 \u118D converts back to \uB0D0 \u118E converts back to \uB0EC \u118F converts back to \uB108 \u1190 converts back to \uB124 \u1191 converts back to \uB140 \u1192 converts back to \uB15C \u1193 converts back to \uB178 \u1194 converts back to \uB194 \u1195 converts back to \uB1B0 \u1196 converts back to \uB1CC \u1197 converts bac

11-06-2004

WORK AROUND Name: krT82822 Date: 12/05/99 No way, when a converter does not work there is no way to make it do. ====================================================================== Name: krT82822 Date: 02/08/2000 I will be converting Strings on a character-basis so that I can identify which characters are 'troublesome', since its not possible to isolate 'mapping' problems when converting Strings with multiple characters. (Review ID: 100000) ======================================================================

11-06-2004

EVALUATION k to \uB1E8 \u1198 converts back to \uB204 \u1199 converts back to \uB220 \u119A converts back to \uB23C \u119B converts back to \uB258 \u119C converts back to \uB274 \u119D converts back to \uB290 \u119E converts back to \uB2AC \u119F converts back to \uB2C8 \u11A0 converts back to \uB2E4 \u11A1 converts back to \uB300 \u11A2 converts back to \uB31C \u11A8 converts back to \uAC01 \u11A9 converts back to \uAC02 \u11AB converts back to \uAC04 \u11AE converts back to \uAC07 \u11AF converts back to \uAC08 \u11B0 converts back to \uAC09 \u11B1 converts back to \uAC0A \u11B2 converts back to \uAC0B \u11B7 converts back to \uAC10 \u11B8 converts back to \uAC11 \u11B9 converts back to \uAC12 \u11BA converts back to \uAC13 \u11BB converts back to \uAC14 \u11BC converts back to \uAC15 \u11BD converts back to \uAC16 \u11BE converts back to \uAC17 \u11C0 converts back to \uAC19 \u11C1 converts back to \uAC1A \u11C2 converts back to \uAC1B \u11C3 converts back to \uAC1C \u11C4 converts back to \uAC1D \u11C7 converts back to \uAC20 \u11CB converts back to \uAC24 \u11D3 converts back to \uAC2C \u11D4 converts back to \uAC2D \u11D6 converts back to \uAC2F \u11D7 converts back to \uAC30 \u11D8 converts back to \uAC31 \u11DF converts back to \uAC38 \u11E0 converts back to \uAC39 \u11E3 converts back to \uAC3C \u11E7 converts back to \uAC40 \u11F2 converts back to \uAC4B \u11F4 converts back to \uAC4D Cp970 converts back 2480 characters into a different one Large number maps back to \u25C9, small number to other characters. EUC_JP converts back 2 characters into a different one \u00A5 converts back to \u005C, \u203E converts back to \u007E. See JIS0201. EUC_TW does not convert back 1 characters \u0000 maps back to empty string. ISO2022JP does not convert back 3 characters \u000E, \u000F, \u001B map back to empty strings. ISO2022KR has 37512 empty mapped characters Includes half of high surrogate characters. ISO2022KR does not convert back 1539 characters \u000E, \u000F, \u001B map back to empty strings. Plus half of high and all of low surrogate characters. JIS0201 converts back 2 characters into a different one \u00A5 converts back to \u005C, \u203E converts back to \u007E. JIS0208 does not convert back 40521 characters They map back to empty string. JIS0212 does not convert back 41333 characters They map back to empty string. MacDingbat converts back 47179 characters into a different one MacDingbat doesn't have a question mark, so unmapped characters are arbitrarily mapped to 0x3F, which translates back to \u271F. SJIS converts back 2 characters into a different one \u00A5 converts back to \u005C, \u203E converts back to \u007E. See JIS0201. ISO2022CN error java.lang.ArrayIndexOutOfBoundsException Is thrown for an invalid escape sequence. ISO2022KR error java.lang.ArrayIndexOutOfBoundsException Is thrown for an invalid escape sequence. norbert.lindenberg@Eng 1999-12-08 JIS0201 and JIS0208 are standards that define fewer than 7000 characters. The test is probably in error when it reports failures to convert back over 40000 characters. Using an independent test of JIS0201, JIS0212, and SJIS, 8 errors were detected: CHECKING BYTE ARRAY TO STRING PASS? CODE IN CHECK OUT COMMENT FAIL JIS0201 5C 00A5 005C # YEN SIGN FAIL JIS0201 7E 203E 007E # OVERLINE FAIL JIS0212 2237 007E FF5E # TILDE FAIL SJIS 5C 00A5 005C # YEN SIGN FAIL SJIS 7E 203E 007E # OVERLINE FAIL SJIS 815F 005C FF3C # REVERSE SOLIDUS CHECKING STRING TO BYTE ARRAY PASS? CODE IN CHECK OUT COMMENT FAIL JIS0212 007E 2237 7E # TILDE FAIL SJIS 005C 815F 5C # REVERSE SOLIDUS Hex under "IN" is input to a conversion. Hex in the "CHECK" column is the expected output. Hex under the "OUT" column is the actual output using a 1.3 or 1.4 JDK. The test used a pair of code pages from ftp://www.unicode.org/ . The page for JIS0202 is for "JIS X 0201 (1976) to Unicode 1.1". The code page for SHIFTJIS says that # This table contains the data the Unicode Consortium has on how # Shift-JIS (a combination of JIS 0201 and JIS 0208) maps into Unicode" and is dated 8 March 1994. allan.jacobs@Eng 2000-08-07 JIS0208 required a separate test. The test uses a download from ftp://www.unicode.org/ and is for "JIS X 0208 (1990)". This code page lists conversions for 6879 characters. There is only one character that is incorrectly converted. CHECKING BYTE ARRAY TO STRING PASS? CODE IN CHECK OUT COMMENT FAIL JIS0208 2140 005C FF3C # REVERSE SOLIDUS CHECKING STRING TO BYTE ARRAY PASS? CODE IN CHECK OUT COMMENT FAIL JIS0208 005C 2140 5C # REVERSE SOLIDUS allan.jacobs@Eng 2000-08-11 This bug is extremely broad in its analysis of the various character converters. Also as has been already commented a number of the reported issues are not bugs but are due to fallback and compatibility mappings which have been requested within some of the converters or due to the original test case having provided unpaired surrogate/unmappable characters. However, in the interests of more efficient bug management and tracking I have isolated the real issues which need to be addressed in various converters and I have created bugs to track those issues/errors. Some bugs already exist for these issues. Here is a concise summary of the issues pertinent to this broad bug report and associated reference to the new bugIDs. 1. Cp930, Cp933, Cp935, Cp937, Cp939. These EBCDIC based encodings provide mappings for U+000E and U+000F back to native \x0e or \0x0f respectively. These mappings should be removed and these should be unmappable chars. (see BugID: 4429358) 2. ISO2022CN and ISO2022KR converters throw runtime exception, (ArrayOutOfBoundsException on jdk1.3). See BugID: 4429369 3. Some of the IBM converters contain char->byte mappings which now appear to be obsolete. These should be removed. See BugID: 4429377 4. Cp933 throws InternalError if it attempts to encode a solitary character. Due to inadequate (4 byte) reservation of space for SI/SO leading/trailing bytes in the decoded output. see BugID: 4333733 5. SJIS/EUC-JP/JIS0201/JIS0208/SJIS roundtrip issues. Have been previously addressed within bug: 4361835. 6. Handling of 'NEL' (U+0085) character conversion within EBCDIC encodings. Since 4159519 was fixed to handle new line/line feed handling on EBCDIC platforms we have a potential roundtrip code mapping conflict issue with the mapping of the control character U+0085. This needs to be resolved. BugID: TBD. 7. ISO2022CN (decoder), ISO2022CN_CNS (encoder), ISO2022CN_GB (encoder) not supported. Already captured in 4140796. Please use the bugIDs above for tracking the remaining identified constituent issues. This bug is being closed out in lieu of the newly created and existing bugs. Ian.Little@Ireland 3/23/2001.

23-03-2001