Bug ID: JDK-4964355 Clarify (lack of) specification for optional charsets

Type: Bug
Component: core-libs
Sub-Component: java.nio.charsets
Affected Version: 5.0
Priority: P2
Status: Resolved
Resolution: Fixed
OS: generic,solaris_9
CPU: generic
Submitted: 2003-12-05
Updated: 2017-05-16
Resolved: 2004-09-17
Other
5.0 b59Fixed
The JCK api test ./api/java_io/mbCharEncoding/TestCNS11643.java will need to be updated
to reflect the fact that the EUC-TW charset coder provided within j2se 1.5.0 will be
updated as a result of 4847097 to incorporate support for planes 4,5,6,7 & 15.
The current test is limited to testing planes 1, 2 & 3. Also as part of revving the
charset coder implementation to support these additional planes the implementation was
brought into sync with the more up to date assignments of characters within plane 3.
Because of some re-assignments compared to the much older snapshot which was used here
to date this test will start to fail from 1.5.0 b32 onwards as a result of integration
of fix for bugID 4847097. 
###@###.### 2003-12-05

Name: ooR10006			Date: 05/06/2004



Below is the log of the email discussion of the issue:

------- From Ian Little, 28 Apr 2004:

I am keenly aware that tomorrow is your beta2 deadline for JCK 1.5.
I have gleaned some useful info regarding the status of some of
the mappings whose status was previously unknown in the last day
or so. I will send you details in a followup e-mail later today.
It is now very likely that I have the majority of the info I need in 
order to file a retrospective CCC request for the changes introduced
in 1.5.0 beta as part of 4847097.

------- From Ian Little, 22 Apr 2004:

Norbert usefully summarized the specifics of the situation for CNS-11643
and more specifically with regards to the J2SE "x-EUC-TW" charset
implementation in a recent e-mail discussion which I have quoted below.

Norbert Lindenberg wrote;

 >CNS 11643 is a special case. The Unihan database has normative 
mappings for it, but >undermines its normative status by providing 
"some additional characters" for CNS 11643 >plane 3. Also, I don't see 
any specification linking the name "x-EUC-TW" and the variant of >CNS 
11643 used in the Unihan database. In general, since the x-/X- 
namespace is >unregulated, no assumptions should be made about 
compatibility of these encodings across >implementations. If people 
interested in such character encodings care about compatibility, >they 
should as a first step register their encodings with IANA.

My original take on Norberts observations above was that the J2SE 
"x-EUC-TW" implementation doesn't need to reference any normative 
mapping data on account of the naming scheme we chose to use for it 
which indicates that it is a specific implementation of an unregulated 
non-IANA registered charset.

However, what I had overlooked when making this initial analysis was 
the fact that J2SE provided historically (and continues to provide) and 
alias "cns11643" which maps to "x-EUC-TW" and so this puts more weight 
behind an argument that J2SE ought to provide clear references (in so far 
as possible) to normative mapping documents and specifics of how we may 
deviate from those specifications with regards to CNS-11643 <-> Unicode 
mappings.

In addition I  determined that the J2SE 1.5.0 "x-EUC-TW" implementation
includes some 2178 additional characters which were provided to the
Solaris G11n engineering team when they updated the Solaris zh_TW.EUC
locale and iconv converter for EUC-TW for Solaris 10.  These 2178 
additional mappings (which I assume are part of CNS-11643 (1992) or 
some draft extension being worked on by the Taiwanese standards agency
www.cmex.tw) are excluded from the list of CNS11643<-> Unicode mapping
detailed in Unihan 3.2.0 database. Once I understand the status of
these additional mappings I will be in a better position to make a 
statement about how our current implementation conforms to published 
standards. These mappings came to be added to our implementation as part 
of third hand exchange between CMEX->Sun Solaris->Java Software.

>On 29 Mar 2004, at 13:25, Oleg V. Oleinik wrote:
>
>> Hello Ian,
>>
>>> Do you need to be a reviewer of the actual CCC request also?
>>
>> Yes, I do. Actually, JCK team will review/approve this CCC request,
>> most probably, I will be responsible engineer. I would like to ask you
>> to provide not short description but complete description with:
>>    - change description and justification,
>>    - the reference to standard's normative documents, including 
>> mapping files,
>>    - possibly, specification change,
>>    - incompatibility risk evaluation.
>> I need all these to give adequate JCK evaluation of compatibility
>> impact. Since existing JCK CNS11643 test fails there is a backward
>> incompatibility between 1.4 and 1.5 implementations of CNS encoding,
>> so, there should be a valid reason for this change.
>>
>
>Yes, I agree these should all be part of the CCC request.
>
>> Ian, in general, it is correct that for all non-required encodings
>> implemented in J2SE RI, J2SE specification provides references
>> to normative documents describing encoding standards, mapping
>> tables and all J2SE 1.5 deviations from the standards.
>> Otherwise, how could we require that if implemented, licensee's
>> encoders/decoders pass JCK tests which are based on documents provided
>> by Unicode organization, which are not mentioned in J2SE specifications
>> and which are "not real standards" according to Masayoshi's evaluation 
>> of
>>
>> 4486307: (spec) Need to document deviation from standards in Japanese 
>> charsets:
>>
>> "The mapping tables from Unicode.org are not real standards.
>> For Japanese encodings, follow the JIS standards. See 4251698
>> for the JIS X 0208 specific issue. (The Yen Sign problem is still
>> applicable to the JIS standards. In JIS X 0201, 0x5c is Yen Sign.)
>> masayoshi.okutsu@Eng 2001-08-01"
>>
>> I would like to discuss this issue with J2SE i18n team (and you,
>> Masayoshi, Mark). Probably, we should wait for the results of the
>> discussion for you to file correct CCC request - what do you think?
>>
>
>I will try to organize such a meeting. I think it makes very good sense.
>The T&L dev team (and probably Java i18n team also) are working
>to the important b46 promotion deadline in order to get final pieces
>in for beta2. I'd imagine such a meeting would be unlikely to happen
>until after b46 integration and probably promotion had completed
>(i.e into next week). I will try to coordinate times and give heads up
>to the relevant folks in order to setup such a meeting.
>
>regards,
>
>--Ian.
>
>>> Date: Mon, 29 Mar 2004 11:29:42 +0100
>>> From: Ian Little <###@###.###>
>>> Subject: Re: CNS 11643 mapping file and its origins
>>>
>>> Oleg,
>>>
>>> I will prepare a retrospective CCC to account for the CNS 11643
>>> changes introduced as part of addressing 4847097 in 1.5.0 very
>>> shortly. I'll notify you once the CCC request has been filed.
>>> I think Gauri Sharma is on the CCC comittee who reviews these
>>> sort of requests from a jck perspective. Do you need to be
>>> a reviewer of the actual CCC request also ?
>>>
>>> Best regards,
>>>
>>> --Ian.
>>>
>>> On 24 Mar 2004, at 13:26, Oleg V. Oleinik wrote:
>>>
>>>> Hello Ian,
>>>>
>>>>
>>>>> I agree that if we are not in conformance then a ccc request will 
>>>>> have
>>>>> to be filed.
>>>>
>>>> As far as I understand the process, CCC request should be filed 
>>>> despite
>>>> of whether implementation of CNS is Unicode 4 compatible or not, 
>>>> since,
>>>> CNS encoding functionality changes in 1.5 (has already changed) and 
>>>> all
>>>> such changes should be approved by CCC (and also by others including
>>>> JCK team).
>>>>
>>>> I think CCC request still should be filed.
>>>>
>>>> Date: Tue, 23 Mar 2004 15:59:31 +0000
>>>> From: Ian Little <###@###.###>
>>>> Subject: Re: CNS 11643 mapping file and its origins
>>>>
>>>> Oleg,
>>>>
>>>> Based on the e-mail from Federic Zhang which says that the Solaris
>>>> native zh_TW converter is based upon Unihan-3.2.0.txt (which is
>>>> the version which is mentioned in the Unicode 4.0 specification)
>>>> it may be that the CNS 11643 Java implementation in 1.5.0
>>>> (which took mappings second hand from the Solaris zh_TW,
>>>> EUC_TW converter) may be in conformance with Unicode 4.0.
>>>>
>>>> We need to determine a couple of things.
>>>>
>>>> 1. How does the J2SE 1.5.0 EUC-TW charset implementation (updated
>>>>      as part of 4847097) compare with the mappings for CNS provided
>>>>     in Unihan 3.2.0.txt
>>>>
>>>> and
>>>>
>>>> 2. Do CMEX provide the source mappings for the Unicode 
>>>> Unihan-x.y.z.txt
>>>>      mappings for CJK locales ?
>>>>
>>>> I will check this with the Java i18n engineering team and the
>>>> appropriate
>>>> folks within Solaris g11n who may have the answers to these items.
>>>> I agree that if we are not in conformance then a ccc request will 
>>>> have
>>>> to
>>>> be filed.
>>>>
>>>> --Ian.
>>>>
>>>> On 23 Mar 2004, at 15:37, Oleg V. Oleinik wrote:
>>>>
>>>>> Hello Ian,
>>>>>
>>>>> Thank you for your help. I have some concerns regarding CNS11643
>>>>> implementation in J2SE 1.5, specifically:
>>>>>
>>>>> 1. Is CMEX an official source for Unicode/CNS11643 mapping tables?
>>>>> I thought that the tables you implemented are provided by Unicode
>>>>> organization, since this seems to be not so, a reference to the
>>>>> normative mapping table document which all J2SE 1.5 implementations
>>>>> should use for implementing Unicode/CNS11643 encoding should be
>>>>> specified in J2SE 1.5 specifications.
>>>>>
>>>>> Otherwise, we can not require all J2SE 1.5 implementations
>>>>> supporting CNS11643 to pass JCK test that is based on CMEX
>>>>> mapping table.
>>>>>
>>>>> J2SE 1.5 specifications say that J2SE 1.5 is based on Unicode 4
>>>>> therefore mapping tables provided by Unicode.org is correct to use 
>>>>> in
>>>>> J2SE implementations, however, will all J2SE 1.5 implementors use
>>>>> the same CMEX CNS11643 mapping table as our implementation does?
>>>>>
>>>>> In general, I think it would be correct to clearly specify the 
>>>>> sources
>>>>> of Unicode/some_encoding mappings for each implemented in J2SE
>>>>> encoding,
>>>>> even though the encoding is optional.
>>>>> This will allow all J2SE implementations to be compatible in
>>>>> encodings.
>>>>>
>>>>>
>>>>> 2. Could you please send a CCC request regarding CNS11643-related
>>>>> changes in J2SE 1.5? - all the changes in J2SE 1.5 should be
>>>>> approved by CCC, but we did not receive CNS11643-related CCC 
>>>>> request.
>>>>>
>>>>> Ian, what do you think?
>>>>>
>>>>>> Date: Mon, 22 Mar 2004 18:09:55 +0000
>>>>>> From: Ian Little <###@###.###>
>>>>>> Subject: Fwd: CNS 11643 mapping file and its origins
>>>>>>
>>>>>> Oleg,
>>>>>>
>>>>>> Brian has replied with a pointer to the CMEX Taiwanese site which
>>>>>> contains
>>>>>> the latest mappings.
>>>>>>
>>>>>> I hope this is of assistance. Let me know if I can assist you 
>>>>>> further
>>>>>> in
>>>>>> getting a conformance test prepared for validating the J2SE bundled
>>>>>> EUC-TW charset implementation.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> --Ian.
>>>>>>
>>>>>>
>>>>>> Begin forwarded message:
>>>>>>
>>>>>> From: "Qingjiang(Brian) Yuan" <###@###.###>
>>>>>> Date: 22 March 2004 16:48:29 GMT
>>>>>> To: Ian Little <###@###.###>
>>>>>> Cc: federic zhang <###@###.###>
>>>>>> Subject: Re: CNS 11643 mapping file and its origins
>>>>>>
>>>>>> Ian,
>>>>>> We got those mapping tables from CNS11643 committee
>>>>>> (###@###.###)
>>>>>> last year, but not sure whether they have updated the tables or 
>>>>>> not,
>>>>>> their website is http://www.cns11643.gov.tw/eng/index.jsp. You can 
>>>>>> go
>>>>>> ahead to ask ###@###.### for the latest mapping tables, I 
>>>>>> think
>>>>>> both Solaris and J2SE should follow the latest version if there is
>>>>>> one.
>>>>>>
>>>>>> Thanks.
>>>>>> Brian.
>>>>>>
>>>>>> Ian Little wrote:
>>>>>>
>>>>>>> Federic (or Brian) :
>>>>>>>
>>>>>>> In J2SE 1.5.0 (tiger) quite a while back I added support for
>>>>>>> additional CNS
>>>>>>> planes within our EUC-TW charset implementation (4847097). This
>>>>>>> work as you will recall was triggered by the Solaris enhancements
>>>>>>> tracked within bugID 4721967. At the time Federic (or perhaps 
>>>>>>> Brian)
>>>>>>> forwarded me a header file cns_utf.h which contained the mappings
>>>>>>> for the additional planes. I used this to generate the mapping
>>>>>>> lookup
>>>>>>> indices within our revised EUC-TW implementation.
>>>>>>>
>>>>>>> At this point in time the JCK engineering team mostly based out of
>>>>>>> Novosbirsk/Siberia are requesting to update an existing test which
>>>>>>> fails because of some of the reassignments of characters once
>>>>>>> assigned within plane 3 of CNS-11643 to plane 4,etc. They have
>>>>>>> requested that I give them a pointer to the standards which 
>>>>>>> contain
>>>>>>> the official mappings so that they can revise the JCK test to
>>>>>>> conform
>>>>>>> with the published standard CNS-11643 (1992) ?
>>>>>>>
>>>>>>> My query to you is can you send me details of how cns_utf.h was
>>>>>>> devised and where is the source standards from which the mappings
>>>>>>> contained within that header file have come from? Getting this
>>>>>>> information
>>>>>>> will be very useful for the jck engineers who need to see 
>>>>>>> standards
>>>>>>> and how our mappings conform (or not) with them. I know that our
>>>>>>> j2se EUC-TW charset implementation is now in sync with the native
>>>>>>> iconv implementation on Solaris. The question is what is the 
>>>>>>> origin
>>>>>>> of the mappings used to guide the Solaris implementation.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> --Ian


======================================================================
CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: tiger-rc FIXED IN: tiger-rc INTEGRATED IN: tiger-b59 tiger-rc VERIFIED IN: 1.5.0_01
18-09-2004
EVALUATION Name: ooR10006 Date: 12/11/2003 The tests will need to be fixed. ====================================================================== Development engineering has decided to make the unspecified behavior of optional charsets even more explicit. ###@###.### 2004-06-26
26-06-2004
PUBLIC COMMENTS . ###@###.### 2003-12-05
05-12-2003