Bug ID: JDK-4688797 [Col] Collator has problems with Turkish Locale and SECONDARY or PRIMARY strength

Type: Bug
Component: globalization
Sub-Component: translation
Affected Version: 1.4.0,6

Priority: P3
Status: Resolved
Resolution: Fixed
OS: windows_2000,windows_xp
CPU: x86

Submitted: 2002-05-21
Updated: 2006-08-18
Resolved: 2006-08-18

JDK 6
6 b96Fixed

Name: nt126004			Date: 05/21/2002


FULL PRODUCT VERSION :
java version "1.4.0"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0-b92)
Java HotSpot(TM) Client VM (build 1.4.0-b92, mixed mode)

FULL OPERATING SYSTEM VERSION :
Microsoft Windows 2000 [Version 5.00.2195]

A DESCRIPTION OF THE PROBLEM :
Turkish has 2 unique letter pairs:
'\u0130' & 'i' ('&#304;' & 'i') which correspond to
English 'I', & 'i'
& 
'I', & '\u0130' ('I' & '&#305;') which don't exist as
letters in English and represent back-vowel pairs of
English 'I', & 'i'.

If you didn't get them above, you can check them out at:
http://www.prustinteractive.com/toolbox/font/

In other words, English I i are both with a dot in Turkish,
and the back-vowel versions of them are both dotless.

  From the API it appears that either:
langCollator.setStrength(Collator.PRIMARY)
or
langCollator.setStrength
(Collator.SECONDARY|Collator.CANONICAL_DECOMPOSITION);
or
langCollator.setStrength(Collator.SECONDARY);

should be capturing the difference between the 2 pairs, but
none does.

All combinations of containing PRIMARY & SECONDARY fail to
distinguish between the dotfulls and the dotless. The only
thing that gets both of them to compare != 0 is TERTIARY or
(a logical | with) Collator.FULL_DECOMPOSITION. But the
moment i do that i am no longer able to ignore case.
Besides, the Collator still treats the 2 pairs as the same
letter and mingles, for example, the words starting with
any of them, when sorted.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1. Use the source code below
2. Compare results
3.

EXPECTED VERSUS ACTUAL BEHAVIOR :
In the source code below:
1) should be != 0
2) should be == 0

Actual:
1) == 0
it can be made != 0 with TERTIARY or FULL_DECOMPOSITION,
but then 2) becomes != 0
And 2 letter pairs are considered as 1 pair in sorting.

getRules() returns a string identical to that for US
Locale, which might be root of problem.

This bug can be reproduced always.

---------- BEGIN SOURCE ----------
import java.text.*;
import java.util.*;
  
public class collate {
  public static void main(String args[])
  {
    Collator coll = Collator.getInstance(new Locale("tr", "TR"));
	//workaround place
  
    coll.setStrength(Collator.TERTIARY);
    System.out.println(coll.compare("a","A"));//false
    coll.setStrength(Collator.SECONDARY);
    System.out.println(coll.compare("a","A"));//true
  
    coll.setStrength(Collator.SECONDARY);
    System.out.println(coll.compare("\u0131","i"));//1) should be != 0
    System.out.println(coll.compare("\u0130","i"));//2) should be == 0

    coll.setStrength(Collator.PRIMARY);
    System.out.println(coll.compare("a","\u00e0"));//true
  
    coll.setStrength(Collator.IDENTICAL);
    System.out.println(coll.compare("a","b"));//false

    CollationKey key1 = coll.getCollationKey("abc");
    CollationKey key2 = coll.getCollationKey("def");
    System.out.println(key1.compareTo(key2));//false
  }
}

---------- END SOURCE ----------

CUSTOMER WORKAROUND :
The line indicated above as workaround place should be
replaced with:

RuleBasedCollator tr_Collator;
try {
  tr_Collator = new
     RuleBasedCollator
(""<a,A<b,B<c,C<?,?<d,D<e,E<f,F<g,G<\u011f,\u011e<?,?<h,H<?;
\u0131,I<i,\u0130;?<j,J" +
		
	"<k,K<l,L<m,M<n,N<o,O<?,?<p,P<r,R<s,S<\u015f,\u015e<
?,?<t,T<u,U<?,?<v,V<y,Y<z,Z<'-'<' '<q,Q<w,W<x,X"");
} catch (ParseException ex) {
  ex.printStackTrace();
}
turkishCollator.setStrength
(Collator.SECONDARY|Collator.CANONICAL_DECOMPOSITION);//this
 line is optional, as rule ensures letter-grade difference
/*
letters ?,?, ?, ?, ?,? are not part of Turkish alphabet,
but are ASCII correspondences, and are included with an
attempt to provide for their ordering as well under
CANONICAL_DECOMPOSITION. Letters q,Q, w,W, x,X are not part
of Turkish alphabet, so they follow Z.
Note: while spec says "All non-mentioned Unicode characters
are at the end of the collation order. ", my '?' characters
(included only for testing) got ranked at the end of a-
words, not after 'Z', or 'X'. That might be another bug,
but one that won't concern most users of Turkish version of
Collator.
*/
(Review ID: 146774) 
======================================================================
###@###.### 10/14/04 00:39 GMT

EVALUATION Contribution forum : https://jdk-collaboration.dev.java.net/servlets/ProjectForumMessageView?forumID=1463&messageID=14642

06-08-2006

EVALUATION I have printed out the collation rules for java version "1.4.2_09" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_09-b05) Java HotSpot(TM) Client VM (build 1.4.2_09-b05, mixed mode) and found that the requested "I" set rules are there, sorting after H we have a i-dotless, I-dotless, i-dotted, I-dotted. The same collator rules are used for all versions up to mustang. The bug has been logged against 1.4.0-b92, is this one still under development/maintenance? Or suggest to upgade to 1.4.2_09-b05? I'm going to log a separate SR to handle the "Q,W,X" collation and close this one as not reproducible in current versions of J2SE.

26-09-2005

EVALUATION There is a problem preventing further evaluation: several special character, probably including the "I" related gets converted into the question marks ("?"). However the collator definition for tr according the CLDR http://unicode.org/cldr/repository/common/collation/tr.xml?rev=1.18&content-type=text/vnd.viewcvs-markup needs to be modified at least for the "I" set and the Turkish non-existing letters (Q,W,X). I'd like to ask submitter to supply the official collation specification, the language institute would be nice, in order to fully fix the collation problems. Or at least the ASCII encoded string used in example for workaround. *** (#1 of 1): [ UNSAVED ] ###@###.###

15-09-2005