JDK-8041791 : String.toLowerCase regression - violates Unicode standard
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.lang
  • Affected Version: 8
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2014-04-25
  • Updated: 2022-08-08
  • Resolved: 2014-05-14
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7 JDK 8 JDK 9
7u76Fixed 8u20Fixed 9 b14Fixed
Related Reports
Blocks :  
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
The change JDK-8020037 "String.toLowerCase incorrectly increases length, if string contains \u0130 char" seems to be wrong, according to my reading of the Unicode standard.

The text "String.toLowerCase incorrectly increases length" makes the assumption that this is a problem, but of course it isn't: The documentation specifically says "Since case mappings are not always 1:1 char mappings, the resulting String may be a different length than the original String."

I look at http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt and see:

# Preserve canonical equivalence for I with dot. Turkic is handled below.

0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE

My understanding of this is that in all locales *except* the ones handled specially (which are 'az', 'lt', and 'tr') we should bi-directionally convert "\u0130" <-> "\u0069\u0307".
I.e. lowercasing "\u0130" should result in "\u0069\u0307";
converting "\u0069\u0307" to uppercase or titlecase should yield "\u0130".

Note this allows round-trip conversions, which is why it is specified this way.

Java 7 correctly does the former conversion, but not the latter.
Java 8 does neither.

Comments
UR SQE tested the fix in 8u20. No objections to take the fix into PSU15_01
17-11-2014

Critical request: - Justification: this problem highly affects user experience in Turkish locale. - Risk Analysis: Low, the fix is pretty simple. - Webrev: http://cr.openjdk.java.net/~naoto/8041791/jdk9/webrev.00/ - Testing (done/to-be-done): Automatic regression test is included. - Back ports (done/to-be-done): Done - FX Impact: No
17-11-2014

URL: http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/6d8b6c20a32b User: lana Date: 2014-05-21 18:41:42 +0000
21-05-2014

URL: http://hg.openjdk.java.net/jdk9/dev/jdk/rev/6d8b6c20a32b User: naoto Date: 2014-05-14 17:53:19 +0000
14-05-2014

The Description is correct. Refer to the Unicode standard 6.2 Core Specification 5.18 Case Mappings pp. 173-174.
09-05-2014