Bug ID: JDK-6265809 Huge String.toLowerCase() performance regression

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other	JDK 6
5.0u7Fixed	6 b52Fixed

See the email log for more information from the customer.

A DESCRIPTION OF THE REQUEST :
String.toLowerCase() has a huge performance regression in JDK 1.5.0_x

toLowerCase() iternally uses String.intern() on the Local's language, presumably to avoid needing to use String.equals() for comparison on some strings of length nomore than 2. This use of intern() is a performance desaster. [If you are curious check out the native JDK source code for String.intern(), and see how it oscillates multiple times between native code and java code, only to end up using a plain vanilla HashMap on the Java heap for its interned Strings.]


JUSTIFICATION :
Try running some heavy string tokenization codes involving String.toLowerCase for case insensitivity, or simple a loop that does many millions of toLowerCase calls. The JDK 1.5 profiler shows that 80% of the time is spent in String.intern(), which is called by String.toLowerCase()

This is unacceptable, in particular since the JDK 1.4.x line never had this problem.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The fix is to revert to using equals for comparison the Locale's language, or even better have the Locale or default Locale intern the language (the assumption is that Locale's are infrequently constructed).
ACTUAL -
see problem description

---------- BEGIN SOURCE ----------
class  test
{
	public static void main(String[] args) 
	{
		String s = "";
        long stime = System.currentTimeMillis();
		for (int i=0; i < 1000; i++) {
			s += "hello + i".toLowerCase(); 
		}
        System.out.println("time = "+(System.currentTimeMillis()-stime));	
	}
}
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
See expected behaviour
###@###.### 2005-05-05 10:30:14 GMT

EVALUATION The test case in the description has nothing to do with intern(). It lower-cases the string "hello + i", which is already lower-case, and toLowerCase checks for that before it even looks at the locale. What the test primarily measures is the performance of string concatenation. The test case provided by hoschekw in the JDC comments is more useful. Using that test case and the string "Hello", I see a performance of gain of about 45%/40% (server/client) by removing the intern() calls, or of about 38%/25% by using equals() instead of intern(). However, it's also clear that the slowdown for the string "hello" (which is already lower-case) cannot be attributed to the intern() calls. This needs to be investigated separately. I confirmed that performance was improved by deleting unnecessary intern(). Will be fixed in Mustang. Performance was improved by deleting intern(), but Mustang's modified version is still slower than 1.4.2_X. 1.4.2_09/solaris-i586/bin/java TestDateTime1 30000000 hello secs=3.519 1.4.2_09/solaris-i586/bin/java TestDateTime1 30000000 Hello secs=8.288 solaris-i586.withIntern/bin/java TestDateTime1 30000000 hello secs=8.149 solaris-i586.withIntern/bin/java TestDateTime1 30000000 Hello secs=24.672 solaris-i586.withoutIntern/bin/java TestDateTime1 30000000 hello secs=7.162 solaris-i586.withoutIntern/bin/java TestDateTime1 30000000 Hello secs=11.667 I treat this bug report for intern() fix, and will file another bug for further performance improvemet.

27-07-2005

EVALUATION The implementation of Locale in the JRE does in fact intern the language string, so the call in String.toLowerCase is not necessary. If we don't want to rely on this implementation detail, we can also check the string contents directly. ###@###.### 2005-05-05 17:14:08 GMT Changed to "bug" as suggested by an SDN comment - it's not clear why this was filed as an RFE. ###@###.### 2005-05-28 00:34:55 GMT

05-05-2005