JDK-6790402 : Speed-up FastCharsetProvider
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.nio.charsets
  • Affected Version: 6u10
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: windows_xp
  • CPU: x86
  • Submitted: 2009-01-06
  • Updated: 2017-04-17
Description
FULL PRODUCT VERSION :
java version "1.6.0_11"
Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
Java HotSpot(TM) Client VM (build 11.0-b16, mixed mode, sharing)

ADDITIONAL OS VERSION INFORMATION :
Windows XP SR-2

A DESCRIPTION OF THE PROBLEM :
This is a MIXED list of BUGs and RFEs regarding performance of sun.nio.cs.FastCharsetProvider !

  Bug 1:
There is a shortcut to enhance VM startup in line 94:
	if (cln.equals("US_ASCII")) {
	    cs = new US_ASCII();
            ...
As it seems that current charset for startup is UTF-8, the shortcut should be updated to UTF-8.
Also check this in sun.io.Converters, where ISO-8859-1 is yet defined for defaultEncoding.

  Bug 2:
Method canonicalize() is invoked 2 times in sequence after invoking charsetForName() (lines 119 -> 82).
This unnecessarily consumes performance.

  Bug 3:
Method lookup() should be synchronized to insure lock also for invokation from charsets() iteration; synchronization could be saved for method charsetForName().

RFE 4:
After unsuccessful lookup in cache, there is a 2nd lookup in classMap to check support for given charset name:
	// Check cache first
	Charset cs = cache.get(csn);
	if (cs != null)
	    return cs;
	// Do we even support this charset?
	String cln = classMap.get(csn);
	if (cln == null)
	    return null;
This could be enhanced by initializing the cache with a NULL Charset object. So there is only need for 1 single map lookup:
        Charset cs = cache.get(csn);
        if (cs == null) // Don't we even support this charset?
            return null;
        if (cs != NULL)
            return cs;
so ...

RFE 5:
Class name lookup can be done after startup charset check, if latter is done by canonical name comparison:
        if (csn.equals("us-ascii"))

RFE 6:
In method toLower() s shouldn't be looped twice after 'toLower' becomes false, and copying of lower chars could be saved:
(copying String's internal char[] shouldn't be slower than new char[], as it also needs to fill initial content with 0's)
    private static String toLower(final String s) {
        boolean allLower = true;
        char[] ca = null;
        for (int i=0; i<s.length(); i++) {
            int c = s.charAt(i);
            if (((c - 'A') | ('Z' - c)) >= 0) {
                if (allLower) {
                    ca = s.toCharArray();
                    allLower = false;
                }
                ca[i] += '\u0020';
            }
        }
        return allLower ? s : new String(ca);
    }


RFE 7:
charsets() iterator must not canonicalize each charset name, so canonicalizing can be moved from lookup() to charsetForName().

See my correction here:
https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/src/sun/nio/cs/FastCharsetProvider.java?diff_format=s&rev=552&view=markup


RFE 8:
classMap could be saved. Instead pre-load the cache initially with the class names. This would save 1 HashMap lookup to get the class name of a charset to be instantiated:
    In constructor do ...
        cache = classNames; // type: Map<String,Object>

RFE 9:
Pre-load the VM's startup charset. This would save the check for each lookup:
        private final static Charset STARTUP_CHARSET = new UTF_8();
        cache.put(toLower(STARTUP_CHARSET.name()), STARTUP_CHARSET);

All this would result in very simple lookup:

    private synchronized Charset lookup(String lowCanonical) {
        // Check cache first
        Object o = cache.get(lowCanonical); // TODO: maybe also cache by arbitrary alias
        if (o instanceof String) {
            // Instantiate new charset
            Charset cs = newCharset((String)o);
            // Cache it
            if (cs != null)
                cache.put(lowCanonical, cs);
            return cs;
        }
        return (Charset)o;
    }
    Charset newCharset(String className) {
        try {
            Class c = Class.forName(packagePrefix+"."+className,
                                    true, getClass().getClassLoader());
            return (Charset)c.newInstance();
        }
        catch (ClassNotFoundException x) {}
        catch (IllegalAccessException x) {}
        catch (InstantiationException x) {}
        return null;
    }

See my correction here:
https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/src/sun/nio/cs/FastCharsetProvider.java?diff_format=s&rev=555&view=markup


REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
For complete source code see:
https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/src/sun/nio/cs/FastCharsetProvider.java?diff_format=s&rev=551&view=markup
---------- END SOURCE ----------
Comment copied from bugzilla report http://bugs.openjdk.java.net/show_bug.cgi?id=100092#c0
Description From UlfZibis 2009-07-19 12:12:28 PDT

Created an attachment (id=114) [details]
webrev including patch

Maybe following would be suitable for the mercurial changeset summary:

6790402 Speed-up FastCharsetProvider
Summary: Charset for startup should be UTF-8;
         Invoke canonicalize() only once after invoking charsetForName();
         Method lookup() should be synchronized;
         Enhance lookup by initializing cache with a NULL Charset object;
         Class name lookup can be done after startup charset check,
             if is done by canonical name comparison;
         Method toUpper() would be potentially faster than toLower();
         charsets() iterator must not canonicalize each charset name;
         classMap could be saved;
         Pre-load the VM's startup charset;
         Moving file standard-charsets to more appropriate location;
         Prepare AbstractCharsetProvider, ExtendedCharsets for better
             performance and flexibility;
Contributed-by: Ulf.Zibis at CoSoCo.de

Comments
An older webrev to fix the issue.
17-04-2017

SUGGESTED FIX See attached webrev from OpenJDK bugzilla.
23-07-2012

EVALUATION combine the "cm" and "c" maps is an interesting idea.
06-01-2009