JDK-4740135 : LTP: Severe performance bottleneck in XMLEncoder: String.getBytes(encoding)
  • Type: Bug
  • Component: client-libs
  • Sub-Component: java.beans
  • Affected Version: 1.4.1
  • Priority: P3
  • Status: Closed
  • Resolution: Won't Fix
  • OS: windows_xp
  • CPU: x86
  • Submitted: 2002-08-30
  • Updated: 2006-12-12
  • Resolved: 2006-12-12
Related Reports
Relates :  
Description
Name: nt126004			Date: 08/30/2002


FULL PRODUCT VERSION :
java version "1.4.1-rc"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1-rc-b19)
Java HotSpot(TM) Client VM (build 1.4.1-rc-b19, mixed mode)

FULL OPERATING SYSTEM VERSION :
Microsoft Windows XP [Version 5.1.2600]

ADDITIONAL OPERATING SYSTEMS :
This problem is platform-neutral.


A DESCRIPTION OF THE PROBLEM :
As the size of object graph gets large, the performance of
XMLEncoder degrades significantly. I use XMLEncoder to
save data in my application. Archiving of an object graph
took about 18 sec. It was unacceptable.

Profiling revealed that the main cause is in a very odd
place in the XMLEncoder. Please see the following code
snippet from java.beans.XMLEncoder:

    private void writeln(String exp) {
        try {
            for(int i = 0; i < indentation; i++) {
                out.write(' ');
            }
            out.write(exp.getBytes(encoding));
            out.write(" \n".getBytes());
        }
        catch (IOException e) {
            getExceptionListener().exceptionThrown(e);
        }
    }

The "exp.getBytes(encoding)" was proved to be occupying
most of execution time. We created a modified version of
XMLEncoder that uses a custom version of UTF-8 encoder. By
doing so, we could reduce the archiving time to 1 sec.

It's critical problem. We use XMLEncoder to save design of
visual content composed of java.awt.Components and other
Java objects. Even with a small content, the performance
is unacceptable.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1. Creates a test program to encode a relatively large
graph. It's enough if it takes over 15sec in your machine.
2. Profiles it.
3. See what's eating up the execution time.
4. Replace getBytes with a custom, simple UTF-8 encoder.
5. Then profile it again, and compare with the previous
result.

EXPECTED VERSUS ACTUAL BEHAVIOR :
The two results should be significantly different. In my
case, I experienced archiving time of 18sec and 1sec for
the same graph.

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
The following code demonstrates the performance difference between
the original and modified versions of XMLEncoder. Statement2
and NameGenerator2 were just same as java.beans.Statement and NameGenerator.

Compile with javac -source 1.4 Test.java.
Run java Test.


**** see attached jar file

---------- END SOURCE ----------

CUSTOMER WORKAROUND :
Create an enhanced version extending java.beans.Encoder
based on the original source code.
(Review ID: 163565) 
======================================================================

Comments
EVALUATION We should not use custom UTF-8 encoder, because we should support other encodings (see 4625418). Possible we should use Writer instead OutputStream.
26-10-2006

CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: dragon
28-08-2004

SUGGESTED FIX *** /tmp/geta26116 2003-08-07 17:04:35.000000000 -0700 --- XMLEncoder.java 2003-08-07 16:23:03.000000000 -0700 *************** *** 481,497 **** private void writeln(String exp) { try { for(int i = 0; i < indentation; i++) { ! out.write(' '); } ! out.write(exp.getBytes(encoding)); ! out.write(" \n".getBytes()); } catch (IOException e) { getExceptionListener().exceptionThrown(e); } } private void outputValue(Object value, Object outer, boolean isArgument) { // System.out.println("outputValue: " + instanceName(value)); if (value == null) { --- 481,539 ---- private void writeln(String exp) { try { + StringBuffer buf = new StringBuffer(); for(int i = 0; i < indentation; i++) { ! buf.append(' '); } ! buf.append(exp); ! buf.append(" \n"); ! // Should support other encodings if the spec requires. ! writeUTF(buf.toString()); } catch (IOException e) { getExceptionListener().exceptionThrown(e); } } + /** + * A custom UTF-8 encoder. + */ + private void writeUTF(String str) throws IOException { + int strlen = str.length(); + int utflen = 0; + char[] charr = new char[strlen]; + int c, count = 0; + + str.getChars(0, strlen, charr, 0); + + for (int i = 0; i < strlen; i++) { + c = charr[i]; + if ((c >= 0x0001) && (c <= 0x007F)) { + utflen++; + } else if (c > 0x07FF) { + utflen += 3; + } else { + utflen += 2; + } + } + + byte[] bytearr = new byte[utflen]; + for (int i = 0; i < strlen; i++) { + c = charr[i]; + if ((c >= 0x0001) && (c <= 0x007F)) { + bytearr[count++] = (byte) c; + } else if (c > 0x07FF) { + bytearr[count++] = (byte) (0xE0 | ((c >> 12) & 0x0F)); + bytearr[count++] = (byte) (0x80 | ((c >> 6) & 0x3F)); + bytearr[count++] = (byte) (0x80 | ((c >> 0) & 0x3F)); + } else { + bytearr[count++] = (byte) (0xC0 | ((c >> 6) & 0x1F)); + bytearr[count++] = (byte) (0x80 | ((c >> 0) & 0x3F)); + } + } + out.write(bytearr); + } + private void outputValue(Object value, Object outer, boolean isArgument) { // System.out.println("outputValue: " + instanceName(value)); if (value == null) {
28-08-2004

EVALUATION This is the kind of bug report that I like to see. Not only did the submitter find the issue, but they did an analysis and presented a great solution. All that's left for me to do is to review and test. Unfortunately, we missed the deadline for getting it into 1.4.1 but it will make it into the next minor release. ###@###.### 2002-08-30 This will not make it into 1.4.2 due to time constraints. Commiting to 1.5. ###@###.### 2002-11-21 The implementation of suggested fix to use a custom UTF-8 encoder (in light of other fixes to XMLEncoder and Statement) yeilds a performance increase of about 25%. With a performance increase like this, perhaps the String.getBytes/ StringCoding.encode() should be examined to see if we can get a better performance yeild accross the board. ###@###.### 2003-08-07 I passed this fix to our String encoding experts and this is what they have to say: The issue is that the submitters optimization actually won't work where there are surrogate pairs within the input. Surrogate pairs as you may already know are a representation of characters which lie outside the Basic Multilingual plane of Unicode (i.e above the first 64k of defined chars). They are contiguous characters which are constrained to values in a certain character value range. Real examples include some Chinese hanzi characters and Kanji Japanese characters which were added late in the pipeline of character assignment within the evolution of Unicode. Other characters which might be represented as surrogate pairs include musical notation symbols and some characters from less used and ancient scripts,etc. I have a niggling suspicion that part of the reason why there is more overhead without the inlined conversion which the submitter provides is because the general nio UTF-8 encoder needs to reserve 2 chars for every input char because of the potential that it may need to process a surrogate pair within the input. .... He continues the next day: A fairly cursory review of this has revealed that the submitters suggested fix whereby he has written a homespun UTF-8 encoder implementation and placed it inline with the XMLEncoder class implementation rather than using the regular java.lang.String API. One glaring issue I see with this is that his implementation will not handle surrogate pairs if they occur within the input stream or within each line parsed by the XML encoder. In fact without having looked closely at the sun.nio.cs.UTF-8 CharsetEncoder implementation in a while I think by shortcutting any code which inspects the input for surrogates the submitter may have naively bought some performance gains. Given that the suggested fix as it is presented ignores surrogate pairs completely I am not sure that it is appropriate to adopt this fix or anything similar for XMLEncoder. .... Also, another String/IO expert chimes in on how we can realize another performance gain: Yes, I agree. In light of JSR 204 the i18n team would rightly object to having a UTF-8 encoder that doesn't handle surrogates properly. We should certainly look at speeding up the NIO UTF-8 charset coder if possible, but I also wonder if the XMLEncoder is doing UTF-8 encoding in the most efficient way possible. A nontrivial performance gain might be had simply by using the UTF-8 java.nio.charset.CharsetEncoder directly rather than the String class, which has somewhat more overhead. .... The conclusion that can be reached is that I need to investitgate the NIO UTF-8 encoding. This will be decommited from 1.5 unless the oportunity presents itself. Otherwise, I'm going to commit it to 1.5.1 ###@###.### 2003-09-19
19-09-2003