Bug ID: JDK-4131655 java.io.InputStreamReader performance: Factor of five speed penalty

JDK-4131655 : java.io.InputStreamReader performance: Factor of five speed penalty

Type: Bug
Component: core-libs
Sub-Component: java.io
Affected Version: 1.1.5,1.2.0

Priority: P4
Status: Closed
Resolution: Duplicate
OS: generic
CPU: generic

Submitted: 1998-04-22
Updated: 1998-06-30
Resolved: 1998-06-30

Related Reports

Duplicate :	JDK-4093056 - RFE: Add facilities for fast character-encoding conversion
Relates :	JDK-4131647 - UTF* internal conversion bug: Output buffer too small

Description

See the attachment for the code that generated these measurements; you may
need to comment out the test for "UTF16Reader" to compile and run it.  The
test data is generated by the "gen.java" file attached to bugid 4131647,
which turned up its own set of bugs ... Note that the custom readers took
about 1/2 hour to write and debug.  Admittedly, things like UTF-8 will be
slower than UTF-16 but that does not justify a FACTOR OF FIVE (or more)
difference in speed. 

---------------------------
From xxxx Sat Apr 18 15:56:36 1998
To: xxxx
Subject: Reader performance
Cc: xxxx

You'd asked for numbers when I asked you about performance problems in
the Reader/Writer framework, and here are some ugly ones.

Each of these (single) runs read 1M chars of XML data (basically, this
was randomly generated UNICODE, with some XML framing) from files cached
in memory.  The "read" loop was "read a 1K block, then read 512 characters
one at a time" until the end of the data was reached.

    InputStreamReader, "UnicodeLittle"  16.34 ms (JDK 1.1.5)
    InputStreamReader, "UnicodeLittle"  17.94 ms (JDK 1.2 beta4)

    Custom "UnicodeLittleReader"         3.86 ms (JDK 1.1.5)
    Custom "UnicodeLittleReader"         3.77 ms (JDK 1.2 beta4)

    InputStreamReader, "UTF8"           24.82 ms (JDK 1.1.5)
    InputStreamReader, "UTF8"           25.63 ms (JDK 1.2 beta4)

The custom reader does the obvious stuff -- notably not allocating a
garbage character array on each character-at-a-time read, and adding
no superfluous method calling overhead for block reads.  Stuff that
the character converter object framework seemingly precludes.

If the character-at-a-time reads were removed, the times were rougly five
seconds to read the Unicode via InputStreamReader, eleven for UTF-8, and
about 10% faster for the custom reader.  That is, the custom reader is
still on the order of 25% faster.

For comparision, one XML parser, which doesn't use Readers because
of their performance, read ** AND PARSED ** the two files in only
two seconds more than the JDK's bulk read cases took ...
 
It's no wonder the people designing these APIs are steering away from
using the java.io.Reader classes.  Which is worrisome, since all XML
data is UNICODE.
 
- xxxx

<UPDATE>
<AUTHOR> david.brownell@Eng 1998-06-29 </AUTHOR>

Software REWRITTEN to use the bulk reads can get acceptable
performance even with this speed penalty.  In fact, I've
now done so and outperform the fastest of the third party
XML processing engines.

However, for other applications I still think this is a
pretty severe problem.   Not everyone has complete control
over all of their input data sources.

</UPDATE>

Comments

SUGGESTED FIX New API and implementation strategy ... subclass readers. Alternatively, since an issue is that sun.io.ByteToCharConverter.convert() method doesn't support any kind of single-character access to conversion functionality, provide such functionality and use it!

11-06-2004

EVALUATION Performance is comparable for bulk conversions, and the logic of byte to character converters is inherently quite complex for single character conversions. The presented comparison case bypasses the byte to character conversion mechanism altogether. benedict.gomes@Eng 1998-06-29 (To clarify: the implementation I wrote -- less than 30 minutes -- mostly benefits from internal APIs that don't force conversion into a buffer. The java.io.InputStreamReader code allocates a one byte buffer, converts into it, and then returns the content of that buffer ... the buffer is garbage immediately. It CANNOT bypass byte-to-character conversion, that is part of the problem definition ... I'm assuming Benedict made a typo above, Readers are byte-to-char, not char-to-byte!! :-) david.brownell@Eng 1998-06-29 Yes, that was my error. To further clarify, I meant that it bypasses the ByteToCharConverter API mechanism, not the conversion per se. benedict.gomes@Eng 1998-06-29 The InputStreamReader class was not designed to support efficient single-character reads. Due to the inherent complexities of character encodings, it is impossible to support efficient single-character reads without an additional level of post-conversion buffering. This is why the InputStreamReader specification explicitly suggests that instances should be wrapped in a BufferedReader. In applications that must pass the same reader to different subsystems, a single BufferedReader instance should be passed around. We are well aware of the need for a more general, and more efficient, character-conversion API. The fact that the current internal Byte/CharConverter API throws exceptions so often is one reason why we did not make that API public. I'm closing this as a duplicate of 4093056. -- mr@eng 6/30/1998

30-06-1998

WORK AROUND All software that needs to do character-at-a-time reads needs to arrange to buffer the data, perhaps with a BufferedReader or in application-specific buffers. I don't call this a "convenient" workaround since it's not possible in cases where the Reader is handed to a subsystem that may not have exclusive use thereof: the buffer would need to be used by the next subsystem. Also, InputStreamReader already _has_ a buffer. david.brownell@Eng 1998-06-29 This is not a workaround -- this is how the InputStreamReader class was designed to be used! An application should not pass an instance of InputStreamReader around to different subsystems, it should pass an instance of BufferedReader that buffers the InputStreamReader. -- mr@eng 6/30/1998

30-06-1998