JDK-4344267 : Broken UTF-8 conversion of split surrogate-pair
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.nio.charsets
  • Affected Version: 1.3.0
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2000-06-08
  • Updated: 2000-12-20
  • Resolved: 2000-12-20
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
1.4.0 betaFixed
Related Reports
Relates :  
Description

Name: rlT66838			Date: 06/08/2000


SCSL JDK 1.3 Beta source code (Sep 1999)


This is from the September 1999 JDK 1.3 source release;  there is a (faint)
chance that this may have been found and fixed already...

Surrogate pairs are handled correctly if both the high half and the low half
are in the same input[] buffer.  However, if a surrogate pair straddles two
input buffers, then it hits two bugs:

First, there is code that does

            inputChar = highHalfZoneCode;
            highHalfZoneCode = 0;
            if (input[inOff] >= 0xdc00 && input[inOff] <= 0xdfff) {
                // This is legal UTF16 sequence.
                int ucs4 = (highHalfZoneCode - 0xd800) * 0x400
                    + (input[inOff] - 0xdc00) + 0x10000;

The ucs4 calculation assumes that highHalfZoneCode still contains the first
half of the surrogate pair, but highHalfZoneCode has been zapped to 0.

  Fix:  the ucs4 calculation should use inputChar instead of highHalfZoneCode.

Next, it tries to output the ucs4 value:

                output[0] = (byte)(0xf0 | ((ucs4 >> 18)) & 0x07);
                output[1] = (byte)(0x80 | ((ucs4 >> 12) & 0x3f));
                output[2] = (byte)(0x80 | ((ucs4 >> 6) & 0x3f));
                output[3] = (byte)(0x80 | (ucs4 & 0x3f));
                charOff++;

This should *not* use output[], it should use outputBytes[], then set
outputSize = 4, then execute the logic that occurs further down:

            if (byteOff + outputSize > outEnd) {
                throw new ConversionBufferFullException();
            }
            for (int i = 0; i < outputSize; i++) {
                output[byteOff++] = outputByte[i];
            }

It might also be good for consistency if it set inputSize = 1 and then
did "charOff += inputsize", rather than the current "charOff++", but
that's probably a judgment call.

Also, highHalfZoneCode is redundantly set to 0 again.  Not bad, but looks funny.
(Review ID: 105886) 
======================================================================

Comments
CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: merlin-beta FIXED IN: merlin-beta INTEGRATED IN: merlin-beta
14-06-2004

EVALUATION A new UTF-8 charset encoder/decoder is being provided for merlin Since the UTF-8 converter is being re-written using a fresh implementation it is not appropriate to consider the fix provided as a fix for Merlin. However, the problem reported will certainly not re-occur in the new implementation. in the new encoder. The suggested fix given here is valid and could be considered as a candidate fix for JDK <= 1.3. Ian.Little@Ireland
11-06-2004

WORK AROUND Name: rlT66838 Date: 06/08/2000 Don't split surrogate pairs across calls to convert() ======================================================================
11-06-2004