Bug ID: JDK-6442955 UTF-8 encoder returns a byte array with a null byte

Type: Bug
Component: core-libs
Sub-Component: java.nio.charsets
Affected Version: 6

Priority: P2
Status: Closed
Resolution: Not an Issue
OS: windows_xp
CPU: x86

Submitted: 2006-06-23
Updated: 2010-04-02
Resolved: 2006-06-24

FULL PRODUCT VERSION :
java version "1.6.0-beta2"
Java(TM) SE Runtime Environment (build 1.6.0-beta2-b86)
Java HotSpot(TM) Client VM (build 1.6.0-beta2-b86, mixed mode, sharing)

ADDITIONAL OS VERSION INFORMATION :
Windows XP Professional SP 2

A DESCRIPTION OF THE PROBLEM :
This bug is responsible for the following behavior:
Some UTF-16 characters can't be put into a JDOM after they have been encoded using the CharsetEncoder. The returning ByteBuffer contains a null byte at the end. This zero byte seems to be responsible for the error while building the DOM.

Also there is a difference in version 1.5.0_07 compared to version 1.6.0 (b86). The character which causes this behaviour is different:

"u\0237" - version 1.5.0_07 OK,  version 1.6.0 NOK
"u\304E" - version 1.5.0_07 NOK, version 1.6.0 OK




STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Run the class CharsetEncoderTest twice, one time with java 1.5.0_07 and the second time with Java 1.6.0 b86...

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
CharsetEncoder should encode the two Unicode (UTF-16) characters into UTF-8 Characters, which then could be used as the Text of an XML DOM entry.
ACTUAL -
XML-DOM should accept the encoded String generated out of the ByteBuffer which returned from the CharsetEncoder.

The ByteBuffer contained a additional "empty" byte with the value = 0.

(This behavior occurs in both java versions mentioned, but with different characters...

ERROR MESSAGES/STACK TRACES THAT OCCUR :
Exception in thread "main" org.jdom.IllegalDataException: The data "AA " is not legal for a JDOM attribute: 0x0 is not a legal XML character.
	at org.jdom.Attribute.setValue(Attribute.java:486)
	at org.jdom.Attribute.<init>(Attribute.java:229)
	at org.jdom.Attribute.<init>(Attribute.java:252)
	at org.jdom.Element.setAttribute(Element.java:1109)
	at test.CharsetEncoderTest.testEncodeSaveXML(CharsetEncoderTest.java:39)
	at test.CharsetEncoderTest.main(CharsetEncoderTest.java:20)


!!! NOTE !!!: The space in the String "AA " was not a space in the original Error Message. It was an undisplayable Character.

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
import java.io.UnsupportedEncodingException;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;

import org.jdom.Document;
import org.jdom.Element;

public class CharsetEncoderTest {

    private static int encodee160 = 0x304E;   // Works only with version 1.6.0
    private static int encodee150_07 = 0x237; // Works only with version 1.5.0_07
    private static String encoded;

    public static void main(String[] args) {
        testEncodeSaveXML(encodee150_07);
        testEncodeSaveXML(encodee160);
    }
    
    public static void testEncodeSaveXML(int character) {
        Charset set = Charset.forName("UTF-8");
        CharsetEncoder encoder = set.newEncoder();
        CharBuffer chb = CharBuffer.allocate(1);
        chb.put((char) character);
        chb.rewind();
        encoder.reset();
        try {
            ByteBuffer bb;
            bb = encoder.encode(chb);
            byte[] ba = bb.array();
            encoded = new String(ba, "ISO-8859-1");
            Document doc = new Document();
            Element e = new Element("XMLChar");
            e.setAttribute("value", encoded);
            doc.setRootElement(e);
        } catch (CharacterCodingException e) {
            e.printStackTrace();
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
    }
}

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Removing the last (wrong) character from the encoded String before processing if encoding resulted in a null byte...

EVALUATION Here's a shorter program USING ONLY CORE LIBRARY CLASSES, demonstrating the issue: ---------------------------------------- import java.nio.*; import java.nio.charset.*; public class Bug { public static void main(String[] args) throws Throwable { for (char c : new char[] { 0x304e, 0x0237 }) { CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder(); CharBuffer chb = CharBuffer.allocate(1); chb.put(c); chb.rewind(); ByteBuffer bb = encoder.encode(chb); System.out.printf("limit=%d", bb.limit()); byte[] ba = bb.array(); for (int i = 0; i < ba.length; i++) System.out.printf(" %02x", ba[i] & 0xff); System.out.println(); } } } ---------------------------------------- (mb29450@suttles) ~/src/toy/6442955 $ for v in 5.0u7 6; do jver $v jr Bug;done ==> javac -source 1.5 -Xlint:all Bug.java ==> java -esa -ea Bug limit=3 e3 81 8e 00 limit=2 c8 b7 ==> javac -source 1.6 -Xlint:all Bug.java ==> java -esa -ea Bug limit=3 e3 81 8e limit=2 c8 b7 00 So yes, the behavior in mustang is different, but this is not a bug. Only the bytes in the returned byte buffer before the limit are significant. See: http://download.java.net/jdk6/docs/api/java/nio/charset/CharsetEncoder.html#encode(java.nio.CharBuffer) The Charset Buffer API is optimized for efficiency, not convenience. Buffer copies are avoided, but this means some buffers have extra elements between their limit and their capacity.

24-06-2006