Bug ID: JDK-4324508 UTF8 output is sometimes non-canonical

Type: Bug
Component: tools
Sub-Component: javac
Affected Version: 1.3.0,1.4.0,1.4.2

Priority: P4
Status: Resolved
Resolution: Fixed
OS: generic,linux_redhat_7.2,solaris_7
CPU: generic,x86,sparc

Submitted: 2000-03-23
Updated: 2000-11-16
Resolved: 2000-06-02

Other
1.4.0 betaFixed

About 1/2 the time when javac is supposed to generate a 2-byte UTF8 encoding,
it instead generates a 3-byte encoding.  This happens when the most significant
content bit of the 2-byte encoding would be one.

CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: merlin-beta FIXED IN: merlin-beta INTEGRATED IN: merlin-beta
14-06-2004
SUGGESTED FIX The problem is in Convert.chars2utf. The predicate (ch <= 0x3FF) should in fact be (ch <= 0x7FF). See doc string for java.io.DataOutput.writeUTF.
11-06-2004
PUBLIC COMMENTS <p>Some earlier compilers, such as the javac of jdk 1.3, violated the vmspec re the .class file format when compiling Cyrillic, Armenian, Hebrew, Arabic, Syriac, & Thaana identifiers and strings. <p>Specifically, these compilers produced three bytes of UTF-8 in place of two bytes of UTF-8, for the chars u0400..07FF. For example, the isJavaIdentifierPart char \u0401 appeared as x E0 90 80 rather than as x D0 80. <p>Corrupt identifiers never were fully acceptable to java.lang.reflect. Corrupt identifiers and strings may have been unacceptable to other, less incorrect, compilers and interpreters of .class files. <p>The javac of jdk1.4 differs from jdk1.3 by producing vmspec UTF-8: shortest-form always except x C0 80 for u0000. <p>Consequently, when jdk1.4 javac compiles Cyrillic, Armenian, Hebrew, Arabic, Syriac, or Thaana source identifiers together with jdk1.4 binaries, the resulting binary, when distributed, will not link with separately distributed jdk1.3 binaries. <p>Sun does not provide a tool to correct this binary incompatibility between old and new .class files. In particular, javac -target 1.1, 1.2, and 1.3 will not reproduce corrupt legacy binary identifiers. Sun has thus begun to expand how different .class files can be even when their version numbers match. <p>Sun does not provide a tool to correct this binary incompatibility between old .class files and new .java files. Stay with jdk1.3 til you can redistribute all your binaries together. The javac of jdk1.4 often reports this kind of trouble as "cannot resolve symbol", for a symbol that includes chars from u0400..u07FF.
10-06-2004
EVALUATION The description is correct. The suggested fix the the correct one and has been implemented. iris.garcia@eng 2000-05-31
31-05-2000

Duplicate :	JDK-4842396 - (reflect) InvocationTargetException thrown when run apps compiled by 1.3
Duplicate :	JDK-4793774 - VM regression in 1.4.2-b10: "illegal UTF8 string in constant pool" fro
Relates :	JDK-4800106 - VM regression in 1.4.2-b10: "illegal UTF8 string in constant pool"