JDK-4324508 : UTF8 output is sometimes non-canonical
  • Type: Bug
  • Component: tools
  • Sub-Component: javac
  • Affected Version: 1.3.0,1.4.0,1.4.2
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic,linux_redhat_7.2,solaris_7
  • CPU: generic,x86,sparc
  • Submitted: 2000-03-23
  • Updated: 2000-11-16
  • Resolved: 2000-06-02
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
1.4.0 betaFixed
Related Reports
Duplicate :  
Duplicate :  
Relates :  
Description
About 1/2 the time when javac is supposed to generate a 2-byte UTF8 encoding,
it instead generates a 3-byte encoding.  This happens when the most significant
content bit of the 2-byte encoding would be one.

Comments
CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: merlin-beta FIXED IN: merlin-beta INTEGRATED IN: merlin-beta
14-06-2004

SUGGESTED FIX The problem is in Convert.chars2utf. The predicate (ch <= 0x3FF) should in fact be (ch <= 0x7FF). See doc string for java.io.DataOutput.writeUTF.
11-06-2004

PUBLIC COMMENTS <p>Some earlier compilers, such as the javac of jdk 1.3, violated the vmspec re the .class file format when compiling Cyrillic, Armenian, Hebrew, Arabic, Syriac, & Thaana identifiers and strings. <p>Specifically, these compilers produced three bytes of UTF-8 in place of two bytes of UTF-8, for the chars u0400..07FF. For example, the isJavaIdentifierPart char \u0401 appeared as x E0 90 80 rather than as x D0 80. <p>Corrupt identifiers never were fully acceptable to java.lang.reflect. Corrupt identifiers and strings may have been unacceptable to other, less incorrect, compilers and interpreters of .class files. <p>The javac of jdk1.4 differs from jdk1.3 by producing vmspec UTF-8: shortest-form always except x C0 80 for u0000. <p>Consequently, when jdk1.4 javac compiles Cyrillic, Armenian, Hebrew, Arabic, Syriac, or Thaana source identifiers together with jdk1.4 binaries, the resulting binary, when distributed, will not link with separately distributed jdk1.3 binaries. <p>Sun does not provide a tool to correct this binary incompatibility between old and new .class files. In particular, javac -target 1.1, 1.2, and 1.3 will not reproduce corrupt legacy binary identifiers. Sun has thus begun to expand how different .class files can be even when their version numbers match. <p>Sun does not provide a tool to correct this binary incompatibility between old .class files and new .java files. Stay with jdk1.3 til you can redistribute all your binaries together. The javac of jdk1.4 often reports this kind of trouble as "cannot resolve symbol", for a symbol that includes chars from u0400..u07FF.
10-06-2004

EVALUATION The description is correct. The suggested fix the the correct one and has been implemented. iris.garcia@eng 2000-05-31
31-05-2000