JDK-4614120 : UTF-8 vmspec not verified by java -Xfuture
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 1.4.0,1.4.2
  • Priority: P4
  • Status: Closed
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2001-12-14
  • Updated: 2012-10-08
  • Resolved: 2002-10-26
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
1.4.2 mantisFixed
Related Reports
Duplicate :  
Description
Name: gm110360			Date: 12/14/2001


java version "1.4.0-beta3"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0-beta3-b84)
Java HotSpot(TM) Client VM (build 1.4.0-beta3-b84, mixed mode)

> http://java.sun.com/products/jdk/1.2/compatibility.html

> Runtime Incompatibilities in Version 1.2

> In JDK 1.2 software
> the -Xfuture option enables the strictest possible
> class-file format checks ...

If only this were true.  Please try the demo below, see that something's
broken, tell me what it is, and fix it.

> reject... illegal UTF-8 strings

I hope I'm right to think you agree that vmspec UTF-8, by "4.4.7 The
CONSTANT_Utf8_info Structure", is only shortest form UTF-8 except that u0000,
if present, appears always as x C0 80?

By that definition, the `java -Xfuture` verification of .class file format
rejects a lot less than all forms of "illegal UTF-8 strings".

1) The verification never complains of not-shortest-form UTF-8.  (Though it
does complain of the too-short-form x 00.)

2) The verification accepts truncated and ill-formed UTF-8 in string values,
attribute names, and unused entries.

We care because by design, vmspec UTF-8 defines precisely zero or one ways to
represent any sequence of chars.  By defining more than one sequence of bytes
as equal to a given sequence of chars, we raise unanswerable questions.  Does
one method override another?  Is a field present?  Is a constant initialiser
present?

Now for the promised quick, rough demo of some of this.  Try editing the binary
A.class after compiling this source:

        class A
            {
            final static int theInt = 0x9ABCDEF0;
            String theString = "ConstantValue";
            }

        class B
            {
            public static void main(String[] strings)
                {
                System.out.println(Character.isJavaIdentifierPart('\u00E0'));

                A a = new A();
                String st = a.theString;
                for (int index = 0; index < st.length(); ++index)
                        {
                        char ch = st.charAt(index);
                        System.out.println("x" + Integer.
                                toHexString(ch).toUpperCase());
                        }
                }
            }

In the binary A.class, confirm you see only one CONSTANT_Utf8_info entry that
equals "theInt":

        01 00:06 74 68 65 49 6E 74 // theInt

  See also that `java -Xfuture B` accepts the A.class binary.

Now change the A.class binary.  Change the trailing x74 to an xE0.  See that
`java -Xfuture B` explodes, complaining of an "Illegal Field name".  So far so
good.

Now restore the original A.class binary (most simply, recompile it).  Go find
the one entry of:

        01 00:0D 43 6F 6E 73 74 61 6E 74 56 61 6C 75 65 // ConstantValue

Change the trailing x65 to an xE0.  See that `java -Xfuture B` is happy.

Conclude that string values and attribute names may contain truncated Utf.

Repeat, if you like, changing two trailing bytes, to see constant pool Utf may
contain ill-formed Utf, such as x D0 01 (b10xx:xxxx does not follow b110x:xxxx).

Repeat, if you like, changing three trailing bytes, to see constant pool Utf
may contain not-shortest-form Utf, such as x E0 90 81.  So may field names, etc.

Please tell me what's broken and fix it - or unconfuse me!

Thanks in advance.    Pat LaVarre

> http://developer.java.sun.com/developer/bugParade/
> +-Xfuture +utf
> 4 Results Found, Sorted by [lack of] Relevance
(Review ID: 136117) 
======================================================================

Comments
CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: mantis FIXED IN: mantis INTEGRATED IN: mantis mantis-b05
14-06-2004

WORK AROUND Name: gm110360 Date: 12/14/2001 Workaround? Ouch. No easy answer here. I guess people can run a separate verifier to reject stuff we think we'll almost never find, like: truncated or ill-formed UTF-8 unused entries containing illegal UTF-8 of any kind But as for not-shortest-form UTF-8, neglecting to fix that in jdk1.2 has left us facing a slowly growing horror. Per review ID 136105, we know javac no more antique than jdk1.3 commonlu produces not-shortest-form UTF-8, for chars u0400..u07FF i.e. Cyrillic, Armenian, Hebrew, Arabic, Syriac, and Thaana. Not just strings: identifiers too. We know only now, with the jdk1.4 of late 2001, has a javac decided by default to break compatibility with the jdk1.0.2 jvm's of Win95 IE 3. We can conclude we're going to be living with not-shortest-form UTF-8 for a long, long time. Somebody Who Matters has gotta decide. Is it better to have the vmspec be stable, short, and unreal ... or do we answer the unanswerable. When is it ok for two .class file readers - a jvm, a javac, java.lang.reflect, whatever - to disagree about what a .class file means? When must each see bytes as bytes? When must each see bytes as chars? If we say a jvm has to see bytes as bytes, don't we have to let java.lang.reflect see bytes as bytes? If we say a jvm has to see bytes as chars, we implicitly require any jvm to convert every UTF-8 string it sees? All wrong answers. I can't tell you how pleased I'd be to be told I'm all mixed up here. ======================================================================
11-06-2004

EVALUATION By fixing bug 4169783, this bug is partially fixed. The format checker will verify shortest form of utf8 strings in future releases. ###@###.### 2002-03-19
19-03-2002