JDK-4945544 : Mapping from class name to file name for unicode characters
  • Type: Bug
  • Component: specification
  • Sub-Component: language
  • Affected Version: 5.0
  • Priority: P3
  • Status: Closed
  • Resolution: Duplicate
  • OS: solaris_8
  • CPU: generic
  • Submitted: 2003-10-29
  • Updated: 2006-11-16
  • Resolved: 2006-11-16
Related Reports
Duplicate :  
Description
Currently, class files that contain non-ascii characters cause no end of problems for VM implementatios. File system support for unicode characters in file names differs among platforms, and is therefore not reliable. In general, many preograms written using non-US locales are forced to use ASCII characters for their class names for portability, which undermines the internationalization support of the Java language and platform. Even the jar file implementation, based as it is upon an underlying C implementation, has trouble with file names stored within it.

We propose that the mapping from class name to file name should be specified to translate any non- ascii-printable characters in the class or package name into the characters "+Unnnn" where "+" is the ascii plus character, U is the ascii "U" character, and nnnn are the hex digits of the unicode representation of the character. Surrogates (i.e., a pair of such encodings) would be used for characters in the higher unicode planes. No changes in the language, VM, or API specifications are needed; this mapping was never specified in any existing document. However, we recommend this mapping be described in either or both of the new JLS and JVMS documents beginning in Tiger.

On loading a given class that contains non- ascii-printable characters, the VM could try both locations (that is, the rewritten file name AND the non-rewritten file name). That allows full backward compatibility with existing class files that may use non-ascii characters.

If we decide to so this, changes are required in javac and the standard class loaders.

Comments
EVALUATION The proposed mapping from Unicode characters to the ASCII string +Unnnn is essentially what JLS 7.2.1 recommends as a convention, i.e. @nnnn We have always been wary about specifying a Unicode->ASCII mapping in the JLS. I don't believe it's worthwhile to comment on Unicode-capable file systems, since we cannot mandate such a file system. And the JLS never mentions tools like jar. Otherwise, this request is essentially a duplicate of 4421728.
16-11-2006

EVALUATION The statement in the description "this mapping was never specified in any existing document" is incorrect. A (different) mapping from class names to file names is described in section 7.2.1 of the Java Language Specification, both first and second edition. However, this recommendation was never implemented in Sun's JDK. At this point, I'm not sure this feature is worth implementing anymore. All operating systems that Sun's JDK runs on now provide Unicode-based file systems, either as the standard (Windows) or when logging in using UTF-8 locales (Unix). This provides developers with a reasonable way to create class files that include non-ASCII characters. These class files can then be packaged into jar files, which always encode file names in UTF-8. There used to be various bugs that prevented this from working end-to-end, but I believe 5030265 was the last one of those bugs, and the regression test written for it (test/tools/launcher/UnicodeTest.sh) ensures that bugs won't resurface. Instead of implementing this feature, I'd suggest updating JLS 7.2.1 to describe the current reality around file systems and jar files. ###@###.### 2004-11-06 03:19:14 GMT
06-11-2004