United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6746458 writing libraries in Java for non-Java languages requires support for exotic identifiers
JDK-6746458 : writing libraries in Java for non-Java languages requires support for exotic identifiers

Submit Date:
Updated Date:
Project Name:
Resolved Date:
Affected Versions:
Fixed Versions:

Related Reports

Sub Tasks

Public discussion: http://blogs.sun.com/jrose/entry/symbolic_freedom_in_the_vm

Each language has its own rules for forming identifiers of functions, variables, and types.  The JVM allows almost total freedom, at the bytecode level, for forming names of methods, fields, and classes.  So far so good.  If a language (say, Lisp, Smalltalk, Ruby, or Scala) wants to encode its names directly in the JVM, it runs into two problems.  The first is the small number of remaining restrictions on JVM bytecode names; this can easily be overcome by lightweight mangling as outlined in the above blog post.

The second problem is the fact that the language runtime is probably written largely in Java, which means that the "exotic" names in the language cannot be directly implemented by Java classes, methods, and fields.  In practice, this difficulty forces language implementors to mangle their names not for the JVM (which is permissive) but for Java (which is restrictive).

The solution to the second problem is to provide an "escape" syntax (similar to that of Lisp and other languages) which allows an arbitrary string to pass through the language scanner as a simple identifier token instead of some other token (or an unscannable mess).

We propose the syntax #"foo".  See example implementation here:

The javac frontend should accept any quoted string immediately following '#' (with no intervening space), interpreting normal string escape sequences, and taking the resulting string exactly as the spelling of a normal identifier.  Keywords like 'int' should not be recognized.  Strings which risk being illegal at the JVM level must be rejectedi immediately; this simply means rejecting the empty string and strings which contain any of the characters "/.;<>[".

This design is neutral toward mangling schemes but supports the one described in the blog entry mentioned above.



This is a small but important feature. There are two kinds of use for exotic identifiers:
- Writing Java code to reference artifacts in other languages, e.g. call a Ruby "+" function
- Generating Java code to represent artifacts of other languages, e.g. generating a Java class whose methods are named for XML tags and attributes, of which "class" is a popular example. 

Some spec points:

3.8 Identifiers

- An identifier is a SIMPLE identifier or an EXTENDED identifier:
    SimpleIdentifierCharacters but not a Keyword or BooleanLiteral or NullLiteral
    SimpleIdentifierCharacters JavaLetterOrDigit
  // JavaLetter and JavaLetterOrDigit are unchanged from JLS3
    // Extended identifiers must not be empty, and should permit standalone \ and novel escape sequences
    // like \| and \? in support of John's mangling scheme.
    // Thus ExtendedIdentifier cannot be #"StringLiteral" because it would empty strings, and
    // disallow a standalone \ (through StringLiteral's use of StringCharacter), and is too restrictive
    // about legal escape sequences (only \b, \t, \n et al, as per 3.10.6).
    ExtendedIdentifierCharacters ExtendedIdentifierCharacter
    InputCharacter but not / or . or ; or < or > or [ or "

- A simple identifier is an unlimited-length sequence of Java letters and digits...

- An extended identifier is a # ASCII character, then a " ASCII character, then an unlimited-length sequence of Unicode characters (excluding / (\u002F) and . (\u002E) and ; (\u003B) and < (\u003C) and > (\u003E) and [ (\u005B) and " (\u0022)), then a " ASCII character.

- The body of a simple identifier is its sequence of Java letters and digits. The body of an extended identifier is the sequence of Unicode characters between the " tokens. Two identifiers are the same only if their bodies have the same Unicode character at corresponding positions. // See also 6.5.

- The body of an extended identifier can use the character and string escape sequences (3.10.6) to represent certain special characters. Outside those escape sequences, the backslash Unicode character (\u005C) is not treated specially in an extended identifier.

- Unlike a simple identifier, an extended identifier may have the same spelling as a keyword or any literal (with the pedantic exception of the empty string literal).

3.9 Keywords

The following characters sequences...are reserved for use as keywords and cannot be used as ***simple identifiers***:

3.10.6 Escape Sequences for Character and String Literals

It is a compile-time error if the character following a backslash in an escape ***sequence for a character literal or string literal*** is not an ASCII...

6.2 Names and Identifiers

A simple name is ***the body of*** a single identifier. A qualified name consists of a name, a ???.??? token, and a ***simple name***.

6.5 Determining the Meaning of a Name

// Need to ensure that any reference to an identifier in the rest of the JLS is interpreted to mean "the body of the identifier", e.g. 8.4.1 Formal Parameters "If two formal parameters of the same method or constructor are declared to have the same name (that is, their declarations mention the same Identifier),", e.g. 8.9 Enums "* The string must match exactly an identifier used to declare an enum constant in this type", e.g. 9.7 Annotations "The Identifier in an ElementValuePair must be the simple name of...", e.g. much of section 14.

7.7 Unique Package Names

// This section suggests a convention for mangling domain name characters not permitted in simple identifiers. Suggest this problem can be avoided by composing a package name from extended, not simple, identifiers.

This and other feature-specific CRs have been superceded by the JSR 292 spec.

Hardware and Software, Engineered to Work Together