Bug ID: JDK-6746458 writing libraries in Java for non-Java languages requires support for exotic identifiers

JDK-6746458 : writing libraries in Java for non-Java languages requires support for exotic identifiers

Type: Enhancement
Component: specification
Sub-Component: language
Affected Version: 7

Priority: P3
Status: Closed
Resolution: Duplicate
OS: generic
CPU: generic

Submitted: 2008-09-09
Updated: 2011-01-27
Resolved: 2010-08-18

Related Reports

Duplicate :	JDK-6955343 - specification of JSR 292 language changes
Relates :	JDK-6863344 - javac: Pretty.java should handle exotic identifiers
Relates :	JDK-6868490 - exotic identifier syntax should not be enabled by default in -source 6 or earlier
Relates :	JDK-6754038 - writing libraries in Java for non-Java languages requires support for invokedynamic sites
Relates :	JDK-4266845 - 3.10.5: Explicitly mention that string literals can't contain \u0022 (") characters
Relates :	JDK-6868524 - javadoc should handle exotic identifiers

Description

Public discussion: http://blogs.sun.com/jrose/entry/symbolic_freedom_in_the_vm

Each language has its own rules for forming identifiers of functions, variables, and types. The JVM allows almost total freedom, at the bytecode level, for forming names of methods, fields, and classes. So far so good. If a language (say, Lisp, Smalltalk, Ruby, or Scala) wants to encode its names directly in the JVM, it runs into two problems. The first is the small number of remaining restrictions on JVM bytecode names; this can easily be overcome by lightweight mangling as outlined in the above blog post.

The second problem is the fact that the language runtime is probably written largely in Java, which means that the "exotic" names in the language cannot be directly implemented by Java classes, methods, and fields. In practice, this difficulty forces language implementors to mangle their names not for the JVM (which is permissive) but for Java (which is restrictive).

The solution to the second problem is to provide an "escape" syntax (similar to that of Lisp and other languages) which allows an arbitrary string to pass through the language scanner as a simple identifier token instead of some other token (or an unscannable mess).

We propose the syntax #"foo". See example implementation here:
http://hg.openjdk.java.net/mlvm/mlvm/langtools/file/tip/quid.patch

The javac frontend should accept any quoted string immediately following '#' (with no intervening space), interpreting normal string escape sequences, and taking the resulting string exactly as the spelling of a normal identifier. Keywords like 'int' should not be recognized. Strings which risk being illegal at the JVM level must be rejectedi immediately; this simply means rejecting the empty string and strings which contain any of the characters "/.;<>[".

This design is neutral toward mangling schemes but supports the one described in the blog entry mentioned above.

Comments

EVALUATION This and other feature-specific CRs have been superceded by the JSR 292 spec.

18-08-2010

EVALUATION This is a small but important feature. There are two kinds of use for exotic identifiers: - Writing Java code to reference artifacts in other languages, e.g. call a Ruby "+" function - Generating Java code to represent artifacts of other languages, e.g. generating a Java class whose methods are named for XML tags and attributes, of which "class" is a popular example. Some spec points: 3.8 Identifiers - An identifier is a SIMPLE identifier or an EXTENDED identifier: Identifier: SimpleIdentifier ExtendedIdentifier SimpleIdentifier: SimpleIdentifierCharacters but not a Keyword or BooleanLiteral or NullLiteral SimpleIdentifierCharacters: JavaLetter SimpleIdentifierCharacters JavaLetterOrDigit // JavaLetter and JavaLetterOrDigit are unchanged from JLS3 ExtendedIdentifier: #"ExtendedIdentifierCharacters" // Extended identifiers must not be empty, and should permit standalone \ and novel escape sequences // like \| and \? in support of John's mangling scheme. // Thus ExtendedIdentifier cannot be #"StringLiteral" because it would empty strings, and // disallow a standalone \ (through StringLiteral's use of StringCharacter), and is too restrictive // about legal escape sequences (only \b, \t, \n et al, as per 3.10.6). ExtendedIdentifierCharacters ExtendedIdentifierCharacter ExtendedIdentifierCharacters ExtendedIdentifierCharacter ExtendedIdentifierCharacter InputCharacter but not / or . or ; or < or > or [ or " - A simple identifier is an unlimited-length sequence of Java letters and digits... - An extended identifier is a # ASCII character, then a " ASCII character, then an unlimited-length sequence of Unicode characters (excluding / (\u002F) and . (\u002E) and ; (\u003B) and < (\u003C) and > (\u003E) and [ (\u005B) and " (\u0022)), then a " ASCII character. - The body of a simple identifier is its sequence of Java letters and digits. The body of an extended identifier is the sequence of Unicode characters between the " tokens. Two identifiers are the same only if their bodies have the same Unicode character at corresponding positions. // See also 6.5. - The body of an extended identifier can use the character and string escape sequences (3.10.6) to represent certain special characters. Outside those escape sequences, the backslash Unicode character (\u005C) is not treated specially in an extended identifier. - Unlike a simple identifier, an extended identifier may have the same spelling as a keyword or any literal (with the pedantic exception of the empty string literal). 3.9 Keywords The following characters sequences...are reserved for use as keywords and cannot be used as ***simple identifiers***: 3.10.6 Escape Sequences for Character and String Literals It is a compile-time error if the character following a backslash in an escape ***sequence for a character literal or string literal*** is not an ASCII... 6.2 Names and Identifiers A simple name is ***the body of*** a single identifier. A qualified name consists of a name, a ��.�� token, and a ***simple name***. 6.5 Determining the Meaning of a Name // Need to ensure that any reference to an identifier in the rest of the JLS is interpreted to mean "the body of the identifier", e.g. 8.4.1 Formal Parameters "If two formal parameters of the same method or constructor are declared to have the same name (that is, their declarations mention the same Identifier),", e.g. 8.9 Enums "* The string must match exactly an identifier used to declare an enum constant in this type", e.g. 9.7 Annotations "The Identifier in an ElementValuePair must be the simple name of...", e.g. much of section 14. 7.7 Unique Package Names // This section suggests a convention for mangling domain name characters not permitted in simple identifiers. Suggest this problem can be avoided by composing a package name from extended, not simple, identifiers.

09-09-2008