Bug ID: JDK-8196004 JEP 326: Raw String Literals (Preview)

Summary
-------

Add _raw string literals_ to the Java programming language. A raw string
literal can span multiple lines of source code and does not interpret
escape sequences, such as `\n`, or Unicode escapes, of the form `\uXXXX`.

_Please note: This was intended to be a [preview language feature](http://openjdk.java.net/jeps/12) in JDK 12, but it was [withdrawn](https://mail.openjdk.java.net/pipermail/jdk-dev/2018-December/002402.html) and did not appear in JDK 12. It was superseded by Text Blocks ([JEP 355](https://openjdk.java.net/jeps/355)) in JDK 13._

Goals
-----

- Make it easier for developers to
    - express sequences of characters in a readable form, free of Java
      indicators,
    - supply strings targeted for grammars other than Java, and
    - supply strings that span several lines of source without supplying
      special indicators for new lines.
- Raw string literals should be able to express the same strings as
  traditional string literals, except for platform-specific line
  terminators.
- Include library support to replicate the current `javac` string-literal
  interpretation of escapes and manage left-margin trimming.

Non-Goals
---------

- Do not introduce any new String operators.
- Raw string literals do not directly support string interpolation.
  Interpolation may be considered in a future JEP.
- No change in the interpretation of traditional string literals in any
  way, including:
  - multi-line capability,
  - customization of delimiters with repeating open and close
    double-quotes, and
  - handling of escape sequences.

Motivation
----------

Escape sequences have been defined in many programming languages, including
Java, to represent characters that can not be easily represented
directly. As an example, the escape sequence `\n` represents the ASCII
newline control character. To print "hello" and "world" on separate
lines the string `"hello\nworld\n"` can be used;

    System.out.print("hello\nworld\n");

Output:

    hello
    world

Besides suffering from readability issues, this example fixedly targets
Unix based systems, where other OSes use alternate new line
representations, such as `\r\n` (Windows). In Java, we use a higher
level method such as `println` to provide the platform appropriate
newline sequence:

    System.out.println("hello");
    System.out.println("world");

If "hello" and "world" are being displayed using a GUI library, control
characters may not have any significance at all.

The escape sequence indicator, backslash, is represented in Java string
literals as `\\`. This doubling up of backslashes leads to the
[Leaning Toothpick Syndrome](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome),
where strings become difficult to interpret because of excessive
backslashes. Java developers are familiar with examples such as:

    Path path = Paths.get("C:\\Program Files\\foo");

Escape sequences, such as `\"` to represent the double-quote
character, also lead to interpretation issues when used in non-Java
grammars. For example, searching for a double-quote within a string
requires:

    Pattern pattern = Pattern.compile("\\\"");

The reality of escape sequences is they are often the exception and not
the rule in everyday Java development. We use control characters less,
and escape presence adversely affects the readability and maintainability
of our code. Once we come to this realization, the notion of a
non-interpreted string literal becomes a well reasoned result.

Real-world Java code, which frequently embeds fragments of other
programs (SQL, JSON, XML, regex, etc) in Java programs, needs a
mechanism for capturing literal strings as-is, without special handling
of Unicode escaping, backslash, or new lines.

This JEP proposes a new kind of literal, a _raw string literal_, which
sets aside both Java escapes and Java line terminator specifications, to
provide character sequences that under many circumstances are more
readable and maintainable than the existing traditional string literal.

### File Paths Example

<table>
<tr>
<th>Traditional String Literals</th>
<th>Raw String Literals</th>
</tr>
<tr>
<td>
<pre style="text-align:left;">
Runtime.getRuntime().exec("\"C:\\Program Files\\foo\" bar");
</pre>
</td>
<td>
<pre style="text-align:left;">
Runtime.getRuntime().exec(`"C:\Program Files\foo" bar`);
</pre>
</td>
</tr>
</table>

### Multi-line Example

<table>
<tr>
<th>Traditional String Literals</th>
<th>Raw String Literals</th>
</tr>
<tr>
<td>
<pre style="text-align:left;">
String html = "&lt;html&gt;\n" +
              "    &lt;body&gt;\n" +
              "		    &lt;p&gt;Hello World.&lt;/p&gt;\n" +
              "    &lt;/body&gt;\n" +
              "&lt;/html&gt;\n";

</pre>
</td>
<td>
<pre style="text-align:left;">
String html = `&lt;html&gt;<br/>                   &lt;body&gt;<br/>                       &lt;p&gt;Hello World.&lt;/p&gt;<br/>                   &lt;/body&gt;<br/>               &lt;/html&gt;<br/>              `;<br/>
</pre>
</td>
</tr>
</table>

### Regular Expression Example

<table>
<tr>
<th>Traditional String Literals</th>
<th>Raw String Literals</th>
</tr>
<tr>
<td>
<pre style="text-align:left;">
System.out.println("this".matches("\\w\\w\\w\\w"));
</pre>
</td>
<td>
<pre style="text-align:left;">
System.out.println("this".matches(`\w\w\w\w`));
</pre>
</td>
</tr>
</table>

Output:

    true

### Polyglot Example

<table>
<tr>
<th>Traditional String Literals</th>
<th>Raw String Literals</th>
</tr>
<tr>
<td>
<pre style="text-align:left;">
String script = "function hello() {\n" +
                "   print(\'\"Hello World\"\');\n" +
                "}\n" +
                "\n" +
                "hello();\n";
ScriptEngine engine = new ScriptEngineManager().getEngineByName("js");
Object obj = engine.eval(script);

</pre>
</td>
<td>
<pre style="text-align:left;">
String script = `function hello() {<br/>                    print('"Hello World"');<br/>                 }<br/>				<br/>                 hello();<br/>                `;<br/>
ScriptEngine engine = new ScriptEngineManager().getEngineByName("js");
Object obj = engine.eval(script);
</pre>
</td>
</tr>
</table>

Output:

    "Hello World"

### Database Example

<table>
<tr>
<th>Traditional String Literals</th>
<th>Raw String Literals</th>
</tr>
<tr>
<td>
<pre style="text-align:left;">
String query = "SELECT `EMP_ID`, `LAST_NAME` FROM `EMPLOYEE_TB`\n" +
               "WHERE `CITY` = ���INDIANAPOLIS'\n" +
               "ORDER BY `EMP_ID`, `LAST_NAME`;\n";


</pre>
</td>
<td>
<pre style="text-align:left;">
String query = ``SELECT&nbsp;`EMP_ID`,&nbsp;`LAST_NAME`&nbsp;FROM&nbsp;`EMPLOYEE_TB`<br/>                 WHERE&nbsp;`CITY`&nbsp;=&nbsp;���INDIANAPOLIS'<br/>                 ORDER&nbsp;BY&nbsp;`EMP_ID`,&nbsp;`LAST_NAME`;<br/>               ``;<br/>
</pre>
</td>
</tr>
</table>

Description
-----------

A raw string literal is a new form of literal.

    Literal:
      IntegerLiteral
      FloatingPointLiteral
      BooleanLiteral
      CharacterLiteral
      StringLiteral
      RawStringLiteral
      NullLiteral

    RawStringLiteral:
      RawStringDelimiter RawInputCharacter {RawInputCharacter} RawStringDelimiter

    RawStringDelimiter:
        ` {`}

A raw string literal consists of one or more characters enclosed in
sequences of backticks `` ` `` (`\u0060`) (backquote, accent grave).
A raw string literal opens with a sequence of one or more
backticks. The raw string literal closes when a backtick sequence
is encountered of equal length as opened the raw string literal. 
Any other sequence of backticks is treated as part of the string body.

Embedding backticks in a raw string literal can be accomplished by
increasing or decreasing the number of backticks in the open/close
sequences to mismatch any embedded sequences.
However, this does not help when a backtick is desired as the first
or last character in the raw string literal, since that character will be
treated as [part of the open/close sequence](http://mail.openjdk.java.net/pipermail/compiler-dev/2018-November/012665.html).
In this case, it is necessary to [use a workaround](http://mail.openjdk.java.net/pipermail/compiler-dev/2018-November/012657.html),
such as padding the body of the raw string literal and then stripping the padding.

Characters in a raw string literal are never interpreted, with the
exception of CR and CRLF, which are platform-specific line terminators.
CR (`\u000D`) and CRLF (`\u000D\u000A`) sequences are always translated
to LF (`\u000A`). This translation provides the least surprising behavior across platforms.

Traditional string literals support two kinds of escapes: 
[Unicode escapes](https://docs.oracle.com/javase/specs/jls/se11/html/jls-3.html#jls-3.3) of the form `\uxxxx`, 
and [escape sequences](https://docs.oracle.com/javase/specs/jls/se11/html/jls-3.html#jls-3.10.6) such as `\n`. 
Neither kind of escape is processed in raw string literals; 
the individual characters that make up the escape are used as-is.
This implies that processing of Unicode escapes is disabled
when the lexer encounters an opening backtick and reenabled when
encountering a closing backtick. For consistency, the Unicode escape
`\u0060` may not be used as a substitute for the opening backtick.

The following are examples of raw string literals:

    `"`                // a string containing " alone
    ``can`t``          // a string containing 'c', 'a', 'n', '`' and 't'
    `This is a string` // a string containing 16 characters
    `\n`               // a string containing '\' and 'n'
    `\u2022`           // a string containing '\', 'u', '2', '0', '2' and '2'
    `This is a
    two-line string`   // a single string constant

It is a compile-time error to have an open backtick sequence and no
corresponding close backtick sequence before the end of the compilation unit.

In a `class` file, a [string constant](https://docs.oracle.com/javase/specs/jvms/se11/html/jvms-4.html#jvms-4.4.3) 
does not record whether it was derived from a raw string literal or a traditional string literal.

At run time, a raw string literal is evaluated to an instance of `String`, 
like a traditional string literal. Instances of `String` that are derived
from raw string literals are treated in the same manner as those 
derived from traditional string literals.

### Escapes

It is highly probable that a developer may want a string that is
multi-line but has interpreted escape sequences. To facilitate this
requirement, instance methods will be added to the `String` class to
support the run-time interpretation of escape sequences. Primarily,

    public String unescape()

will translate each character sequence beginning with `\` that has
the same *spelling* as an escape defined in the JLS 
(either a [Unicode escape](https://docs.oracle.com/javase/specs/jls/se11/html/jls-3.html#jls-3.3) 
or an [escape sequence](https://docs.oracle.com/javase/specs/jls/se11/html/jls-3.html#jls-3.10.6)) 
to the character represented by that sequence.

Examples (b0 thru b3 are true):

    boolean b0 = `\n`.equals("\\n");
    boolean b1 = `\n`.unescape().equals("\n");
    boolean b2 = `\n`.length == 2;
    boolean b3 = `\n`.unescape().length == 1;

Other methods will provide finer control over which escapes are
translated.

There will also be a provision for tools to invert escapes. The following
method will also be added to the `String` class:

    public String escape()

which will convert all characters less than `' '` into Unicode or
character escape sequences, characters above `'~'` to Unicode escape
sequences, and the characters `"`,  `'`,  `\` to escape sequences.

Examples (b0 thru b3 are true):

    boolean b0 = "\n".escape().equals(`\n`);
    boolean b1 = `���`.escape().equals(`\u2022`);
    boolean b2 = "���".escape().equals(`\u2022`);
    boolean b3 = !"���".escape().equals("\u2022");

### Source Encoding

If a source file contains non-ASCII characters, ensure use of the
correct encoding on the javac command line (see javac -encoding).
Alternatively, supply the appropriate Unicode escapes in the raw string
and then use one of the provided library routines described above to
translate Unicode escapes to the desired non-ASCII characters.

### Margin Management

One of the issues with multi-line strings is whether to format the
string against the left margin (as in heredoc) or, ideally, blend with
the indentation used by surrounding code.  The question then becomes,
how to manage this _incidental indentation_.

For example, some developers may choose to code as

```
        String s = `
this is my
    embedded string
`;
```

while other developers may not like the outdenting style and choose to
embed relative to the indentation of the code

```
        String html = `
                       this is my
                           embedded string
                      `;
```

In the latter case, the developer probably intends that `this` should be
left-justified while `embedded` should be relatively indented by four
spaces, and we surely want to support this, but we are reluctant to try
and read the developer's mind and assume that this white space is
incidental.

To allow for contrasting coding styles, while providing a flexible and
enduring solution, raw string literals are scanned with the incidental
indentation intact; i.e., raw. The consequence of this design is that if
the developer chooses the above former case, they need no further
processing. Otherwise, the developer will have access to easy-to-use
library support for a variety of alternate coding styles. This will
permit coding style change without affecting the JLS.

We believe the most common case will be the latter case above. For that
reason, we will provide the following `String` instance method:

        public String align()

which after removing all leading and trailing blank lines, left
justifies each line without loss of relative indentation. Thus,
stripping away all incidental indentation and line spacing.

Example:

        String html = `
                           <html>
                               <body>
                                   <p>Hello World.</p>
                               </body>
                           </html>
                      `.align();
        System.out.print(html);

Output:
```
<html>
    <body>
        <p>Hello World.&</p>
    </body>
</html>
```

Further, generalized control of indentation will be provided with the
following `String` instance method:

        public String indent(int n)

where `n` specifies the number of white spaces to add or remove from
each line of the string; a positive `n` adds n spaces (U+0020) and
negative `n` removes n white spaces.

Example:

        String html = `
                           <html>
                               <body>
                                   <p>Hello World.</p>
                               </body>
                           </html>
                      `.align().indent(4);
        System.out.print(html);

Output:
```
    <html>
        <body>
            <p>Hello World.&</p>
        </body>
    </html>
```

In the cases where align() is not what the developer wants, we expect the
preponderance of cases to be align().indent(n). Therefore, an additional
variation of `align` will be provided:

        public String align(int n)

where `n` is the indentation applied to the string after _alignment_.

Example:

        String html = `
                           <html>
                               <body>
                                   <p>Hello World.</p>
                               </body>
                           </html>
                      `.align(4);
        System.out.print(html);

Output:
```
    <html>
        <body>
            <p>Hello World.&</p>
        </body>
    </html>
```

Customizable margin management will be provided by the string instance method:

        <R> R transform���(Function<String,���R> f)

where the supplied function f is called with `this` string as the argument.

Example:
```
public class MyClass {
    private static final String MARKER= "| ";
    public String stripMargin(String string) {
        return lines().map(String::strip)
                      .map(s -> s.startsWith(MARKER) ? s.substring(MARKER.length()) : s)
                      .collect(Collectors.joining("\n", "", "\n"));
    }

    String stripped = `
                          | The content of
                          | the string
                      `.transform(MyClass::stripMargin);
    System.out.print(stripped);
```

Output:
```
The content of
the string
```

It should be noted that concern for class file size and runtime impact
by this design is addressed by the _constant folding_ features of
[JEP 303](http://openjdk.java.net/jeps/303).

Alternatives
------------

### Choice of Delimiters

A traditional string literal and a raw string literal both enclose their
character sequence with *delimiters*. A traditional string literal uses
the double-quote character as both the opening and closing delimiter.
This symmetry makes the literal easy to read and parse. A raw string
literal will also adopt symmetric delimiters, but it must use a
different delimiter because the double-quote character may appear
unescaped in the character sequence. The choice of delimiters for a raw
string literal is informed by the following considerations:

- Delimiters should have a low profile for small character sequences,
  margin management, and general readability.

- The opening delimiter should be a clear indication that what follows
  is the body of a raw string literal.

- The closing delimiter should have a low probability of occurring in the
  string body. If the closing delimiter needs to occur in the body of the
  string then the rules for embedding the closing delimiter should be
  clean and simple. Embedding must be accomplished without the use of
  escapes.

We assume that the string-literal delimiter choice includes only the
three Latin1 quote characters: single-quote, double-quote, and backtick.
Any other choice would affect clarity and be inconsistent with
traditional string literals.

Still, it is necessary to differentiate a raw string literal from a
traditional string literal. For example, double-quote could be combined
with other characters or custom phrases to form a kind of compound
delimiter for raw string literals. For example, `$"xyz"$` or
`abcd"xyz"abcd`. These compound delimiters meet the basic requirements,
but lack a clean and simple embedding of the closing delimiter. Also,
there is a temptation in the custom phrases case to assign semantic
meaning to the phrase, heralding another industry similar to Java
annotations.

There is the possibility to use quote repetition: `"""xyz"""`. Here we have
to be cautious to avoid ambiguity. Example: `"" + x + ""` can be
parsed as the concatenation of a traditional string literal with a
variable and another traditional string literal, or as a raw string
literal for the seven-character string `" + x + "`.

The advantage of the backtick is that it does not require repurposing.
We can also avoid the ambiguity created by quote repetition and the empty
string. It is a *new* delimiter in terms of the Java Language
Specification. It meets all the delimiter requirements, including a
simple embedding rule.

Another consideration for choice of delimiters is the potential for
future technologies. With raw and traditional string literals both using
simple delimiters, any future technology could be applied symmetrically.

This JEP proposes to use backtick character. It is distinct from existing
quotes in the language but conveys similar purpose.

### Multi-line Traditional String Literals

Even though this option has been set aside as a raw string literal
solution, it may still be reasonable to allow multi-line traditional
string literals in addition to raw string literals. Enabling such a
feature would affect tools and tests that assume multi-line traditional
string literals as an error.

### Other Languages

Java remains one of a small group of contemporary programming languages
that do not provide language-level support for raw strings.

The following programming languages support raw string literals and were
surveyed for their delimiters and use of raw and multi-line strings; C,
C++, C\#, Dart, Go, Groovy, Haskell, JavaScript, Kotlin, Perl, PHP,
Python, R, Ruby, Scala and Swift. The Unix tools bash, grep and sed were
also examined for string representations.

A multi-line literal solution could have been simply achieved by
changing the Java specification to allow CR and LF in the body of a
double-quote traditional string literal. However, the use of double
quote implies that escapes must be interpreted.

A different delimiter was required to signify different interpretation
behavior. Other languages chose a variety of delimiters:

<table>
<tr>
<th><p><strong>Delimiters</strong></p></th>
<th><p><strong>Language/Tool</strong></p></th>
</tr>
<tr>
<td><p><code>&quot;&quot;&quot;...&quot;&quot;&quot;</code></p></td>
<td><p>Groovy, Kotlin, Python, Scala, Swift</p></td>
</tr>
<tr>
<td><p><code>`...`</code></p></td>
<td><p>Go, JavaScript</p></td>
</tr>
<tr>
<td><p><code>@&quot;...&quot;</code></p></td>
<td><p>C#</p></td>
</tr>
<tr>
<td><p><code>R&quot;...&quot;</code></p></td>
<td><p>Groovy (old style)</p></td>
</tr>
<tr>
<td><p><code>R&quot;xxx(...)xxx&quot;</code></p></td>
<td><p>C/C++</p></td>
</tr>
<tr>
<td><p><code>%(...)</code></p></td>
<td><p>Ruby</p></td>
</tr>
<tr>
<td><p><code>qq{...}</code></p></td>
<td><p>Perl</p></td>
</tr>
</table>

Python, Kotlin, Groovy and Swift have opted to use triple double quotes
to indicate raw strings. This choice reflects the connection with
existing string literals.

Go and JavaScript use the backtick. This choice uses a character that
is not commonly used in strings. This is not ideal for use in Markdown
documents, but addresses a majority of cases.

A unique meta-tag such as `@"..."` used in C# provides similar
functionality to the backticks proposed here. However, `@` suggests
annotations in Java. The use of another meta-tag limits the use of that
meta-tag for future purposes.

### Heredoc

An alternative to quoting for raw strings is using "here" documents or
heredocs. Heredocs were first used in Unix shells and have found their
way into programming languages such as Perl. A heredoc has a placeholder
and an end marker. The placeholder indicates where the string is to be
inserted in the code as well as providing the description of end marker.
The end marker comes after the body of the string. For example,

        System.out.println(<<HTML);
    <html>
        <body>
            <p>Hello World.</p>
        </body>
    </html>
    HTML

Heredocs provide a solution for raw strings, but are thought by many to
be an anachronism. They are also obtrusive and complicate margin
management.

Testing
-------

String test suites should be extended to duplicate existing tests replacing
traditional string literals with raw string literals.

Negative tests should be added to test corner cases for line terminators and
end of compilation unit.

Tests should be added to test escape and margin management methods.

Tests should be added to ensure we can embed Java-in-Java and
Markdown-in-Java.
Blocks :	JDK-8199065 - Test Plan for JEP 326 Raw String Literals
Blocks :	JDK-8201461 - JShell: Raw string literals doesn't work with java comments
Relates :	JDK-8215489 - Remove String::align
Relates :	JDK-8215681 - Remove compiler support for Raw String Literals from JDK 12
Relates :	JDK-8196005 - Library support for Raw String Literals
Relates :	JDK-8222530 - JEP 355: Text Blocks (Preview)
Relates :	JDK-8206981 - Compiler support for Raw String Literals
Relates :	JDK-8215490 - Remove String::align
Relates :	JDK-8198986 - 3.10.7: Raw string literals