Bug ID: JDK-8223776 String::stripIndent (Preview)

JDK-8223776 : String::stripIndent (Preview)

Type: CSR
Component: core-libs
Sub-Component: java.lang

Priority: P3
Status: Closed
Resolution: Approved
Fix Versions: 13

Submitted: 2019-05-13
Updated: 2019-06-05
Resolved: 2019-06-04

Related Reports

CSR :	JDK-8223775 - String::stripIndent (Preview)
Relates :	JDK-8222530 - JEP 355: Text Blocks (Preview)

Description

Summary
-------

This feature introduces a new String instance method String::stripIndent, used to remove incidental white space introduced by incidental indentation of a Text Block content.

This method is part of a [preview language feature](http://openjdk.java.net/jeps/12): [Text Blocks](https://bugs.openjdk.java.net/browse/JDK-8222530)

Problem
-------

Text blocks are easier to read than their concatenated string literal counterparts, but the "obvious" interpretation of a text block would include the spaces added to indent the embedded string so that it lines up neatly with the opening delimiter and/or enclosing code. Consequently, each text block would not represent the same string as the concatenated string literals, hurting migration, and if the developer were to re-indent the code using the IDE, it would change the contents of the text block. The following HTML example uses dots to visualize the spaces that the developer added for indentation, but did _not_ intend to be part of the content:

        String html = """
        ..............<html>
        ..............    <body>
        ..............        <p>Hello World.</p>
        ..............    </body>
        ..............</html>
        ..............""";

Accordingly, a better interpretation of a text block is to differentiate _incidental white space_ from _essential white space_. The proposed string method would "re-indent" the content by removing the incidental white space (the dots above), to yield what the developer intended: (using `|` to visualize the left margin)

        |<html>
        |    <body>
        |        <p>Hello World.</p>
        |    </body>
        |</html>

Solution
--------

The _re-indentation algorithm_ takes a text block and removes the same amount of white space from each line of content until at least one of the lines has a non-white space character in the leftmost position. The algorithm is as follows:

1. Split the content of the multi-line string at every line terminator (LF, CR and CRLF), producing a list of _individual lines_. Note that any line in the content which was just an line terminator will become an empty line in the list of individual lines.

2. Add all _non-blank_ lines from the list of individual lines into a set of _determining lines_. (Blank lines -- lines that are empty or are composed wholly of white space -- have no visible influence on the indentation. Excluding blank lines from the set of determining lines avoids throwing off step 4 of the algorithm.)

3. If the last line in the list of individual lines (i.e., the line with the text block closing delimiter) is _blank_, then add it to the set of determining lines. (The indentation of the closing delimiter should influence the indentation of the content as a whole -- a "significant trailing line" policy.)

4. Compute the _common white space prefix_ of the set of determining lines, by counting the number of leading white space characters on each line and taking the minimum count.

5. Remove the common white space prefix from each _non-blank_ line in the list of individual lines.

6. Remove all trailing white space from all lines in the modified list of individual lines from step 5. ("Hidden" white space at the end of lines is unintentional, so it is overwhelmingly likely that the developer does _not_ want it in the string.) Note that this step collapses wholly-white space lines in the modified list so that they are empty, but does not discard them.

7. Construct the result string by joining all the lines in the modified list of individual lines from step 6, using LF as the separator between lines. If the final line in the list from step 6 is empty, then the joining LF from the previous line will be the last character in the result string.

This re-indentation algorithm will be referenced in normative text by the new JLS section for text blocks (see http://cr.openjdk.java.net/~abuckley/jep355/text-blocks-jls.html). In other words, the JLS will logically incorporate the API spec of String::stripIndent, but will not physically incorporate it. 

Specification
-------------

```
    /**
     * Returns a string whose value is this string, with incidental
     * {@linkplain Character#isWhitespace(int) white space} removed from
     * the beginning and end of every line.
     * <p>
     * Incidental {@linkplain Character#isWhitespace(int) white space}
     * is often present in a text block to align the content with the opening
     * delimiter. For example, in the following code, dots represent incidental
     * {@linkplain Character#isWhitespace(int) white space}:
     * <blockquote><pre>
     * String html = """
     * ..............&lt;html&gt;
     * ..............    &lt;body&gt;
     * ..............        &lt;p&gt;Hello, world&lt;/p&gt;
     * ..............    &lt;/body&gt;
     * ..............&lt;/html&gt;
     * ..............""";
     * </pre></blockquote>
     * This method treats the incidental
     * {@linkplain Character#isWhitespace(int) white space} as indentation to be
     * stripped, producing a string that preserves the relative indentation of
     * the content. Using | to visualize the start of each line of the string:
     * <blockquote><pre>
     * |&lt;html&gt;
     * |    &lt;body&gt;
     * |        &lt;p&gt;Hello, world&lt;/p&gt;
     * |    &lt;/body&gt;
     * |&lt;/html&gt;
     * </pre></blockquote>
     * First, the individual lines of this string are extracted as if by using
     * {@link String#lines()}.
     * <p>
     * Then, the <i>minimum indentation</i> (min) is determined as follows.
     * For each non-blank line (as defined by {@link String#isBlank()}), the
     * leading {@linkplain Character#isWhitespace(int) white space} characters are
     * counted. The leading {@linkplain Character#isWhitespace(int) white space}
     * characters on the last line are also counted even if
     * {@linkplain String#isBlank() blank}. The <i>min</i> value is the smallest
     * of these counts.
     * <p>
     * For each {@linkplain String#isBlank() non-blank} line, <i>min</i> leading
     * {@linkplain Character#isWhitespace(int) white space} characters are removed,
     * and any trailing {@linkplain Character#isWhitespace(int) white space}
     * characters are removed. {@linkplain String#isBlank() Blank} lines are
     * replaced with the empty string.
     *
     * <p>
     * Finally, the lines are joined into a new string, using the LF character
     * {@code "\n"} (U+000A) to separate lines.
     *
     * @apiNote
     * This method's primary purpose is to shift a block of lines as far as
     * possible to the left, while preserving relative indentation. Lines
     * that were indented the least will thus have no leading
     * {@linkplain Character#isWhitespace(int) white space}.
     * The line count of the result will be the same as line count of this
     * string.
     * If this string ends with a line terminator then the result will end
     * with a line terminator.
     *
     * @implNote
     * This method treats all {@linkplain Character#isWhitespace(int) white space}
     * characters as having equal width. As long as the indentation on every
     * line is consistently composed of the same character sequences, then the
     * result will be as described above.
     *
     * @return string with incidental indentation removed and line
     *         terminators normalized
     *
     * @see String#lines()
     * @see String#isBlank()
     * @see String#indent(int)
     * @see Character#isWhitespace(int)
     *
     * @since 13
     *
     * @deprecated  This method is associated with text blocks, a preview language feature.
     *              Text blocks and/or this method may be changed or removed in a future release.
     */
    @Deprecated(forRemoval=true, since="13")
    public String stripIndent() {
```

Comments

As a code review comment, I suggest putting a statement to the effect of "The output string has the same number of lines as the input" into the API note and also explicitly addessing that the result string will end with a line terminator if and only if the input string ends in a line terminator (that is my reading of the current spec). After discussions with the text blocks team, moving to Approved; white space handling will be an area of interest during the preview period.
04-06-2019
There is a reason for stripping trailing white space that may not be evident. Lesson learned from JEP 326 _Raw String Literal_; delivering an all in one solution is a poor approach when introducing a new Java feature. New features need to be discussed and deployed in small steps. Each step must stand on its own, however each step may be somewhat reliant on the previous steps. Thus, some forecasting is required. Two spaces at the end of a line might have been a questionable design choice for markdown, but I'm not going to judge. I still use markdown none the less. All the editors I use have strip trailing spaces turned on (mostly because of OpenJDK check-in policy), so I can not easily use the markdown _two spaces for `<br />`_ feature. Relying on spaces at the of a line is generally problematic in the OpenJDK. Some of the tests that contain text blocks require the use of `\u0020` at the end of line to prevent removal of spaces before check-in. The decision to strip trailing white space is about the reliability of visible characters versus the uncertainty of non-evident characters; i.e., white space. Developers would prefer to have something dependable over "I'm not sure how many spaces are there. Fingers crossed. Maybe I better check again." What if your editor strips white space? What if you edit a file that a coworker created expecting the white space to stay? This situation is similar to what led us to normalizing the line terminator. How do you keep trailing white space? Poor as it is, there is the `\040` solution. Can we do better? How do you know a traditional string literal has spaces at the end? You visualize and count the spaces in the gap leading up to the _visible_ closing delimiter. Java also relies on the delimiter to know where the spaces stop. Text blocks will likely need a visible end of line delimiter to capture trailing white space. The amber experts have been discussing the representation of very long strings using text blocks and a line continuation sequence. One of the suggestions is to use `\<line-terminator>` to indicate line continuation. Example; String text = """ This a very long line of text to \ demonstrate how a long str\ ing can be broken up over several lines. """; is equivalent to String text = "This a very long line of text to demonstrate how a long string can be broken up over several lines\n"; With regards to stripping white space, the example manifests the following; - Each line of the text block content has a beginning and end; the first visible character and the last visible character. - There is discernible intent to keep the space prior to the continuation sequence in the first line. (If more space was required, shift the continuation sequence further to the right with more spaces.) - Things will go awry if there was an unintended space before the line terminator; you end up with `\<space>` instead of `\<line-terminator>`. What does this have to do with stripping trailing white space? - It is better to rely on what can be seen. - There are other ways to represent trailing white space. The amber experts just haven't come to a conclusion yet. - Unintended trailing white space will bite many more developers than those requiring trailing spaces. Best to guarantee that there is no white space there than to leave the developer hunting for bugs at runtime. and - A solution for keeping trailing white space will likely end up in post processing, as in a new escape sequence processed by `String::translateEscapes`. `String::stripIndent`'s primary purpose is to normalize the content of the text block for further processing. We believe that developers can reasonably use this method as an indentation normalizer. String s = " Line 1\n" + " Line 2\n" + " Line 3"; System.out.println(s.stripIndent()); \|Line 1�� \| Line 2�� \|Line 3 Yes, `String::stripIndent` with trailing white space stripping does slightly more than strip incidental indentation, but the indentation removal is predominantly what you visualize in the result. The method could be called `String::removeUnintendedWhitespace` or `String::extractVisibleContent`, but would those names confound developers?
04-06-2019
I believe the tension here is that the `stripIndent()` method is caught in the following vise: the natural semantics for the language include stripping at both sides, and we want to have a library method that does what the language does (so users don't have to recreate that functionality), but the behavior of `stripIndent()` is potentially surprising if you don't make the connection with the language behavior. And the obvious solution is to split the entry points; allow something like `indent()` to cover the "reindent from the left only" use case that Joe believes is the natural interpretation for library users, while leaving `stripIndent()` as is (possibly with some renames to connect it better with the language feature.)
04-06-2019
A few better-informed comments on my apprehension about stripping trailing space. I think the motivation to strip leading spaces is clear; the leading spaces are most likely inserted only to appease the formatting and editing conventions of Java programs. The semantic content of the text block starts to the right of the lined-up text. Many of the examples for text blocks use excerpts of other programming or markup languages to motivate the feature. As such, how those language do or do not treat trailing space should be of concern since the trailing spaces in the lines of a text block are of no particular concern to Java per se. I would argue a more conservative approach is to leave trailing spaces as-is; if the editor strips them away, that is the editor's business or configuration and not the Java platform. While a language like Whitespace (https://hackage.haskell.org/package/whitespace-0.4/src/docs/tutorial.html) might be dismissed as a niche use-case, in Markdown trailing whitespaces can be significant: "When you do want to insert a <br /> break tag using Markdown, you end a line with two or more spaces, then type return." https://daringfireball.net/projects/markdown/syntax#p: Therefore, I suggest the text blocks effort reconsider the policy of stripping trailing whitespace to preserve the semantics of Markdown and other language that may happen to give meaning to trailing spaces.
31-05-2019
The algorithm is effectively the same as String::align with the leading and trailing blank line removal dropped. The number of lines returned is equal to the number of lines supplied. Line terminators as defined in String::lines, changes should propagate from there. The String::stripTrailing is consistent with removal of incidentals. https://mail.openjdk.java.net/pipermail/amber-spec-experts/2019-May/001376.html . I will attempt to simplify the language.
29-05-2019
The prior work related to raw string literals (JEP 326) included methods methods for stripping and processing indentation. Are there salient differences between the earlier proposals and this one? If there are notable differences, what was the motivation? In particular, if there a concise operational definition of this method that could be expressed in terms of, say, String.ident? At least an initial reading of the textual form of the re-indentation algorithm leads to some unclear conclusions: "Add all non-blank lines from the list of individual lines into a set of determining lines. (Blank lines -- lines that are empty or are composed wholly of white space -- should exert no influence on the result. Excluding them from the set of determining lines avoids throwing off step 4 of the algorithm.)" Blanks lines exerting no influence on the result could be interpreted as meaning blank lines are removed from the output. However, the rest of the text leads me to believe that is not the case; blank lines are instead reduce to a single line terminator. Please clarify. If an invariant of the proposed method is that the number of lines of the output is the same as the number of lines in the original string, that would be useful to note IMO. Possibly worth noting, might be too obscure, is that Vertical Tab (U+000B) is treated as a horizontal space character rather than as a line breaking character. Presumably U+0085, Next Line, is treated the same way as the lines() method does not recognize it as a line terminator. My initial impression is that is rather aggressive to proactively trip tailing whitespace and all the whitespace from blank lines. Moving to Provisional.
25-05-2019
http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-May/060410.html
24-05-2019