JDK-8260266 : UTF-8 by Default
  • Type: CSR
  • Component: core-libs
  • Sub-Component: java.nio.charsets
  • Priority: P3
  • Status: Closed
  • Resolution: Approved
  • Fix Versions: 18
  • Submitted: 2021-01-21
  • Updated: 2021-08-30
  • Resolved: 2021-08-30
Related Reports
CSR :  
Description
Summary
-------

Use UTF-8 as the default charset for the Java SE APIs, so that APIs which depend on the default charset behave consistently across all JDK implementations and independently of the user’s operating system, locale, and configuration.

Problem
-------

APIs that use the default charset are a hazard for developers that are new to the Java platform. They are also a bugbear for experienced developers. Consider an application that creates a `java.io.FileWriter` with its 1-arg constructor and uses it to writes some text to a file. Writing the text encodes it into a sequence of bytes using the default charset. Another application, run on a different machine or by a different user on the same machine,  creates a `java.io.FileReader` with its 1-arg constructor and uses it to read the text from the file. Reading the file decodes the bytes to a sequence of characters/text using the default charset. If the default charset is different when reading then the resulting text may be silently corrupted or incomplete (as these APIs replace erroneous input, they don't fail).

Developers that are familiar with the hazard may choose to use methods that specify the charset (either by charset name or `Charset`) but the resulting code is more verbose. Furthermore, using APIs that specify the charset may inhibit the use of some Java Language features (Method References in particular). Sometimes developers attempt to set the default charset by means of the system property `file.encoding` but this has never been a supported mechanism (and may not actually be effective, especially when changed after the Java virtual machine has been initialized).

In JDK 17 and earlier, the name `default` is recognized as an alias for the `US-ASCII` charset. That is, `Charset.forName("default")` produces the same result as `Charset.forName("US-ASCII")`. The default alias was introduced in JDK 5 to ensure that legacy code which used `sun.io` converters could migrate to the `java.nio.charset` framework introduced in JDK 1.4.

It would be extremely confusing for JDK 18 to preserve `default` as an alias for `US-ASCII` when the default charset is specified to be `UTF-8`. It would also be confusing for `default` to mean `US-ASCII` if the user configures the default charset to its pre-JDK 18 value by setting `-Dfile.encoding=COMPAT` on the command line. Redefining `default` in JDK 18 to be an alias for the default charset (whether `UTF-8` or user-configured) would cause subtle behavioral changes in the (few) programs that call `Charset.forName("default")`.

Continuing to recognize `default` in JDK 18 would be prolonging a poor  choice. It is not defined by the Java SE Platform, nor is it recognized by IANA as the name or alias of any character set. In fact, for ASCII-based network protocols, IANA encourages use of the canonical name `US-ASCII` rather than just `ASCII` or obscure aliases such as `ANSI_X3.4-1968` -- plainly, use of the JDK-specific alias `default` goes counter to that advice. Java programs can use the enum constant `StandardCharsets.US_ASCII` to make their intent clear, rather than passing a string to `Charset.forName(...)`.

Solution
--------

The specification of the `Charset.defaultCharset()` API will be changed to specify that the default charset is UTF-8 unless configured otherwise by an implementation-specific means.  All APIs that use the default charset will link to `Charset.defaultCharset()` if they don't already do so. `System.out` and `System.err` are the exceptions in that they continue to use `Console.charset()` charset as the default charset.

To mitigate the compatibility impact, the `file.encoding` property will be documented (in an implementation note) so that it can be set on the command line to the value "COMPAT" (i.e. `-Dfile.encoding=COMPAT`). When started with this value the default charset will be determined based on the locale and default encoding as long-standing behavior, which is the same encoding as `native.encoding` system property value.

In addition, the `file.encoding` property will also be documented to allow it to be set on the command line with the value "UTF-8", essentially a no-op.

With regards to the charset name `default`, `Charset.forName("default")` will throw an `UnsupportedCharsetException` in JDK18. This will give developers a chance to detect use of the idiom and migrate to either `US-ASCII` or to the result of `Charset.defaultCharset()`.

Specification
-------------

Add the following row in the chart in `Implementation Note` in `java.lang.System#getProperties()` method.

     * <tr><th scope="row">{@systemProperty file.encoding}</th>
     *     <td>The name of the default charset, defaults to {@code UTF-8}.
     *     The property may be set on the command line to the value
     *     {@code UTF-8} or {@code COMPAT}. If set on the command line to
     *     the value {@code COMPAT} then the value is replaced with the
     *     value of the {@code native.encoding} property during startup.
     *     Setting the property to a value other than {@code UTF-8} or
     *     {@code COMPAT} leads to unspecified behavior.
     *     </td></tr>

Modify the following paragraph in the class description of `java.nio.charset.Charset` class from:

     * <p> Every instance of the Java virtual machine has a default charset, which
     * y or may not be one of the standard charsets.  The default charset is
     * determined during virtual-machine startup and typically depends upon the
     * locale and charset being used by the underlying operating system. </p>

to:

     * <p> Every instance of the Java virtual machine has a default charset, which
     * is {@code UTF-8} unless changed in implementation specific manner. Refer to
     * {@link #defaultCharset()} for more detail.

Modify the method description of `java.nio.charset.Charset#defaultCharset()` from:

      /**
       * Returns the default charset of this Java virtual machine.
       *
       * <p> The default charset is determined during virtual-machine startup and
       * typically depends upon the locale and charset of the underlying
       * operating system.
       *
       * @return  A charset object for the default charset
       *
       * @since 1.5
       */

to:

    /**
     * Returns the default charset of this Java virtual machine.
     *
     * <p> The default charset is {@code UTF-8}, unless changed in an
     * implementation specific manner.
     *
     * @implNote An implementation may override the default charset with
     * the system property {@code file.encoding} on the command line. If the
     * value is {@code COMPAT}, the default charset is derived from
     * the {@code native.encoding} system property, which typically depends
     * upon the locale and charset of the underlying operating system.
     *
     * @return  A charset object for the default charset
     * @see <a href="../../lang/System.html#file.encoding">file.encoding</a>
     * @see <a href="../../lang/System.html#native.encoding">native.encoding</a>
     *
     * @since 1.5
     */

Remove the `platform` from the default charset wording from the following method descriptions, e.g., change "the platform's default charset" to "the default charset":

 - `java/io/ByteArrayOutputStream`
 - `java/io/FileReader`
 - `java/io/FileWriter`
 - `java/io/InputStreamReader`
 - `java/io/OutputStreamWriter`
 - `java/io/PrintStream`
 - `java/io/PrintWriter`
 - `java/net/URLDecoder`
 - `java/net/URLEncoder`
 - `java/util/Scanner`

In addition, change the code example in `java/io/OutputStreamWriter` class description from:

     * <pre>
     * Writer out
     *   = new BufferedWriter(new OutputStreamWriter(System.out));
     * </pre>

to:

     * <pre>
     * Writer out
     *   = new BufferedWriter(new OutputStreamWriter(anOutputStream));
     * </pre>

This is a leftover from the related [CSR][1].


  [1]: https://bugs.openjdk.java.net/browse/JDK-8264209
Comments
Moving amended request to Approved.
30-08-2021

[~darcy], we had to reopen this CSR to include the removal of "default" charset name. Changes since the approved version are: - Paragraphs from "The legacy default charset" section of the JEP are copied/pasted into the Problem/Solution sections - Compatibility Risk Description section now has a paragraph describing Charset.forName("default") will throw an UCE.
27-08-2021

This change is a JEP to make sure that it gets visibility. Documentation, a release note, and outreach are planned.
26-07-2021

Just to add to Naoto's comment, we have this text in the JEP "Developers can check for issues with an existing JDK release by running with -Dfile.encoding=UTF-8 in advance of any early-access or JDK release with this change."
23-07-2021

I don't think any warning or alert in 17 or earlier releases is possible. What we could do is to familiarize this JEP and urge the users to test their applications with `-Dfile.encoding=UTF-8` on existing environment, which would demonstrate the same effect.
22-07-2021

Moving to Approved. What, if anything, can or should be done in 17 and earlier release trains to help prepare or provide notice of this?
22-07-2021