JDK-8297814 : (fs) Re-visit Path.getExtension return value
  • Type: Enhancement
  • Component: core-libs
  • Sub-Component: java.nio
  • Priority: P3
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2022-11-30
  • Updated: 2022-12-03
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 20
20Unresolved
Description
A new API Path.getExtension() was added by JDK-8057113. This works, but the return value is nullable, which can lead to errors. This can be avoided by making an API change. Since this API was just added in JDK 20, we still have the opportunity to fix it before JDK 20 ships. Therefore, I'm filing this as a P3 bug.

The getExtension() return value comprises three cases:

1) the extension is present and non-empty, e.g. "a/b/c/foo.jpg"
2) the extension is present but empty, e.g. "a/b/c/foo."
3) the extension is absent, e.g. "a/b/c/foo"

As currently defined, the method returns the following for the different cases:

1) "jpg"
2) ""  (empty string)
3) null

The proposal is to change the return values for these cases as follows:

1) ".jpg"
2) "."  (string consisting of a single period)
3) ""  (empty string)

Essentially, the period character "." will now be included in the return value when an extension is present.

There is ample precedent for this inclusion. A survey of a broad variety of platforms reveals quite a split between period-included and period-excluded arrangements; neither dominates. Python is an example of another platform that includes the period in its notion of a file extension. For example:

==========
>>> from os.path import splitext
>>> splitext('a/b/c/foo')[1]
''
>>> splitext('a/b/c/foo.')[1]
'.'
>>> splitext('a/b/c/foo.jpg')[1]
'.jpg'
==========

The advantage of making this change is that the return value of getExtension() is never null. This helps avoid NullPointerExceptions. (Using Optional was previously considered and rejected, as it proved to be quite clumsy in actual usage. Also, nothing else in nio uses Optional, so it would be an outlier.) In addition, adjusting the API to include the period in the return value makes certain use cases work more smoothly. For example, consider removing the extension from a path string:

==========
    // CURRENT
    String removeExtension1(Path p) {
        String s = p.toString();
        String x = p.getExtension();
        return s.substring(0, s.length() - (x == null ? 0 : 1 + x.length()));
    }

    // MODIFIED
    String removeExtension2(Path p) {
        String s = p.toString();
        return s.substring(0, s.length() - p.getExtension().length());
    }
==========

In the current API, the extension must be null-checked, and if it's non-null, the length of the substring to be removed must be incremented by one in order to include the period. The modified API avoids both of these problems.

Another example is obtaining a list of files that have a known extension. This can be done as follows:

==========
    // CURRENT
    List<Path> listJPGfiles1(Path dir) throws IOException {
        try (var s = Files.list(dir)) {
            return s.filter(p -> "jpg".equalsIgnoreCase(p.getExtension()))
                    .toList();
        }
    }

    // MODIFIED
    List<Path> listJPGfiles2(Path dir) throws IOException {
        try (var s = Files.list(dir)) {
            return s.filter(p -> p.getExtension().equalsIgnoreCase(".jpg"))
                    .toList();
        }
    }
==========

Using the current API, one can avoid NPEs by using "Yoda conditions", that is, making the string literal "jpg" be the receiver of an equals() or equalsIgnoreCase() call. This is effective but it's easy to forget to do this, which can lead to NPEs. Even when this technique is used, it's often disliked by developers, who find it reads unnaturally.

A CSR for this bug will be filed to cover the incremental specification changes compared to the previous CSR.

The release note JDK-8297160 will also need to be updated.
Comments
I've looked through a bunch of source code for occurrences of `lastIndexOf('.')`. My hunch was that it would turn up a bunch of places where code did string processing in order to separate the filename extension. Here's a SourceGraph search to do that: https://sourcegraph.com/search?q=context:global+lang:java+lastIndexOf%5C%28%27%5C.%27%5C%29&patternType=regexp&case=yes&sm=1 There are a bunch of hits that parse things like fully qualified package/class names and internet domain names, but there are also a fair number of cases that process file names. It seems like it's a fairly even split between code that includes and excludes the "." as part of the extension. There are also a number of cases where it would be beneficial to have something like a removeExtension() method that extracts the stem of the filename, excluding the extension and the dot. If that's added, then we want p.removeExtension() + p.getExtension() to reconstruct the filename, and we don't want to conflate an empty extension (filename has trailing dot) with absence of an extension (filename has no dot at all). We may consider something like removeExtension() in the future.
03-12-2022

[~mduigou] It's clearly preferable to avoid null. I think getExtension() does need to distinguish between the absent and empty-extension cases. Having the caller test for the filename for a trailing "." is somewhat risky because of other edge cases: getFileName() could return null, or empty, or the string "." which is special-cased because of the leading-period rule.
30-11-2022

[~mduigou] Do you have any use cases that you could share?
30-11-2022

Having used APIs that both include the extension separator and exclude it, the later form without included separator is much more useful. Almost invariably you end up stripping away the separator the vast majority of the time. The majority of the time it also seems reasonable to conflate the cases for no extension with empty extension to both returning the empty string for most usages. For the cases in which the difference is significant, testing the filename to see if the last character is the separator is sufficient to distinguish. Providing an API which provides regular output that doesn't require the separator to be removed seems safest. A more complicated API which has non-regular output, sometimes null, sometimes not containing any characters seems error prone. Imagine that if the API does include separator and potentially return null at least some percentage of lazy users are still going to use `String extension = file.getExtension().substring(1);` which will work for most files but blow-up for both the empty and null cases. BTW, using the constant as the receiver for equality or comparison is often recommended as a security best practice to avoid exploits where bad data might be injected.
30-11-2022