JDK-4899439 : File uses strings for names but file names are byte arrays on OS
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.io
  • Affected Version: 1.4.2,6,6u3
  • Priority: P4
  • Status: Closed
  • Resolution: Duplicate
  • OS: linux,solaris_9,solaris_nevada
  • CPU: generic,x86,sparc
  • Submitted: 2003-07-31
  • Updated: 2014-06-02
  • Resolved: 2009-02-16
Related Reports
Duplicate :  
Duplicate :  
Relates :  
Relates :  
Description
Name: rmT116609			Date: 07/31/2003


FULL PRODUCT VERSION :
java version "1.4.2"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2-b28)
Java HotSpot(TM) Client VM (build 1.4.2-b28, mixed mode)

FULL OS VERSION :
Any Solaris or Unix

EXTRA RELEVANT SYSTEM CONFIGURATION :
This can happen on a Japanese machine, where the locale is "ja" but can probably happen on any locale.

A DESCRIPTION OF THE PROBLEM :
You can create a file that Java's java.io.File class cannot read. This is because file names are actually byte-arrays in the os but java.io.File takes a String for a file name (which is composed of Unicode characters).

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
OK, create.c is a program that will create a file whose name is not a character in the 'ja' locale. Note that the OS has no problem with this.

Lister.java defines a class that lists files in the current directory.  For each file, it spits out the (a) 'toString()' version of the file, (b) the char array of the name as hex, and (c) the 'getBytes' byte array of the name.

So, what you can do is compile and run create.c, which will create a file whose name is a single byte whose hex value is 99.  Then compile and run Lister.java, which will give you the following output (shown for two different locales):

---------------------------------------------
$ export LANG=
$ java Lister
name:M-^O����; chars:99,; bytes:99,

$ export LANG=ja
$ java Lister
name:?; chars:fffd,; bytes:3f,
---------------------------------------------

Note that when running in the JA locale, there is no character corresponding to byte value 0x99!  So, Java uses the replacement character 0xFFFD, and the '?' character 0x3F, as a replacement. Of course, you don't know what characters make up a file name so you can't just swtich character sets arbitrarily when trying to load files using java.io.File.

The point is that there are files which Java cannot uniquely represent as a straight String.  I suppose we could get the filename via JNI, do the conversion ourselves, and then use the private-use area of Unicode to encode all our strings, but Ugh!

//--------------------------------------------------------
// create.c
//--------------------------------------------------------

#include <stdio.h>

int main()
{
        const char* name = "\x99";
        FILE* file = fopen( name, "w" );
        if( file == NULL )
        {
                printf( "could not open file %s\n", name );
                return 1;
        }

        fclose( file );
        return 0;
}

//--------------------------------------------------------
// Lister.java
//--------------------------------------------------------

import java.io.*;

public class Lister
{
    public static void main( String[] args )
    {
        new Lister().run();
    }

    public void run()
    {
        try
        {
            doRun();
        }
        catch( Exception e )
        {
            System.out.println( "Encountered exception: " + e );
        }
    }

    private void doRun() throws Exception
    {
        File cwd = new File( "." );
        String[] children = cwd.list();
        for( int i = 0; i < children.length; ++i )
        {
            printName( children[ i ] );
        }
    }
    
    private void printName( String s )
    {
        System.out.print( "name:" );
        System.out.print( s );
    
        System.out.print( "; chars:" );
        printCharsAsHex( s );
    
        System.out.print( "; bytes:" );
        printBytesAsHex( s );
    
        System.out.println();
    }

    private void printCharsAsHex( String s )
    {
        for( int i = 0; i < s.length(); ++i )
        {
            char ch = s.charAt( i );
    
            System.out.print( Integer.toHexString( ch ) + "," );
        }
    }

    private void printBytesAsHex( String s )
    {
        byte[] bytes = s.getBytes();
        for( int i = 0; i < bytes.length; ++i )
        {
            byte b = bytes[ i ];
            
            System.out.print( Integer.toHexString( unsignedExtension( b ) ) + "," );
        }
    }

    private int unsignedExtension( byte b )
    {
        return (int)b & 0xFF;
    }
}


EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Being able to read a file and then use the name associated with the file to reopen the file.
ACTUAL -
---------------------------------------------
$ export LANG=
$ java Lister
name:M-^O����; chars:99,; bytes:99,

$ export LANG=ja
$ java Lister
name:?; chars:fffd,; bytes:3f,
---------------------------------------------


REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
//--------------------------------------------------------
// create.c
//--------------------------------------------------------

#include <stdio.h>

int main()
{
        const char* name = "\x99";
        FILE* file = fopen( name, "w" );
        if( file == NULL )
        {
                printf( "could not open file %s\n", name );
                return 1;
        }

        fclose( file );
        return 0;
}

//--------------------------------------------------------
// Lister.java
//--------------------------------------------------------

import java.io.*;

public class Lister
{
    public static void main( String[] args )
    {
        new Lister().run();
    }

    public void run()
    {
        try
        {
            doRun();
        }
        catch( Exception e )
        {
            System.out.println( "Encountered exception: " + e );
        }
    }

    private void doRun() throws Exception
    {
        File cwd = new File( "." );
        String[] children = cwd.list();
        for( int i = 0; i < children.length; ++i )
        {
            printName( children[ i ] );
        }
    }
    
    private void printName( String s )
    {
        System.out.print( "name:" );
        System.out.print( s );
    
        System.out.print( "; chars:" );
        printCharsAsHex( s );
    
        System.out.print( "; bytes:" );
        printBytesAsHex( s );
    
        System.out.println();
    }

    private void printCharsAsHex( String s )
    {
        for( int i = 0; i < s.length(); ++i )
        {
            char ch = s.charAt( i );
    
            System.out.print( Integer.toHexString( ch ) + "," );
        }
    }

    private void printBytesAsHex( String s )
    {
        byte[] bytes = s.getBytes();
        for( int i = 0; i < bytes.length; ++i )
        {
            byte b = bytes[ i ];
            
            System.out.print( Integer.toHexString( unsignedExtension( b ) ) + "," );
        }
    }

    private int unsignedExtension( byte b )
    {
        return (int)b & 0xFF;
    }
}

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
None that I am aware of. It seems that there should be a special object (or byte array) that java.io.File takes so that if you can read a file from a directory listing, that you can use the same name to Open the file.
(Incident Review ID: 189047) 
======================================================================

Comments
EVALUATION This feature has been addressed by the new file system API defined by JSR-203. In particular, the platform representation is preserved and used in subsequent access to the file.
16-02-2009

EVALUATION The previous evaluation makes a valid point. When I implemented environment variables, I tried to avoid this sort of bug. When examined by Java code, an environment variable has only String names and values, approximations of the underlying real names and values, but the environment variables themselves will not be corrupted by being passed through the ProcessBuilder abstraction. I think we could fix this in File as well, but with significant effort. Also, the Java platform has firmly decided not to expose the underlying native representation, e.g. byte arrays, directly to the user, making such an effort less complete than one would like. Also, we would see anomalies like two different File objects having exactly the same name, as seen by Java code, since the byte array to String mapping is not one-to-one.
27-02-2008

EVALUATION [dep, 26Feb2008] The problem here is not just the ability to accurately communicate the name of the file to the user, but a fundamental breakdown of the abstractions provided by the JDK. Files simply *don't work*. If I call File.listFiles(), I get a list of File objects. i.e. a list of opaque abstractions of files. If I pass any of those to another standard Java API, e.g. the constructor of FileReader, it should Just Work. Because of this bug, it won't always. At no point do I care about -- or even look at -- the name. The name could be written in a mystical ancient tongue whose representation in any encoding would cause instant death if seen, and it shouldn't matter. To phrase this differently, there are two constraints that must be met: 1) All Files produced by standard APIs such as listFiles() should be valid. 2) All standard APIs that accept Files should correctly consume valid Files. Currently, one (or both) of these constraints isn't being met.
27-02-2008

EVALUATION Looks like, "CR 5098433 REG: DnD of File-List between JVM is broken for non ASCII file names - Win32" has the same causes.
06-03-2007

WORK AROUND If you run Java in a Latin-1 locale on Unix, you will likely get one-to-one conversion between bytes and the first 256 Unicode characters. Here is how you can check whether you have this one-to-one property. public class BinaryConversion { public static void main (String[] args) { byte[]bytes = new byte[256]; for (int i = 0; i < 256; ++i) bytes[i] = (byte) i; String alphabet = new String(bytes); assert alphabet.length() == 256; System.out.println (alphabet); byte[]newbytes = alphabet.getBytes(); assert newbytes.length == bytes.length; for (int i = 0; i < bytes.length; ++i) assert newbytes[i] == bytes[i]; } } If your current Java process doesn't have this property, you could start a new JVM using Runtime.exec() with environment variables of LC_ALL=en_US.ISO8859-1 LANG=en_US.ISO8859-1 Not totally portable to all Unix machines, but more so than JNI. ###@###.### 2003-08-18
18-08-2003

EVALUATION How much access to the underlying OS should be provided by Java is highly controversial. I agree that it would have been better to provide a Unicode-string view and a byte array view of all underlying OS text objects in 1995. Today the question is much more difficult. - The Java API assumes in many places that a String can represent a filename. For example, system properties like user.dir are strings, not more abstract hypothetical FileName objects. - Operating systems are rapidly Unicode-izing, so this problem is well on its way out. Windows and MacOSX both use Unicode for filenames, so on those systems there is no issue, assuming a proper Java implementation. - Other operating system objects suffer from the same representation issues. For example, command line arguments and environment variables are presented to the user as if they were strings, but the typical content is filenames. Half-hearted attempts to fix this bug would leave us with the ability to access the underlying byte array only in some contexts. - Even if we were designing Java from scratch, it is not clear that providing the lower level access would be worth the cost of the increased complexity of the API. Perhaps if class String were not final, we could have a hypothetical ExternalData class inherit from String, with extra methods to provide byte-level access. ###@###.### 2003-08-18 In any case, this issue is not for Tiger. ###@###.### 2003-10-30
18-08-2003