Bug ID: JDK-4100320 URLEncoder.encode() incorrect on non-ASCII platforms

Type: Bug
Component: core-libs
Sub-Component: java.net
Affected Version: 1.1.4

Priority: P3
Status: Closed
Resolution: Fixed
OS: solaris_2.5.1
CPU: sparc

Submitted: 1997-12-18
Updated: 1999-01-15
Resolved: 1999-01-15

Other
1.2.0 1.2beta4Fixed



Name: mf23781			Date: 12/18/97


The URLEncoder.encode() method converts a String to its URL-encoded
form:
   o regular alphanumeric characters are not changed.
   o non-alphanumeric characters are converted to %xx, where xx represents
     the ASCII hexadecimal value of the character.
   o spaces are convert to '+',

Hence, "abc+  def" becomes "abc%2b++def"

   URL-encoded characters are those from a portable subset of ASCII,
   destined to be used in URLs so that they can be correctly handled
   by computers around the world. (Ref: Java 1.1 Developer's Handbook)
 
   Problem 1:
   As each character is handled (whether converted to hex or not),
   it is written to the ByteArrayOutputStream (out) using the out.write
   method. Once the whole string has been written to the stream,
   the stream is converted to a String, using the
   ByteArrayOutputStream.toString method:

        return out.toString();

   This method converts a stream according to the default local encoding.
   This is incorrect as the stream does not contain data in the default
   local encoding, but contains data in ASCII and/or hex, created by out.write.
   The result is that the encoded string returned from this method is garbage.
  
   Solution:
   Explicitly specify the encoding of the stream, rather than letting it
   default to the platform's local encoding:
        try {
            return out.toString("8859_1");
        } catch(Exception e) {
            System.out.println("Exception: " + e);
            return null;
        }


   Once Problem 1 is corrected, the output is:
   OS/390: abc%4e++def   where %4e is ebcdic '+'
   aix:    abc%2b++def   where %2b is ascii  '+'

   Problem 2:
   The above output is incorrect on any non-ASCII platform. The purpose of the
   URLEncoder.encode() method is to convert non-alphanumeric characters
   to their ASCII hexadecimal value, but we have EBCDIC hex values for OS/390.
   The URLEncoder.encode() method states that it "converts to the
   external encoding before hex conversion", hence we end up with
   the local encoding's hex value of the character.

   Solution:
   Explicitly specify the encoding of the writer, rather than letting it
   default to the platform's local encoding:
        OutputStreamWriter writer = null;
        try {
            writer = new OutputStreamWriter(buf,"8859_1");
        } catch(Exception e) {
            System.out.println("Exception: " + e);
        }
   This will ensure that an ASCII character is written to the buffer and
   subsequently converted into hex.

Circumvention:

 Temporarily change the default local encoding
 to ASCII, around the call to URLEncoder.encode(), 
 eg.
 
	 Properties prop = System.getProperties();
	 prop.put("file.encoding", "8859_1");
	 result = URLEncoder.encode(s);  	// reset to local encoding
	 prop.put("file.encoding", "Cp1047");
	
======================================================================

A licensee has this to add :

I have had a look at various of the RFC's concerning
HTTP (RFC 2068),
URI's (RFC 2396, http://www.ics.uci.edu/pub/ietf/uri/)
and HTML (http://www.w3.org/MarkUp/). While all of these where
first designed to use only the US_ASCII character encoding, there
are moves afoot to extend the standards to encompass different
encodings. Specifically, an http server can specify the
character encoding of the message content, and an http client
can request content in a particular character encoding from
the server.

However, URL encoding is more complicated, since there are the
issues of both external encoding, which is the same as the
content encoding of an http message, for example, and the
internal encoding, which is the tranlation of characters to
octets (bytes) PRIOR to URLEncoding of the octets.

Section 2.1 of RFC 2396 says:

" 2.1 URI and non-ASCII characters

   The relationship between URI and characters has been a source of
   confusion for characters that are not part of US-ASCII. To describe
   the relationship, it is useful to distinguish between a "character"
   (as a distinguishable semantic entity) and an "octet" (an 8-bit
   byte). There are two mappings, one from URI characters to octets, and

   a second from octets to original characters:

   URI character sequence->octet sequence->original character sequence

   A URI is represented as a sequence of characters, not as a sequence
   of octets. That is because URI might be "transported" by means that
   are not through a computer network, e.g., printed on paper, read over

   the radio, etc.

   A URI scheme may define a mapping from URI characters to octets;
   whether this is done depends on the scheme. Commonly, within a
   delimited component of a URI, a sequence of characters may be used to

   represent a sequence of octets. For example, the character "a"
   represents the octet 97 (decimal), while the character sequence "%",
   "0", "a" represents the octet 10 (decimal).

   There is a second translation for some resources: the sequence of
   octets defined by a component of the URI is subsequently used to
   represent a sequence of characters. A 'charset' defines this mapping.

   There are many charsets in use in Internet protocols. For example,
   UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences

   of characters in the repertoire of ISO 10646.

   In the simplest case, the original character sequence contains only
   characters that are defined in US-ASCII, and the two levels of
   mapping are simple and easily invertible: each 'original character'
   is represented as the octet for the US-ASCII code for it, which is,
   in turn, represented as either the US-ASCII character, or else the
   "%" escape sequence for that octet.

   For original character sequences that contain non-ASCII characters,
   however, the situation is more difficult. Internet protocols that
   transmit octet sequences intended to represent character sequences
   are expected to provide some way of identifying the charset used, if
   there might be more than one [RFC2277].  However, there is currently
   no provision within the generic URI syntax to accomplish this
   identification. An individual URI scheme may require a single
   charset, define a default charset, or provide a way to indicate the
   charset used.

   It is expected that a systematic treatment of character encoding
   within URI will be developed as a future modification of this
   specification.

"

Of special significance to our issue is:

 "Internet protocols that
   transmit octet sequences intended to represent character sequences
   are expected to provide some way of identifying the charset used, if
   there might be more than one [RFC2277].  However, there is currently
   no provision within the generic URI syntax to accomplish this
   identification"

As I read it, this states that there is no mechanism for
specifying the internal character encoding within an URL. This
is obvious from the syntax definition for URLS.

Consequently, a consumer of an URL has to somehow know what the
internal encoding of the URL producer was. There is no protocol
to do so.

Moreover, the internal encoding is not necessarily the same as
the document encoding (external encoding). Consider the
following scenario:

An OS/390 machine serves an HTML document for a client
request. The document is in EBCDIC, and the URL's within the
document have been internally encoded using the EBCDIC
character encoding. However, the client requests the document
in ASCII. Consequently, the server auto-translates the
document from EBCDIC to ASCII, and the document is transmitted
back to the client in US_ASCII encoding. However, the URLS
within the document were internally encoded in EBCDIC, and to
decode them, the client needs to URLDecode the (ASCII-encoded)
octets making up the URL, then run these decoded octets
backwards through EBCDIC to get the original characters of the
URL. There is no way for the client to tell that the URL's
were internally encoded in EBCDIC, however, so it is going to
use ASCII internally (its default encoding) and get some very
funny looking URL's.

Note that the above is not a rhetorical situation. It is exactly
how http servers and proxies are supposed to behave in the
presence of requests and documents in different encodings.

Now, as far as I am concerned, there are two sensible approaches
to the issue of internal URL encoding/decoding:

1) Always use the default platform encoding/decoding.
2) Always use some standard character encoding/decoding.

Lets examine these in turn:

1) URL's generated and consumed on the same machine, or on
machines with the same default encoding, will work OK. URL's
generated and consumed on machines with disparate default
encodings will not work. Older internet utility programs (mail,
http clients, etc.) won't work unless the default encoding is
US-ASCII, because they won't know to think about anything else.

2) The obvious choice for a uniform encoding is US-ASCII, since
it is the historical choice. Now, if all machines use the
same encoding, then URL's will work across all machines,
regardless of their default encoding. If we choose US-ASCII
as the default choice, then older internet programs will also
work fine. However, non-ASCII characters cannot be handled.

However, there is a better choice, as far as I know. Utf-8 can
be used as a transparent extension to US_ASCII. Since all
ASCII characters are identically mapped in Utf-8, it is
backwards compatible with old programs, and it can also support
arbitrary unicode characters, allowing URL's containing
arbitrary unicode characters.

Hence, I would recommend in the interests of maximum
interoperability and minimum interference with existing
systems, that URLEncoder.encode() uses utf-8 as its internal
encoding on all platforms.



mick.fleming@Ireland 1998-12-10

CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: generic FIXED IN: 1.2beta4 INTEGRATED IN: 1.2beta4
14-06-2004
WORK AROUND Name: mf23781 Date: 12/18/97 ======================================================================
11-06-2004
SUGGESTED FIX A licensee has implemented the following fix on their non-ascii platform : 103,111c103,120 < // convert to external encoding before hex conversion < try { < writer.write(c); < writer.flush(); < } catch(IOException e) { < buf.reset(); < continue; < } < byte[] ba = buf.toByteArray(); --- > // xxxx.7522 only convert if the character is outside of the ASCII range > byte[] ba; > if( c < 128 ) { > // don't convert to the local encoding > ba = new byte[] { (byte)c }; > } > else { > // convert to external encoding before hex conversion > try { > writer.write(c); > writer.flush(); > } catch(IOException e) { > buf.reset(); > continue; > } > ba = buf.toByteArray(); > } > // ibm.7522 ends 131c140,151 < return out.toString(); --- > /* > * xxxx.7522 > * > * Don't use the default platform encoding - the bytes are already > * in ascii/unicode and we don't want them converted. > */ > try { > return out.toString("8859_1"); > } catch (Exception e) { > return null; > } > mick.fleming@Ireland 1999-01-15
15-01-1999
EVALUATION The outer encoding in URLEncoder has been removed, but the inner "conversion to external encoding" remains because there are servers that depend on it to be able to reference files in other languages. michael.mccloskey@eng 1998-04-24
24-04-1998