JDK-4383964 : localization/UNICODE issues
  • Type: Enhancement
  • Component: core-libs
  • Sub-Component: java.util
  • Affected Version: 1.3.0
  • Priority: P4
  • Status: Closed
  • Resolution: Duplicate
  • OS: generic
  • CPU: generic
  • Submitted: 2000-10-27
  • Updated: 2000-10-28
  • Resolved: 2000-10-28
Related Reports
Duplicate :  
Description

Name: skT45625			Date: 10/27/2000


java version "1.3.0"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0-C)
Java HotSpot(TM) Server VM (build 2.0fcs-E, mixed mode)

The java.util.ResourceBundle class and subclasses, as well as the
java.util.Properties class are supposed to work with ISO-8859-1
encoded data (including UNICODE encoded using the \uxxxx format).

In practice, it encodes everything that is not printable US-ASCII
or a small set of control characters into the UNICODE \uxxxx
format.

The \uxxxx format is painful to work with, and next to impossible
to customize or localize.  You'll notice this flies in the face
of the purpose of these classes.

Now, if the java.util.Properties class could be passed an
encoding to load the data with (as described in the java.lang
package.html), it would make it much easier for people to work
directly in UNICODE or other character encodings (espcially
UNICODE).  Similarly, the java.util.ResourceBundle class, or
rather the java.util.PropertyResourceBundle class, should accept
an encoding for .properties files.

This means customization can be done in the native language by
natives in UNICODE.  And the results further edited and saved as
UNICODE.  All the actual functionality to do this is in
java.io.InputStreamReader and java.io.OutputStreamWriter.  If
the java.util.Properties accepted a java.io.Reader or
java.io.Writer, you would be most of the way to
internationalization.
(Review ID: 109597) 
======================================================================

Comments
WORK AROUND Name: skT45625 Date: 10/27/2000 For java.util.Properties, the following class will work around, for the java.util.ResourceBundle, there is no easy work around except by skipping the automatic localization automatically and working directly with java.util.Properties. public class Translate extends Object { private static boolean bDebug = Boolean.getBoolean("com.ndc.verbose"); private java.io.ByteArrayOutputStream outStep; private java.io.ByteArrayInputStream inStep; /** Creates new Translate */ public Translate() { inStep = null; outStep = null; } /** * This is an input stream with the data from a file on it. The InputStream * will only exist if {@link #translateIn} has been called before. Use this * function in conjunction with {@link java.util.Properties#load}. * * @return An InputStream containing the data. */ public java.io.InputStream getInputStream() { return inStep; } /** * THis is an output stream for the Properties class to write to. Use this * in conjunction with {@link java.util.Properties#store}. And use * {@link #translateIn} aftewards to actually write the data out. * * @return An OutputStream to write data to. */ public java.io.OutputStream getOutputStream() { outStep = new java.io.ByteArrayOutputStream(); return outStep; } /** * Read in an unicode text file (with encoding UTF-16) and convert it to a * format that {@link java.util.Properties#load} will be able to handle. * The generated InputStream with the data on it can be accessed by {@link * #getInputStream}. * <p> * These methods and fields are not synchronized or exclusive methods, so * care must be taken in those instances. They can be addressed by using * multiple instances of <code>Translate</code>. * <p> * The code is designed to match what happens in * {@link java.util.Properties#store}. All non-ASCII and non-ISO control * characters are converted to standard unicode encoding * (<code>\u0000</code>). * * @param sFileName The file name to read data from. * @throws java.io.FileNotFoundException Thrown if a file is not found. * @throws java.io.IOException If a read fails. * @throws java.io.UnsupportedEncodingException If the encoding isn't * supported (however, the Java 2.0 standard says these are standard * encodings to be supported on all platforms). */ public void translateIn(String sFileName) throws java.io.FileNotFoundException, java.io.IOException, java.io.UnsupportedEncodingException { java.io.RandomAccessFile raf = new java.io.RandomAccessFile(sFileName, "r"); // read fully and put all the data into a string buffer // assume an UNICODE file, if all the data is unicode, the number of bytes // in the file divided by two is enough int len = (int) raf.length() / 2; char cTemp = 0; StringBuffer sbTemp = new StringBuffer((int) raf.length()); for (int i = 0; i < len; i++) { cTemp = raf.readChar(); if (!Character.isISOControl(cTemp) && cTemp > 0x7e) { // It turns out that Properties.store only stores US-ASCII, that means // that all other characters must be converted. Control characters // from ISO-8859-1 and US-ASCII must still be let through, of course. sbTemp.append('\\'); sbTemp.append('u'); sbTemp.append(toHex((cTemp >> 12) & 0xf)); sbTemp.append(toHex((cTemp >> 8) & 0xf)); sbTemp.append(toHex((cTemp >> 4) & 0xf)); sbTemp.append(toHex( cTemp & 0xf)); } else { // regular US-ASCII character or control character sbTemp.append(cTemp); } } raf.close(); // create the InputStream to feed to Properties.load inStep = new java.io.ByteArrayInputStream(sbTemp.toString().getBytes("iso-8859-1")); } /** * This takes something in iso-8859-1, with appropriately encoded control * and unicode characters (\u0000 in both cases), and converts it to regular * unicode except for the control characters as determined by * {@link java.lang.Character#isISOControl}. This is then written to a file * in UTF-16 format (i.e. regular double-byte unicode). * * @param sFileName The name of the file to write. */ public void translateOut(String sFileName) throws java.io.UnsupportedEncodingException, java.io.FileNotFoundException, java.io.IOException { String sTemp = outStep.toString("ISO-8859-1"); int len = sTemp.length(); StringBuffer sbOut = new StringBuffer(); // go through, and for each escaped character sequence starting with // <code>u</code>, convert it to to an integer, and if it's not a control // character, then convert it to unicode for (int i = 0; i < len; i++) { char cTemp = sTemp.charAt(i), cNext, cUnicode; char[] cUnicodeEncode = new char[4]; if (bDebug) // if an escape sequence occurs, the numbers should be incremented // by more than 1 System.out.println(i); if (cTemp == '\\') { // start of an escaped sequence if (bDebug) // tell the user what's going on System.out.print("escape seq "); cNext = sTemp.charAt(++i); // increment i switch (cNext) { case '\\': case 't': case 'n': case 'r': case 'f': // These are escaped special control characters if (bDebug) System.out.println("control character"); sbOut.append(cTemp); sbOut.append(cNext); break; case '=': case ':': case ' ': case '#': case '!': // These are special characters escaped to ensure proper loading if (bDebug) System.out.println("special character"); sbOut.append(cTemp); sbOut.append(cNext); break; case 'u': // pick up the next four characters and check to see what to // do with them sTemp.getChars(i + 1, i + 5, cUnicodeEncode, 0); if (bDebug) { System.out.print("unicode "); System.out.print(cTemp); System.out.print(cNext); System.out.println(cUnicodeEncode); } cUnicode = decodeUnicodeEncoding(cUnicodeEncode); if (Character.isISOControl(cUnicode)) { // It is a control character, leave it in escaped format sbOut.append(cTemp); sbOut.append(cNext); // The code will continue and pool it up as regular characters } else { sbOut.append(cUnicode); i += 4; // skip the next four characters, they've been processed } break; default: // some not understood escape sequence, let it through sbOut.append(cTemp); sbOut.append(cNext); } } else // regular character sbOut.append(cTemp); } // open the file java.io.FileOutputStream fos = new java.io.FileOutputStream(sFileName); // prepare to write in straight unicode java.io.OutputStreamWriter osw = new java.io.OutputStreamWriter(fos, "UTF-16"); if (bDebug) { String sTemp2 = sbOut.toString(); System.out.println(sTemp2); System.out.print(len); System.out.print(" vs. "); System.out.println(sTemp2.length()); java.io.FileOutputStream fosTemp = new java.io.FileOutputStream(sFileName + ".old"); fosTemp.write(outStep.toByteArray()); fosTemp.close(); osw.write(sTemp2); } else osw.write(sbOut.toString()); // clean up osw.flush(); fos.flush(); osw.close(); fos.close(); } /** * Converts four hexadecimal digits to a double byte character. It may not * properly handle negative char values, but it will handle them as well as * the original conversion functions in java.util.Properties. * * @see #fromHex * @see java.util.Properties#loadConvert * @param cEncoded */ public char decodeUnicodeEncoding(char[] cEncoded) throws IllegalArgumentException { int value = 0; for (int i = 0; i < cEncoded.length; i++) value = (value << 4) + fromHex(cEncoded[i]); return (char) value; } /** * Converts a character from a hexadecimal digit to an integer. It is case * insensitive. * * @param cHex The hexadecimal character to convert. * @return The integer value of cHex after conversion. * @throws IllegalArgumentException Thrown if the character is not a valid * digit. */ public int fromHex(char cHex) throws IllegalArgumentException { int value = 0; switch (cHex) { case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': value = cHex - '0'; break; case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': value = 10 + cHex - 'a'; break; case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': value = 10 + cHex - 'A'; break; default: throw new IllegalArgumentException("Malformed \\uxxxx encoding."); } return value; } /** * An array of hex digits used for look up by {@link #toHex}. */ private static final char[] cHexDigit = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f' }; /** * Converts an integer to a hexadecimal digit. The integer passed in should * in the range <code>0 &gt;= iIn &gt;=15</code>. The code will mask off any * bits higher than the fourth, so emptor caveat. The digits are lower case. * * @param iIn The integer to convert to a hexadecimal digit. * @return The hexadecimal digit that corresponds to the integer. */ private char toHex(int iIn) { return Translate.cHexDigit[(iIn & 0xf)]; } } ======================================================================
11-06-2004