Name: dk106046 Date: 04/13/2004
JNI does not truly support the UTF-8 strings.
Background
----------
A Java method is called through C++ code. Two jstring objects are created using the NewStringUTF() function and then it calls the Java method by using the CallObjectMethod() method. One of jstring objects contains Simplified Chinese non-surrogate and surrogate characters. Here is the data before the NewStringUTF() functionis called:
Data before NewStringUTF() (in UTF-8 & represented in hex notation): E9 93 BE E6 8E A5 2D F0 A0 80 80 F0 A0 80 81 F0 A0 80 82
Note: The Chinese characters before the hyphen (hex 2D) are non-surrogate characters and the characters after the hyphen are surrogate characters.
Here is the data that dumped from the Java side. getBytes() method passing UTF-8 to get the byte array of the data:
Data from Java code (in UTF-8 & represented in hex notation): E9 93 BE E6 8E A5 2D C2 80 C2 80 C2 80 C2 80 C2 80 C2 80 C2 80 C2 80 C2
80 C2 80 C2 80 C2 80
The Chinese characters before the hyphen are fine but the surrogate characters after the hyphen are now corrupted.
To ensure that the JRE that we are using supports the GB18030 surrogate characters, I created a Java program that would read in a UTF-8 character string and output it using encoding GB18030. The Java program was able to successfully convert the data from UTF-8 to GB18030. Because of this, I am assuming the problem I am having resides in the NewStringUTF() function.
Testcase
--------
import java.util.*;
import java.lang.*;
import java.net.*;
import java.io.*;
public class TestSurrogate {
static {
System.loadLibrary("SurrogateJNI");
}
public native String saveText(String text, String inFile);
public static void main(String[] argv) {
TestSurrogate prog = new TestSurrogate();
try {
if (argv.length != 4) {
System.out.println("Usage: TestSurrogate InputFileName " +
"InputEncoding OutputFileName OutputEncoding");
return;
}
// read in GB18030 text
FileInputStream fis = new FileInputStream(argv[0]);
byte[] data = new byte[1048576];
int inLen;
inLen = fis.read(data, 0, 1048576);
if (inLen < 1) {
System.out.println("Could not read input from file " + argv[0]);
fis.close();
return;
}
String origText = new String(data, 0, inLen, argv[1]);
fis.close();
System.out.println("Data read successfully from file " + argv[0] +
" using encoding " + argv[1]);
// save text using the GB18030 encoding
System.out.println("Outputting original text to file JavaOutputGB18030.txt" +
" using encoding GB18030");
FileOutputStream xfos = new FileOutputStream("JavaOutputGB18030.txt");
xfos.write(origText.getBytes("GB18030"));
xfos.close();
// send text to JNI method and get same text back
System.out.println("Calling JNI method displayText()");
String retText = prog.saveText(origText, argv[0]);
System.out.println("Successfully returned from JNI method displayText()");
// output the returned text
System.out.println("Outputting returned text to file " + argv[2] +
" using encoding " + argv[3]);
FileOutputStream fos = new FileOutputStream(argv[2]);
fos.write(retText.getBytes(argv[3]));
fos.close();
} catch( Exception e ) {
e.printStackTrace();
}
} // end main
} // end class TestSurrogate
Findings
--------
After simulation of the testcase we found that GetStringUTFChars() is working fine, but the NewStringUTF() returns a value in which the characters get corrupt. NewStringUTF and GetStringUTFChars do not use the converters or know about surrogate pairs. They simply return a 3 byte utf value. The surrogate pairs all map to a value outside of \uffff and most start in the \u20000 range which can not be encoded in a 3 byte utf8 encoding. WE CANNOT USE SURROGATE CHARACTERS in these functions.
Opinions
--------
The problem is that the Java's "UTF" functions don't actually use UTF-8. They use something that is similar to UTF-8, but not the same. The two differences are:
1. the encoding of NUL=U+0000 as C0 80, which is illegal UTF-8 but has to do with avoiding a NUL byte 0x00 in a byte stream for string serialization, and
2. the encoding of supplementary code points using 3-byte sequences for each surrogate, like in CESU-8 but unlike in UTF-8 (where this is illegal).
In summary, use the "UTF" JNI functions only if you a) know exactly what you are doing and b) you want to use Java's string serialization for serialization purposes, not for processing. Never use "UTF" function values as if they were UTF-8 strings.
Instead, the JNI API provides perfectly good functions for handling
16-bit Unicode strings, which matches the form of Unicode used just about everywhere else for processing - be it in Java, Windows, ICU(!), MacOS X, or almost any other seriously Unicode-supporting software. If you need UTF-8 string support, then you can use conversion functions like in ICU or Windows to convert between the 16-bit Unicode processing form and the external UTF-8 charset.
This bug is really about the statement "The JNI uses UTF-8 strings ..." at http://java.sun.com/j2se/1.4.2/docs/guide/jni/spec/types.html, which is not correct.
======================================================================