java/util/zip/ZipOutputStream.java has an implementation of UTF8 encoding
that does not take into account surrogates:
private static byte[] getUTF8Bytes(String s) {
char[] c = s.toCharArray();
int len = c.length;
// Count the number of encoded bytes...
int count = 0;
for (int i = 0; i < len; i++) {
int ch = c[i];
if (ch <= 0x7f) {
count++;
} else if (ch <= 0x7ff) {
count += 2;
} else {
count += 3;
}
}
// Now return the encoded bytes...
byte[] b = new byte[count];
int off = 0;
for (int i = 0; i < len; i++) {
int ch = c[i];
if (ch <= 0x7f) {
b[off++] = (byte)ch;
} else if (ch <= 0x7ff) {
b[off++] = (byte)((ch >> 6) | 0xc0);
b[off++] = (byte)((ch & 0x3f) | 0x80);
} else {
b[off++] = (byte)((ch >> 12) | 0xe0);
b[off++] = (byte)(((ch >> 6) & 0x3f) | 0x80);
b[off++] = (byte)((ch & 0x3f) | 0x80);
}
}
return b;
}
-----------------------------------------------------------
Also, Norbert Lindenberg noted:
I did notice another thing that looks fishy:
src/share/native/java/util/zip/ZipFile.c has calls to the JNI routines
GetStringUTFLength and GetStringUTFRegion, apparently also to handle
file names. These are probably wrong, because JNI uses modified UTF-8
and zip/jar files should use standard UTF-8.