JDK-4980042 : Cannot use Surrogates in zip file metadata like filenames
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.jar
  • Affected Version: 5.0
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2004-01-19
  • Updated: 2009-04-25
  • Resolved: 2009-04-25
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7
7 b57Fixed
Related Reports
Relates :  
Relates :  
Description
java/util/zip/ZipOutputStream.java has an implementation of UTF8 encoding
that does not take into account surrogates:

private static byte[] getUTF8Bytes(String s) {
  char[] c = s.toCharArray();
  int len = c.length;
  // Count the number of encoded bytes...
  int count = 0;
  for (int i = 0; i < len; i++) {
      int ch = c[i];
      if (ch <= 0x7f) {
  	count++;
      } else if (ch <= 0x7ff) {
  	count += 2;
      } else {
  	count += 3;
      }
  }
  // Now return the encoded bytes...
  byte[] b = new byte[count];
  int off = 0;
  for (int i = 0; i < len; i++) {
      int ch = c[i];
      if (ch <= 0x7f) {
  	b[off++] = (byte)ch;
      } else if (ch <= 0x7ff) {
  	b[off++] = (byte)((ch >> 6) | 0xc0);
  	b[off++] = (byte)((ch & 0x3f) | 0x80);
      } else {
  	b[off++] = (byte)((ch >> 12) | 0xe0);
  	b[off++] = (byte)(((ch >> 6) & 0x3f) | 0x80);
  	b[off++] = (byte)((ch & 0x3f) | 0x80);
      }
  }
  return b;
}
-----------------------------------------------------------
Also, Norbert Lindenberg noted:

I did notice another thing that looks fishy: 
src/share/native/java/util/zip/ZipFile.c has calls to the JNI routines 
GetStringUTFLength and GetStringUTFRegion, apparently also to handle 
file names. These are probably wrong, because JNI uses modified UTF-8 
and zip/jar files should use standard UTF-8.

Comments
EVALUATION we go with the standard utf-8 charset from jdk7
16-04-2009

EVALUATION Changing the current encoding to support surrogates could result in creating JAR files that are incompatible, i.e. cannot be read, by previous Java releases. This incompatibility is not acceptable. In fixing 4244499 though, there is a reasonable chance that support can be provided for the current implementation as well as standard UTF-8.
09-04-2008

EVALUATION Probably jarzip should use standard mechanisms for encoding/decoding UTF8, instead of doing it by hand. ###@###.### 2004-01-18
18-01-2004