JDK-8323583 : Allow ZipInputStream.readEnd to parse small Zip64 ZIP files
  • Type: CSR
  • Component: core-libs
  • Sub-Component: java.util.jar
  • Priority: P4
  • Status: Proposed
  • Resolution: Unresolved
  • Fix Versions: 23
  • Submitted: 2024-01-11
  • Updated: 2024-01-15
Related Reports
CSR :  
Description
Summary
-------

Allow `java.util.zip.ZipInputStream` to parse entries using the Zip64 format where neither the compressed nor uncompressed file size exceeds the 4GB limit.

Problem
-------

The compressed and uncompressed size of a ZIP entry are often not known until all entry data has been written by the client.

If the producer cannot seek back in the ZIP stream to update the size fields in the LOC header, those fields are left as zero and the actual compressed and uncompressed file sizes are instead put in a 'Data Descriptor' record immediately following the file data.

If the entry uses the Zip64 format, then the 'compressed size' and 'uncompressed size' fields are instead set to the magic marker value 0xFFFFFFFF and a Zip64 extra field is added with the 'Original Size' and 'Compressed Size' both set to zero.

The 'Data Descriptor' record normally encodes size fields using 4 byte numbers. However, 8-byte numbers should be used instead when either the compressed or uncompressed sizes exceed 4GB, or if the entry uses the Zip64 format:

```
4.3.9.2 When compressing files, compressed and uncompressed sizes 
      SHOULD be stored in ZIP64 format (as 8 byte values) when a 
      file's size exceeds 0xFFFFFFFF.   However ZIP64 format MAY be 
      used regardless of the size of a file.  When extracting, if 
      the zip64 extended information extra field is present for 
      the file the compressed and uncompressed sizes will be 8
      byte values.  
```

ZipInputStream currently relies solely on the size information aquired from the Inflater when deciding how to parse the data descriptor record. The LOC is not consulted to see if the entry uses the Zip64 format.

If an entry does use the Zip64 format, but neither the compressed or uncompressed sizes exceed 4GB, then ZipInputStream currently fails to parse the Data Descriptor correctly and a ZipException is thrown instead:

```
java.util.zip.ZipException: invalid entry size (expected 0 but got 6 bytes)
	at java.base/java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:616)
```

While ZipOutputStream does not use the Zip64 format when writing entries of an unknown size, other tools do produce such files, including Info-ZIP used in streaming mode:

```
echo hello | zip -fd > hello.zip
```

It would be useful to update ZipInputStream to allow parsing such valid ZIP files. Supporting these files could benefit OpenJDK testing as well, which currently relies on producing very large files to test Zip64.

Solution
--------

The solution is to update ZipInputStream such that it not only consults the number of compressed and uncompressed bytes read by the Inflater, but also inspects the LOC header to determine if it uses the Zip64 format. When an entry uses Zip64, then `ZipInputStream.readEnd` should parse the Data Descriptor using 8-byte numbers instead of the regular 4-bytes.

`ZipInputStream.readLOC` is a good decision point for determining whether to expect 4- or 8-byte numbers. This method has full access to the LOC header fields including the extra field where any Zip64 field is located.

ZipInputStream is updated as follows:

- A new boolean internal flag `ZipInputStream.expect64BitDataDescriptor` is added. The purpose of this field is to communicate the number format determined by `readLOC` to the `readEnd` method which is responsible for the actual parsing of the Data Descriptor record.
- `readLOC` is updated to inspect the LOC and set `expect64BitDataDescriptor`to true if the LOC uses the Zip64 format; that is if the compressed and uncompressed size fields are both 0xFFFFFFFF and the extra field contains a valid Zip64 extra field. To reduce changes in `readLOC`, this logic is mostly implemented in the new support methods `expect64BitDataDescriptor` and `isZip64DataDescriptorField`.
- `readEnd` is updated to read 8-byte fields when the `expect64BitDataDescriptor` flag is true.
 

Specification
-------------

The specification is not changed, this is purely an implementation and behavioral change.

Comments
Moving to Proposed to indicate CSR review is requested.
12-01-2024