Bug ID: JDK-7073588 ZipInput/OutputStream handling of the data descriptor is wrong for big entries

JDK-7073588 : ZipInput/OutputStream handling of the data descriptor is wrong for big entries

Type: Bug
Component: core-libs
Sub-Component: java.util.jar
Affected Version: 7

Priority: P3
Status: Closed
Resolution: Not an Issue
OS: windows_7
CPU: x86

Submitted: 2011-08-01
Updated: 2023-10-28
Resolved: 2011-08-10

Related Reports

Relates :

JDK-8303866 - Allow ZipInputStream.readEnd to parse small Zip64 ZIP files

Description

FULL PRODUCT VERSION :
java version "1.7.0"
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)

ADDITIONAL OS VERSION INFORMATION :
I was forced to choose an OS but the bug is platform independent.

A DESCRIPTION OF THE PROBLEM :
The "spec" for ZIP files http://www.pkware.com/documents/casestudies/APPNOTE.TXT says in the section about data descriptors:

      When compressing files, compressed and uncompressed sizes
      should be stored in ZIP64 format (as 8 byte values) when a
      files size exceeds 0xFFFFFFFF.   However ZIP64 format may be
      used regardless of the size of a file.  When extracting, if
      the zip64 extended information extra field is present for
      the file the compressed and uncompressed sizes will be 8
      byte values.

This means the sizes are eight byte if there is a ZIP64 extenden information extra field and four bytes if there is none.  This is not what java.util.zip implements, ZipOutputStream#writeEXT in http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/src/share/classes/java/util/zip/ZipOutputStream.java writes eight byte data if one of the sizes exceeeds 0xFFFFFFFF but it never writes any ZIP64 extended information extra field.  This means conforming implementations will "think" the sizes are four bytes while in fact they are eight bytes.

Likewise ZipInputStream#readEnd always assumes the sizes are eight byte if the Inflater has seen more than 0xFFFFFFFF bytes and four bytes otherwise - this will lead to reading too few bytes if the ZIP64 extended information field is present but the sizes are smaller than 2^32

I stumbled over this while implementing ZIP64 support for Apache Commons Compress, using Java 7's jar as one of my interop partners.

I realize there is a difficult choice to be taken when writing to a stream - which as of this report hasn't been implemented for entries of unknown size in Apache Commons Compress - as you either have to always add the ZIP64 field or never if you don't know how much you are going to write.  At least for entries of known size - like the files the jar tool adds to the archive - you should be able to not use the data descriptor at all, though.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
pick a file bigger than 4GB and create a jar file from it, there won't be any ZIP64 extended information extra field but the data descriptor uses eight bytes.  One example is the file 5GB_of_Zeros_jar.zip attached to https://issues.apache.org/jira/browse/COMPRESS-36



EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
There must be a ZIP64 extended information extra field whenever you write a data descriptor with eight byte sizes.

Always assume the sizes are eight byte when reading and there is a ZIP64 extended information extra field, never assume they are eight bytes if there is none.
ACTUAL -
sizes only depend on the number of bytes defalter/inflater have processed/written.

REPRODUCIBILITY :
This bug can be reproduced always.

CUSTOMER SUBMITTED WORKAROUND :
  Tools processing the archive may be able to work around the problem if they don't need the size information at all.  I'm also thinking of adding some heuristics along the lines of "if Infalter has seen more than 0xFFFFFFFF, then the sizes are probably eight byte even if there is no ZIP64 extra field - let's look whether there is a usable signature after eight/sixteen bytes" but this is clumsy.

I don't see any workaround when ZipInputStream tries to read a perfectly valid ZIP file that sets a ZIP64 extra field with a size smaller than 0xFFFFFFFF - when reading the data descriptor it will reight eight bytes too few and not be positioned at the next LFH or the central directory.

Comments

EVALUATION The "clarification" from PKWare: -------------------------------------------------------- Thank you for your interest in the ZIP format. I reviewed the APPNOTE and I believe the documentation should be updated for more clarity on this. I will log a ticket to get further clarification on this record into a future version of the APPNOTE. To address your question, you would not use the Data Descriptor (presence is signaled using bit 3) at the same time as the ZIP64 Extended Information Extra Field (which uses the 0xFFFFFFFF value and "Extra Field" 0x0001). When using the Data Descriptor, the values would be written as ZERO. When alternatively, the ZIP64 extended information extra field is used, the values should be 0xFFFFFFFF. I hope this helps with your understanding. Please let me know if there is any additional information I can provide to you on this topic. --------------------------------------------------------- It appears the suggestion is to not have both Data Descriptor and ZIP64 extended Information Extra Field at the "same time". And our implementation is doing exactly that. Closed this one as "not a defect" for now.

10-08-2011

EVALUATION The "compressed/uncompressed size" part of the loc spec states If bit 3 of the general purpose bit flag is set, these fields are set to zero in the local header and the correct values are put in the data descriptor and in the central directory. If an archive is in ZIP64 format and the value in this field is 0xFFFFFFFF, the size will be in the corresponding 8 byte ZIP64 extended information extra field. and the ZIP64 Information Extra Field (0x0001) spec says The following is the layout of the zip64 extended information "extra" block. If one of the size or offset fields in the Local or Central directory record is too small to hold the required data, a Zip64 extended information record is created. The order of the fields in the zip64 extended information record is fixed, but the fields will only appear if the corresponding Local or Central directory record field is set to 0xFFFF or 0xFFFFFFFF. ... This entry in the Local header must include BOTH original and compressed file size fields. The above spec appears to say three things here (1) if the loc size and csize are to be stored in data descriptor (when the general purpose flag bit 3 is set), these fields are set to ZERO. (2) if this archive is in ZIP64 format (what does this really mean? one possible interpretation is that there is ZIP64 extention appears at the "extra field" of this loc) AND these 2 fields are 0xFFFFFFFF, then the corresponding size/csize can be found at the ZIP64 extention in the extra field. (3) in order to have size and csize appears in ZIP64 extended info extra field, their corresponding fields in loc MUST be 0xffffffff. Since the csize/size MUST be present in loc's ZIP64 extra field, the size/csize fields in this loc MUST be 0xffffffff. Here is the problem, if the bit 3 of the general purpose flag is set, therefor the size and csize fields in loc MUST be ZERO, (3) then can NOT be true. And from implementation point view, the reason why we have the "data description" is mostly because you don't know the value of size and csize yet when writing the loc (such as in the streaming case), it really does not make sense to have a zip64 extended info extra field as well, which is part of the loc, and you still don't know the size/ csize values when writing it. That said, this is obviously contradicting to what is specified in the extracting part of the "data descriptor" spec, as quoted, When compressing files, compressed and uncompressed sizes should be stored in ZIP64 format (as 8 byte values) when a files size exceeds 0xFFFFFFFF. However ZIP64 format may be used regardless of the size of a file. When extracting, if the zip64 extended information extra field is present for the file the compressed and uncompressed sizes will be 8 byte values. which says you CAN have a ZIP64 extended info extra field in a loc (sizeZ&csize are 0xffffffff), even if the bit 3 of the general flag is set (size&csize are 0). Based on above, the only thing the implementation can do is to be liberal when reading the "data dexcriptor" if there is a "zip64 exteneded extra field" present, even when the bit 3 of the general flag is set. The implementation can do (1) if the actually size/csize of the inflated data is > 32-bit, assume the "data decriptor" data are 64-bit (what the ZipInputStream is doing now) or (2) if there is a zip64 extended extra field is present in loc (regardless what is actually stored in the loc's size/csize, be ZERO or be 0xffffffff, and regardless what is stored in Zip64 extended extra fielda0, asssume the "data decriptor" data are 64-bit and use it to validate the resulting inflated data.

01-08-2011