JDK-6551597 : FtpURLConnection fails on large files
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.net
  • Affected Version: 6
  • Priority: P4
  • Status: Closed
  • Resolution: Duplicate
  • OS: linux
  • CPU: x86
  • Submitted: 2007-04-30
  • Updated: 2011-02-16
  • Resolved: 2007-07-31
Related Reports
Duplicate :  
Description
FULL PRODUCT VERSION :
java version "1.5.0_11"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_11-b03)
Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_11-b03, mixed mode)
and
java version "1.6.0"
Java(TM) SE Runtime Environment (build 1.6.0-b105)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0-b105, mixed mode)

ADDITIONAL OS VERSION INFORMATION :
This behavior has been observed under:
 Suse Linux 10.1 x86_64
 Windows XP Business 32 bit
 Ubuntu 7.04 x86_64

EXTRA RELEVANT SYSTEM CONFIGURATION :
No relevant hardware pattern

A DESCRIPTION OF THE PROBLEM :
When attempting to read a large file, GZipInputStream behaves improperly.  This behavior is observed when attempting to ingest very large files - with compressed sizes ranging from 10-40GB and uncompressed sizes between 200-400GB.  The 2nd symptom - partial read - occurs across all tested JVM's back to 1.4.  The untrappable exception only occurs in versions 1.5 and above.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1) Open an InputStream to a large gzipped file.
2) Create a new GZIPInputStream on the above stream.
3) Attempt to read the full contents of the stream.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Ideally, GZIPInputStream would allow the entire file to be read without exception.
ACTUAL -
Two symptoms of this condition consistently occur:
1) An untrappable NumberFormatException is reported directly to STDERR.
 2) Stream traversal completes before the real EOS is reached.

The exception is untrappable and does not interrupt execution.  The specific string referenced in the exception varies from file to file but is always consistent for a specific file.

The GZipInputStream can still be read as normal and appears to read to the end of the stream without further exception.  However - this traversal completes and believes it has reached the end of the stream when in fact only a small portion of the stream has been read.  The actual number of bytes read before the stream believes it is complete varies from file to file, but is always consistent on a per-file basis.

ERROR MESSAGES/STACK TRACES THAT OCCUR :
java.lang.NumberFormatException: For input string: "15983638838"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
	at java.lang.Integer.parseInt(Integer.java:459)
	at java.lang.Integer.parseInt(Integer.java:497)
	at sun.net.www.protocol.ftp.FtpURLConnection.getInputStream(FtpURLConnection.java:398)
	at com.weather.logs.Parser.main(Parser.java:25)

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
URL url = new URL("http://some.ftp.site/some/big/file.gz");
URLConnection urlc = url.openConnection();
BufferedInputStream bis = new BufferedInputStream(urlc.getInputStream());
GZIPInputStream gzipis = new GZIPInputStream(bis);

int len = 0, total = 0;
byte[] inBuff = new byte[256];
while ((len = gzipis.read(inBuff)) != -1) {
	total += len;
}
---------- END SOURCE ----------

Comments
EVALUATION On further examination of the stacktrace, I note that the exception is in the "net" code. It seems like the ftp client implementation does not support input streams of sizes larger than 2**32. An obvious fix is using Long.parseLong instead of Integer.parseInt but I don't know if that will actually work. It also appears from a superficial examination that the human-readable part of the FTP response is being parsed, contrary to rfc959. An FTP reply consists of a three digit number (transmitted as three alphanumeric characters) followed by some text. The number is intended for use by automata to determine what state to enter next; the text is intended for the human user. It is intended that the three digits contain enough encoded information that the user-process (the User-PI) will not need to examine the text and may either discard it or pass it on to the user, as appropriate. In particular, the text may be server-dependent, so there are likely to be varying texts for each reply code. The suspicious line of code is here: if ((offset = response.indexOf(" bytes)")) != -1) { Redispatching to classes_net, and changing description to FtpURLConnection fails on large files
08-06-2007

EVALUATION It seems that large file support has only recently been added to (or only recently started working on) GNU gzip, starting with gzip 1.3. Here are some extracts from gzip's NEWS file: 92:* Add support for large files, e.g. files larger than 2 GB on Solaris 2.6. 93:* Adjust file size listing format for files larger than 10 GB. 104: - files larger than 4 GB
30-04-2007