JDK-8252739 : Deflater.setDictionary(byte[], int off, int len) ignores the starting offset for the dictionary
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.jar
  • Affected Version: 11.0.5,13,14,15
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • CPU: generic
  • Submitted: 2020-09-03
  • Updated: 2022-06-24
  • Resolved: 2020-09-23
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 16
16 b18Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
Apache Lucene recently changed in its master branch to use Inflater/Deflater's ability to provide a custom dictionary. The code worked, but nightly testing has shown, that under certain circumstances, the compression does not work.

Background information: Lucene's index file formats are in most cases handled through a Lucene class BytesRef which is like a pointer into a bytearray that contains much more data than actually needed by using offset and length (a buffer is loaded from disk and then a BytesRef is used to point to a slice). Depending on the indexing process, the dictionary is sometimes not at the beginning of the underlying byte[], so Lucene uses Inflater#setDictionary(byte[], int ofs, int len), passing the slice in the much bigger byte array. 

The bug happens if: ofs >0

Checking source code of Inflater and the JNI C code shows: The offset is passed down to the JNI code, but the implementation completely ignores the ofs parameter: [https://github.com/openjdk/jdk/blob/1643bc3defa241aef2cad53d0f11076366c3620d/src/java.base/share/native/libzip/Deflater.c#L100-L111]

We have a simple test case that shows the bug, see attached files.

WORKAROUND: Create a copy of the byte array slice.

Code that does not work:
deflater.setDictionary(data, DICT_OFFSET, DICT_LENGTH);

Code that works:
deflater.setDictionary(Arrays.copyOfRange(data, DICT_OFFSET, DICT_OFFSET + DICT_LENGTH));

At Lucene we will use the workaround for the time beeing, but the code should really be fixed, as it may cause index corrumption. Luckily we did not deploy the code to our users yet.

We also checked, if the Deflater#setDictionary(ByteBuffer) method as a workaround (by passing a ByteBuffer wrapping the byte array slice), but after reading the source code, the Java part checks for direct buffers and only then passes to the non-buggy version getting a native address/bytebuffer. If the ByteBuffer is a heap bytebuffer it calls the buggy method ignoring offset, too.

So the bug affects the following methods:
- Deflater.setDictionary(byte[], int ofs, int len) (if ofs > 0)
- Deflater.setDictionary(ByteBuffer) (if ByteBuffer is a Heap-ByteBuffer and arrayOffset()/position()!=0)

The methods in Inflater seems correct.

Other investigations:
JDK 8 seems correct: [http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/45506343cb65/src/share/native/java/util/zip/Deflater.c#l116]

Thanks to Robert Muir for investigating that issue and finding the ignored offset parameter and Adrien Grand (both Lucene) for finding the issue.
Comments
Hi, I can confirm the bug is fixed in JDK 16 build 18: $ gradlew :lucene:core:test -Ptests.verbose=true --tests "TestLucene87StoredFieldsFormatHighCompression" | grep buggy 2> JDK is buggy (JDK-8252739): false This prints true with previous versions, including JDK 11.0.8 Is there any way to backport this regression introduced in a minor JDK 11 update also to JDK-11? Uwe
03-10-2020

May I request a backport to the LTS release 11? The bug was introduced there, so it's a regression and leads to strange effects for people updating to 11.0.5 or later and previously used 11.
23-09-2020

Changeset: 812b39f5 Author: Lance Andersen <lancea@openjdk.org> Date: 2020-09-23 14:21:45 +0000 URL: https://git.openjdk.java.net/jdk/commit/812b39f5
23-09-2020

Thank you for the reduced test case as that will be much better than using test_data.txt given its size
03-09-2020

Hi, we implemented a dynamic workaround in Apache Lucene: - first we detect if the bug is there (using a much simpler reproducer, see here: [https://github.com/apache/lucene-solr/blob/99df3814abff7c40e80530edc90a2f008a3b92b5/lucene/core/src/java/org/apache/lucene/codecs/lucene87/BugfixDeflater_JDK8252739.java#L67-L110]. If you want to have a test, maybe that's better than the attached java file with the huge data file! - if the bug is there, we use a subclass of Deflater that uses a scratch byte[] to copy the data into if offset>0. The whole workaround class is here: [https://github.com/apache/lucene-solr/blob/99df3814abff7c40e80530edc90a2f008a3b92b5/lucene/core/src/java/org/apache/lucene/codecs/lucene87/BugfixDeflater_JDK8252739.java]
03-09-2020

There should be a test added when fixing this, there are no tests with dictionaries and offset ! =0. The test should cover all variants: - byte[] - ByteBuffer.wrap(byte[]) - ByteBuffer.allocateDirect(...)
03-09-2020

This was broken by JDK-8225189
03-09-2020

I did not test JDK-12 releases, but some later ones may be affected, too.
03-09-2020

For a fix see the JNI function in Inflater: https://github.com/openjdk/jdk/commit/9115f920d2405d889f5a1176d3488aa77a0322d4#diff-1de445010a0326f965b7c125bd8f9b76R113 This clearly shows the difference and where the bug is.
03-09-2020

JDK 11.0.5 is the first one in the JDK 11 series that breaks. I tested AdoptOpenJDK, as the older binaries by Oracle are not downloadable.
03-09-2020

It looks like the bug was introduced here: https://github.com/openjdk/jdk/commit/9115f920d2405d889f5a1176d3488aa77a0322d4#diff-681a63e0502608f0fae34c9859d6fec7R101 Interestingly we have seen this also with a build of Java 11 (AdoptOpenJDK).
03-09-2020