JDK-7104012 : AIOOBE from RuleBasedBreakIterator.lookupState for some suppl. chars
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.text
  • Affected Version: 7
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux
  • CPU: x86
  • Submitted: 2011-10-23
  • Updated: 2014-02-05
  • Resolved: 2012-10-03
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 8
8 b61Fixed
Description
FULL PRODUCT VERSION :
java version "1.7.0_01"
Java(TM) SE Runtime Environment (build 1.7.0_01-b08)
Java HotSpot(TM) 64-Bit Server VM (build 21.1-b02, mixed mode)

(but this problem also exists in 1.5, 1.6, etc)

ADDITIONAL OS VERSION INFORMATION :
Linux beast 3.0.0-12-generic #20-Ubuntu SMP Fri Oct 7 14:56:25 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux


A DESCRIPTION OF THE PROBLEM :
BreakIterator has problems with some supplementary character sequences. When iterating text that contains these characters, it throws an internal ArrayIndexOutOfBoundsException in RuleBasedBreakIterator.lookupState

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Run the included test program

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
it should not throw an exception, instead next() should return next text boundary or BreakIterator.DONE
ACTUAL -
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 268
	at java.text.RuleBasedBreakIterator.lookupState(RuleBasedBreakIterator.java:1036)
	at java.text.RuleBasedBreakIterator.handleNext(RuleBasedBreakIterator.java:931)
	at java.text.RuleBasedBreakIterator.next(RuleBasedBreakIterator.java:621)
	at test.main(test.java:8)


REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
import java.text.BreakIterator;
import java.util.Locale;

public class test {
  public static void main(String args[]) {
    BreakIterator bi = BreakIterator.getSentenceInstance(Locale.US);
    bi.setText("\udb40\udc53"); // U+E0053, TAG LATIN CAPITAL LETTER S
    bi.next();
  }
}

---------- END SOURCE ----------

Comments
SupplementaryCharacterData.getValue(int codepoint) returns (int)0xFF for codepoints w/ a specific category. It is not same as (byte)0xFF(=-1) which is defined as RuleBasedBreakIterator.IGNORE. Changed the return value of getValue() to return IGNORE if the gotten value is 0xFF.
03-10-2012

EVALUATION AIOOBE should not be thrown. This bug needs to be fixed in JDK 8.
08-12-2011