JDK-6609664 : [BI] BreakIterator.getLineInstance() breaking at char \u2019 when used as apostrophe
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.text
  • Affected Version: 5.0
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: windows_2000
  • CPU: x86
  • Submitted: 2007-09-26
  • Updated: 2019-04-11
Description
FULL PRODUCT VERSION :
java version "1.5.0"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0-b64)
Java HotSpot(TM) Client VM (build 1.5.0-b64, mixed mode)

ADDITIONAL OS VERSION INFORMATION :
ver Windows 2000

A DESCRIPTION OF THE PROBLEM :
BreakIterator.getLineInstance is treating character '\u2019' (right single curly apostrophe) as punctuation and breaking the line directly after, even when used as an apostrophe (unicode standard says this is the preferred character to use instead of the vertical apostrophe '0x27') and has text after it.

The break iterator should be aware of the 'state' of the character. Whether or not it is being used as punctuation (ending a quote) or as a modifier (apostrophe).

For example:

String text = "The 1940���s and 1960���s. "

/** make sure the apostrophe is character \u2019 since charcter 0x27 wraps correctly */


The BreakIterator. getLineInstance() breaks after the apostrophe.  When making the following substitution, the BreakIterator wraps correctly:

WORD_BREAKER.setText(text*.replace('\u2019',(char)0x27));

The break iterator should know something about the state of the character.  If there is a letter after the apostrophe it's used as a modifier and it should not break.  If there is white space, it should break, since it is being used as punctuation.


STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
String text = "Some have argued for a connection between the protest movements of the 1940���s and those of the 1960���s. "

/** make sure the apostrophe is character \u2019 since charcter 0x27 wraps correctly */


The BreakIterator. getLineInstance() breaks after the apostrophe.  When making the following substitution, the BreakIterator wraps correctly:

WORD_BREAKER.setText(text*.replace('\u2019',(char)0x27));


EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The apostrophes (character \u2019) should not break from the word 1940's and 1960's.
ACTUAL -
The breaks are occuring as follows:

1940'
s
1960'
s.

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
          String text = "The 1940\u2019s and 1960\u2019s. ";

           BreakIterator WORD_BREAKER = BreakIterator.getLineInstance();
           WORD_BREAKER.setText(text);
           int prev = 0;
           
           for (int pos=0;
                  pos != BreakIterator.DONE;
                  pos = WORD_BREAKER.next())
            {
               if (pos > 0)
               {
                  System.out.println("group from character["+
                                                    prev + "] = "+ text.charAt(prev) +
                                                   " to character["+ (pos - 1)
                                                   + "] = "+ text.charAt(pos - 1) );
               }
                prev = pos;
            }


**********
OUTPUT:
***********
group from character[0] = T to character[3] =
group from character[4] = 1 to character[8] = ���
group from character[9] = s to character[10] =
group from character[11] = a to character[14] =
group from character[15] = 1 to character[19] = ���
group from character[20] = s to character[22] =

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
replace character \u2019 with 0x27

          String text = "The 1940\u2019s and 1960\u2019s. ";
           BreakIterator WORD_BREAKER = BreakIterator.getLineInstance();
           WORD_BREAKER.setText( text.replace('\u2019',(char)0x27));
           int prev = 0;
           
           for (int pos=0;
                  pos != BreakIterator.DONE;
                  pos = WORD_BREAKER.next())
            {
               if (pos > 0)
               {
                  System.out.println("group from character["+
                                                    prev + "] = "+ text.charAt(prev) +
                                                   " to character["+ (pos - 1)
                                                   + "] = "+ text.charAt(pos - 1) );
               }
                prev = pos;
            }


**********
OUTPUT:
***********
group from character[0] = T to character[3] =
group from character[4] = 1 to character[10] =
group from character[11] = a to character[14] =
group from character[15] = 1 to character[22] =