JDK-8176371 : (scanner) Scanner fails when string length equals buffer size and latest characters are the delimiter
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util
  • Affected Version: 8,9
  • Priority: P3
  • Status: Open
  • Resolution: Unresolved
  • OS: generic
  • CPU: generic
  • Submitted: 2017-02-26
  • Updated: 2018-09-11
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Description
FULL PRODUCT VERSION :


ADDITIONAL OS VERSION INFORMATION :
Microsoft Windows 8.x

A DESCRIPTION OF THE PROBLEM :
I've found a strange behaviour of java.util.Scanner class. I tried to split a String variable into a set of tokens separated by the delimiter ";" using a Scanner variable.

If I consider a string of "<any_char>[*1022]" + ";[*n]" I expect that Scanner returns a number n of token. However, when n=3, the Scanner class fails: it "see" just 2 tokens instead of 3. I think it's something related to internal char buffer size of Scanner class (1024 characters) and I've found this issue only if the last characters are exacly the delimiter set for the Scanner variable.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Generate a string of composed by 2 parts:
1- 1022 random characters (even the delimiter) 
2- an ending set of 3 characters exactly the same as the delimiter set (in my case ";;;")

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
If I consider a string of "a[*1022]" + ";[*n]" I expect a number n of token. However if n=3 the Scanner class fails: it "see" just 2 tokens instead of 3. I think it's something related to internal char buffer size of Scanner class.

a[x1022];      -> 1 token

a[x1022];;     -> 2 token

a[x1022];;;    -> 3 token

a[x1022];;;;   -> 4 token
ACTUAL -
a[x1022];      -> 1 token: correct

a[x1022];;     -> 2 token: correct

a[x1022];;;    -> 2 token: wrong  (I expect 3 tokens)

a[x1022];;;;   -> 4 token: correct

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
I attach a simple example:

import java.util.Scanner;

public static void main(String[] args) {

    // generate test string: (1022x "a") + (3x ";") 
    String testLine = "";
    for (int i = 0; i < 1022; i++) {
        testLine = testLine + "a";
    }
    testLine = testLine + ";;;";

    // set up the Scanner variable
    String delimeter = ";";
    Scanner lineScanner = new Scanner(testLine);
    lineScanner.useDelimiter(delimeter);
    int p = 0;

    // tokenization
    while (lineScanner.hasNext()){
            p++;
            String currentToken = lineScanner.next();
            System.out.println("token" + p +  ": '" + currentToken + "'");
    }
    lineScanner.close();
}
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Using String .split method


Comments
Similar to JDK-8176407, which also exhibits a bug when the delimiter spans a buffer boundary.
09-03-2017

Verified this issue for 8 GA,8u121,9ea on Windows,Linux and could confirm the issue as reported by the submitter. Steps to reproduce: ********************** - Run the attached test program(JI9047876.java) with JDK. Result: ********* OS : Windows 7,10 64 bit, Ubuntu Linux 14.04 LTS JDK: ++++ 8 b132 : Fail 8u121 b13 : Fail 9ea+152 : Fail ==============================================================
08-03-2017