JDK-8151481 : j.u.regex.Pattern cleanup
  • Type: Enhancement
  • Component: core-libs
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2016-03-09
  • Updated: 2022-11-02
  • Resolved: 2016-05-11
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 9
9 b119Fixed
Related Reports
Relates :  
Relates :  
Description
(1) pull out the "broken" printNodeTree (for debugging) from the Pattern. This one does  not work as expected for a while . To replace the printNoteTree with the working one and putting it at a separate class j.u.regex.PrintPattern, which now can print out the clean and complete node tree of the pattern. For example,

   Pattern: [a-z0-9]+|ABCDEFG
     0:  <Start>
     1:  <Branch>
     2:    <CharPropertyGreedy +>
     3:      <Union>
     4:        <Range[a-z]>
     5:        <Range[0-9]>
         <-branch.separator->
     6:    <Slice  "ABCDEFG">
     7:  </Branch>
     8:  <END>

(2) the optimization for the greedy repetition of a "CharProperty", which parse the greedy repetition on a single "CharProperty", such as \p{IsGreek}+, or the most commonly used .* into a single/smooth loop node.

from

    Pattern: \p{IsGreek}+
     0:  <Start>
     1:  <Curly GREEDY  + >
     2:    <Script GREEK>
         </Curly>
     3:  <END>

to

     Pattern: \p{IsGreek}+
     0:  <Start>
     1:  <CharPropertyGreedy Script GREEK+>
     2:  <END>

   The simple jmh benchmark [2] indicates it is about 50%+, especially for those no-match case.

(3) the optimization for the "union" of various individual "char" inside a chracter class [...], usch as. [ABCDEF]. For a regex like [a-zABCDEF], now the engine generates the nodes like

   Pattern: [a-zABCDEF]
     0:  <Start>
     1:  <Union>
     2:    <Union>
     3:      <Union>
     4:        <Union>
     5:          <Union>
     6:            <Union>
     7:              <Range[a-z]>
     8:              <Bits [ A B C D E F]>
     8:            <Bits [ A B C D E F]>
     8:          <Bits [ A B C D E F]>
     8:        <Bits [ A B C D E F]>
     8:      <Bits [ A B C D E F]>
     8:    <Bits [ A B C D E F]>
     9:  <END>

with the optimization it generate (which it should)

   Pattern: [a-zABCDEF]
     0:  <Start>
     1:  <Union>
     2:    <Range[a-z]>
     3:    <Bits [ A B C D E F]>
     4:  <END>

   The jmh benchmark [2] also indicates it is much faster, especially for those no-match case.

(4) Replace those "constant" CharProperty nodes with a simple function interface/lambda. The change reduces the total package classes (anonymous classes) from 130+ to < 70.


oh, there is another one
(5) fix the change for the "j.u.regex: Negated Character Classes" [3]

[1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html
[2] http://cr.openjdk.java.net/~sherman/regexClosure/MyBenchmark.java
[3] http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-June/006957.html 
Comments
URL: http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/d0c319c32334 User: lana Date: 2016-05-18 20:42:24 +0000
18-05-2016

URL: http://hg.openjdk.java.net/jdk9/dev/jdk/rev/d0c319c32334 User: sherman Date: 2016-05-11 04:19:37 +0000
11-05-2016