JDK-8046101 : JEP 111: Additional Unicode Constructs for Regular Expressions
  • Type: JEP
  • Component: core-libs
  • Priority: P4
  • Status: Candidate
  • Resolution: Unresolved
  • Submitted: 2011-07-26
  • Updated: 2016-01-18
Related Reports
Relates :  
Relates :  
Relates :  
Description
Summary
-------

Adopt further regular-expression constructs from from [Unicode TR#18][tr18].


Motivation
----------

The primary motivation is to enhance/enrich the Unicode support level to allow
developers to write sophisticated Unicode-enabled regular expressions on the
Java platform.  This is important to keep the Java Platform competitive with
other languages that already offer more complete support for Unicode regular
expressions.


Description
-----------

Java Regular Expressions are derived from Perl Regular Expression and are
supposed to provide Java developers most of the Perl style regression
expression features.  Perl Regular Expressions have evolved rapidly in the past
couple years to follow [Unicode Standard TR#18 Unicode Regular Expressions][tr18].  Java Regular Expressions have claimed to be in conformance
with Level 1 of the same Unicode Standard TR#18 Unicode Regular Expressions,
plus RL2.1 Canonical Equivalents, which is the "lowest" level of conformance.
Given that the Unicode Standard has been widely accepted as the de facto
standard for development platforms and Java uses Unicode as its internal
encoding scheme, it appears that higher-level Unicode support is desirable for
developers working on Unicode-aware applications.  The following new constructs
and features are proposed to provide better Unicode support in Java Regular
Expressions:

  - \\N \\{...\\} -- Unicode Name Properties
  - \\X -- Extended Grapheme Clusters
  - Fix the broken Canonical Equivalent support
  - \\R -- Unicode line-break sequence, as suggested at TR#18 Line Boundaries
  - \\g \\{...\\} -- Perl style construct for named capturing group and capturing group
  - More complete Unicode properties, as in \\p \\{IsXXXX\\}
  - \\h \\H \\v \\V -- Horizontal/vertical whitespace


Testing
-------

All the features (new regex constructs) listed here will be covered by the new
unit tests and run by the existing test framework.


[tr18]: http://unicode.org/reports/tr18