Adopt a more space-efficient internal representation for strings.
Improve the space efficiency of the `String` class and related classes
while maintaining performance in most scenarios and preserving full
compatibility for all related Java and native interfaces.
It is not a goal to use alternate encodings such as UTF-8 in the internal
representation of strings. A subsequent JEP may explore that approach.
The current implementation of the `String` class stores characters in a
`char` array, using two bytes (sixteen bits) for each character. Data
gathered from many different applications indicates that strings are a
major component of heap usage and, moreover, that most `String` objects
contain only Latin-1 characters. Such characters require only one byte
of storage, hence half of the space in the internal `char` arrays of such
`String` objects is going unused.
We propose to change the internal representation of the `String` class
from a UTF-16 `char` array to a `byte` array plus an encoding-flag field.
The new `String` class will store characters encoded either as
ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per
character), based upon the contents of the string. The encoding flag
will indicate which encoding is used.
String-related classes such as `AbstractStringBuilder`, `StringBuilder`,
and `StringBuffer` will be updated to use the same representation, as
will the HotSpot VM's intrinsic string operations.
This is purely an implementation change, with no changes to existing
public interfaces. There are no plans to add any new public APIs or
The prototyping work done to date confirms the expected reduction in
memory footprint, substantial reductions of GC activity, and minor
performance regressions in some corner cases.
For further detail, see:
- [State of String Density Performance]
- [String Density Impact on SPECjbb2005 on SPARC]
We tried a "compressed strings" feature in JDK 6 update releases, enabled
by an `-XX` flag. When enabled, `String.value` was changed to an
`Object` reference and would point either to a `byte` array, for strings
containing only 7-bit US-ASCII characters, or else a `char` array. This
implementation was not open-sourced, so it was difficult to maintain and
keep in sync with the mainline JDK source. It has since been removed.
Thorough compatibility and regression testing will be essential for a
change to such a fundamental part of the platform.
We will also need to confirm that we have fulfilled the performance goals
of this project. Analysis of memory savings will need to be done.
Performance testing should be done using a broad range of workloads,
ranging from focused microbenchmarks to large-scale server workloads.
We will encourage the entire Java community to perform early testing with
this change in order to identify any remaining issues.
Risks and Assumptions
Optimizing character storage for memory may well come with a trade-off in
terms of run-time performance. We expect that this will be offset by
reduced GC activity and that we will be able to maintain the throughput
of typical server benchmarks. If not, we will investigate optimizations
that can strike an acceptable balance between memory saving and run-time
Other recent projects have already reduced the heap space used by
strings, in particular [JEP 192: String Deduplication in G1][jep192].
Even with duplicates eliminated, the remaining string data can be made to
consume less space if encoded more efficiently. We are assuming that
this project will still provide a benefit commensurate with the effort