JDK-8054307 : JEP 254: Compact Strings
  • Type: JEP
  • Component: core-libs
  • Sub-Component: java.lang
  • Priority: P2
  • Status: Closed
  • Resolution: Delivered
  • Fix Versions: 9
  • Submitted: 2014-08-04
  • Updated: 2019-05-18
  • Resolved: 2016-04-07
Related Reports
Blocks :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Sub Tasks
JDK-8081277 :  
JDK-8081279 :  
JDK-8081284 :  
JDK-8081285 :  
JDK-8081287 :  
JDK-8081728 :  
JDK-8081736 :  
JDK-8085843 :  
JDK-8132375 :  
JDK-8136757 :  
JDK-8138690 :  
JDK-8138702 :  
JDK-8139070 :  
JDK-8139268 :  
JDK-8139316 :  
JDK-8140388 :  
JDK-8140389 :  
JDK-8141132 :  
JDK-8176344 :  
Description
Summary
-------

Adopt a more space-efficient internal representation for strings.


Goals
-----

Improve the space efficiency of the `String` class and related classes
while maintaining performance in most scenarios and preserving full
compatibility for all related Java and native interfaces.


Non-Goals
---------

It is not a goal to use alternate encodings such as UTF-8 in the internal
representation of strings.  A subsequent JEP may explore that approach.


Motivation
----------

The current implementation of the `String` class stores characters in a
`char` array, using two bytes (sixteen bits) for each character.  Data
gathered from many different applications indicates that strings are a
major component of heap usage and, moreover, that most `String` objects
contain only Latin-1 characters.  Such characters require only one byte
of storage, hence half of the space in the internal `char` arrays of such
`String` objects is going unused.


Description
-----------

We propose to change the internal representation of the `String` class
from a UTF-16 `char` array to a `byte` array plus an encoding-flag field.
The new `String` class will store characters encoded either as
ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per
character), based upon the contents of the string.  The encoding flag
will indicate which encoding is used.

String-related classes such as `AbstractStringBuilder`, `StringBuilder`,
and `StringBuffer` will be updated to use the same representation, as
will the HotSpot VM's intrinsic string operations.

This is purely an implementation change, with no changes to existing
public interfaces.  There are no plans to add any new public APIs or
other interfaces.

The prototyping work done to date confirms the expected reduction in
memory footprint, substantial reductions of GC activity, and minor
performance regressions in some corner cases.

For further detail, see:

  - [State of String Density Performance][1]
  - [String Density Impact on SPECjbb2005 on SPARC][2]


Alternatives
------------

We tried a "compressed strings" feature in JDK 6 update releases, enabled
by an `-XX` flag.  When enabled, `String.value` was changed to an
`Object` reference and would point either to a `byte` array, for strings
containing only 7-bit US-ASCII characters, or else a `char` array.  This
implementation was not open-sourced, so it was difficult to maintain and
keep in sync with the mainline JDK source.  It has since been removed.


Testing
-------

Thorough compatibility and regression testing will be essential for a
change to such a fundamental part of the platform.

We will also need to confirm that we have fulfilled the performance goals
of this project.  Analysis of memory savings will need to be done.
Performance testing should be done using a broad range of workloads,
ranging from focused microbenchmarks to large-scale server workloads.

We will encourage the entire Java community to perform early testing with
this change in order to identify any remaining issues.


Risks and Assumptions
---------------------

Optimizing character storage for memory may well come with a trade-off in
terms of run-time performance.  We expect that this will be offset by
reduced GC activity and that we will be able to maintain the throughput
of typical server benchmarks.  If not, we will investigate optimizations
that can strike an acceptable balance between memory saving and run-time
performance.

Other recent projects have already reduced the heap space used by
strings, in particular [JEP 192: String Deduplication in G1][jep192].
Even with duplicates eliminated, the remaining string data can be made to
consume less space if encoded more efficiently.  We are assuming that
this project will still provide a benefit commensurate with the effort
required.


[1]: http://cr.openjdk.java.net/~shade/density/state-of-string-density-v1.txt
[2]: http://cr.openjdk.java.net/~huntch/string-density/reports/String-Density-SPARC-jbb2005-Report.pdf
[jep192]: http://openjdk.java.net/jeps/192

Comments
I updated the JEP integration/due date according to the following schedule: Putback to hs-comp: Nov 03 Integrate into hs-main: Nov 12 Integrate into jdk9-dev: Nov 18 Integrate into Master: Nov 25 == JEP int date
05-11-2015

There are new intrinsics for StringUTF16.get/putChar in the prototype implementation.
16-05-2015

There's a substantial improvement in speed if StringUTF16.[get|put]Char are implemented using Unsafe instead of in plain Java: http://cr.openjdk.java.net/~plevart/misc/JEP254/CharAtBench.java
16-05-2015

There is a prototype implementation in JDK Sandbox repository (mainly tested on x86), under JDK-8054307-branch here is the brief build instructions: $ hg clone http://hg.openjdk.java.net/jdk9/sandbox/ $ cd sandbox $ sh ./get_source.sh $ sh ./common/bin/hgforest.sh up -r JDK-8054307-branch $ make configure $ make images
16-05-2015

Just a note on code at: http://cr.openjdk.java.net/~sherman/8054307/jdk/ ... A reference to String should be safe to pass to threads via data-race. New "byte coder" field is also part of String's final state but is not marked final. Java code is currently written so that it can't be made final (assigned in initBytes methods). So is safety relying on the implementation details and piggy-backing on "value" field final marker to affect "coder" field store order? Should there be an explicit Unsafe.storeFence() at the end of each constructor?
15-05-2015

How the hashcode of a String object is computed is part of the spec. So yes, it is supposed to be the same.
08-05-2015

Is hashcode value for a String going to be the same when moving from old Strings to compressed ones? Especially for the Strings that are compressable?
08-05-2015

The link from the description doesn't work anymore: http://cr.openjdk.java.net/~huntch/string-density/String-Density-SPARC-jbb2005-Report.pdf Is the same material available elsewhere?
06-05-2015

Ok, after the detailed explanations by Sherman and Charlie, the proposal looks sane. Performance team will help to rigorously estimate the performance impact of this feature.
02-12-2014

Yes, I was talking about the 64-bit vm. This alignment issue was briefly discussed.
05-11-2014

Replying briefly. New boolean field is a 8-byte waste at least on 32-bit VMs, because objects are aligned by 8 bytes, and there is no alignment shadow after "hash" field in 32-bit mode. In 64-bit modes, we can squeeze the boolean flag into the existing alignment shadow without increasing the String size. Using http://openjdk.java.net/projects/code-tools/jol/ on 8u40: $ java -jar jol-cli/target/jol-estimates.jar java.lang.String ***** 32-bit VM: ********************************************************** java.lang.String object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 8 (object header) N/A 8 4 char[] String.value N/A 12 4 int String.hash N/A Instance size: 16 bytes (estimated, the sample instance is not available) Space losses: 0 bytes internal + 0 bytes external = 0 bytes total ***** 64-bit VM: ********************************************************** java.lang.String object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 16 (object header) N/A 16 8 char[] String.value N/A 24 4 int String.hash N/A 28 4 (loss due to the next object alignment) Instance size: 32 bytes (estimated, the sample instance is not available) Space losses: 0 bytes internal + 4 bytes external = 4 bytes total ***** 64-bit VM, compressed references enabled: *************************** java.lang.String object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 12 (object header) N/A 12 4 char[] String.value N/A 16 4 int String.hash N/A 20 4 (loss due to the next object alignment) Instance size: 24 bytes (estimated, the sample instance is not available) Space losses: 0 bytes internal + 4 bytes external = 4 bytes total
05-11-2014

We already have a "int hash" field, so should be 3 bytes of "waste". It is a "regression" considered we just removed the offset and count fields. The idea of squeezing the "flag" into somewhere, such as the the byte[], has been evaluated (in my first round of prototype it is only a "boolean isLatin1" field, so in theory we only need to store a "bit" somewhere), but it appears not worth the trouble for just 3 bytes (not only the libraries code, there are lots of hotspot code for the String class as well...). It's always a trade-off of memory and cpu speed. Now I'm considering these 3 bytes as a "reserved space" for future expansion, should we want to add more "supported" encoding:-)
05-11-2014

I don't specially like the new 'flag' field. In your webrev it's a byte field but nevertheless it will occupy 8 bytes because of alignment reasons and it will waste 7 bytes. Maybe we could store the flag as first entry of the byte array? Of course that will require to adjust the index for every access, but maybe that's still cheaper than wasting 7 bytes (especially for small strings)?
05-11-2014

One of the motivations of the suggested latin1+utf16 approach is exactly to avoid the compression/decompression overheads that the previous CompressedStrings project had experienced. The nature of this approach guarantees the O(1) performance of the index based charAt() operation while still have the benefit of memory footprint saving. In fact this is something coming up after couple experimenting tryings of the utf8 based internal representation. As the proposal suggested, the LATIN + UTF16 is our first choice to explore for now. In this approach, (1) the char[] is replaced with a byte[] as the internal storage of the String object (2) a "flag" field is used to indicate which encoding is being used to encode the internal byte[] (3) when the String object is first being initialized, the chars are scanned once to decide if it's latin1 only String or utf16 string, if a latin1 only string, only the low-8-bit is stored, otherwise the original char[] is stored as byte[] in utf16 encoding, with the native byte endian (4) all operations after that will be go either latin1 or utf16 branches This approach keeps the O(1) performance for the index based operation. With the cost of the initial scanning during the initialization of the String the compression/decompression overheads are avoided. Sure the implementation now looks a little "complicated" and there are potential slowdown for certain access, especially for the utf16 based String objects (most operations for the latin1 based seems get a little faster). With help of vm Intrinsic for key methods such as the put/getChar for utf16, we might be able to control the regression in a limited range. The byte[] for the utf16 based String objects uses the native byte endian, so from vm point of view, they are the same "16-bit-unsigned-integer block" as the char[] are. The webrev of the libraries side change of the experimental implementation is at http://cr.openjdk.java.net/~sherman/8054307/jdk/ Some preliminary results from running the experimental binary against the benchmark like specjbb2005 look encouraging. As Martin and Alekey have pointed out, the nature of the non-fixed-width utf8 encoding brings in the biggest disadvantage of the O(N) performance regression on the index based access operation, which definitely is unacceptable, if implemented directly. The reason we are still keeping this as one of the directions we are willing to continue exploring is that it appears there might be some use scenario/needs that an utf8 internal String representation is desirable, for example the jvm server works in a pure utf8 environment and most of the String objects are not being accessed via index based methods... with some possible/acceptable "workarounds" such as to have a "iterator-like" interface to replace the CharSequence (so we can cache the index/offset without counting from the beginning every time), if possible, to have a ASCII + UTF8 combination (similar as to our first "latin1 + utf16), so if most of the String objects inside a vm is ascii based (this is so far the fact that the data suggests what is happening inside a running jvm), these objects will not have the regression... But, big vm side change will be needed, if we really go support the utf16 based String.
04-11-2014

+1 This is one scary improvement we are looking at. Vaguely remembering the CompressedStrings experience to approach the same problem, I would like to see why the suggested approach is better, and how it avoids the pitfalls encountered there. If I remember correctly, CompressedStrings failed with paying for the compression/decompression overheads associated with using the UTF-16-expecting code. JEP says nothing about that experience, and I think CompressedStrings post-mortem is a deciding factor for moving on with this JEP. Degrading String.charAt() to O(N) is completely unacceptable, and will lead to major performance problems across the entire Java ecosystem. Although I am not sure how would one degrade charAt with compressed Strings, since they are also fixed-record-sized by 1 byte in current approach.
04-11-2014

Changing the String representation is difficult. Good luck! Lots of existing code assumes that UTF-16 code units can be random-accessed in an O(1) fashion. Changing the internal representation to e.g. UTF-8 will make String.charAt O(N), which seems unacceptable. Optimizing for US-ASCII content means giving (even more) performance preference to US users, which is politically unpleasant. If you choose an 8-bit character set, you will also need to make a political decision, e.g. Latin-1 vs. including the Euro. How will we avoid repeating the poor experience of Compressed Strings? I have a dream. String could have a compressed and uncompressed rep. GC could auto-compress non-recently-used Strings. At runtime, if the String is compressed, uncompress it first before continuing. Heavy cooperation with hotspot. Very difficult to implement well.
04-11-2014