JDK-8059092 : JEP 250: Store Interned Strings in CDS Archives
  • Type: JEP
  • Component: hotspot
  • Sub-Component: runtime
  • Priority: P2
  • Status: Closed
  • Resolution: Delivered
  • Fix Versions: 9
  • Submitted: 2014-09-24
  • Updated: 2018-01-08
  • Resolved: 2016-02-27
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
Summary
-------

Store interned strings in class-data sharing (CDS) archives.


Goals
-----

  - Reduce memory consumption by sharing the `String` objects and
    underlying `char` array objects amongst different JVM processes.

  - Only support shared strings for the G1 GC.  Shared strings require a
    pinned region, and G1 is the only HotSpot GC that supports pinning.

  - Only support 64-bit platforms with compressed object and class
    pointers.

  - No significant degradation (< 2-3%) on startup time, string-lookup
    time, GC pause time, or runtime performance using the usual
    benchmarks.


Non-Goals
---------

  - Reducing startup time is not a goal.

  - Other types of GCs (besides G1) will not be supported.

  - 32-bit platforms will not be supported.


Motivation
----------

Currently, when CDS stores classes into the archive, the
`CONSTANT_String` items in the constant pools are represented by UTF-8
strings.  When the class is loaded, the UTF-8 strings are converted into
`java.lang.String` objects on demand.  This potentially wastes memory,
since each character in each interned string takes up three bytes or more
(two bytes in the `String`, 1-3 bytes in the UTF-8).

Also, because the strings are created dynamically, they cannot easily be
shared across JVM processes.


Description
-----------

At dump time, a designated string space is allocated within the Java heap
during heap initialization.  Pointers to the interned `String` objects
and their underlying `char`-array objects are modified, as if those objects
are from the designated space, when writing out the interned string table
and the `String` objects.

The string table is compressed and then stored in the archive at dump
time.  The compression technique for the string table is the same as for
the shared symbol table (see [JDK-8059510][8059510]). The regular narrow
oop encoding and decoding is used to access the shared `String` objects
from the compressed-string table.

On 64-bit platforms with compressed oop pointers, the narrow oops are
encoded using offsets (with or without scaling) from the narrow oop base.
Currently there are four different encoding modes: 32-bit unscaled, zero
based, disjoint heap based, and heap based.  Depending on the heap size
and the heap minimum base, an appropriate encoding mode is selected.  The
narrow-oop encoding mode (including the encoding shift) must be the same
at both dump time and run time, so that the oop pointers within the
shared string space remain valid at run time.  The shared-string space
can be considered relocatable, with restrictions, at runtime.  It is not
required to be mapped at the same address as at dump time, but it should
be at the same offset from the narrow oop base at dump time and run time.
The heap size is not required to be the same at dump time and run time,
as long as the same encoding mode is used.  The offset of the string
space and the oop-encoding mode (and shift) should be stored in the
archive for run-time validation.  If the encoding mode changes, it will
invalidate the encoding of the oop pointer to the `char` array from each
shared `String`.  In such cases the shared-string data is ignored while
the rest of the shared data can still be used by the VM.  A warning
indicating that shared strings are not used due to incompatible GC
configuration will be reported by the VM.

At run time, the string space is mapped as part of the Java heap at the
same offset from the oop encoding base as at dump time.  The mapping
starts at the lowest page-aligned address of the string space saved in
the archive.  The mapped string space contains the shared `String` and
`char`-array objects.  All G1 regions which overlap this mapped space
will be marked as pinned; these G1 regions are unavailable for run-time
allocation.  There may be unused space wasted in a region that partially
overlaps, but there should be at most one such region, at the end of the
mapping.  No patching is required for the oop pointers within the string
space since the same narrow oop encoding is used.  The shared-string
space is writable, but the GC should not write to the oops in the space
in order to preserve shareability across different processes.  An
application that attempts to lock one of these shared strings, and thus
writes to the shared space, will get a private copy of the page, and
therefore lose the benefit of sharing that particular page.  Such cases
are rare.

The shared-string table is distinct from the regular string table at
runtime.  Both tables are searched when looking up interned strings.  The
shared-string table is a read-only table at run time; no entries can be
added or removed from it.

The G1 string-deduplication table is a separate hash table containing the
`char` arrays for deduplication at runtime.  When a string is interned
and added to the `StringTable`, the string is deduplicated and the
underlying `char` array is added to the deduplication table if it is not
there already.  The deduplication table is not stored into the archive.
The deduplication table is populated during VM startup using the
shared-string data.  As an optimization, the work is done in the
`G1StringDedupThread` (in `G1StringDedupThread::run()`, after
`initialize_in_thread()`) to reduce startup time.  The shared strings'
hash values are precomputed and stored in the strings at dump time to
avoid the deduplication code writing the hash values at runtime.


Testing
-------

Testing for this feature will cover the following areas:

 - Basic operation of this feature;

 - Modes that are incompatible with this feature, such as non-G1 GC and
   uncompressed object/class pointers;

 - Variation of ordinary-object-pointer encoding between dump time and
   run time;

 - Invalid string-file format;

 - Selected string operations when using this feature, such as interning
   and string comparison; and

 - Ensure that this feature does not cause heap corruption using GC
   diagnostic modes.


Dependences
-----------

The serviceability agent needs to be updated to add support for the
shared-string table (see [JDK-8079830][8079830]).

With the change proposed by [JDK-8054307][8054307], the underlying `char`
array will be changed to be a `byte` array.  The code that copies
interned strings to the string space and perform deduplication will need
to reflect that if and when JDK-8054307 is integrated.  The impact should
be minimal.



[8059510]: https://bugs.openjdk.java.net/browse/JDK-8059510
[8054307]: https://bugs.openjdk.java.net/browse/JDK-8054307
[8079830]: https://bugs.openjdk.java.net/browse/JDK-8079830

Comments
A very limited TOI will be enough here, it could be merged with other features as well
16-12-2015

Windows (64-bit) is not supported for shared strings currently. On Windows platform, mapping (using MapViewOfFileEx()) a memory region from already reserved memory fails with error code 487 (ERROR_INVALID_ADDRESS). Freeing a partial region of a reserved memory using VirtualFree() with MEM_RELEASE is also not allowed (fails with 487 error code). With the current design, the entire java heap is reserved up-front early during VM initialization. As the shared string is part of the existing reserved java heap, the VM cannot map the shared string region at runtime.
23-07-2015

GC marking issue for shared objects ============================ During full G1 GC, the mark word in the object's header is used for GC marking. So when full GC happens, all objects in the pinned string region are 'touched' by GC and making all shared pages become private. Possible solutions: 1) Pre-mark shared objects in the CDS archive at dump time Pros: GC sees the shared object is already marked and does not need to remark the object. That avoids the marking on the shared object during runtime. Cons: The mark word in object header is also used by locking code. Many places in the locking code assumes objects are in 'neutral'. Premarking the shared objects break the assumptions in the locking code. Changing the locking code to recognize the 'pre-marked' state is risky as it might affect compiled code and generated code. In JDK7, the String objects in CDS archive are RW and the underlying 'value' arrays are RO. The String objects in the shared archive would be 'dirty' if full GC happens and unsharable. The 'value' arrays are pre-marked. That makes the 'value' arrays still sharable even when full GC happens. However, locking on the array crashes (any java code can use reflection to get hold of the String's 'value' array). 2) Use bitmap for object marking in full GC Might have large performance hit in full GC 3) Not follow reference to the pinned object This is the solution used. Notes from Tom: To avoid needing to duplicate the marking code, the added overhead needs to be as small as possible. A local bitmap has been added to markSweep, as well as a local copy of the heap base address for computing indices into the map. When regions are pinned, G1 calls markSweep routines to allocate the map and set the base address, as well as mark the map bits corresponding to pinned regions. The bitmap is at the fixed granularity of min_region_size, rather than the runtime region size, to streamline the test. Only the check for whether the heap_base address has been set is in-lined in markSweep routines, using the is_object_pinned function. The actual check of the bitmap is out-of-line, and this is a measured improvement when pinned regions are not in use, at the expense of a slightly larger hit when pinned regions are actually in use. The adjust_pointer code also must check for attempts to update pointers into pinned regions, and uses the same is_object_pinned check.
06-05-2015

JDK-8054307 may change the char[] array to a byte[] array.
19-02-2015

Hi Peter, thank you so much for the ideas and suggestions. For the 0 hash strings, not putting them in the shared string space is a good idea. Yes, setting the String.hash field needs to be done at dump time. I'll make that clear in the JEP.
13-02-2015

Re: "Another optimization is to pre-hash the shared strings during dump." If by "hash" you mean to fill in the value of the String.hash field, then you *have* to do that during dumping or a runtime call to String.hashCode() will write the String.hash field which will destroy the sharing of your String space across multiple processes. (Or segfault if you manage to get the String space in a read-only page of memory. :-) Special attention needs to be paid to Strings whose hash is 0, since the code in String.hashCode() will try to (over)write the 0 in the String.hash field at each call to String.hashCode(). For example, "\000" creates a String whose hashCode is 0. An option is not to put those Strings in the interned String space, but instead create one interned instance of each one in the table of Strings that interned at runtime.
12-02-2015

Re: "The shar[e]d string table is separate from the regular string table at runtime. Both tables are searched when looking up interned strings." An alternative to storing the interned String *table* when you build the CDS archive is to store a *sorted list* of Strings. The advantage is that you have no overhead for table entries. The disadvantage is that you have to binary search the sorted list rather than being able to hash search the table. A compromise is to hash-search for the String in the runtime interned String table, and if you do not find it there, binary search for it in the CDS interned String list. If you find the String in the CDS interned String list, you create an entry for the String in the runtime interned String table so you find it more quickly the next time someone asks if there is an interned String with this value. Note that you do not copy the String object from the CDS String list, you just reference it from the runtime table.
12-02-2015