Summary
-------
Reduce the Java heap live-data set by enhancing the G1 garbage collector so that
duplicate instances of `String` are automatically and continuously deduplicated.
Non-Goals
---------
It not a goal to implement this feature for garbage collectors other than G1.
Motivation
----------
Many large-scale Java applications are currently bottlenecked on memory.
Measurements have shown that roughly 25% of the Java heap live data set in these
types of applications is consumed by `String` objects. Further, roughly half of
those `String` objects are duplicates, where duplicates means
`string1.equals(string2)` is true. Having duplicate `String` objects on the heap
is, essentially, just a waste of memory. This project will implement automatic
and continuous `String` deduplication in the G1 garbage collector to avoid
wasting memory and reduce the memory footprint.
Description
-----------
### String Deduplication
The `String` class has two fields:
    private final char[] value
    private int hash
The `value` field is implementation-specific and not observable from outside of
the `String` class itself. The `String` class does not modify the contents of
the `char[]` array, nor does it synchronize on the array object itself. This
means that it can safely and transparently be used by multiple instances of
`String` at the same time.
Deduplicating a `String` object is conceptually just an re-assignment of the
`value` field, _i.e._, `aString.value = anotherString.value`. The actual
re-assignment is however done by the VM, which in turn means that the `final`
property of the `value` field is not a problem in practice.
We are not actually deduplicating the `String` objects, only their backing
character arrays. Deduplicating the actual `String` object cannot be done
safely, since such a change would be observable from the Java application and
could cause problems if, for example, the application used that object for
synchronization or in some other way relied on the object's identity.
String deduplication will not require any changes to the JDK class library or to
any other existing Java code.
### Expected Benefit
Measurements done on a large number of Java applications (big and small) have
shown the following:
  - Average percent of live heap data set occupied by `String` objects = 25%
  - Average percent of live heap data set occupied by duplicate `String` objects
    = 13.5%
  - Average `String` length = 45 characters
Given that we are only deduplicating character arrays we will still carry the
overhead of the `String` objects (object header, fields, and padding). This
overhead is platform/configuration dependent and varies between 24 and 32 bytes.
However, given an average `String` length of 45 characters (90 bytes + array
header) there is still a significant win to be had.
Taking the above into account, the actual expected benefit ends up at around
*10%* heap reduction. Note that this number is a calculated average based on a
wide range of applications. The heap reduction for a specific application could
vary significantly both up and down.
## Implementation
### Overview
When garbage collection is performed, live objects on the heap are visited. For
each object we visit a check is applied to see if the object is a candidate for
string deduplication. If the check indicates that this is a candidate then a
reference to the object is inserted into a queue for later processing. A
deduplication thread runs in the background and processes the queue. Processing
a queue entry means removing it from the queue and attempting to deduplicate the
`String` object it references. A hashtable is used to keep track of all unique
character arrays used by `String` objects. When deduplicating, a lookup is made
in this table to see if there is already an identical character array somewhere
on the heap. If so, the `String` object is adjusted to point to that character
array, releasing the reference to the original array allowing it to eventually
be garbage collected. If the lookup fails the character array is instead
inserted into the hashtable so that this array can be shared at some point in
the future.
### Candidate Selection
Candidate selection is done during young/mixed and full collections. This is a
performance sensitive operation since it is applied to all visited objects.
An object is considered a deduplication candidate if all of the following
statements are true:
  - The object is an instance of `String`,
  - The object is being evacuated _from_ a young heap region, and
  - The object is being evacuated _to_ a young/survivor heap region and the
    object's age is _equal to_ the deduplication age threshold, **or** the
    object is being evacuated _to_ an old heap region and the object's age is
    _less than_ the deduplication age threshold.
Once a `String` object has been promoted to an old region, or its age is higher
than the deduplication age threshold, it will never become a candidate again.
This approach avoids making the same object a candidate more than once.
Interned strings are a bit special. These are explicitly deduplicated before
being inserted into the `StringTable` (see below for details on why). These can
later also become deduplication candidates if they reach the deduplication age
threshold or are evacuated to an old heap region. The second attempt to
deduplicate such strings will be in vain, but we have no fast way of filtering
them out. This has been shown to not be a problem, as the number of interned
strings is usually dwarfed by the number of normal (non-interned) strings.
### Deduplication Age Threshold
It is assumed that `String` objects either live for a very short time *or* live
for a long time. Deduplicating objects that will die soon is just a waste of CPU
and memory resources. To avoid deduplicating strings too early the deduplication
age theshold dictates how old a `String` object must be before it will be
considered a candidate for deduplication. This threshold will have a reasonable
default, but will also be configurable using a VM option.
### Deduplication Queue
The deduplication queue actually consists of several queues, one queue per GC
worker thread. This allows lock-free and cache-friendly enqueue operations by
the GC workers. This is important since these operations are done during a
stop-the-world phase.
### Deduplication Hashtable
The deduplication hashtable is used to keep track of all unique character arrays
(which are attached to `String` objects) found on the heap. When a deduplication
candidate is processed, a lookup is made in the hashtable to see if an identical
character array already exists. If the lookup is successful the `String`
object's `value` field is updated to point to the character array found in the
hashtable, allowing the garbage collector to eventually collect the original
array. If the lookup fails the character array is instead added to the hashtable
so that this array can be shared at some point in the future. A character array
is removed from the hashtable when it is garbage collected, _i.e._, when all
`String` objects referring to it have become unreachable.
The hashtable is dynamically resized to accommodate the current number of table
entries. The table has hash buckets with chains for hash collision. If the
average chain length goes above or below given thresholds the table grows or
shrinks accordingly.
The hashtable is also dynamically rehashed (using a new hash seed) if it becomes
severely unbalanced, _i.e._, a hash chain is significantly longer than average.
This is similar to how `StringTable` handles an unbalanced hashtable.
For workloads that produce a large number of unique strings, where there is
little opportunity for deduplication, the hashtable could consume more memory
than deduplication frees. In those cases string deduplication should not be
enabled. The deduplication statistics printed to the GC log will give guidance
in making such decisions.
### Deduplication Thread
The deduplication thread is a VM internal thread which runs concurrently with
the Java application. This thread is where the actual deduplication work is
done. It waits for `String` object references to appear on the deduplication
queue and starts to dequeue them one by one. For each `String` it dequeues, it
computes the string hash code (if needed), looks it up in the deduplication
hashtable and possibly deduplicates the string. The deduplication thread
maintains deduplication statistics (number of candidates inspected, number of
strings deduplicated, etc) which it can print to the GC log.
### Interned Strings
When a `String` is interned (`String.intern()` is invoked) it will be
deduplicated before it is inserted in the `StringTable`. This ensures that once
a `String` has been interned it will never be deduplicated again. Deduplicating
a `String` after it has been interned has shown to be a bad idea since it will
counteract compiler optimizations done for string literals. Some optimizations
assume (and rightly so) that the `String.value` field is never changed to point
to a different array. With this knowledge the compiler can emit code with the
address of the character array as an immediate value. This optimization allows,
for example, `String.equals()` to do a simple pointer comparison in a fast path.
If the array is moved by the GC the address will be adjusted accordingly in such
code blocks. However, if `String.value` is outside of the GC the optimization
will silently fail and fall back to the normal (slower) character by character
comparison.
### Impact on GC Pause Times
The following items can/will affect GC pause times:
  - Candidate selection is done in the hot path for marking (full collections)
    and evacuation (young/mixed collections).
  - Both the deduplication queue and hashtable stores `oop`s which are treated
    as weak references from a GC point of view. This means that the GC needs to
    traverse both structures to adjust or remove references to objects that were
    moved or garbage collected. Traversing the queue and the hashtable is the
    most performance-critical part of this feature. The traversal is done in
    parallel using all GC worker threads.
The assumption is that a high enough deduplication success rate will balance out
most or all of this impact, because deduplication can reduce the amount of work
needed in other phases of a GC pause (like reduced amount of objects to
evacuate) as well as reduce the GC frequency (due to reduced pressure on the
heap).
### Command-line Options
The following new command-line options will be made available:
  - `UseStringDeduplication` (`bool`) --- Enable string deduplication
  - `PrintStringDeduplicationStatistics` (`bool`) --- Print detailed
    deduplication statistics
  - `StringDeduplicationAgeThreshold` (`uintx`) --- `String` objects reaching
    this age will be considered candidates for deduplication
Alternatives
------------
There are numerous other ways of deduplicating `String` objects.
  - Deduplicate at `String` creation time
    The problem with this approach is that many or most `String` objects die
    young, and the overhead of computing the hash code and finding an existing
    equal character array is not insignificant.
  - Use `String.intern()` explicitly in the code
    In some cases this is indeed the best way to avoid duplicated `String`
    objects to start with, but there are a few problems with this approach.
    One problem is that `String.intern()` returns the exact same `String` object
    for all equal strings. Unless extreme care is taken there may be functional
    regressions, for example in cases where `String` objects are used for
    synchronization.
    Another problem is that in many cases developers do not know where in the
    code they should use `String.intern()`, or it may even be hard to find the
    code and/or people responsible for the code and have the code updated in the
    first place.
    Finally, the current `String.intern()` implementation does not scale that
    well, which means the operation can be very expensive.
  - Profile and inject the equivalent of `String.intern()` invocations
    One can also profile existing applications and find out where duplicated
    `String` objects are typically being stored, and use frameworks like
    `java.lang.instrument` to inject `String.intern()` calls in suitable places.
    This has the advantage of not having to update the source code itself, but
    rather change the byte code dynamically for the actual workload. One of the
    problems of doing this is that it is not as straightforward to understand
    how frequently fields are updated, so if the `intern()` invocations are
    injected in hot paths it may impact performance significantly. Also, if the
    source code is changed the profiling may need to be redone which could be a
    costly and, to some extent, manual effort.
  - Deduplicate in `String.equals()` and `String.compareTo()`
    When two `String` objects are compared and the result shows that they are
    equal, before returning the result these methods could adjust one of the
    strings to use the other string's backing character array. A prototype for
    this was implemented, which worked fairly well. The main advantage of this
    approach is that there is zero memory overhead because there is no need to
    keep a deduplication hashtable around.
    There are, however, a few obvious limitations with this approach. First, two
    `String` objects need to be compared for deduplication to happen. This means
    a large part of all deduplication candidates are missed, because not all
    strings are compared. Further, the VM has compiler intrinsics for these
    methods, which complicates the implementation since it is not just about
    making adjustments to the `String` class itself. There are a few other
    technical issues, which all in all makes this a less attractive approach.
  - One-off deduplication
    Instead of doing deduplication continuously it could also be done as a
    one-off operation. In short this could be implemented by locating all the
    `String` objects on the heap, building up a deduplication hashtable on the
    fly, deduplicate the `String` objects as needed and then release the
    hashtable and other temporary data structures. This is significantly easier
    to implement than continuous deduplication and also has the advantage of not
    adding to the memory footprint when deduplication is not being done. One
    could further imagine that, if there is a need, these one-off deduplication
    operations could be scheduled to execute occasionally, thereby making it
    semi-continuous.
    A prototype of this approach was developed which used JVMTI to scan the heap
    for `String` objects. There were a few problems. First, results from a
    large scale Java EE type of workload showed that it is beneficial to do
    deduplication continuously, not just now and then. If we end up executing
    this operation frequently then the overhead of scanning the whole heap and
    rebuilding the hashtable every time becomes significant. Further, doing
    this type of work with JVMTI is a bit too inflexible when it comes to
    selecting which `String` objects to deduplicate and when.
Testing
-------
`jtreg` tests will be added to make sure the deduplication works as expected.
System tests and performance tests are needed to assess the Java heap live
data set reduction and performance regressions/improvements.
Risks and Assumptions
---------------------
  - It is assumed that the introduction of the "is deduplication enabled"-check
    in the hot path of the garbage collector marking/copying logic does not add
    a significant overhead. This needs to be verified.
  - Normally a `String` object and its corresponding character array will be
    placed next to each other in memory, leading to good cache locality. After
    deduplicating a `String` object its character array will be further away,
    which might introduce a slight performance overhead. Initial measurements
    have, however, shown that this does not seem to be a problem. In fact, after
    deduplication the total amount of accessed memory is reduced which tends to
    result in improved cache hit rates.
Impact
------
  - Performance/scalability: Changes might affect GC pause times and cache hit
    rate when accessing the backing character array of `String` objects. We'll
    need to run tests to assess performance impact.