JDK-8303182 : compressed Symbol pointers
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 21
  • Priority: P4
  • Status: Closed
  • Resolution: Won't Fix
  • Submitted: 2023-02-24
  • Updated: 2024-01-17
  • Resolved: 2024-01-17
Related Reports
Relates :  
Relates :  
Description
We might store resolved Utf8 strings not as 8-byte Symbol* words but as 4-byte SymbolRef words.

This would require something like a globally reserved area for storing symbol data, of size up to 4 billion times some grain size (such as 16).  We do something like this already for compressed oops and/or compressed classes.

It seems likely that all symbols ever used by any one JVM instance will fit into 64 gigabytes, even with some fragmentation overhead.

It may also be the case that moving from an 8-byte to a 4-byte representation for symbols, as stored in metadata, might reduce footprint.

Background:

An unresolved `CONSTANT_Utf8` is represented very compactly as a two-byte `u2` index into a contextually defined constant pool.  HotSpot metadata is organized to try to keep this representation where possible.

When a symbol is resolved, it is stored in a C++ pointer to a compactly organized record, a header containing a length and (saturatable) reference count, immediately followed by Utf8 bytes.  This is reasonably compact.

(The compactness could possibly be improved in the case of method signatures which repeat class names.  Such schemes have been evaluated in the past.  They have been difficult to implement.  Perhaps something can be done about this in the future.  For example, it would be simpler and almost as effectively to store common prefixes of symbols, so each symbol would be broken into two physical parts, one of which was shareable.  That is an RFE for a different day.)

For places where we have to store resolved symbols, such as the constant pools themselves, it may be helpful to store them in 4 bytes instead of 8 bytes.

Even in places where, today, we store symbols in unresolved 2-byte indexes (e.g., methods), it may be profitable to expand them to 4-byte resolved references, simply to reduce the dynamic overhead of decoding.

There is probably no reason to use compressed symbol pointers during "live" processing (in a C++ stack frame).  The 8-byte type SymbolHandle is the right choice there.

This RFE is tentative, because we already have a good coverage by SymbolHandle for "live" cases and contextually defined u2 indexes for "at rest" cases, with limited use of Symbol* "at rest" in constant pools to link everything together.

However, if we have tables in HotSpot that make heavy use of C++ Symbol* pointers to represent resolved symbols, it may be worth the effort of using compressed symbol references in those tables.  The dictionaries proposed in JDK-8301007 are an example of such tables.  Class loader constraints are another example.
Comments
Runtime Triage: This is not on our current list of priorities. We will consider this feature if we receive additional customer requirements.
17-01-2024

I have some more statistics: the default CDS archive is 14,483,456 bytes. In the metadata objects, we have a total of 158,587 Symbol* pointers. If we reduce these pointers from 8 to 4 bytes, we can save a total of 158,587 * 4 = 634,348 bytes, or about 4.3% of the archive. Most of the Symbol* are from ConstantPools
26-02-2023

Data from the field suggests that even large apps load less than 10M symbols, and that the average size of a symbol is no more than about 100 bytes (including overheads). This suggests any segment size of 1Gb or larger would serve normal applications, so that 4Gb to 64Gb (as suggested above) is a safe limit to work under. Combining some ideas from various places, something like this might (or might not) be profitable: struct SymbolHeader { enum { _refcount_mask = 0x0000FFFF, _flag_has_prefix = 0x00010000, //if this->_prefix is present _flag_valid_class = 0x00020000, //valid class or interface name _flag_valid_signature = 0x00040000, //valid field or method descriptor _flag_mask = 0x00070000, //valid field or method descriptor _hash_mask = ~(_flag_mask | _refcount_mask), //selected bits of hash }; u4 _hash_flags_and_refcount; u2 _length; u1 _prefix_copy[2]; // probe with first 64 bits with refcount & flags masked off union { struct { Symbol* _ref; // raw pointer to sharable prefix, if present u1 _suffix[4]; } _prefix; u1 _body[8]; // if prefix is not present }; u8 probe_key() { union { u8 key; SymbolHeader copy; } u; u.copy = *this; u.copy._hash_flags_and_refcount &= _hash_mask; return u.key; // should compile to something like *(u8*)this & ~0x7FFFF } }; The refcount can co-exist with any desired flags as well as hash (which is what it co-exists with today). This design trades hash for flags. But that is partially made up by having the first two bytes unconditionally present, even if the prefix is used. That makes some queries faster, and the two bytes can be used (along with most of the other 64 initial bits) for making a fast hash probe. A symbol of fewer than 2 bytes would store zeroes there.
24-02-2023