Bug ID: JDK-8198332 [Lilliput] Remove klass word from objects

Investigate downsizing average object size by removing (or downsizing) the klass word from the `oopDesc` layout.

DRAFT DRAFT DRAFT

The `klass` word in the object header points to the internal type
metadata for the object. This word can be 64 bits, or if compressed, 32 bits.

There are well-known techniques for reducing the average size of this
field, by assigning short encodings for a number of "favored" or "well known"
klass pointers. A typical Java heap has a small number of types that
contribute the bulk of all objects. This means that an 8-bit number is
often enough to address the klasses of 90% of the objects in the heap.
(Also, a 4-bit number will often cover over 50% in the heap, while a
10-bit number will often cover over 98%.)

Therefore, if we could find just a few extra bits to store a "common klass"
encoding in an object, most of the klass header could be omitted.
In doing this density of the heap would go up by a significant factor, basically `(1+3/S)`, where `S` is the average object size; since `S` is typically something like 30-40 bytes (except for arrays), the overhead of klass can often be a significant percentage.

For course, after the first `2^N` code points are used to encode common
klasses (where `N` is some number like 8 or 10) then a larger klass field
will be required in objects of subsequently created klasses. Since object
layout is performed when a class is loaded, it is reasonable to allocate
extended (32-bit) klass fields only in those later types which do not acquire
a short (8-bit or 10-bit) klass index.

If non-compressed oops are in use, and if a decision is made to color
managed pointers, and if `N` bits (N = 8 or 10 or even 16) are available
as ignored color bits in the 64-bit pointer, then the klass bits (for a favored
klass) can *also* be hoisted into every reference to that object. This
would realize the very old idea of a _tagged pointer_ to an object.

This scheme can be combined with JDK-8198331 (mark word removal).
The first 16 bits of an object could be allocated to a hyper-compressed
klass pointer plus a few mark-word control bits (to signal the presence
of a lock and/or hash code). The 32-bit word *containing* those 16 bits
could be allocated to a full klass pointer (or index), but in most cases
the other 16 bits would be available for ordinary instance fields or
other metadata.

These notes were posted before Lilliput. There are of possible use in that project, if it embraces a two-phase class index format, with fewer bits for common classes.

13-04-2022

The Klass “layout helper” code was written to reduce more expensive pointer chasing through v-tables, and I’d certainly like to see if it can be made even cheaper by artful summarization in the header. Here are a few more thoughts on that, which may or may not make it into this work. * The layout helper currently gives object lengthh and array-ness (and a few other edge cases). It could be extended to summarize the "Top Hundred" most important shapes too, so that oop maps (which require lots of indirections) don't need to be consulted for common pointer-finding operations. * Any reasonable bitfield size (8-16 above) is going to be too small to encode all klasses and/or layout-helper shapes (think hundreds of shapes and millions of classes). So there has to be at least one “escape” code which means, “sorry, you have to look at my Klass for that”. * The size of the bitfield(s) should probably be tunable, at least at first, to find the sweet spot. The sweet spot is going to be hit when the cost of loading and decoding the header summary is significantly less than recognizing an escape code and doing the old thing with the Klass. That in turn depends on the average number of objects which require escapes. That in turn depends on how many bits W the header field is. We can encode 2^W-1 distinct cases in that field, and at best they are the most common such cases (layouts and/or classes) that will be found in some particular walk over the heap, so that we can fast-path as many objects as possible. Tuning W large enough to catch a good P=50%-90% of cases on the fast path is the challenge. If there are N total cases, we want 2^W < N. Maybe 2^W+1 = N, so that we encode everything in the header. But if W might need to significantly smaller than lg N. * If the statistics are Zipf-like (as seems to be often, with stuff like this), then the n-th most common case will have frequency proportional to 1/n^p (p a parameter at or just above 1). The relevant formula for fast path success is P = H[2^W,p]/H[N,p]. Since H is log-like (p near 1) then, sadly, P = W/lg N, so we need W ~ lg N / 2, if P ~ 50% not spectacular but maybe helpful. I think it’s worth exploring, since (a) the details of H make P more like 55% if p=1, and (b) if the power law is more favorable (p=1.1, say), then P gets better still, and scales well at large values of N. Or so the graphs tell me. It makes me curious what is “p” (and if we have a CDF of H(n,p)) for typical application heaps. * When decoding the summary bits from the header, the worst case is a table lookup, for speed, but that probably gives the best case for compression of header bits and thus P (frequency of fast path). One might think that’s no better than grabbing a decoded word directly from the Klass itself, and that might be right, but it’s possible that a compact table lookup would allow more of the D$ to be used for the objects themselves, instead of metadata bits that aren’t being used. Especially if the decode table were phase-specific, and made as small as possible. Especially with the next point… * A compromise might work for certain empirically common layouts (such as arrays) are decodable on an extra-fast path; this is approximately what layout-helper does today, and the same thing might work if a shortened layout-helper were moved into the object header. But you want it to hold a tiny oop map too, I think, in which case only the simplest of possible layouts are directly decodable. * Direct decode fits nicely with phase-specific tables, if there’s one particular phase or operation which needs direct decode, and the tables can be handled with table lookup. The whole summary can have a “decodable” bit (or bits) that either feed into a table index or (in some fast-fast paths) are tested for a “fast fast” configuration that allows the table lookup to be elided. Example: Suppose there are only 1000 object shapes, and we have 8 bits to summarize them. H(255)/H(1000) is 82%, so we'll get a Pareto-style fast path using a table lookup. Put general arrays on the slow path, but arrays whose size matches one of the common shapes get processed via the table (this means that arrays of the same type but different lengths get different summaries, in this example). (The numbers 8 and 1000 are pulled out of the air for this example. In reality they need to be determined empirically.) Example 2: Suppose we find table lookup is not so good for size and oop computation, but we still want 8 bits of summary. Allocate one bit for as a "fast fast" bit, and keep the whole table at 256 elements, but take 1/2 of the code points as a directly decodable 7-bit summary of object layouts. (Our 82% will decrease somewhat, since surely not everything that decodes from the summary will deserve its place in the table.) Decode object length from (say) 5 of the seven bits; this probably matches the common object sizes nicely. Decode oop locations from the size and the remaining 2 bits; not all possible oop locations will be supported, but support the most common ones produced by the layout algorithm of the class file parser (it's predictable). Under the assumption that we want to encode sizes 0..N and that the M<=N oops are always placed at the beginning (or end) of the object, there are (N+1)(N+2)/2 cases; the triangular matrix can also be densely coded in 7 bits at the expense of a few more instructions in the decode path. In fact, the condition M<=N could be the condition which gates the fast-fast path, with the other ill-conditioned pairs (N,M) being handled via table lookup.

09-12-2020

The summarization game can also be played with a full 64-bit bitfield also (or 128 bits, etc.), in which case the sweet spot design is likely to include features such as separate length and bit mask subfields (the bit mask shows where the oops are, in great detail, up to a maximum size such as 32 oops). The size field need not be more than about 10 bits (assuming a slow path) and this also leaves space for another field (also 10 bits) which locates the first oop, thus avoiding a common FFS operation on the bit mask. In this way, a great number of object layouts can be on the "fast fast" path, which means they can be decoded just by simple ALU operations on the 64-bit summary. What about table lookup and slow paths? That's where a 64-bit summary is also interesting, since it can *simultaneously* be a pointer to slow-path information, such as a record holding the size and one or more oop maps. This can be done with a simple low-tag on the summary word: Even means "it's a pointer", while odd means "it's an immediate". If the slow path case (pointer) points directly to a Klass, then any amount of information can be fished out of that. If this is the design, then the remaining 11 bits (we have already spoken for 32+10+10+1) can be used as a summary of the Klass itself, assuming we don't already have a fast way to get that value. This could be the case if we use *all* of the mark word for this summary, in which case the _klass field can be elided if the Klass is one of the "Top Thousand" most common classes. Since there are other likely subscribers for mark-word bits (age, val-vs-ref, hash code, other status) it's not likely that Klass + layout can hog 50 or more of the header bits, but it's probably worth keeping in mind. If the last 11 bits are free, and we are *only* encoding shape (nothing else derived from the Klass) then they could also be used as the size of the first block of oops (remember that there's already a field which tells us where they are). The bit mask (32 bits) would often be zero, in the very common case where there's only one such block. Thus, the common case today of a single oop-map record would be encoded (along with the object size) in a single 64-bit summary word, directly decodable without additional memory references, for any object of reasonable size. A second block (somewhere after the first, after intervening primitives) would also easily fit into this scheme, leaving about 20 bits free for a bit mask (or a third block of oops).

09-12-2020