JDK-8352075 : Perf regression accessing fields
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 21,25
  • Priority: P3
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2025-03-14
  • Updated: 2025-04-24
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Causes :  
Duplicate :  
Relates :  
Description
JDK 21 and later versions introduce a performance regression when accessing fields in a class with many fields, in interpreted mode. I believe this is caused by introduction of FieldStream in JDK-8292818; another performance regression caused by this change was already found upon heap dump JDK-8317692.

Field access causes iteration through all fields in InstanceKlass::find_local_field(...). After JDK-8292818 this results in decoding many variable-length integers (through FieldInfoReader.read_field_info invoked by the next() method) rather than simple indexed access, which turns out to be costly. 
Moreover, when we call fd->reinitialize(...) the field lookup by index results in another iteration through the fields rather than O(1) access.

I am attaching a reproducer; this creates a class with 21,000 fields, compiles it and executes this (all that the class does is to initialize
all its fields). On JDK 17 running this reports 581 ms; on JDK 21 the test takes 8017 ms on my laptop.

I was able to avoid the second iteration by passing FieldInfo to reinitialize() and the execution went down to 5712 ms, but I don't see a simple solution that would make the first iteration more efficient.

hotspot-dev reference: https://mail.openjdk.org/pipermail/hotspot-dev/2025-March/102679.html
Comments
A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/24847 Date: 2025-04-24 10:37:56 +0000
24-04-2025

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/24713 Date: 2025-04-17 07:07:55 +0000
17-04-2025

Beyond JDK-8353175 (PR sent), I've implemented some optimizations in https://github.com/rvansa/jdk/tree/JDK-8352075-dev * CPU optimization (the most important): storing an extra byte for each field in the stream that keeps the encoded length, allowing faster skipping through the stream. Combined with JDK-8353175 fix, in the benchmark `CCC.java` this gets us within 10% of the JDK17 baseline, though customer reproducer is still showing 50% regression. I've tried to measure memory usage on a Spring Boot quickstart featuring 6800 instance classes loaded (about 16k fields) and according to NMT metadata Used shows about 32kB growth - whopping 0.12%. IMO that would be acceptable, but the number itself is a bit disturbing - it has grown 2x more than it was expected. But this would be subject to a separate evaluation. * Two memory optimizations to give back the extra memory for control bytes: ** I was able to get back 1+ byte in 9k out of those 16k fields by not storing signature index at all if the index is name index + 1, using an extra bit in the control byte (introduced above) to mark this. ** The most common cases for field and access flags can be packed to a table, storing these both in 1 byte instead of 2+. However these memory optimizations seem to degrade performance in the reproducers, so ideally I would not apply those. I'll file this for review once that https://github.com/openjdk/jdk/pull/24290 is integrated.
03-04-2025

Up to you, really. I have a minor dislike for subtasks, because they get nested a bit too deeply when backports are performed. So feel free to submit the separate RFE for double stream scan fix, and link it here as "Related" issue. Then we keep this open until we are reasonably sure we cannot do anything else with this performance regression.
25-03-2025

> I am thinking that since double stream scan fix is a bulk of the improvement Is it the bulk in CCC.java (which ought to be more realistic)? Because in my reproducer as well as in the code provided by customer it saves only about one third of the full regression, still keeping the regression at an order of magnitude. But if you still think it's better to co-opt it, I can do that, clarifying the scope in the description.
25-03-2025

> Shouldn't that rather be a subtask, though? I am thinking that since double stream scan fix is a bulk of the improvement, we can co-opt this issue to do it. Just rename it to something more succinct. Then we file the related bugs for the leftovers.
25-03-2025

[~shade] Yes, I can prepare the fix. Shouldn't that rather be a subtask, though?
25-03-2025

Radim, are you doing the double stream scan fix? Tell me if you are not, I will find an assignee for this work then :) I think we can co-opt this issue for the double stream scan fix, and submit three other improvements (https://bugs.openjdk.org/browse/JDK-8352075?focusedId=14761782&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14761782) as separate RFEs.
24-03-2025

[~shade] Thank you for investigating this; it seems like there's not really a will for reverting the CPU/memory tradeoff completely. In that case I think that the double stream scan fix is a worth low-hanging fruit. > Now, a class with thousands of fields is clearly an outlier in the universe of Java classes, but there's no question it is going to suffer a lot from the FieldInfoStream encoding. So what are the options? > Using two formats to encode field information, Unsigned5 for classes with a small number of fields, and an expanded, directly accessible, format for classes with a huge number of fields would also defeat the purpose of FieldInfoStream by not compressing the meta-data responsible for most of the waste. I don't really agree with this conclusion. If a class has many fields, variable-sized encoding won't provide as high gain as for small classes anyway. And even though we'd lose some gain, since 'class with thousands of fields is clearly an outlier' the change wouldn't change the memory consumption significantly while it would remove a pain-point in some use-cases. I would not try to micro-optimize Unsigned5 code if that is handling the current format. The main disadvantage is a strong data dependency both within the record for single field (5 mandatory integers + 3 optional) and in result on consecutive fields. There's little chance for any vectorization and a lot of branching. Variable encodings for streams tend to solve that by introducing control byte(s), possibly in an independent control stream. Varint-GB and Varint-G81U (see https://arxiv.org/pdf/1709.08990) have some redundancy between this control byte and data bytes. While in theory it is possible to shuffle the MSB from data to control byte, I don't think that we can strictly keep the memory footprint of the most favourable case for current encoding - 5 bytes for 5 low integers. If we allow for one control byte, we could encode the MSB of name and signature and keep the other 6 bits for total record length. Using a separate stream for control bytes, this would allow efficient parallelization of the loop.
24-03-2025

Aleksey Shipilev wrote: 4. Open research question: Explore other encodings besides UNSIGNED5, especially those that might help with quicker queries over the stream without much of the decoding. While the position where a FieldInfo is stored in an Unsigned5 stream depends on the data previously encoded in the stream, the way it is encoded doesn't. This means is possible to generate an index for faster accesses. When encoding a huge number of FieldInfo, the VM could record the position in the stream every m entries, and store this information in an array next to the encoded stream. A search for a FieldInfo with a particular field index i would then start by reading the position stored in the array at index i/m and then start decoding the stream from this position. The complexity of the search would then be O(m) instead of O(n). The determination of m would require some experiments to find the best trade-off between the time to decode m entries and the memory used to store those indexes.
19-03-2025

ILW=MMH=P3 Impact: Performance degradation, not a crash Likelihood: Possible existing generators for this many fields Workaround: User doesn't usually have control over these generators RT Triage: This has been changed to a bug to address double stream scan. File separate CRs for related enhancements.
18-03-2025

Good writeup, thanks! I believe there is no significant payoff for significant re-design in field encoding. CDS would likely not work for dynamically generated classes, as you mentioned. So, I think the pragmatic way forward is to perform incremental improvements, and handle the remaining performance loss for larger field counts as the unfortunate but necessary tradeoff for simplicity. Now that I look closer to Rewriter, it seems it does the field scans to support JDK-8157181, and it only cares about static final fields. Yet it has to scan the entire field stream. I don't think we should be splitting the field stream into static and instance fields. But maybe there is a way to perform more efficient filtering for statics (or generically for field flags). E.g. introduce JavaStaticFieldStream that would read the stream frame up to access_flags, check them, and then "quickly" read the rest of the stream frame for a field without much of the decoding. Remains to be seen if this useful. Or maybe Rewriter should not be doing any of this, and if CI needs to know this field property, it needs to compute it lazily itself. How about this plan / sub-tasks: 1. Immediate high-level fix: Fix up double stream scan by reusing FieldInfo, where we can. 2. Medium-running mid-level fix: See if we can help Rewriter by either specializing FieldStreams, or moving the computation closer to CI (see above). 3. Longer-running low-level fix: Micro-optimize UNSIGNED5 decoding, if possible. I suspect we can squeeze some performance out of that. 4. Open research question: Explore other encodings besides UNSIGNED5, especially those that might help with quicker queries over the stream without much of the decoding.
18-03-2025

The double stream scan is clearly a coding error, and should be fixed without questions (thank you for having identified this issue). Regarding the remaining FieldInfoStream decoding during field resolution, the work cannot be moved to the rewriter because at the time the method is rewritten, the field is not resolved yet and might not be resolvable (the class holding the field might not be loaded yet). Using the rewriting phase to copy the information needed at resolution time from the FieldInfoStream to the ResolvedFieldEntry would defeat the whole purpose of FieldInfoStream as explained below. The rational behind FieldInfoStream is to trade some CPU time for better memory density. The first observation was that the previous FieldInfo data structure was space inefficient. It has a fixed size, with 4-bytes int fields storing small values (<256) most of the time, and optional information that only a few fields had (initial value). Project Valhalla is adding more of those optional information with value types (layout kind, null marker offset), and is more than likely to add even more when generic specialization is added with the parametric VM. The second observation was that field information are used during the startup phase (resolution) and warm-up phase (JIT compilation), but rarely after that (debugging being an exception). Having a fully expanded data structure in memory for the lifetime of the JVM when it is only used during the early phase of the application was considered as a wasteful solution. It was understood that using a compressed format would make access to field information more expensive, specially with a format requiring a sequential decoding, but because the vast majority of Java classes have a small number of fields, and because this overhead only impacts the early phase of the application, it was considered acceptable. Now, a class with thousands of fields is clearly an outlier in the universe of Java classes, but there's no question it is going to suffer a lot from the FieldInfoStream encoding. So what are the options? Moving more information from the FieldInfoStream to the ResolvedFieldEntry would defeat the purpose of FieldInfoStream by duplicating meta-data and keeping in an expanded form information that is not used after resolution. Using two formats to encode field information, Unsigned5 for classes with a small number of fields, and an expanded, directly accessible, format for classes with a huge number of fields would also defeat the purpose of FieldInfoStream by not compressing the meta-data responsible for most of the waste. An option could be to split field information in two data structures, one directly accessible for the information that exist for all fields, and a compressed form for the optional information. This would add complexity, and would reduce the gains provided by the Unsigned5 on small indexes. But it could restore the performance for classes with a huge number of fields. A completely different approach would be to move resolution of those thousands of fields ahead-of-time, using an AppCDS archive for instance, but classes with thousands of fields are often generated dynamically, so a CDS expert would be needed to know what is possible.
18-03-2025

> Can we do something similar for InstanceKlass::field(int index) and other methods that use ... My dirty patch did all the places I could easily reach. The remaining hotspots are described by my point (2): a walk in Rewriter and a walk in LinkResolver. Let's see if Radim wants to take the FieldInfo caching on, and if Frederic has more comments on this.
18-03-2025

> Easy one, which Radim has already identified: [~shade] Can we do something similar for InstanceKlass::field(int index) and other methods that use " for (JavaFieldStream fs(this); !fs.done(); fs.next())" clause? I was just wondering if we could get more drops from 106ms.
18-03-2025

> Hey Xuelei, I think your JDK-8352169 is the duplicate of this one? [~shade] Yes. JDK-8352169 was created by mistake when I try to link this one to JDK-8292818. Thank you for closing the duplicated.
18-03-2025

It would be prudent to involve [~fparain] in the discussion here given he did the field streaming work.
18-03-2025

Basically, I am seeing two opportunities: 1. Easy one, which Radim has already identified: avoid double stream scan when we already know the FieldInfo. This seems to get us halfway there, the time drops to 106 ms here. Radim, do you want to follow-up on that? I have some spare cycles for it as well... I.e. do: bool InstanceKlass::find_local_field_from_offset(int offset, bool is_static, fieldDescriptor* fd) const { for (JavaFieldStream fs(this); !fs.done(); fs.next()) { if (fs.offset() == offset) { - fd->reinitialize(const_cast<InstanceKlass*>(this), fs.index()); + FieldInfo fi = fs.to_FieldInfo(); + fd->reinitialize(const_cast<InstanceKlass*>(this), fs.index(), &fi); if (fd->is_static() == is_static) return true; } } 2. Figure out if Rewriter::scan_method can pass the resolved field info (via ResolvedFieldEntry?) to LinkResolver::resolve_field, so that we only do the (unavoidable) walk through a method once in Rewriter. Not sure how much hassle would that be.
17-03-2025

I was able to construct a bit more straight-forward example based on Radim's generator. Attached CCC.java has 30 classes, each bearing 200 instance and 200 static fields, initialized by 20 methods. I think we can reasonably argue this is not an unusual shape, especially with generated code. It shows impact that we very much want to mitigate, especially for startup time. $ hyperfine -w 5 -r 10 "java CCC" # Mainline Time (mean ± σ): 149.4 ms ± 11.5 ms [User: 135.6 ms, System: 28.2 ms] Range (min … max): 139.8 ms … 172.6 ms 10 runs # JDK 17: Time (mean ± σ): 60.2 ms ± 2.2 ms [User: 40.0 ms, System: 27.5 ms] Range (min … max): 57.6 ms … 64.5 ms 10 runs Attached profile-forward.html and profile-reverse.html show where we spend time.
17-03-2025

Hey Xuelei, I think your JDK-8352169 is the duplicate of this one? Radim have posted to hotspot-dev, and this one is his bug, so I think this one takes precedence.
17-03-2025

Change the priority to P2 (ILM: MHH).
17-03-2025

[~dholmes] We've got a customer who faced this regression from 5 secs to 23 secs.
17-03-2025

I think it is (well) known that the fieldstream approach is slower when there are huge numbers of fields. The question, to me, is how realistic is such a scenario. Do real applications run into this issue?
17-03-2025