JDK-8328473 : StringTable and SymbolTable statistics delay time to safepoint
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 17,21,23,25
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2024-03-19
  • Updated: 2025-03-07
  • Resolved: 2025-02-26
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 25
25 b12Fixed
Related Reports
Causes :  
Description
There are periodic JFR events that calculate statistics for the string table and symbol table. These calculations are run in the in_vm thread state, and hence delay safepoints for the duration of the calculation. There is no incremental polling either, so the safepoint delay is proportional to the size of the tables. The events run by default every 10 seconds. I would class that as a performance bug. It is not expected and could have severe consequences.
Moreover, the string table can get very large and it isn't clear if doing this every 10 seconds is a good idea.
Comments
Changeset: 1e18fffe Branch: master Author: Coleen Phillimore <coleenp@openjdk.org> Date: 2025-02-26 11:49:09 +0000 URL: https://git.openjdk.org/jdk/commit/1e18fffee456382c4eeb017b3fad0dc99ccaad35
26-02-2025

I was trying to get my head around reading this vm data structure in native mode, and was frankly afraid it would work a little. Especially worrying for StringTable that have OopHandles. We also hold the resize_lock which is a Mutex that can't be held for native code. So no. Interesting idea though.
24-02-2025

Impact: JDK-8185525 added these JFR events, and they are enabled by default. Marked related affected versions and linked up the related issues.
24-02-2025

> Maybe we should transition to native state, and thus avoid interfering with safepoints completely? Ah probably not, because that would expose us to shutdown races, since native-state printing would not block VM_Exit safepoint from destroying the tables.
24-02-2025

Do we care about VM state for gathering these events? Maybe we should transition to native state, and thus avoid interfering with safepoints completely? Not sure if CHM allows us to.
24-02-2025

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/23750 Date: 2025-02-24 14:27:01 +0000
24-02-2025

I've implemented the statistics gathering in chunks that wait for safepoints at intervals, like the GrowTask and BulkDeleteTask. If the table is resized (if rehashed in the middle it might not be accurate, but the next collection will be more accurate). It's not specified to stop the VM and be 100% accurate but there is a dcmd that does that, if necessary. Now how do I test this?
24-02-2025

Time measurements would depend arbitrarily on the size of the string tables. We have seen one particular application where string tables are very large, and this becomes a noticeable problem. Personally I think that it seems okay to perform incremental polling and accept that safepoints might rehash the table in the middle... once. We already accept that the entire table is concurrently mutated (w.r.t. inserts) while we are walking it. So it is already the case that whatever statistics we end up with might not correlate do any particular linearizable point in time. Safepoint rehashing seems to me like it would be the smaller consistency problem here.
20-03-2024

Do we actually have time measurements for collecting these stats? Are these stats something that can be gathered while safepoint-safe when a safepoint operation may concurrently update the tables?
20-03-2024

RT Triage: ILW = MML = P4
19-03-2024

We should not block out safepoints until a potentially huge data structure is walked from a latency perspective, and the interval seems questionable given JFR’s aim to not impact the overall CPU useage significantly. This is the problem domain. I’m not proposing a given solution.
19-03-2024

Is this as simple as you asking not to run at the current frequency? If so, do you have a suggestion on what it should be? Or is figuring out the reasonable the crux of the issue? Do you also want to see whether this can be optimized in some way, in addition to making it happen less frequently? If so, did you have anything concrete in mind here that we can do?
19-03-2024