JDK-8267460 : runtime/os/TestTracePageSizes.java#with-Serial fails on linux-aarch64 since JDK-8267155
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 17,18
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: linux
  • CPU: aarch64
  • Submitted: 2021-05-20
  • Updated: 2025-05-23
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Sub Tasks
JDK-8267463 :  
JDK-8267480 :  
Description
runtime/os/TestTracePageSizes.java fails since the changes in JDK-8267155 on linux-aarch64 with the following error:

command: main -XX:+AlwaysPreTouch -Xmx128m -Xlog:pagesize:ps-%p.log -XX:+UseSerialGC -XX:+UseTransparentHugePages TestTracePageSizes
reason: User specified action: run main/othervm -XX:+AlwaysPreTouch -Xmx128m -Xlog:pagesize:ps-%p.log -XX:+UseSerialGC -XX:+UseTransparentHugePages TestTracePageSizes 
Mode: othervm [/othervm specified]
elapsed time (seconds): 1.299
----------configuration:(0/0)----------
----------System.out:(0/0)----------
----------System.err:(13/846)----------
java.lang.AssertionError: Page sizes mismatch: 64 != 524288
	at TestTracePageSizes.main(TestTracePageSizes.java:294)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
	at java.base/java.lang.Thread.run(Thread.java:831)


Seems to be an issue with large pages (512M) somehow.

Also only the "with-Serial" test fails for some reason.
Comments
I don't have the bandwidth to do this, so I unassign myself from this issue.
23-05-2025

Okay, I think this is an old error in the test itself that is already fixed as side effect of JDK-8319969. What I think happens: - we use THP - we are on arm64, and the kernel is compiled with "CONFIG_ARM64_64K_PAGES=Y", so we have 64K pages and 512MB large (explicit and THP) pages - we use THP - the JVM reads huge page size "os::_large_page_size" - the SerialGC passes this page size to the ReservedSpace object. It stores this page size internally, but since it's THP, all it does with it is make sure the base address is properly aligned. - it traces "512MB" as part of the pagefile trace: Log (ps-697260.log): ``` [0.006s][info][pagesize] Heap: min=512M max=512M base=0x00000000e0000000 page_size=512M size=512M" ``` - but it takes a while to coalesce 512MB. Therefore the underlying vma is still 64K: smaps-copy-697260-0.txt: ``` e0000000-100010000 rw-p 00000000 00:00 0 Size: 524352 kB <<< 512MB KernelPageSize: 64 kB <<< still small paged MMUPageSize: 64 kB Rss: 524352 kB Pss: 524352 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 524352 kB Referenced: 524352 kB Anonymous: 524352 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 <<< elegible VmFlags: rd wr mr mw me ac ``` The older versions of this test only checked for the "hg" VmFlag in smaps. Newer versions of this test also test if "THPeligible==1": https://github.com/openjdk/jdk/blob/cba0f786fc65a5bfbc6e921efd1f191b63b30ba5/test/hotspot/jtreg/runtime/os/TestTracePageSizes.java#L371 and I assume that fixes this bug. It would be nice for someone at Oracle to check if this bug still occurs with a current JVM since I cannot reproduce it. If it cannot be reproduced, I would close the bug as a duplicate of JDK-8319969.
23-01-2025

Re-opening since the test is problem listed, and as David points out, should not be closed without first removing it from the problem list.
26-02-2024

[~aph] this bug should not be closed as it is used in the ProblemList to exclude these tests. If the bug is no longer considered an issue the ProblemList must still be updated. Thanks
23-01-2024

I fixed several errors in this area with JDK-8310233. I wonder whether this issue is still reproducable.
10-08-2023

I might be able to address this, given some information about how to configure the host.
10-08-2023

It has been brought to my attention that reproducing this bug requires specially configured host(s) with differing memory page sizes (huge pages of varying sizes). Hence reopening this bug.
23-02-2023

Ran on Oracle Linux 8 and 9 over 200 times, could not reproduce. Closing as CNR. Corresponding entry will be removed from problem list shortly, see: JDK-8303085: Runtime problem list cleanup
22-02-2023

So I just did this on a RHEL system after removing the entries from ProblemList.txt, and it appears to be non-reproducible: Running test 'jtreg:test/hotspot/jtreg/runtime/os/TestTracePageSizes.java' Passed: runtime/os/TestTracePageSizes.java#G1 Passed: runtime/os/TestTracePageSizes.java#Parallel Passed: runtime/os/TestTracePageSizes.java#Serial Passed: runtime/os/TestTracePageSizes.java#no-options Passed: runtime/os/TestTracePageSizes.java#compiler-options Test results: passed: 5 Hugepagesize: 524288 kB KernelPageSize: 64 kB
26-05-2022

I'll remove myself from this bug as Assignee since I don't have the hardware to reproduce this; one needs an aarch64 box with 64k base page size (so, 512m would be the only available large page size). As a quick solution, it may be that just removing the -Xmx128m from the test runner arguments may be enough to make the error disappear; that had been a performance optimization.
10-06-2021

ILW = MLM = P4
25-05-2021

> I wonder if the real question is what we should do with LargePageSizeInBytes when using THP? It doesn't really have any meaning with THP since it is the kernel that decides what size of pages to back the reservation with. I am not even sure why UseTHP has to be connected to UseLargePages at all. Isn't the point of THP that it works transparently? Could we not, if UseTHP is specified, do everything as we do with small pages, just madvise when committing? And let the kernel figure out how to best fold the memory area into large pages? AFAIU the advantage of the current behavior is that reservations and commits are carefully groomed and aligned to the large page size. But I think this does not really matter for the standard "small" large pages of 2m - if we allocate 512m heap and the first 1.8m are transparently small-paged, so what? The only real benefit the current THP handling has is with really large large pages, e.g. 512m as in this case; but I wonder whether this is not an artificial scenario, since for the THP handler to consider a 512m range worthy of folding into one page, don't have all these pages to be paged in (which is why we start the test with AlwaysPreTouch)? How realistic is that, and how fragile is that page (how probable is it that it splinters if e.g. part of the heap get uncommitted)? One thing is sure, if we simplified UseTHP to just do-as-with-small-pages-just-madvise, we could get rid of quite a bit of complexity.
20-05-2021

Still for the original problem we don't set a LargePageSizeInBytes, but we end up getting the same problem because the large page size is bigger than the young generation.
20-05-2021

The problem is that the young generation is not large enough. Here: bool VirtualSpace::initialize(ReservedSpace rs, size_t committed_size) { const size_t max_commit_granularity = os::page_size_for_region_unaligned(rs.size(), 1); return initialize_with_granularity(rs, committed_size, max_commit_granularity); } The max_commit_granularity will be calculated to 4k instead of 1g, and this is what will be used as the alignment hint further down. The reason is that the size of the ReservedSpace here is 715784192 bytes. This will actually be "fixed" once we have multiple large page sizes, at least for x86 where we in this case would get 2m as the alignment hint. I wonder if the real question is what we should do with LargePageSizeInBytes when using THP? It doesn't really have any meaning with THP since it is the kernel that decides what size of pages to back the reservation with.
20-05-2021

I am looking into Serial now. Instrumenting os::pd_realign_memory shows me this: [0.028s][warning][pagesize] attempt madvise: 0x0000000080000000, 715784192 bytes, alignment 4096, vm page size 4096, large page size 1073741824 [0.223s][warning][pagesize] attempt madvise: 0x00000000aaaa0000, 357957632 bytes, alignment 4096, vm page size 4096, large page size 1073741824 [0.600s][warning][pagesize] Young RS: 0x0000000080000000, size 715784192, alignment 1073741824 [0.600s][warning][pagesize] Old RS: 0x00000000aaaa0000, size 1431699456, alignment 1073741824 What I think is problematic is that we call young_rs and old_rs inits with alignment hint of 4K? ReservedSpace.alignment() returns 1G for them, but alignment_hint gets reduced to 4K when we reach the madvise call. I believe this is a VM bug, rather than an issue with the test: VM code seems to be over-optimistic that with small heap and large pages we would be able to always map the THPs. Seems to be that way at least for Serial: when generations a less than a page sized and/or split the pages, we run into problems when madvise seems to be only issued for page-aligned subparts of them.
20-05-2021

Oh. I think I understand why this happens. So -Xmx2g ergonomically selects -Xms1g. GC then tries to commit that initial heap size. Which is a page size in my artificial config. But, MADV_HUGEPAGE is never delivered to that segment. And this is why, and this is the fix that makes the test pass both on my x86_64 (with modified test) and aarch64 hosts (with original test from current master): diff --git a/src/hotspot/os/linux/os_linux.cpp b/src/hotspot/os/linux/os_linux.cpp index 1e0c375a025..4ea49e62b47 100644 --- a/src/hotspot/os/linux/os_linux.cpp +++ b/src/hotspot/os/linux/os_linux.cpp @@ -2788,7 +2788,7 @@ void os::pd_commit_memory_or_exit(char* addr, size_t size, } void os::pd_realign_memory(char *addr, size_t bytes, size_t alignment_hint) { - if (UseTransparentHugePages && alignment_hint > (size_t)vm_page_size()) { + if (UseTransparentHugePages) { // We don't check the return value: madvise(MADV_HUGEPAGE) may not // be supported or the memory may already be backed by huge pages. ::madvise(addr, bytes, MADV_HUGEPAGE); It also explains why -Xmx3g works in my artificial config: it selects -Xms2g, which means initial heap commit passes this check. EDIT: After some thought, this is not actually correct patch... Eh.
20-05-2021

I honestly like your patch. Why not, if UseTransparentHugePages, switch on THP for *all* regions. If its good for the heap, its good for other regions. No need to leave the decision to use THP on a particular memory range up to the caller. That would simplify this coding. EDIT: I realize THP are a bad idea for large regions with a short or highly volatile life span. But we have mainly static areas, I think. Aph may know more, IIRC he introduced THPs, including this commit logic.
20-05-2021

But os::vm_page_size() should be the base page size. 4K in your case. So with your patch, you switch now on THP for all commits indiscriminately if UseTransparentHugePages=on. I thought the error must be somewhere in the callers, where we pass in "alignment" as page size from various commit-the-heap places, and for SerialGC it should be in VirtualSpace::expand_by(). I don't see any obvious error here but the alignment calculation may be off here. I think passing the alignment as, basically, boolean arg to indicate "use thp here" is not easy to understand. (What I also don't understand: say we have reserved a memory segment, then madvise a part of it. Will the linux kernel then split this segment, make two or three out of them, and set the attribute only in the middle one? Like it does with partial uncommits?)
20-05-2021

I don't get 1g pages to work on my system, Ubuntu won't come up :/
20-05-2021

I opened https://bugs.openjdk.java.net/browse/JDK-8267475 to follow up on the "excessive rounding up of heap size" issue.
20-05-2021

Hold on a sec. I revert JDK-8267155 -- https://github.com/openjdk/jdk/commit/726785b8d7c18569bddae6a08fa7f61d8d7bd2c4 -- and the test *still fails*!
20-05-2021

The weird thing is the VM is started with Xmx128m. Why on earth would the VM try to reserve a heap with 512m page? I believe this error is triggered by the test now passing in Xmx128m, and that its a pre-existing bug.
20-05-2021

Enabled debug says it fails here: From logfile: [0.024s][info][pagesize] Heap: min=512M max=512M base=0x00000000e0000000 page_size=512M size=512M From smaps: [e0000000, 100000000) pageSize=64KB isTHP=false isHUGETLB=false Failure: 64 != 524288
20-05-2021

Thank you Thomas.
20-05-2021

I can reproduce this on one of aarch64 boxes here.
20-05-2021

Attached all ps-* and smaps-* files in the directory. I think 697260 is the problematic one as it's the only one with a forced 512M heap.
20-05-2021

I have no access to a 64k granule aarch64 machine. Could someone pls attach the smaps file of the failing test (should be part of the retained error files).
20-05-2021

Thanks for filing [~tschatzl] - you beat me to it. :)
20-05-2021

Ha! The config above, but with -Xmx3g passes! Note how isTHP=true now. I wonder if either Hotspot code or OS code does not put the madvise out. From logfile: [0.006s][info][pagesize] Heap: min=1G max=3G base=0x0000000740000000 page_size=1G size=3G From smaps: [740000000, 800000000) pageSize=4KB isTHP=true isHUGETLB=false Success: 1048576 > 4 and THP enabled
20-05-2021

Maybe they just recently changed their aarch64 system to use 64k page size (which includes 512m as the sole huge page size).
20-05-2021

I have rolled back as far as original JDK-8262188 that added the test, and the test is still failing on my AArch64 machine at the same config with the same message. So this looks like a pre-existing day-1 bug, not a recent regression. Pengfei also mentions the failure like this earlier: https://bugs.openjdk.java.net/browse/JDK-8263236?focusedCommentId=14408987#comment-14408987. I am confused why it was detected only recently... I do wonder if Oracle systems had the default heap size that worked, and then -Xmx128m put the test into position where VM rounded up the heap size to large page size *and* recorded that as the page size. That is to say, somebody from Oracle needs to investigate this on the system where this failure was detected. I think I reproduce a similar (?) failure on my x86_64 desktop by adding this config: + * @run main/othervm -XX:+AlwaysPreTouch -Xmx2g -Xlog:pagesize:ps-%p.log -XX:+UseSerialGC -XX:+UseTransparentHugePages -XX:LargePageSizeInBytes=1G TestTracePageSizes From logfile: [0.005s][info][pagesize] Heap: min=1G max=2G base=0x0000000080000000 page_size=1G size=2G From smaps: [80000000, c0000000) pageSize=4KB isTHP=false isHUGETLB=false Failure: 4 != 1048576 STDERR: java.lang.AssertionError: Page sizes mismatch: 4 != 1048576
20-05-2021

Interestingly, this shows a different problem with very large page sizes: if we specify UseLargePages (or UseTransparentHugePages) on a machine with only very large page sizes - like here, we only have 512m - we will round up the heap size to the page size instead of doing the correct thing and go with small pages. In this example we start the VM with 128m heap, which gets rounded up to 512m.
20-05-2021

Can you post the output of "pagesize" and of "java -Xlog:pagesize -XX:+UseTransparentHugePages -XX:+UseSerialGC -Xmx128m" ? To me it looks like the GC thinks erroneously that the heap should be 512m paged, but the heap - being 128m in size - can only use the base page size of 64k.
20-05-2021