JDK-8263242 : serviceability/sa/ClhsdbFindPC.java cannot find MaxJNILocalCapacity with ASLR
  • Type: Bug
  • Component: hotspot
  • Sub-Component: svc-agent
  • Affected Version: 17
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux
  • Submitted: 2021-03-09
  • Updated: 2021-05-24
  • Resolved: 2021-05-18
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 17
17 b23Fixed
Related Reports
Relates :  
Relates :  
Description
The test case was added recently with JDK-8261098. On some machines, it fails with:

$ CONF=linux-x86_64-server-fastdebug make run-test TEST=serviceability/sa/ClhsdbFindPC.java

Starting clhsdb against corefile core and exe /home/shade/jdk/build/linux-x86_64-server-fastdebug/images/jdk/bin/java
[2021-03-09T11:37:01.497454082Z] Gathering output for process 79218
[2021-03-09T11:37:02.344227826Z] Waiting for completion for process 79218
[2021-03-09T11:37:02.344444425Z] Waiting for completion finished for process 79218
Output: 
Opening core file, please wait...
hsdb> hsdb> + verbose true
hsdb> + findsym MaxJNILocalCapacity
0x00007fb919044688
hsdb> + quit

Starting clhsdb against corefile core and exe /home/shade/jdk/build/linux-x86_64-server-fastdebug/images/jdk/bin/java
[2021-03-09T11:37:02.347413091Z] Gathering output for process 79281
[2021-03-09T11:37:03.335273729Z] Waiting for completion for process 79281
[2021-03-09T11:37:03.335473657Z] Waiting for completion finished for process 79281
Output: 
Opening core file, please wait...
hsdb> hsdb> + verbose true
hsdb> + findpc 0x00007fb919044688
Address 0x00007fb919044688: In unknown location
hsdb> + quit

STDERR:
 stdout: [ + findpc 0x00007fb919044688
Address 0x00007fb919044688: In unknown location
];
 stderr: [ + findpc 0x00007fb919044688
Address 0x00007fb919044688: In unknown location
]
 exitValue = -1

java.lang.RuntimeException: Test ERROR java.lang.RuntimeException: 'Address 0x00007fb919044688: .*MaxJNILocalCapacity' missing from stdout/stderr 

Note that the prior invocation of hsdb found the symbol at the same address 0x00007fb919044688, but the subsequent invocation missed it.

This issue goes away after disabling ASLR on target machine:
 $ echo 0 | sudo tee /proc/sys/kernel/randomize_va_space

...and remanifests when ASLR is enabled back:
 $ echo 2 | sudo tee /proc/sys/kernel/randomize_va_space
Comments
Changeset: 10236e7a Author: Chris Plummer <cjplummer@openjdk.org> Date: 2021-05-18 21:26:56 +0000 URL: https://git.openjdk.java.net/jdk/commit/10236e7a52a2dd117c3639d020cfe294797f57af
18-05-2021

It's out for review now.
18-05-2021

Thanks Chris! These two tests failure can reproduce constantly in our Linux X64 environement with 3.10.107 kernel.
18-05-2021

I wasn't giving this bug much attention since no one else was reporting the issue. I can make it a higher priority. I delayed sending out a PR for the above fix because there were a couple of other things I wanted to clarify first. I'll try to get back to it.
17-05-2021

Hi Chris, Recently added test "test/hotspot/jtreg/serviceability/sa/sadebugd/RunCommandOnServerTest.java" fail due to same reason. It is annoying to have these intermittent failures. Do you have plan to submit PR for this bug recently? Tested with following patch and no more reproduce for RunCommandOnServerTest and test/hotspot/jtreg/serviceability/sa/ClhsdbFindPC.java --- a/src/jdk.hotspot.agent/linux/native/libsaproc/libproc_impl.c +++ b/src/jdk.hotspot.agent/linux/native/libsaproc/libproc_impl.c @@ -187,8 +187,9 @@ static bool fill_addr_info(lib_info* lib) { lib->exec_end = (uintptr_t)-1L; for (ph = phbuf, cnt = 0; cnt < ehdr.e_phnum; cnt++, ph++) { if (ph->p_type == PT_LOAD) { - uintptr_t aligned_start = align_down(lib->base + ph->p_vaddr, ph->p_align); - uintptr_t aligned_end = align_up(aligned_start + ph->p_filesz, ph->p_align); + uintptr_t unaligned_start = lib->base + ph->p_vaddr; + uintptr_t aligned_start = align_down(unaligned_start, ph->p_align); + uintptr_t aligned_end = align_up(unaligned_start + ph->p_filesz, ph->p_align); if ((lib->end == (uintptr_t)-1L) || (lib->end < aligned_end)) { lib->end = aligned_end; }
17-05-2021

Also notice random failure when running test/hotspot/jtreg/serviceability/sa/ClhsdbFindPC.java serviceability/sa/ClhsdbFindPC.java#id0 Failed. Execution failed: `main' threw exception: java.lang.RuntimeException: Test ERROR java.lang.RuntimeException: 'Address 0x00002b995a089408: .*MaxJNILocalCapacity' missing from stdout/stderr serviceability/sa/ClhsdbFindPC.java#id2 Failed. Execution failed: `main' threw exception: java.lang.RuntimeException: Test ERROR java.lang.RuntimeException: 'Address 0x00002ab060e05408: .*MaxJNILocalCapacity' missing from stdout/stderr Checking above patch and fixed.
08-04-2021

The issue is that the size of the .so is not properly being calculated, and is too small. The problem can be more readily reproduced by changing the symbol that is searched for from MaxJNILocalCapacity to something near the end of the .bss section. I changed it to _ZL16ZCounterUncommit, and now it reproduces every time. [Just a warning if you try this change after applying the fix below. The test will still fail, but that's because the "findpc <addr>" output unmangles the symbol, so it displays ZCounterUncommit instead of _ZL16ZCounterUncommit, so the test complains.] The issue is that when calculating the size (range) of the .so, the p_filesz field is used instead of p_memsz. +++ b/src/jdk.hotspot.agent/linux/native/libsaproc/libproc_impl.c @@ -188,11 +188,11 @@ static bool fill_addr_info(lib_info* lib) { for (ph = phbuf, cnt = 0; cnt < ehdr.e_phnum; cnt++, ph++) { if (ph->p_type == PT_LOAD) { uintptr_t aligned_start = align_down(lib->base + ph->p_vaddr, ph->p_align); - uintptr_t aligned_end = align_up(aligned_start + ph->p_filesz, ph->p_align); + uintptr_t aligned_end = align_up(aligned_start + ph->p_memsz, ph->p_align); p_memsz is larger because it includes .bss, which is not included in p_filesz. However, at the time this code was written, initially p_memsz was used, but it was thought that it produced an address that was beyond the end of the .so memory, so p_filesz was used instead. The confusion was due to not understanding how .bss is handled. If you look at the maps for libjvm.so, you will see something like: 7efee9393000-7efee93b5000 r-xp 00000000 08:02 23470983 /ws/jdk/build/linux-x86_64-server-fastdebug/images/jdk/lib/libjava.so 7efee93b5000-7efee95b5000 ---p 00022000 08:02 23470983 /ws/jdk/build/linux-x86_64-server-fastdebug/images/jdk/lib/libjava.so 7efee95b5000-7efee95b6000 r--p 00022000 08:02 23470983 /ws/jdk/build/linux-x86_64-server-fastdebug/images/jdk/lib/libjava.so 7efee95b6000-7efee95b7000 rw-p 00023000 08:02 23470983 /ws/jdk/build/linux-x86_64-server-fastdebug/images/jdk/lib/libjava.so 7efee95b7000-7efee95b8000 rw-p 00000000 00:00 0 It was thought, in the above example, that the .so ended at 7efee95b7000. However, you also need to include the anonymous map that follows, because it represents the .bss section. When you look at this address instead (7efee95b8000), then using p_memsz computes the correct end of the .so memory. However, after the above fix I found that the computed end was coming up a page short. Here's an example with output from various sources: #libproc_impl.c fill_addr_info() printf: libjvm.so [1] 0x15055854d000-0x15055a3b8000: base = 0x15055854d000, vaddr = 0x0, memsz = 0x1e6aa70 libjvm.so [2] 0x15055a3b8000-0x15055a565000: base = 0x15055854d000, vaddr = 0x1e6bde0, memsz = 0x1acbc0 # /proc/<pid>/maps 15055854d000-15055a3b8000 r-xp 00000000 08:10 4745273214 /libjvm.so 15055a3b8000-15055a492000 r--p 01e6a000 08:10 4745273214 /libjvm.so 15055a492000-15055a4c8000 rw-p 01f44000 08:10 4745273214 /libjvm.so 15055a4c8000-15055a566000 rw-p 00000000 00:00 0 # clhsdb pmap output: 0x000015055854d000 32864K /libjvm.so # computed end of .so based on clhsdb pmap output (gdb) p /x 0x000015055854d000 + 32864*1024 $11 = 0x15055a565000 So the end result here is that the calculated ending address, 0x15055a565000, is one page short of what the maps file shows is the ending address, which is 15055a566000. This issue is due to the align_down that is done on the starting address of the segment: uintptr_t aligned_start = align_down(lib->base + ph->p_vaddr, ph->p_align); uintptr_t aligned_end = align_up(aligned_start + ph->p_filesz, ph->p_align); This does seem to compute the correct start of the segment, which is 0x15055a3b8000, but also results in coming up a page short in computing the overall size. I think the solution is to use the unaligned start of the library when computing the ending address, which seems to be working - uintptr_t aligned_start = align_down(lib->base + ph->p_vaddr, ph->p_align); - uintptr_t aligned_end = align_up(aligned_start + ph->p_filesz, ph->p_align); + uintptr_t unaligned_start = lib->base + ph->p_vaddr; + uintptr_t aligned_start = align_down(unaligned_start, ph->p_align); + uintptr_t aligned_end = align_up(unaligned_start + ph->p_memsz, ph->p_align);
23-03-2021

In the review for JDK-8261710, Yasumasa mentioned that the size calculation may not be perfect and could use some future improvements. https://github.com/openjdk/jdk/pull/2563#issuecomment-780344032
12-03-2021

The test first successfully finds the address of the MaxJNILocalCapacity symbol: hsdb> + findsym MaxJNILocalCapacity 0x00007fb919044688 But it then fails to do the reverse and map the address to the symbol: hsdb> + findpc 0x00007fb919044688 Address 0x00007fb919044688: In unknown location I know in general this type of reverse lookup is working because otherwise SA would quickly fail due to the inability to map a vtable address to its symbol. The code that does the address -> symbol lookup is in PointerFinder.java JVMDebugger dbg = VM.getVM().getDebugger(); CDebugger cdbg = dbg.getCDebugger(); if (cdbg != null) { loc.loadObject = cdbg.loadObjectContainingPC(a); if (loc.loadObject != null) { loc.nativeSymbol = loc.loadObject.closestSymbolToPC(a); return loc; } } From the output I can tell that the loadObjectContainingPC(a) returned null. I changed the code to instead just always use the first LoadObject in the global list: - loc.loadObject = cdbg.loadObjectContainingPC(a); + java.util.List<LoadObject> objs = cdbg.getLoadObjectList(); + loc.loadObject = objs.get(0); Although this is not the LoadObject that the symbol is in, it turns out that doesn't matter since loadObject.closestSymbolToPC(a) ends up searching every dso no matter which LoadObject was used to call it. With this code in place, the lookup was successful and the test passed. So that is evidence that loadObjectContainingPC(a) is not working properly. All it does it check that the address is in range of the LoadObject by looking at the start address and the size. The failure to find the proper LoadObject is an indication that the size for the LoadObject is too small. I then modified ClhsdbFindPC.java to execute the clhsdb "pmap" command before the above failed "findpc" command. The following is part of the output: hsdb> + pmap ... 0x00007f6eff54f000 33476K /home/shade/jdk/build/linux-x86_64-server-fastdebug/images/jdk/lib/server/libjvm.so 0x00007f6f01736000 4904K /lib/x86_64-linux-gnu/libc-2.24.so ... hsdb> + findsym MaxJNILocalCapacity 0x00007f6f01665688 And some simple math show the MaxJNILocalCapacity address is a ways beyond what SA thinks is the end of libjvm.so: (gdb) p /x 0x00007f6eff54f000 + 33476*1024 $7 = 0x7f6f01600000 (gdb) p /x 0x00007f6f01665688 - $7 $8 = 0x65688 (gdb) p $8 $9 = 415368 So the libjvm.so size is too small by at least 415368 bytes.
11-03-2021