JDK-8314133 : sadebugd tests timeout on OSX unless run as root
  • Type: Bug
  • Component: hotspot
  • Sub-Component: svc-agent
  • Affected Version: 22
  • Priority: P4
  • Status: New
  • Resolution: Unresolved
  • OS: os_x
  • Submitted: 2023-08-10
  • Updated: 2023-08-11
Related Reports
Relates :  
Description
The sadebugd tests on OSX normally have to be run as root. The reason is described in the following comment:

    /**
     * This tests has issues if you try adding privileges on OSX. The debugd process cannot
     * be killed if you do this (because it is a root process and the test is not), so the destroy()
     * call fails to do anything, and then waitFor() will time out. If you try to manually kill it with
     * a "sudo kill" command, that seems to work, but then leaves the LingeredApp it was
     * attached to in a stuck state for some unknown reason, causing the stopApp() call
     * to timeout. For that reason we don't run this test when privileges are needed. Note
     * it does appear to run fine as root, so we still allow it to run on OSX when privileges
     * are not required.
     */
    public static void validateSADebugDPrivileges() {
        if (needsPrivileges()) {
            throw new SkippedException("Cannot run this test on OSX if adding privileges is required.");
        }
    }

"privileges" means running with sudo, and needsPrivileges() will return true if we are not currently running as root. So basically we skip these sadebugd tests if we are not root because running with sudo causes issues.

I recently discovered that instead of using sudo, you can instead put the OSX host in "developer mode". See JDK-8313357. So I added some logic to SATestUtils to allow tests to be run without root or sudo if the host is in developer mode. This seems to have worked well, and I was hoping would also resolve this sadebugd test issue, but it has not. If you run them while in developer mode (and not as root), the issue with LingereApp above still happens.

Here's are more details using an example failure with the ClhsdbAttachToDebugServer. The test is pretty simple, but involves 4 processes, which can make it somewhat confusing to follow. The first process is the test process. It first creates a debuggee process (LingeredApp). It then spawns an sadebugd process that attaches to the debuggee process. This is done be using "jhsdb debugd --pid <LingeredApp_pid>". It then launches clhsdb and has it connect to the sadebugd process. This is done using "jhsdb clhsdb" and then issuing the "attach localhost" command to connect to the sadebugd server. Then a series of simple clhsdb commands are executed before issuing the "quit" command and tearing down the test. For the most part this all seems to work. However, there are issues with the test tear down. It gets stuck here:

"AgentVMThread" #21 [40195] prio=5 os_prio=31 cpu=124.98ms elapsed=118.53s tid=0x000000012d0ff210 nid=40195 waiting on condition  [0x000000016dc3a000]
   java.lang.Thread.State: TIMED_WAITING (parking)
Thread: 0x000000012d0ff210  [0x9d03] State: _at_safepoint _at_poll_safepoint 0
   JavaThread state: _thread_blocked
	at jdk.internal.misc.Unsafe.park(java.base@22-internal/Native Method)
	- parking to wait for  <0x00000005df5dcb30> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@22-internal/LockSupport.java:269)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@22-internal/AbstractQueuedSynchronizer.java:1758)
	at java.lang.ProcessImpl.waitFor(java.base@22-internal/ProcessImpl.java:440)
	at jdk.test.lib.apps.LingeredApp.waitAppTerminate(LingeredApp.java:249)
	at jdk.test.lib.apps.LingeredApp.stopApp(LingeredApp.java:421)
	at jdk.test.lib.apps.LingeredApp.stopApp(LingeredApp.java:515)
	at ClhsdbAttachToDebugServer.main(ClhsdbAttachToDebugServer.java:99)
	at java.lang.invoke.LambdaForm$DMH/0x00000088000c0000.invokeStatic(java.base@22-internal/LambdaForm$DMH)
	at java.lang.invoke.LambdaForm$MH/0x0000008800141800.invoke(java.base@22-internal/LambdaForm$MH)
	at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base@22-internal/Invokers$Holder)
	at jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(java.base@22-internal/DirectMethodHandleAccessor.java:154)
	at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(java.base@22-internal/DirectMethodHandleAccessor.java:103)
	at java.lang.reflect.Method.invoke(java.base@22-internal/Method.java:580)
	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:333)
	at java.lang.Thread.runWith(java.base@22-internal/Thread.java:1583)
	at java.lang.Thread.run(java.base@22-internal/Thread.java:1570)

So it is having trouble stopping the LingeredApp process. However, the LingeredApp process no longer exists. Neither does the sadebugd process or the clhsdb process (which is expected at this point in the test). They have all terminated already, so it's unclear why Process.waitFor() of the LingeredApp process is hanging.

Another thing that is odd is that after a timeout period (I think 6 minutes), a prompt appears for the user's password, as if a sudo command was being executed or some other OS related permissions are being requested. After entering the password, the test exits, and indicates a timeout error, although the log shows that other than the tear down issues, the test has completed properly.

And one final oddity is that after the test exits, "jcmd <pid> Thread.print" of the LingeredApp process returns an error of "Connection refused". Normally if a process does not exist the error is "No such process", but for some reason in this case you get "Connection refused". I confirmed with ps and Acitvity Monitor that the process does not exist.

My conclusion here is that because sadebugd attached to the LingeredApp process, it somehow becomes dissassociated with the test process that created it. So when it exits, Process.waitFor() can't tell the state of the LingeredApp process anymore. One reason I think this is because I know when SA attaches to a debuggee process, the SA process becomes the parent of the debuggee process. However, that alone does not fully explain the problem since the same process ownership xfer happens with non-sadebugd tests, and those tests have no issues with calling LingeredApp.stopApp().