The JVM can stall with high system time observed if a Runtime.exec call suspends the VM thread while it is in the process of returning from the mprotect system call setting the safepoint polling page to read-only. The Java threads that are writing to the polling page then loop in the SEGV signal handler until the VM thread is unsuspended and can set the polling page protection back to read/write. This condition can persist on systems with a small number of cpus and a large number of Java threads touching the polling page due to serialization on the process lock by the signal handler calling __lwp_sigmask to block further signals, which produces the high system time observed. A gcore of the process will show the VM thread in the mprotect system call setting the page to PROT_READ, a java thread in ___lwp_suspend from a fork1 system call, and numerous Java threads in the SEGV signal handler calling __lwp_sigmask. The JVM has been observed in this condition ranging from several seconds to hours.
Note that the comments indicate that the polling page is involved, while the stack suggests it's the thread state serialization pages, suggesting that +UseMembar might provide relief. Relatedly, bugs were recently discovered and fixed in the usemembar code related to the offset in the serialization page being accessed.
Interestingly, we encountered solaris scheduler starvation before (6518490) and removed almost all spin/yield loops in the JVM. To that end, serialize_thread_states() now acquries a lock around the mprotec() calls that change protections on the serializaiton page. See os::serialize_thread_states(). Relatedly, the code in the signal handler should acquire and release that same lock if the faulting address proves to be within the serialization page. That should be sufficient to stop most of the "implicit" looping behavior (it's an implicit loop in the sense that we unwind out of the signal handler, restart the offending instruction, and might trap again. We'd very much like to avoid such looping because of the starvation issues, and because such traps require the kernel to grap the address space lock -- usually incandescent -- to triage the fault).
What's odd about this case is that we're seeing looping at all. Generally, I'd expect the lock mentioned above to preclude such behavior, suggesting some interesting interaction with fork().
As an aside, this strikes me as something we could model and reproduce rather easily with simple C/C++ code.
Note that the comments suggest the s