JDK-8178536 : OOM ERRORS + SERVICE-THREAD TAKES A PROCESSOR TO 100%
Type:Bug
Component:hotspot
Sub-Component:svc
Affected Version:8u112
Priority:P3
Status:Resolved
Resolution:Fixed
Submitted:2017-04-12
Updated:2018-07-02
Resolved:2017-06-16
The Version table provides details related to the release that this issue/RFE will be addressed.
Unresolved : Release in which this issue/RFE will be addressed. Resolved: Release in which this issue/RFE has been resolved. Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.
ServiceThread spinning in an infinite loop at 100% CPU
After throwing the OutOfMemoryErrors the worker threads are waiting for the
next requests but the Service-Thread takes a processor 100%.
Comments
Approved for JDK 9.
17-06-2017
Fix Request
This is an escalated bug against 8u112 and the fix needs to be integrated into 10, 9 and 8uxx repositories. Here are the details on the problem and the solution:
Problem: If there are listeners installed for MemoryMXBean then in the event of an OOME, JVM internal thread 'Service Thread' responsible for delivering notifications to the Java threads itself may encounter an OOM exception and get into a loop of processing its pending requests. This happens if and when the Service Thread executing the native code calls its corresponding java methods and faces an OOM exception, this pending exception makes the thread exit early from SensorInfo���trigger() function before it can update its pending requests counter (_pending_trigger_count). This pending exception is never cleared and that makes the thread loop in LowMemoryDetector::process_sensor_changes().
Hotspot changes:
http://cr.openjdk.java.net/~poonam/8178536/webrev.hotspot/
These changes check for the pending exception and clear it, and make sure that the pending requests counters are kept in sync on both the Java and the VM side.
JDK changes:
http://cr.openjdk.java.net/~poonam/8178536/webrev.jdk/
These changes make the triggerAction() a no-op since we need to call this method if MemoryService::create_MemoryUsage_obj() encounters an OOM exception and we want to avoid further potential OOM exceptions in triggerAction(MemoryUsage).
Please approve this fix for inclusion into JDK 9. This is a no risk change and adds proper handling of the pending OOM exception observed by the low memory detector. The changes have been reviewed by Mandy Chung and Daniel Daugherty.
17-06-2017
I thought this issue is target for JDK 10. You pushed it to JDK 9 without JDK 9 Fix Request Approval.
17-06-2017
Moved from hotspot/runtime -> hotspot/svc.
The ServiceThread in this situation is on the JVM side of
Monitoring and Management support.
07-06-2017
Instead of catching the exception on the Java side, I suggest to make the following changes to handle the pending exception on the VM side in SensorInfo::trigger():
1. For the OOM exception that we could encounter in:
Handle usage_h = MemoryService::create_MemoryUsage_obj(_usage, CHECK);
Here, we can change CHECK to CHECK_CLEAR. With that, we would return from VM's SensorInfo::trigger() without calling the Java Sensor::trigger(). The Java and the VM state will stay in sync and we would have the exception cleared.
Handle usage_h = MemoryService::create_MemoryUsage_obj(_usage, CHECK_CLEAR);
2. For
JavaCalls::call_virtual(&result,
sensorKlass,
vmSymbols::trigger_name(),
vmSymbols::trigger_method_signature(),
&args,
CHECK);
I suggest to do the following change:
JavaCalls::call_virtual(&result,
sensorKlass,
vmSymbols::trigger_name(),
vmSymbols::trigger_method_signature(),
&args,
+ THREAD);
+if (HAS_PENDING_EXCEPTION) {
+ // tty->print_cr("Pending exception after Java trigger() call...");
+ // we just clear the OOM pending exception that we might have encountered in Java's tiggerAction(),
+ // and continue with updating the counters since the Java counters have been updated too.
+ assert((PENDING_EXCEPTION->is_a(SystemDictionary::OutOfMemoryError_klass())), "we expect only an OOM error here");
+ CLEAR_PENDING_EXCEPTION;
+}
{
// Holds Service_lock and update the sensor state
MutexLockerEx ml(Service_lock, Mutex::_no_safepoint_check_flag);
_sensor_on = true;
_sensor_count += count;
_pending_trigger_count = _pending_trigger_count - count;
}
With this change, we clear the exception(to avoid the ServiceThread's spinning) and continue with updating the counters to keep the Java and VM state in sync.
28-05-2017
SensorInfo::trigger and SensorInfo::clear are called for low memory detection. This is the VM implementation for java.lang.management.MemoryPoolMXBean for the memory usage and GC notification. The threshold is set in the Java side and passes it to the VM for monitoring. VM will notify the Java side via sun.management.Sensor objects.
Objects will be allocated during the notification in the sun.management.Sensor::trigger method. In the current implementation, there is no object allocation happened in clearing the sensor. The sensor state is also maintained in the VM side and it has to be kept in sync with the changes in the Java side.
One possible fix is to have:
1. SensorInfo::trigger catches and clears any exception thrown in Sensor.trigger(int, MemoryUsage) method call.
If this happens, it means that the notification is not sent. VM may want to log this cleared exception for troubleshooting. The sensor state in both the VM and Java side are updated regardless of the request is processed successfully or not.
Although Sensor.clear() does not do any object allocation, there is no harm to catch and clear pending exception thrown by Sensor.clear().
2. MemoryUsage object is created and pass it to Sensor::trigger method call. OOME may be thrown.
Handle usage_h = MemoryService::create_MemoryUsage_obj(_usage, CHECK);
VM should check and clear OOME (or maybe assert OOME). In this case, one approach is to continue to update the sensor counters but just drop the notification (as in #1 situation). The VM can call Sensor::trigger(int) method if it fails to create MemoryUsage object. The triggerAction() method in PoolSensor and CollectionSensor classes [1] need to be changed to be a nop. The VM can log this exception case as in #1 to help troubleshooting.
[1] http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/2d94659f7ff3/src/java.management/share/classes/sun/management/MemoryPoolImpl.java#l292
3. Note that ServiceThread also calls other Java methods. They all may run into OOME and similiar issue that needs to be resolved as well.
124 if (sensors_changed) {
125 LowMemoryDetector::process_sensor_changes(jt);
126 }
127
128 if(has_gc_notification_event) {
129 GCNotifier::sendNotification(CHECK);
130 }
131
132 if(has_dcmd_notification_event) {
133 DCmdFactory::send_notification(CHECK);
134 }
135
136 if (acs_notify) {
137 AllocationContextService::notify(CHECK);
138 }
26-05-2017
Here's what is happening wrong with the ServiceThread - ServiveThread gets
stuck in LowMemoryDetector::process_sensor_changes(TRAPS) and this happens
because the SensorInfo::trigger() and SensorInfo::clear() fail to update the
values of _pending_trigger_count and _pending_clear_count.
Debugging showed that since clear() calls sun_management_Sensor_klass() with
CHECK, this function returns without setting _pending_clear_count to 0
because there is a pending exception on the thread.
325 void SensorInfo::clear(int count, TRAPS) {
326 if (_sensor_obj != NULL) {
327 Klass* k = Management::sun_management_Sensor_klass(CHECK);
328 instanceKlassHandle sensorKlass (THREAD, k);
329 Handle sensor(THREAD, _sensor_obj);
330
331 JavaValue result(T_VOID);
332 JavaCallArguments args(sensor);
333 args.push_int((int) count);
334 JavaCalls::call_virtual(&result,
335 sensorKlass,
336 vmSymbols::clear_name(),
337 vmSymbols::int_void_signature(),
338 &args,
339 CHECK);
340 }
341
342 {
343 // Holds Service_lock and update the sensor state
344 MutexLockerEx ml(Service_lock, Mutex::_no_safepoint_check_flag);
345 _sensor_on = false;
346 _pending_clear_count = 0;
347 _pending_trigger_count = _pending_trigger_count - count;
348 }
349}
(gdb) disassemble
Dump of assembler code for function SensorInfo::clear(int, Thread*):
0x00007f120c46f9f0 <+0>: push %rbp
0x00007f120c46f9f1 <+1>: mov %rsp,%rbp
0x00007f120c46f9f4 <+4>: mov %rbx,-0x20(%rbp)
0x00007f120c46f9f8 <+8>: mov %r12,-0x18(%rbp)
0x00007f120c46f9fc <+12>: mov %rdi,%rbx
0x00007f120c46f9ff <+15>: mov %r13,-0x10(%rbp)
0x00007f120c46fa03 <+19>: mov %r14,-0x8(%rbp)
0x00007f120c46fa07 <+23>: sub $0xc0,%rsp
0x00007f120c46fa0e <+30>: cmpq $0x0,(%rdi)
0x00007f120c46fa12 <+34>: mov %esi,%r13d
0x00007f120c46fa15 <+37>: mov %rdx,%r12
0x00007f120c46fa18 <+40>: je 0x7f120c46fad0 <SensorInfo::clear(int,Thread*)+224>
0x00007f120c46fa1e <+46>: mov %rdx,%rdi
0x00007f120c46fa21 <+49>: callq 0x7f120c4b0a10
<Management::sun_management_Sensor_klass(Thread*)>
0x00007f120c46fa26 <+54>: cmpq $0x0,0x8(%r12)
0x00007f120c46fa2c <+60>: mov %rax,%r14
0x00007f120c46fa2f <+63>: je 0x7f120c46fa48 <SensorInfo::clear(int,Thread*)+88>
0x00007f120c46fa31 <+65>: mov -0x20(%rbp),%rbx
0x00007f120c46fa35 <+69>: mov -0x18(%rbp),%r12
0x00007f120c46fa39 <+73>: mov -0x10(%rbp),%r13
0x00007f120c46fa3d <+77>: mov -0x8(%rbp),%r14
0x00007f120c46fa41 <+81>: leaveq
=> 0x00007f120c46fa42 <+82>: retq
0x00007f120c46fa43 <+83>: nopl 0x0(%rax,%rax,1)
0x00007f120c46fa48 <+88>: mov (%rbx),%rdx
0x00007f120c46fa4b <+91>: lea -0x30(%rbp),%rdi
0x00007f120c46fa4f <+95>: mov %r12,%rsi
0x00007f120c46fa52 <+98>: callq 0x7f120c02b720 <Handle::Handle(Thread*, oop)>
0x00007f120c46fa57 <+103>: mov -0x30(%rbp),%rax
0x00007f120c46fa5b <+107>: lea -0xc0(%rbp),%r8
0x00007f120c46fa62 <+114>: movl $0xe,-0x40(%rbp)
I think we should be updating the values of these flags before making the Java call on the sensor object.