JDK-8177091 : Improve performance of G1SATBCardTableModRefBS::write_ref_array_pre_work
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Priority: P4
  • Status: Closed
  • Resolution: Duplicate
  • Submitted: 2017-03-19
  • Updated: 2021-07-02
  • Resolved: 2017-03-20
Related Reports
Duplicate :  
Description
G1 performs much slower than parallel GC in various simple System.arraycopy microbenchmarks with Object[] targets[1].  Although some of this might be unavoidable due to having heavier write barriers, profiling indicate that code in G1SATBCardTableModRefBS::write_ref_array_pre_work may be insufficiently optimized/inlined.

A quick experiment to simply hoist the calls to SATBMarkQueue::enqueue directly into write_ref_array_pre_work improves performance on these in G1 by 3x, which may be a straightforward way to narrow the gap in these corner cases:

diff -r b01c519b715e src/share/vm/gc/g1/g1SATBCardTableModRefBS.cpp
--- a/src/share/vm/gc/g1/g1SATBCardTableModRefBS.cpp	Thu Mar 16 12:09:14 2017 -0700
+++ b/src/share/vm/gc/g1/g1SATBCardTableModRefBS.cpp	Sun Mar 19 20:09:31 2017 +0100
@@ -60,10 +60,22 @@
 G1SATBCardTableModRefBS::write_ref_array_pre_work(T* dst, int count) {
   if (!JavaThread::satb_mark_queue_set().is_active()) return;
   T* elem_ptr = dst;
-  for (int i = 0; i < count; i++, elem_ptr++) {
-    T heap_oop = oopDesc::load_heap_oop(elem_ptr);
-    if (!oopDesc::is_null(heap_oop)) {
-      enqueue(oopDesc::decode_heap_oop_not_null(heap_oop));
+  Thread* thr = Thread::current();
+  if (thr->is_Java_thread()) {
+    JavaThread* jt = (JavaThread*)thr;
+    SATBMarkQueue& smq = jt->satb_mark_queue();
+    for (int i = 0; i < count; i++, elem_ptr++) {
+      T heap_oop = oopDesc::load_heap_oop(elem_ptr);
+      if (!oopDesc::is_null(heap_oop)) {
+        smq.enqueue(oopDesc::decode_heap_oop_not_null(heap_oop));
+      }
+    }
+  } else {
+    for (int i = 0; i < count; i++, elem_ptr++) {
+      T heap_oop = oopDesc::load_heap_oop(elem_ptr);
+      if (!oopDesc::is_null(heap_oop)) {
+        enqueue(oopDesc::decode_heap_oop_not_null(heap_oop));
+      }
     }
   }
 }


[1]
    private static final Object[] TEST_OBJECTS = new Object[200];
    public Object[] dummyObjectArray = new Object[TEST_OBJECTS.length];

    @Benchmark
    public void arrayCopyObject() {
        System.arraycopy(TEST_OBJECTS, 0, dummyObjectArray, 0, dummyObjectArray.length);
    }


Comments
The improvement I thought I got from simply restructuring the code appears to have been an experimental error. There is a real performance issue here when comparing G1 to parallel, but it seems it's already captured by JDK-8028337
20-03-2017