United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-7196911 : command line length affects performance

Details
Type:
Enhancement
Submit Date:
2012-09-07
Status:
Open
Updated Date:
2016-06-03
Project Name:
JDK
Resolved Date:
Component:
hotspot
OS:
linux_oracle_6.0
Sub-Component:
gc
CPU:
x86
Priority:
P3
Resolution:
Unresolved
Affected Versions:
8
Targeted Versions:
10

Related Reports

Sub Tasks

Description
When testing nonpermgen performance, we noticed the score (throughput) changes when applying the command line with different length, even though the added flags is the same as the default value.

This only happens on Linux (not Solaris), we are using OEL6.0 on Intel Sandy-Bridge. The workload is SPECjbb2005.

The command line length affects the performance differently on different builds.  We tested nonpermgen build, the reference build for nonpermgen (ref build with permgen) and jdk7u6.

cmd1:  -server -Xms4g -Xmx4g -XX:+UseParallelGC -XX:-UseAdaptiveSizePolicy -XX:-UseLargePages -XX:-TieredCompilation -XX:-UseCompressedOops

cmd2:  -server -Xms4g -Xmx4g -XX:+UseParallelGC -XX:-UseAdaptiveSizePolicy -XX:-UseLargePages -XX:-TieredCompilation -XX:-UseCompressedOops -XX:-UseSharedSpaces

cmd3: -server -Xms4g -Xmx4g -XX:+UseParallelGC -XX:-UseAdaptiveSizePolicy -XX:-UseLargePages -XX:-TieredCompilation -XX:-UseCompressedOops -Xmixed1234567801234

cmd2 set the value for -XX:-UseSharedSpaces the same as default.
cmd3 does not set the value for -XX:-UseSharedSpaces but just changed the command line length.

runid jvm version       cmd        score        changes
527    nonpermgen        cmd1       338,718        
528    nonpermgen        cmd2       322,747        -4.71%
526    nonpermgen        cmd3       320,297        -5.44%
532    ref               cmd1       322,959      
533    ref               cmd2       340,853        5.54%
531    ref               cmd3       339,767        5.20%
587    7u6               cmd1       338,920
588    7u6               cmd2       321,424        -5.16%

Other observations:
When the throughput drops, the gc behavior changes as well.
For example, run 527, has 1417 PSYoungGen, average gc time is 0.035,  run 528 has 1334 PSYoungGen, average gc time is 0.053.

GC logs are available on request.

                                    

Comments
A fix for:
 JDK-8022880: False sharing between PSPromotionManager instances

has been pushed to the hotspot-gc repository:
 http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/9766f73e770d

                                     
2013-08-14
There are two areas for improvement avoid this problem.

1) Fix the GC code that is susceptible to the false cache line sharing.

2) Fix the command line processing code to reduce the risk of disturbing cache lines.
I assume this means something like using a  buffer (for holding the command line) that grows in
multiples of a cache line length to avoid changing alignments.


                                     
2013-02-05
From mail from Vladimir

There are 2 problems here. First, we can move these structures allocations from C heap:

_manager_array = NEW_C_HEAP_ARRAY(PSPromotionManager*, ParallelGCThreads+1, mtGC);

to Arena to prevent previous malloc allocations affect alignment of these structures. We may also want to align Arena's chunks to cache line size (or first allocation in chunk) to avoid variations between runs.

Second problem is structures size which is n*(cacheline size). We need padding (dummy fields) in such structures to avoid false cache sharing.
                                     
2013-02-05
Installed VTune and reproduced the issue with jdk8b66 with debug info not stripped from the build so that we can see which line caused the issue.

conclusion: It seems only affect ParallelGC, since when object is copy_to_survivor_space, the markOop of new and old objects are swapped.

analysis: A significant part of false sharing is from psPromotionManager.inline.hpp line 170

    // Now we have to CAS in the header.
    if (o->cas_forward_to(new_obj, test_mark)) {


This operation is defined in oop.inline.hpp
// Used by parallel scavengers
inline bool oopDesc::cas_forward_to(oop p, markOop compare) {
  assert(check_obj_alignment(p),
         "forwarding to something not aligned");
  assert(Universe::heap()->is_in_reserved(p),
         "forwarding to something not in heap");
  markOop m = markOopDesc::encode_pointer_as_mark(p);
  assert(m->decode_pointer() == p, "encoding must be reversable");
  return cas_set_mark(m, compare) == compare;
}

inline markOop oopDesc::cas_set_mark(markOop new_mark, markOop old_mark) {
  return (markOop) Atomic::cmpxchg_ptr(new_mark, &_mark, old_mark);
}

cmpxchg_ptr for Linux is defined in atomic_linux_x86.inline.hpp
inline void*    Atomic::cmpxchg_ptr(void*    exchange_value, volatile void*     dest, void*    compare_value) {
  return (void*)cmpxchg((jint)exchange_value, (volatile jint*)dest, (jint)compare_value);
}


Supporting data from VTune:
from 170 with issues:		
Source Line	Source	CPU_CLK_UNHALTED.THREAD OFFCORE_RESPONSE.ALL_DEMAND_MLC_PREF_READS.LLC_MISS.REMOTE_HITM_HIT_FORWARD_1

172	    if (o->cas_forward_to(new_obj, test_mark)) {	53,601,574,999   61,000,540

assembly		
Code Location	Source Line	Assembly
0x82ee60	172	movq  0x58d5d1(%rip), %rax
0x82ee67	172	mov %r12, %rdx
0x82ee6a	172	or $0x3, %rdx
0x82ee6e	172	cmpl  $0x1, (%rax)
0x82ee71	172	mov %r13, %rax
0x82ee74	172	setnle %cl
0x82ee77	172	cmp $0x0, %cl
0x82ee7a	172	jz 0x82ee7d
		Block 45:
0x82ee7c	172	lock cmpxchgq  %rdx, (%rbx)
0x82ee81	172	cmp %rax, %r13
0x82ee84	172	jz 0x82ef98 <Block 64>

from 160 w/o issues		
Source Line	Source	CPU_CLK_UNHALTED.THREAD  OFFCORE_RESPONSE.ALL_DEMAND_MLC_PREF_READS.LLC_MISS.REMOTE_HITM_HIT_FORWARD_1
172	    if (o->cas_forward_to(new_obj, test_mark)) {	14,444,281,229  3,000,060
Code Location	Source Line	Assembly
0x82ee60	172	movq  0x58d5d1(%rip), %rax
0x82ee67	172	mov %r12, %rdx
0x82ee6a	172	or $0x3, %rdx
0x82ee6e	172	cmpl  $0x1, (%rax)
0x82ee71	172	mov %r13, %rax
0x82ee74	172	setnle %cl
0x82ee77	172	cmp $0x0, %cl
0x82ee7a	172	jz 0x82ee7d
		Block 45:
0x82ee7c	172	lock cmpxchgq  %rdx, (%rbx)
0x82ee81	172	cmp %rax, %r13
0x82ee84	172	jz 0x82ef98 <Block 64>


                                     
2013-02-04
Comments from vladimir.kozlov@oracle.com

There are 2 problems here. First, we can move these structures allocations from C heap:

_manager_array = NEW_C_HEAP_ARRAY(PSPromotionManager*, ParallelGCThreads+1, mtGC);

to Arena to prevent previous malloc allocations affect alignment of these structures. We may also want to align Arena's chunks to cache line size (or first allocation in chunk) to avoid variations between runs.

Second problem is structures size which is n*(cacheline size). We need padding (dummy fields) in such structures to avoid false cache sharing. 
                                     
2013-01-15
Worked with Intel on this.

MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM is much higher for the bad case.  This indicates the memory layout causes false sharing. 

We located PSPromotionManager as the source   

One possible cause:
void PSPromotionManager::initialize() {
  ParallelScavengeHeap* heap = (ParallelScavengeHeap*)Universe::heap();
  assert(heap->kind() == CollectedHeap::ParallelScavengeHeap, "Sanity");

  _old_gen = heap->old_gen();
  _young_space = heap->young_gen()->to_space();

  assert(_manager_array == NULL, "Attempt to initialize twice");
  _manager_array = NEW_C_HEAP_ARRAY(PSPromotionManager*, ParallelGCThreads+1, mtGC);
  guarantee(_manager_array != NULL, "Could not initialize promotion manager");

  _stack_array_depth = new OopStarTaskQueueSet(ParallelGCThreads);
  guarantee(_stack_array_depth != NULL, "Cound not initialize promotion manager");

  // Create and register the PSPromotionManager(s) for the worker threads.
  for(uint i=0; i<ParallelGCThreads; i++) {
    _manager_array[i] = new PSPromotionManager();
    guarantee(_manager_array[i] != NULL, "Could not create PSPromotionManager");
    stack_array_depth()->register_queue(i, _manager_array[i]->claimed_stack_depth());
  }

  // The VMThread gets its own PSPromotionManager, which is not available
  // for work stealing.
  _manager_array[ParallelGCThreads] = new PSPromotionManager();
  guarantee(_manager_array[ParallelGCThreads] != NULL, "Could not create PSPromotionManager");
}

_manager_array = NEW_C_HEAP_ARRAY(PSPromotionManager*, ParallelGCThreads+1, mtGC);
It is allocated from C heap using method AllocateHeap.  In share/vm/runtime/arguemnts.cpp, the properties, system properties, jvm flags, etc are saved using AllocateHeap.  From jdk8b66 and later build, where NMT is implemented, the memory reserved for internal can change when the command line changes
cmd: -server -Xms4g -Xmx4g -XX:+UseParallelGC -XX:-UseAdaptiveSizePolicy -XX:-UseLargePages -XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:NativeMemoryTracking=detail -Xmixed123456789012345678
-                  Internal (reserved=25328KB, committed=25296KB)
                            (malloc=25296KB, #2957)
                            (mmap: reserved=32KB, committed=0KB)

 -server -Xms4g -Xmx4g -XX:+UseParallelGC -XX:-UseAdaptiveSizePolicy -XX:-UseLargePages -XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:NativeMemoryTracking=detail
-                  Internal (reserved=25327KB, committed=25295KB)
                            (malloc=25295KB, #2949)
                            (mmap: reserved=32KB, committed=0KB)

But I feel there are more questions:
1. Since there are ParallelGCThreads+1 PSPromotionManager, is it _manager_array or memory in each PSPromotionManager that caused the false sharing?  How can we tell?
2.  This is with ParallelGC, does it impact others, like g1?
3. Is there a way to prevent this?
                                     
2012-12-19
Agree with David. For now, performance runs should take environment length into account. When we have cycles, it would be interesting to see if there are specific internal data cache-alignments that would be worth the per-platform tuning and potential footprint cost.
                                     
2012-11-06
Have you verified that the added command line options are not affecting other options? Use -XX:+PrintFlagsFinal to compare the flag settings for each run.
                                     
2012-11-05
I've witnessed something similar before in CVM. There's a microbenchmark in kBench that (for CVM) sat in a loop writing into 4 consecutive words on the stack. On a certain ARM device, as long as all writes were done in the same half of a 32-byte cache line, the data remained in the write buffer and was not even flushed to the cache line. As soon as a write occurred to a new half cache line, the data was flushed from the write buffer. Thus if the 4 words were 16-byte aligned, you got great performance. Otherwise the write buffer flushes to the datacache a bottleneck and the benchmark slowed down considerably.
                                     
2012-11-05
Dave Dice pointed me to some fascinating research on measurement bias by Todd Mytkowicz at the University of Colorado. In particular the following paper

http://www-plan.cs.colorado.edu/klipto/mytkowicz-asplos09.pdf

shows how Unix environment size can affect perceived performance results due to its influence of the program stack and thus the relative alignment of stack variables. As command-line args are stored in the program image I believe that different length command-lines would also impact data layout and/or alignment and so affect performance results.

However before drawing any conclusions we would need to perform extensive analysis of our benchmarking methodology to ensure that we are measuring things appropriately and to obtain sufficient samples to know what the natural variance of these benchmarks on the systems on which we are running them.

The moral of this story may simply be that performance benchmarking must always ensure that the same length command-line (or environment strings, or .hotspotrc file) is always used when doing comparative runs.
                                     
2012-11-05
Note that the default value of flags are being explicitly added in some cases, while in others the extra length comes from an argument that is effectively ignored eg: -Xmixed1234567801234 matches -Xmixed and the numerals are ignored.
                                     
2012-11-05
Did 4 additional sets:
cmd1: average 463259, standard deviation: 5417
cmd2: average 476690, standard deviation: 11395
                                     
2012-10-24
This certainly doesn't make much sense.The only thing the different command line length might affect is the overall memory layout ie moving things by an extra page. That might lead to different caching behaviour. But this seems excessive.

Do we see a similar change in performance if the various options are passed in via the .hotspotrc file or via the _JAVA_OPTIONS environment variable?
                                     
2012-10-22
Assign to perf team 
                                     
2012-10-16



Hardware and Software, Engineered to Work Together