JDK-7196911 : command line length affects performance
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 8
  • Priority: P3
  • Status: Open
  • Resolution: Unresolved
  • OS: linux_oracle_6.0
  • CPU: x86
  • Submitted: 2012-09-07
  • Updated: 2017-11-17
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availabitlity Release.

To download the current JDK release, click here.
Other
tbd_majorUnresolved
Description
When testing nonpermgen performance, we noticed the score (throughput) changes when applying the command line with different length, even though the added flags is the same as the default value.

This only happens on Linux (not Solaris), we are using OEL6.0 on Intel Sandy-Bridge. The workload is SPECjbb2005.

The command line length affects the performance differently on different builds.  We tested nonpermgen build, the reference build for nonpermgen (ref build with permgen) and jdk7u6.

cmd1:  -server -Xms4g -Xmx4g -XX:+UseParallelGC -XX:-UseAdaptiveSizePolicy -XX:-UseLargePages -XX:-TieredCompilation -XX:-UseCompressedOops

cmd2:  -server -Xms4g -Xmx4g -XX:+UseParallelGC -XX:-UseAdaptiveSizePolicy -XX:-UseLargePages -XX:-TieredCompilation -XX:-UseCompressedOops -XX:-UseSharedSpaces

cmd3: -server -Xms4g -Xmx4g -XX:+UseParallelGC -XX:-UseAdaptiveSizePolicy -XX:-UseLargePages -XX:-TieredCompilation -XX:-UseCompressedOops -Xmixed1234567801234

cmd2 set the value for -XX:-UseSharedSpaces the same as default.
cmd3 does not set the value for -XX:-UseSharedSpaces but just changed the command line length.

runid jvm version       cmd        score        changes
527    nonpermgen        cmd1       338,718        
528    nonpermgen        cmd2       322,747        -4.71%
526    nonpermgen        cmd3       320,297        -5.44%
532    ref               cmd1       322,959      
533    ref               cmd2       340,853        5.54%
531    ref               cmd3       339,767        5.20%
587    7u6               cmd1       338,920
588    7u6               cmd2       321,424        -5.16%

Other observations:
When the throughput drops, the gc behavior changes as well.
For example, run 527, has 1417 PSYoungGen, average gc time is 0.035,  run 528 has 1334 PSYoungGen, average gc time is 0.053.

GC logs are available on request.

Comments
A fix for: JDK-8022880: False sharing between PSPromotionManager instances has been pushed to the hotspot-gc repository: http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/9766f73e770d
2013-08-14

There are two areas for improvement avoid this problem. 1) Fix the GC code that is susceptible to the false cache line sharing. 2) Fix the command line processing code to reduce the risk of disturbing cache lines. I assume this means something like using a buffer (for holding the command line) that grows in multiples of a cache line length to avoid changing alignments.
2013-02-05

From mail from Vladimir There are 2 problems here. First, we can move these structures allocations from C heap: _manager_array = NEW_C_HEAP_ARRAY(PSPromotionManager*, ParallelGCThreads+1, mtGC); to Arena to prevent previous malloc allocations affect alignment of these structures. We may also want to align Arena's chunks to cache line size (or first allocation in chunk) to avoid variations between runs. Second problem is structures size which is n*(cacheline size). We need padding (dummy fields) in such structures to avoid false cache sharing.
2013-02-05

Installed VTune and reproduced the issue with jdk8b66 with debug info not stripped from the build so that we can see which line caused the issue. conclusion: It seems only affect ParallelGC, since when object is copy_to_survivor_space, the markOop of new and old objects are swapped. analysis: A significant part of false sharing is from psPromotionManager.inline.hpp line 170 // Now we have to CAS in the header. if (o->cas_forward_to(new_obj, test_mark)) { This operation is defined in oop.inline.hpp // Used by parallel scavengers inline bool oopDesc::cas_forward_to(oop p, markOop compare) { assert(check_obj_alignment(p), "forwarding to something not aligned"); assert(Universe::heap()->is_in_reserved(p), "forwarding to something not in heap"); markOop m = markOopDesc::encode_pointer_as_mark(p); assert(m->decode_pointer() == p, "encoding must be reversable"); return cas_set_mark(m, compare) == compare; } inline markOop oopDesc::cas_set_mark(markOop new_mark, markOop old_mark) { return (markOop) Atomic::cmpxchg_ptr(new_mark, &_mark, old_mark); } cmpxchg_ptr for Linux is defined in atomic_linux_x86.inline.hpp inline void* Atomic::cmpxchg_ptr(void* exchange_value, volatile void* dest, void* compare_value) { return (void*)cmpxchg((jint)exchange_value, (volatile jint*)dest, (jint)compare_value); } Supporting data from VTune: from 170 with issues: Source Line Source CPU_CLK_UNHALTED.THREAD OFFCORE_RESPONSE.ALL_DEMAND_MLC_PREF_READS.LLC_MISS.REMOTE_HITM_HIT_FORWARD_1 172 if (o->cas_forward_to(new_obj, test_mark)) { 53,601,574,999 61,000,540 assembly Code Location Source Line Assembly 0x82ee60 172 movq 0x58d5d1(%rip), %rax 0x82ee67 172 mov %r12, %rdx 0x82ee6a 172 or $0x3, %rdx 0x82ee6e 172 cmpl $0x1, (%rax) 0x82ee71 172 mov %r13, %rax 0x82ee74 172 setnle %cl 0x82ee77 172 cmp $0x0, %cl 0x82ee7a 172 jz 0x82ee7d Block 45: 0x82ee7c 172 lock cmpxchgq %rdx, (%rbx) 0x82ee81 172 cmp %rax, %r13 0x82ee84 172 jz 0x82ef98 <Block 64> from 160 w/o issues Source Line Source CPU_CLK_UNHALTED.THREAD OFFCORE_RESPONSE.ALL_DEMAND_MLC_PREF_READS.LLC_MISS.REMOTE_HITM_HIT_FORWARD_1 172 if (o->cas_forward_to(new_obj, test_mark)) { 14,444,281,229 3,000,060 Code Location Source Line Assembly 0x82ee60 172 movq 0x58d5d1(%rip), %rax 0x82ee67 172 mov %r12, %rdx 0x82ee6a 172 or $0x3, %rdx 0x82ee6e 172 cmpl $0x1, (%rax) 0x82ee71 172 mov %r13, %rax 0x82ee74 172 setnle %cl 0x82ee77 172 cmp $0x0, %cl 0x82ee7a 172 jz 0x82ee7d Block 45: 0x82ee7c 172 lock cmpxchgq %rdx, (%rbx) 0x82ee81 172 cmp %rax, %r13 0x82ee84 172 jz 0x82ef98 <Block 64>
2013-02-04

Comments from vladimir.kozlov@oracle.com There are 2 problems here. First, we can move these structures allocations from C heap: _manager_array = NEW_C_HEAP_ARRAY(PSPromotionManager*, ParallelGCThreads+1, mtGC); to Arena to prevent previous malloc allocations affect alignment of these structures. We may also want to align Arena's chunks to cache line size (or first allocation in chunk) to avoid variations between runs. Second problem is structures size which is n*(cacheline size). We need padding (dummy fields) in such structures to avoid false cache sharing.
2013-01-15

Worked with Intel on this. MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM is much higher for the bad case. This indicates the memory layout causes false sharing. We located PSPromotionManager as the source One possible cause: void PSPromotionManager::initialize() { ParallelScavengeHeap* heap = (ParallelScavengeHeap*)Universe::heap(); assert(heap->kind() == CollectedHeap::ParallelScavengeHeap, "Sanity"); _old_gen = heap->old_gen(); _young_space = heap->young_gen()->to_space(); assert(_manager_array == NULL, "Attempt to initialize twice"); _manager_array = NEW_C_HEAP_ARRAY(PSPromotionManager*, ParallelGCThreads+1, mtGC); guarantee(_manager_array != NULL, "Could not initialize promotion manager"); _stack_array_depth = new OopStarTaskQueueSet(ParallelGCThreads); guarantee(_stack_array_depth != NULL, "Cound not initialize promotion manager"); // Create and register the PSPromotionManager(s) for the worker threads. for(uint i=0; i<ParallelGCThreads; i++) { _manager_array[i] = new PSPromotionManager(); guarantee(_manager_array[i] != NULL, "Could not create PSPromotionManager"); stack_array_depth()->register_queue(i, _manager_array[i]->claimed_stack_depth()); } // The VMThread gets its own PSPromotionManager, which is not available // for work stealing. _manager_array[ParallelGCThreads] = new PSPromotionManager(); guarantee(_manager_array[ParallelGCThreads] != NULL, "Could not create PSPromotionManager"); } _manager_array = NEW_C_HEAP_ARRAY(PSPromotionManager*, ParallelGCThreads+1, mtGC); It is allocated from C heap using method AllocateHeap. In share/vm/runtime/arguemnts.cpp, the properties, system properties, jvm flags, etc are saved using AllocateHeap. From jdk8b66 and later build, where NMT is implemented, the memory reserved for internal can change when the command line changes cmd: -server -Xms4g -Xmx4g -XX:+UseParallelGC -XX:-UseAdaptiveSizePolicy -XX:-UseLargePages -XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:NativeMemoryTracking=detail -Xmixed123456789012345678 - Internal (reserved=25328KB, committed=25296KB) (malloc=25296KB, #2957) (mmap: reserved=32KB, committed=0KB) -server -Xms4g -Xmx4g -XX:+UseParallelGC -XX:-UseAdaptiveSizePolicy -XX:-UseLargePages -XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:NativeMemoryTracking=detail - Internal (reserved=25327KB, committed=25295KB) (malloc=25295KB, #2949) (mmap: reserved=32KB, committed=0KB) But I feel there are more questions: 1. Since there are ParallelGCThreads+1 PSPromotionManager, is it _manager_array or memory in each PSPromotionManager that caused the false sharing? How can we tell? 2. This is with ParallelGC, does it impact others, like g1? 3. Is there a way to prevent this?
2012-12-19

Agree with David. For now, performance runs should take environment length into account. When we have cycles, it would be interesting to see if there are specific internal data cache-alignments that would be worth the per-platform tuning and potential footprint cost.
2012-11-06

I've witnessed something similar before in CVM. There's a microbenchmark in kBench that (for CVM) sat in a loop writing into 4 consecutive words on the stack. On a certain ARM device, as long as all writes were done in the same half of a 32-byte cache line, the data remained in the write buffer and was not even flushed to the cache line. As soon as a write occurred to a new half cache line, the data was flushed from the write buffer. Thus if the 4 words were 16-byte aligned, you got great performance. Otherwise the write buffer flushes to the datacache a bottleneck and the benchmark slowed down considerably.
2012-11-05

Dave Dice pointed me to some fascinating research on measurement bias by Todd Mytkowicz at the University of Colorado. In particular the following paper http://www-plan.cs.colorado.edu/klipto/mytkowicz-asplos09.pdf shows how Unix environment size can affect perceived performance results due to its influence of the program stack and thus the relative alignment of stack variables. As command-line args are stored in the program image I believe that different length command-lines would also impact data layout and/or alignment and so affect performance results. However before drawing any conclusions we would need to perform extensive analysis of our benchmarking methodology to ensure that we are measuring things appropriately and to obtain sufficient samples to know what the natural variance of these benchmarks on the systems on which we are running them. The moral of this story may simply be that performance benchmarking must always ensure that the same length command-line (or environment strings, or .hotspotrc file) is always used when doing comparative runs.
2012-11-05

Note that the default value of flags are being explicitly added in some cases, while in others the extra length comes from an argument that is effectively ignored eg: -Xmixed1234567801234 matches -Xmixed and the numerals are ignored.
2012-11-05

Have you verified that the added command line options are not affecting other options? Use -XX:+PrintFlagsFinal to compare the flag settings for each run.
2012-11-05

Did 4 additional sets: cmd1: average 463259, standard deviation: 5417 cmd2: average 476690, standard deviation: 11395
2012-10-24

This certainly doesn't make much sense.The only thing the different command line length might affect is the overall memory layout ie moving things by an extra page. That might lead to different caching behaviour. But this seems excessive. Do we see a similar change in performance if the various options are passed in via the .hotspotrc file or via the _JAVA_OPTIONS environment variable?
2012-10-22

Assign to perf team
2012-10-16