United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-8024838 : Significant slowdown due to transparent huge pages

Details
Type:
Bug
Submit Date:
2013-09-15
Status:
Resolved
Updated Date:
2013-10-21
Project Name:
JDK
Resolved Date:
2013-10-05
Component:
hotspot
OS:
Sub-Component:
gc
CPU:
Priority:
P2
Resolution:
Fixed
Affected Versions:
hs25,8
Fixed Versions:
hs25 (b54)

Related Reports
Backport:
Backport:

Sub Tasks

Description
This bug is a follow-up to this thread:

http://mail.openjdk.java.net/pipermail/hotspot-dev/2013-September/010803.html

The changes for JDK-8007074 are causing a significant slowdown in the execution of jdk_core and jdk_svc tests on a machine running Ubuntu 12.0.4.1 LTS (64-bit) on Intel Xeon E5345 hardware. See attached image for the test execution times to see the significant jump from 45mins to >1h to run the tests at concurrency=4. The jump from under 45 minutes to 70+ minutes corresponds to when jdk8/tl was updated from jdk8-b104 to jdk8-b106 (since up to hs25-b48).

From what I can tell, the changes in JDK-8007074 mean that large pages are being used when they weren't previously. Running with -XX:-UseLargePages restores the performance. 
                                    

Comments
One data point is that running the java/io tests with jtreg normally takes about 35 seconds when running with -concurrency=8 (on a 8 core system). When switching to jdk8-b106 then the tests take more than 2 minutes.
                                     
2013-09-15
I'll try to reproduce this.
                                     
2013-09-15
I can reproduce a regression although not as large as the one reported in this bug report. I do get a big performance hit the times the processes start swapping or have to evict the cached files, but I've seen the same affect without the large pages. Maybe it happens more often with transparent huge pages turned on.

To verify the regression I've run with and without large pages by using the flag -XX:-UseLargePages and -XX:+UseLargePages. I've also verified that this is caused by transparent huge pages by turning of the madvise call to the OS:
$ hg diff
diff --git a/src/os/linux/vm/os_linux.cpp b/src/os/linux/vm/os_linux.cpp
--- a/src/os/linux/vm/os_linux.cpp
+++ b/src/os/linux/vm/os_linux.cpp
@@ -2748,7 +2748,7 @@
   if (UseTransparentHugePages && alignment_hint > (size_t)vm_page_size()) {
     // We don't check the return value: madvise(MADV_HUGEPAGE) may not
     // be supported or the memory may already be backed by huge pages.
-    ::madvise(addr, bytes, MADV_HUGEPAGE);
+    //::madvise(addr, bytes, MADV_HUGEPAGE);
   }
 } 

With this change I get the same performance as with -XX:-UseLargePages.
                                     
2013-09-15
I initially assumed there was swapping but vmstat reports si/so as 0 so I assume not. This specific system has 8GB and the agent VMs (x4) are running with -Xmx256m. There are some tests that specify /othervm so there may be additional VMs running periodically (any additional VMs also inherit -Xmx256m). Clearly THP has an effect on this system, maybe more data is required from other systems to help characterize this issue.
                                     
2013-09-16
I experienced a similar slowdown on my development system which is a 24 core / 32GB Xeon i7 box. It has no swap configured. Running Ubuntu x64 13.04 I have not (yet) tried disabling large pages.
                                     
2013-09-16
I tried reproducing this on my laptop (2 core / 8GB, SSD, Ubuntu 13.04, transparent huge pages enabled) but could only see as good or better run times with default/-XX:+LargePages over -XX:-LargePages. I'll try getting some profiling set up for this on a system with more cores.
                                     
2013-09-17
This could be a case of large page fragmentation. If there's insufficient contiguous memory to assemble the large pages, the OS may be trying reorganize memory to coalesce small pages so that it can satisfy the large page request. A simple test for this would be to reboot the machine experiencing the issue; assuming that the downtime is acceptable on that machine.
                                     
2013-10-01
The uptime on this system is long (158 days) so you may be right about page fragmentation. The system is busy at the moment but when I get a chance then I'll reboot it and see if I can duplicate this issue again.
                                     
2013-10-03
URL:   http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/263f2c796d6c
User:  stefank
Date:  2013-10-05 18:57:25 +0000

                                     
2013-10-05
URL:   http://hg.openjdk.java.net/hsx/hsx25/hotspot/rev/263f2c796d6c
User:  jcoomes
Date:  2013-10-11 23:52:06 +0000

                                     
2013-10-11



Hardware and Software, Engineered to Work Together