JDK-8081734 : ConcurrentHashMap/ConcurrentAssociateTest.java, times out 90% of time on sparc with 256 cpu.
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.concurrent
  • Affected Version: 9
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2015-06-02
  • Updated: 2024-07-03
  • Resolved: 2015-07-20
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 9
9 b75Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
The test

java/util/concurrent/ConcurrentHashMap/ConcurrentAssociateTest.java

 has failed 90% of a couple hundred runs due to timeouts when run on a solaris-sparc which is a zone on a host with 256 cpu. Increasing the timeout value to 12 or so allows this to pass sometimes but it takes up to 14 or 15 minutes. While this may be due to the zone sharing resources, availableProcessors() still returns 256, and this test can take an unreasonable amount of time for 1 test.
If this is not a bug, can the test put a cap on this value at 24 or 32 so that it runs in a reasonable amount of time?

Comments
URL: http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/f6699d8032ff User: lana Date: 2015-07-29 20:40:19 +0000
29-07-2015

URL: http://hg.openjdk.java.net/jdk9/dev/jdk/rev/f6699d8032ff User: darcy Date: 2015-07-20 22:14:05 +0000
20-07-2015

If you mean in the test setting 'ps' to, say, 32, rather than the return value of availableProcessors(), I have done that.
20-07-2015

@Martin, yes good point since the objects are not Comparable, it's (deliberately) the worst possible case. I was still a little surprised given the size of the data (and the spread of the hash code randomly generated between 1 and 8 for each object) but sequential execution of 256 tasks is astonishingly slow and over-provisioned concurrent execution just exacerbates it. I can reproduce the effect on my MBP by over-provisioning the number of tasks. I don't think we can/should do much about this degenerate case in CHM. Let's cap the parallelism to 2^5 which limits the maximum number of elements to 2^12. @Steve, can you verify that works within the limits on the T5 machine before i fix the test?
20-07-2015

I've clarified my comment above to mean that availableProcessors() reports the number of "hardware threads" available on the system. This is is basically what the OS (Solaris) reports as a "processor" even though we know that there isn't the hardware available to run them fully in parallel. But they're reported as separate "processors" so that the OS will schedule kernel threads onto them (and by extension, Java's threads) so that their execution can overlap in case they're doing different work, or if one thread is stalled waiting for memory, etc. I don't think anything will be changed in Solaris; this is what has to happen in order to take advantage of hardware threads. We might consider changing availableProcessors() -- perhaps just on SPARC -- but that would be pretty confusing and overall I'm not sure it would help much. I do think we need to change our expectations of availableProcessors(). The assumption seems to be that if it returns a value N, that we can run N threads in parallel and get N times as much work done in the same amount of wall clock time as a single thread. (Plus some fudging for setup and scheduling overhead, etc.) That assumption isn't true on SPARC. Bottom line is that I think the test needs to change the way it creates its workload. Currently it scales the workload linearly with availableProcessors(). It should do something different, but I'm not quite sure what. Martin's comment about linear chaining and O(N^2) growth is also a possibility. That could occur on any system that has a large number of "processors," but we might hit the timeouts first on SPARC because of the issues I've described with its hardware threads.
18-07-2015

One reason this test seems to fail only on SPARC is that architecture's relationship between the value returned by availableProcessors() and the amount of computation it can actually do in parallel. The availableProcessors() call generally returns the number of *threads* not the number of cores. Intel processors generally have 2 threads per core (Hyperthreading) but nowadays SPARC systems have 8 threads per core. (In this context I'm using "threads" to mean "hardware threads" not kernel threads or Java threads.) On a 1-core SPARC system with 8 threads, if 8 Java threads are all trying to do the same work, they'll all get scheduled onto the processors but there won't be any parallelism. (They can overlap execution if they're doing different work though, which is the point of having multiple threads per core.) This test is trying to schedule a workload sufficient to keep "all" of the processors busy. But availableProcessors() on SPARC reports so many more threads than there is actual hardware available, the test will run for several times longer on SPARC than on say Intel.
18-07-2015

availableProcessors() reports the number of "processors" that the OS claims to have. If it shouldn't be reporting "threads" as distinct processors (and I tend to agree) then that is a Solaris issue - but it seems unlikely that would be changed now. The poor old VM really doesn't have a easy way to get sockets vs cores vs thread numbers and figure out what that should mean for actual number of "processors". And even if we reported cores that doesn't mean that the OS would schedule on distinct cores rather than threads on the same core -so performance would still be affected. And of course the actual parallelism available from "hardware threads" depends on the architecture.
18-07-2015

The poor hashcode, combined with not implementing Comparable, which means that we cannot fall back to a tree bin implementation, means that at least we'll have n^2 slowdown as many threads add many elements. It sounds like this is a NUMA machine, so contention may end up being much more expensive than on a single-core machine, so it's at least plausible that we can have a slowdown here worse even than n^2. It's reasonable to limit the number of threads, but there may be a real performance bug here as well, so y'all may want to keep poking at it.
17-07-2015

The underlying machine is a Sparc T5-2. It has 2 physical processors each with 16 cores and 8 threads. Thus availableProcessors() reports 256.
17-07-2015

Is the underlying machine really a SPARC T2?
17-07-2015

A surprising slow down given it takes about 3 seconds to run locally on my MBP. Given a parallelism of 256 one iteration of the test will create 256 tasks each putting 128 entries, so globally a total of 2^15 entries. Keys deliberately have poor hashcodes so collisions are very likely. But 2^15 entries does not seem particularly large. So i am suspicious that there might be something else going on. I think it will require some direct investigation running such tests on the identified machine.
17-07-2015

As it turns out this test need a bigger timeout to complete. When the test runs on this particular sparc machine (a zone on a host with 256cpu sharing resource to its zones) it takes about 14 minutes to complete. If this is not a bug of some sort causing it to be unreasonable slow on this type of machine, can the test by limited to a max of 24 or 32 cpu rather just getting the return of availableProcessors()?
17-07-2015

Never say "every". :) Test passed last 2 build/test cycles.
02-06-2015