JDK-2178143 : JVM crashes if the number of bound CPUs changed during runtime
  • Type: Backport
  • Backport of: JDK-6840239
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 6
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2009-05-21
  • Updated: 2014-02-28
  • Resolved: 2013-03-29
Comments
Changeset: 1b90c7607451 Author: minqi Date: 2013-03-27 17:03 -0700 URL: http://hg.openjdk.java.net/hsx/hotspot-rt/hotspot/rev/1b90c7607451 2178143: JVM crashes if the number of bound CPUs changed during runtime Summary: Supply a new flag -XX:+AssumeMP to workaround the problem. With the flag is turned on, assume VM run on MP platform so is_MP() will return true that sync calls will not skip away. Reviewed-by: dholmes, acorn, dcubed, jmasa Contributed-by: yumin.qi@oracle.com ! src/share/vm/runtime/arguments.cpp ! src/share/vm/runtime/globals.hpp ! src/share/vm/runtime/os.hpp
29-04-2013

This is the bug in putback comment, the others are dup of this.
29-03-2013

You may also be able to do it simply with processor sets on Solaris. (I can't test it because my Solaris system is already a zone and I don't seem to be able to create processor sets when already in a zone.)
21-03-2013

The issue is not the number of "online" CPU's changing but the number of configured CPUs. You need to do this experiment with Solaris Zones where the number of configured CPUs can be changed dynamically.
21-03-2013

This looks not is_MP() escaping problem. In fact in is_MP(): // Interface for detecting multiprocessor system static inline bool is_MP() { assert(_processor_count > 0, "invalid processor count"); return _processor_count > 1; } _processor_count is the real number of processors, it is not number of online processors. For both Linux and Solaris, we get this number by sysconf(_SC_NPROCESSORS_CONF) which returns the number of processors in system. I have tested and the result showed on multiple core platform, always return the number of cores. it is calculated once and no change thereafter. That is, is_MP() will not escape when number of online CPUs adjusted from 1 to > 1.
21-03-2013

GC currently creates all its workers at initialization. In retrospect it was not the right thing to do but that's the state of things now. Adding more GC workers later will be a significant change. Lazy creation of GC workers would be a good thing but it is a change. There are arrays that are allocated based on the number of GC workers. Those arrays will have to be changed to something that can expand if the number of GC workers is increased (beyond the size of the array) after initialization.
20-03-2013

Will add a flag -XX:NumberOfProcessors=<num>. This number should be greater than 1, if wrongly set to 1 VM should not start. Also, for ParallelGCThreads, it is calculated based on physical CPUs, I think we should assume it be NumberOfProcessors, but limited under 32 if NumberOfProcessors configured more than 32. Todays' machine almost no single CPU exists, so the is_MP() is unnecessary call. It returns true always unless manually configure VM CPU set to single CPU like in this case. In 9, we should get rid of it . In 8, to simplify, I will choose using one flag.
15-03-2013

I think this kind of dynamic adaptation is potentially de-stabilising so any VM mechanism for this should be turned on explicitly by the user (periodic tasks are undesirable in power constrained devices). I don't think internal subsystems should be taking it upon themselves to try and do this - they don't have sufficient context to make sensible choices - so it would be up to an application to check for this and adjust subsystems as appropriate (and this may require additional management interfaces for the VM). I would not expect "serious" apps to be run in environments where we see extreme changes to the number of available processors (as it is too unpredictable), so slight under or over provisioning would not be a significant problem But this is all material for a RFE/JEP for Java 9. For Java 8 all we can/should attempt is the VM arg to specify the number of processors to assume. I think perhaps two flags: one that simply says "this is the minimum number of cpus to assume" (that solves the immediate 1 to >1 problem) and a second that says "this is the number of cpus that available_processors must return".
14-02-2013

There are many subsystems of Java code which need to deal with core count changes. Having N threads do polling on Runtime.availableProcessors() doesn't sound like a great idea. Instead, have an MBean (e.g. Runtime) offer an event so that subsystems can subscribe to the event and adjust their logic. If GC checks available processors, then maybe after GC it should fire the event. This will work for almost all workloads since almost all allocate memory and incur GCs. A few workloads which never allocate and are CPU bound would need to know about the change and wouldn't get it. So, it seems the JVM should have a periodic task of checking the available processor count and firing the event. GC subsystem might be able to subscribe to the event.
13-02-2013

For #2 GC could do worker thread management at GC time. Checking available processors and adapting number of threads as needed. I have no idea what is involved in making the GC "thread pool" behave more like a thread-pool. That said the GC has no way to know that it should do this. Can't comment on specific of #3 but sounds very complex. #4 is a non-issue. F/J pools create by applications should not try to dynamically adapt to a change in processor count because the pool has no idea what the user intended or how it was set up in the first place. If the number of processors changes it has no functional impact on the pool, only potential performance impact. Dynamic resource management is a complex issue that involves the whole system stack. The basic "fix" for the VM when going from 1 to >1 CPUs is easy to do. Everything else is a RFE, and needs to tie into the future platform plans for resource management. We have enough issues trying to do performance work on the platform as it is - trying to determine which policies to use in the presence of dynamic resource changes, seems an intractable problem without guidance from the end user.
13-02-2013

Unfortunately, a notification API isn't available. So, the JVM is going to have to check occasionally. Perhaps, it should check just before or after GC or simply poll every X minutes. As for GC thread count (#2), it shouldn't be too involved to launch a few more GC threads or let a few GC threads exit... but maybe I am naive and don't understand what all is involved. Relaying the heap (#3) is involved. In the worse case scenario, all memory accesses will have to go hit RAM that is located on a processor 2 hops away. This should be easy to measure. Then multiply this by the likelihood that a user is going to enable/disable cores dynamically (which just went up significantly since it was approved by Larry Ellison for licensing). The result is how much we can expect to gain from doing this optimization. Then consider this gain versus all of the other optimizations that could be done and prioritize. Checking fork-join (#4) might be as simple as running some tests to ensure that the framework can recover from having the CPU configuration change. If any issues are identified, then they will have to be prioritized.
13-02-2013

Item #1 is fine - easy to do. Items 2,3 and 4 require significant redesigns to try to adapt to dynamically changing resource availability. I'm not aware of any notification API that will inform the VM when the number of processors changes, or the NUMA characteristics change.
13-02-2013

The NumberOfProcessors JVM argument is definitely needed to overcome issues with the following automated solution. 1. Please change the default to assume a multi-core machine and create a flag to enable uniprocessor optimizations 2. GC threads need to dynamically adjust to the dynamically changing core count 3. The heap needs to be relaid as the NUMA characteristics change due to physical cores enabled or disabled 4. Check to make sure fork-join framework can deal with the CPU configuration changing Enabling or disabling cores shouldn't happen very frequently. So, doing the work to relay the heap is worth the CPU impact when considered over the long run. This bug is also applicable for OVM where vCPUs could be added or removed to guests. Enabling and disabling cores dynamically will start happening in production (ETA unknown). It would be wise to fix these issues to be ahead of the incoming bugs.
12-02-2013