JDK-8062591 : SPARC PICL causes significantly longer startup times
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 9
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: solaris
  • CPU: sparc
  • Submitted: 2014-10-30
  • Updated: 2016-01-19
  • Resolved: 2014-11-04
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
7u80Fixed 8u51Fixed 9 b40Fixed
Related Reports
Relates :  
JDK-8056124 introduced the PICL interface to get the cache line information on SPARC.

Unfortunately, it regressed the startup time. It can be shown with a HelloWorld application:

$ for S in `seq 1 100`; do time jdk9-b37/bin/java Hello; done 2>&1 | grep real | sed -e "s/0m//g" -e "s/s//g" | awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }'

jdk9-b31: 0.25s
jdk9-before-8056124: 0.25s
jdk9-after-8056124: 0.70s
jdk9-b32: 0.70s 
jdk9-b35: 0.70s 
jdk9-b36: 0.61s <--- JDK-8058892, fix for another regression comes in
jdk9-b37: 0.61s (latest known)

Aleksey, thanks for verifying it!

before JDK-8056124: 0.25s after JDK-8056124: 0.70s current jdk9/hs-comp + picl-startup.patch: 0.17s Hence, patch helps to mitigate the startup problem nicely. Current hs-comp + patch time is lower than pre- JDK-8056124 time since there was a separate JDK startup regression fixed in between. On this machine, the aforementioned regression took another 0.10s, which implies the PICL overhead now becomes 0.17 - (0.25 - 0.10) = 0.02s. That is bearable.

Aleksey, please find the preliminary patch attached.

It's pretty important for us to have a correct size of the L2 line. We use it for BIS (block initializing store) instructions to zero out a cache line without fetching its previous contents from memory. A single instruction zeros the L2 cache line it touches, and we obviously have to know what size is that.

Wait, if we have a mistake in cache line size, we have a functional failure somewhere? *shudders*. I can see how that might happen, but I thought cache lines size are used as the performance guidelines in the VM...

We could have an option, however if having different cache line sizes on the same machine is a possibility and we run with the default settings then we will start crashing with random values in object fields, which is not going to be a nice thing to debug. We need to check with the hardware folks if such thing could be a reality. I imagine things like optimized memset() should account for this as well.

I admit, it is scary the startup time hit might be proportional to the size of the machine. We are talking about 0.5s on decently sized T4, just imagine what it'll be a few years down the road. Combining the walks seems a good idea regardless of whether we want to keep the full walk or not. Thinking aloud: can we do a short walk by default, and do a long walk under the VM option? (As much as I hate introducing more VM options...)

P1 usually means that nothing is working in a catastrophic fashion and the issue requires immediate attention within 24 hours. I don't think that in this case anything has stopped working. I think the impact is therefore low (considering it's a minor startup problem on SPARC only, and considering that PICL is the only legit way to get cache line sizes anyway). I think LHH or LMH are appropriate, hence P4 or P5. Technically, here is what I found (on our T4) with the latest hs-comp: No PICL: 0m0.274s With PICL: 0m0.352s - It apparently takes a lot of time to walk the PICL tree with picl_walk_tree_by_class(), must be monster tree. - We do it twice, once for L1 and the second time for L2. Combining the walks gets me: 0.303s - If I assume that all cpus have the same cache line sizes (not sure if we can) and I can terminate the walk after we've seen the first cpu I get: 0.261s We have to decide if we want to keep the full walk or not (and if different CPUs with different cache lines in a single box are a possibility).

What would be the proper ILW mapping then? I fail to see how it is not HHH => P1?

Okay, if you give me a patch, I can test on my machine (the one that has much larger degradation).

Found another thing to tweak. The CPU-related part of the tree is closer to the side the traversal starts on, and the records about CPUs are all siblings. So, since we know the number of cpus (os::processor_count()) we can stop the traversal once we've seen them all, leaving a good part of the tree untouched. 0.278s, which is pretty close to be good enough, I think.

Not sure what I can do here. PICL probably needs to be looked at and optimized. Alternatively we can make it lazy and move the cost of interacting with PICL to the first compile and not use BIS in the interpreter. Either way this is definitely not a P1.

Assigning to Igor, as he is the author of the original change.

Initial ILW: H(Significant regression), H(Common use case, startup of any application), H(No viable workaround) => P1