Bug ID: JDK-8062591 SPARC PICL causes significantly longer startup times

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 9

Priority: P4
Status: Resolved
Resolution: Fixed
OS: solaris
CPU: sparc

Submitted: 2014-10-30
Updated: 2016-01-19
Resolved: 2014-11-04

JDK 7	JDK 8	JDK 9
7u80Fixed	8u51Fixed	9 b40Fixed

JDK-8056124 introduced the PICL interface to get the cache line information on SPARC.

Unfortunately, it regressed the startup time. It can be shown with a HelloWorld application:

$ for S in `seq 1 100`; do time jdk9-b37/bin/java Hello; done 2>&1 | grep real | sed -e "s/0m//g" -e "s/s//g" | awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }'

jdk9-b31: 0.25s
jdk9-before-8056124: 0.25s
jdk9-after-8056124: 0.70s
jdk9-b32: 0.70s 
...
jdk9-b35: 0.70s 
jdk9-b36: 0.61s <--- JDK-8058892, fix for another regression comes in
jdk9-b37: 0.61s (latest known)

Aleksey, thanks for verifying it!
03-11-2014
before JDK-8056124: 0.25s after JDK-8056124: 0.70s current jdk9/hs-comp + picl-startup.patch: 0.17s Hence, patch helps to mitigate the startup problem nicely. Current hs-comp + patch time is lower than pre- JDK-8056124 time since there was a separate JDK startup regression fixed in between. On this machine, the aforementioned regression took another 0.10s, which implies the PICL overhead now becomes 0.17 - (0.25 - 0.10) = 0.02s. That is bearable.
03-11-2014
Aleksey, please find the preliminary patch attached.
01-11-2014
It's pretty important for us to have a correct size of the L2 line. We use it for BIS (block initializing store) instructions to zero out a cache line without fetching its previous contents from memory. A single instruction zeros the L2 cache line it touches, and we obviously have to know what size is that.
31-10-2014
Wait, if we have a mistake in cache line size, we have a functional failure somewhere? shudders. I can see how that might happen, but I thought cache lines size are used as the performance guidelines in the VM...
31-10-2014
We could have an option, however if having different cache line sizes on the same machine is a possibility and we run with the default settings then we will start crashing with random values in object fields, which is not going to be a nice thing to debug. We need to check with the hardware folks if such thing could be a reality. I imagine things like optimized memset() should account for this as well.
31-10-2014
I admit, it is scary the startup time hit might be proportional to the size of the machine. We are talking about 0.5s on decently sized T4, just imagine what it'll be a few years down the road. Combining the walks seems a good idea regardless of whether we want to keep the full walk or not. Thinking aloud: can we do a short walk by default, and do a long walk under the VM option? (As much as I hate introducing more VM options...)
31-10-2014
P1 usually means that nothing is working in a catastrophic fashion and the issue requires immediate attention within 24 hours. I don't think that in this case anything has stopped working. I think the impact is therefore low (considering it's a minor startup problem on SPARC only, and considering that PICL is the only legit way to get cache line sizes anyway). I think LHH or LMH are appropriate, hence P4 or P5. Technically, here is what I found (on our T4) with the latest hs-comp: No PICL: 0m0.274s With PICL: 0m0.352s - It apparently takes a lot of time to walk the PICL tree with picl_walk_tree_by_class(), must be monster tree. - We do it twice, once for L1 and the second time for L2. Combining the walks gets me: 0.303s - If I assume that all cpus have the same cache line sizes (not sure if we can) and I can terminate the walk after we've seen the first cpu I get: 0.261s We have to decide if we want to keep the full walk or not (and if different CPUs with different cache lines in a single box are a possibility).
31-10-2014
What would be the proper ILW mapping then? I fail to see how it is not HHH => P1?
31-10-2014
Okay, if you give me a patch, I can test on my machine (the one that has much larger degradation).
31-10-2014
Found another thing to tweak. The CPU-related part of the tree is closer to the side the traversal starts on, and the records about CPUs are all siblings. So, since we know the number of cpus (os::processor_count()) we can stop the traversal once we've seen them all, leaving a good part of the tree untouched. 0.278s, which is pretty close to be good enough, I think.
31-10-2014
Not sure what I can do here. PICL probably needs to be looked at and optimized. Alternatively we can make it lazy and move the cost of interacting with PICL to the first compile and not use BIS in the interpreter. Either way this is definitely not a P1.
31-10-2014
Assigning to Igor, as he is the author of the original change.
30-10-2014
Initial ILW: H(Significant regression), H(Common use case, startup of any application), H(No viable workaround) => P1
30-10-2014