Bug ID: JDK-8307084 C2: Vectorized drain loop is not executed for some small trip counts

JDK-8307084 : C2: Vectorized drain loop is not executed for some small trip counts

Type: Enhancement
Component: hotspot
Sub-Component: compiler
Affected Version: 11,17,20,21

Priority: P4
Status: In Progress
Resolution: Unresolved
OS: generic
CPU: generic

Submitted: 2023-04-28
Updated: 2024-12-18

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdUnresolved

Related Reports

Relates :	JDK-8344085 - C2 SuperWord: improve vectorization for small loop iteration count
Relates :	JDK-8342692 - C2: long counted loop/long range checks: don't create loop-nest for short running loops
Relates :	JDK-8151573 - Multiversioning for range check elimination
Relates :	JDK-8149421 - Vectorized Post Loops

Description

In C2's loop optimization, a counted loop could be split into pre-/main-/post- loops. Meanwhile, C2 inserts minimum trip guards (a.k.a. zero-trip guards) before the main loop and the post loop. These guards test the remaining trip count of the loop. The execution jumps over the loop code if the remaining trip count is less than the loop stride (after unrolling) to avoid loop over-running. For example, if a main loop is unrolled 8x (and vectorized), the main loop guard tests if the loop has less than 8 iterations to run, as is shown in below figure (a).

Usually, the vectorized main loop will be super-unrolled after vectorization. In such cases, the main loop's stride is going to be further multiplied. To avoid the scalar post loop running too much iterations after super-unrolling, C2 clones the main loop before super-unrolling to create a vector drain loop (a.k.a, atomic post loop). The newly inserted post loop also has a min-trip guard. And, both trip guards of the main loop and vector post loop jump to the scalar post loop, as is shown in below figure (b).

After the main loop is super-unrolled, the test in main loop trip guard will be updated. Suppose the super-unrolling count is 4 in this example, the trip guard will test if remaining trip is less than 8 * 4 = 32, as is shown in below figure (c).

The problem here is, if the iteration count of a loop is relatively small but larger than the vector length, the vector atomic post loop will never be executed, because the test of the main loop's trip guard fails and the atomic post loop is jumped over. For example, in above case, a loop still has 25 iterations after the pre-loop is executed, we may can run 3 trips of the atomic post loop but it's impossible. It would be better if the main loop's trip guard does not jump over the atomic post loop.

This issue does not produce any bug but fixing this can improve the performance of small trip count loop.

Comments

Hi [~epeter], I proposed a draft pull request in https://github.com/openjdk/jdk/pull/22629. There are several fuzzer failures in my local testing I'm still working on, but I think it would be better to invite you to have a look first. I'd appreciate it if you could give some feedback :-). Thanks. As for the benchmark in https://github.com/openjdk/jdk/pull/22070 , I didn't watch much performance change but only uplift at several data points. It's probably because policy_unroll() in C2 always helps generate optimal code for current trip count based on profiling information. If we test it with benchmarks containing small-trip-count loops, C2 won’t unroll or auto-vectorize it. If we test it with large-trip-count loops, even with unrolling and auto-vectorization, certain range of trip counts could show the change. The benchmarks that the change polishes up may be when we run some loops with large trip count, C2 generates code based on profiling information from these loops with large trip count and then the program switches to some loops with relatively small trip count. Thanks :)
07-12-2024
[~fgao] This benchmark may be relevant / interesting for you: https://github.com/openjdk/jdk/pull/22070 JDK-8344118
19-11-2024
Hi [~epeter], yeah, I'm still working on it. I have an initial patch, which can help gain expected performance with micro-benchmarks and passed tier 1 - 3 on both x86 and aarch64. But I've been spending quite a lot of time fixing various corner cases found by fuzzer test. Now there're only very few left, finally! Also, I noticed that several patches to refactor loop predicates have been merged recently, and then I need to rewrite the code change involving this part. Once I finish this rewriting, I'll share you my draft pr. Hope you can help review it :)
13-11-2024
[~pli][~fgao] Is there any progress on this / are you still working on it?
13-11-2024
[~fgao] amazing! Looking forward to it :)
13-11-2024
[~chagedorn] I had the same curiosity while reading the code. I saw this name first appeared in JDK-8149421. In the developer's own words: "The addition of atomic unrolled drain loops which precede fix-up segments which are significantly faster than scalar code. The requirement is that the main loop is super unrolled after vectorization." My guess is that since the loop is not super-unrolled after being vectorized, it's like an "atom"?
03-05-2023
Just out of curiosity, why is this vector post loop called "atomic" post loop?
28-04-2023
(I updated the affects versions. It's sufficient to add the versions that are still supported/updated - for readability)
28-04-2023