Commit Graph

2177 Commits

Author SHA1 Message Date
Florian Hahn
3fa1b254b7 [VPlan] Print blend recipe as operand directly, instead of IR PHI.
Update VPBlendRecipe::print() to print the result directly, instead of
relying on the stored Phi pointer. This brings the recipe in line with
how other recipes are printed.
2023-09-04 12:35:58 +01:00
Florian Hahn
cb54522853 [LV] Add test coverage for adding DebugLoc to vector select.
Add missing test coverage for selects with !dbg info.
2023-09-04 12:01:14 +01:00
Nuno Lopes
66a652ab08 recommit test for #65212 2023-09-04 09:17:18 +01:00
Muhammad Omair Javaid
42a46730bb Revert "fix test for #65212"
This reverts commit a0b0d7493d.

It has broken following buildbots:

https://lab.llvm.org/buildbot/#/builders/188/builds/34873
https://lab.llvm.org/buildbot/#/builders/245/builds/13538
https://lab.llvm.org/buildbot/#/builders/65/builds/11074
2023-09-04 12:53:12 +05:00
Nuno Lopes
a0b0d7493d fix test for #65212
I committed the wrong test, sorry.
2023-09-03 17:01:36 +01:00
Nuno Lopes
5a3fd5f3f5 [LoopVectorizer] Fix PR #65212: vectorization of reduction loop wasn't respecting original store alignment 2023-09-03 16:35:05 +01:00
Nuno Lopes
335a9bc4d9 precommit test for #65212 2023-09-03 16:33:57 +01:00
Florian Hahn
fd66195777 [VPlan] Manage compare predicates in VPRecipeWithIRFlags.
Extend VPRecipeWithIRFlags to also manage predicates for compares. This
allows removing the custom ICmpULE opcode from VPInstruction which was a
workaround for missing proper predicate handling.

This simplifies the code a bit while also allowing compares with any
predicates. It also fixes a case where the compare predixcate wasn't
printed properly for VPReplicateRecipes.

Discussed/split off from D150398.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D158992
2023-09-02 21:45:24 +01:00
Igor Kirillov
ac65fb8699 [LoopVectorize] Fix incorrect order of invariant stores when there are multiple reductions.
When a loop has multiple reductions, each with an intermediate invariant
store, the order in which those reductions are processed is not considered.
This can result in the invariant stores outside the loop not preserving the
original order.
This patch sorts VPReductionPHIRecipes by the order in which they have
stores in the original loop before running
`InnerLoopVectorizer::fixReduction` function, and it helps to maintain
the correct order of stores.

Fixes https://github.com/llvm/llvm-project/issues/64047

Differential Revision: https://reviews.llvm.org/D157631
2023-08-31 16:21:44 +00:00
Igor Kirillov
2df9ed11c5 [LoopVectorize] Pre-commit tests for D157631
Differential Revision: https://reviews.llvm.org/D157630
2023-08-31 09:50:53 +00:00
Dhruv Chawla
4ea8212775 [NFC][LoopVectorize] Regenerate test checks 2023-08-30 23:22:57 +05:30
Ramkumar Ramachandra
04b1276ad3 LoopVectorize/iv-select-cmp: add tests for truncated IV
The current tests in iv-select-cmp.ll are not representative of clang
output of common real-world C programs, which are often written with i32
induction vars, as opposed to i64 induction vars. Hence, add five tests
corresponding to the following programs:

  int test(int *a, int n) {
    int rdx = 331;
    for (int i = 0; i < n; i++) {
      if (a[i] > 3)
        rdx = i;
    }
    return rdx;
  }

  int test(int *a) {
    int rdx = 331;
    for (int i = 0; i < 20000; i++) {
      if (a[i] > 3)
        rdx = i;
    }
    return rdx;
  }

  int test(int *a, long n) {
    int rdx = 331;
    for (int i = 0; i < n; i++) {
      if (a[i] > 3)
        rdx = i;
    }
    return rdx;
  }

  int test(int *a, unsigned n) {
    int rdx = 331;
    for (int i = 0; i < n; i++) {
      if (a[i] > 3)
        rdx = i;
    }
    return rdx;
  }

  int test(int *a) {
    int rdx = 331;
    for (long i = INT_MIN - 1; i < UINT_MAX; i++) {
      if (a[i] > 3)
        rdx = i;
    }
    return rdx;
  }

The first two can theoretically be vectorized without a runtime-check,
while the third and fourth cannot. The fifth cannot be vectorized, even
with a runtime-check.

This issue was found while reviewing D150851.

Differential Revision: https://reviews.llvm.org/D156124
2023-08-30 13:09:37 +01:00
Florian Hahn
96e83d3705 [LV] Use IRBuilder to create and optimize middle-block compare.
Split off from D150398 to avoid builder-related diff changes there.
Using IRBuilder to create ICmps simplifies the result if both operands
are constants.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D158332
2023-08-29 11:42:18 +01:00
David Sherwood
c02184f286 [LoopVectorize] Allow inner loop runtime checks to be hoisted above an outer loop
Suppose we have a nested loop like this:

  void foo(int32_t *dst, int32_t *src, int m, int n) {
    for (int i = 0; i < m; i++) {
      for (int j = 0; j < n; j++) {
        dst[(i * n) + j] += src[(i * n) + j];
      }
    }
  }

We currently generate runtime memory checks as a precondition for
entering the vectorised version of the inner loop. However, if the
runtime-determined trip count for the inner loop is quite small
then the cost of these checks becomes quite expensive. This patch
attempts to mitigate these costs by adding a new option to
expand the memory ranges being checked to include the outer loop
as well. This leads to runtime checks that can then be hoisted
above the outer loop. For example, rather than looking for a
conflict between the memory ranges:

1. &dst[(i * n)] -> &dst[(i * n) + n]
2. &src[(i * n)] -> &src[(i * n) + n]

we can instead look at the expanded ranges:

1. &dst[0] -> &dst[((m - 1) * n) + n]
2. &src[0] -> &src[((m - 1) * n) + n]

which are outer-loop-invariant. As with many optimisations there
is a trade-off here, because there is a danger that using the
expanded ranges we may never enter the vectorised inner loop,
whereas with the smaller ranges we might enter at least once.

I have added a HoistRuntimeChecks option that is turned off by
default, but can be enabled for workloads where we know this is
guaranteed to be of real benefit. In future, we can also use
PGO to determine if this is worthwhile by using the inner loop
trip count information.

When enabling this option for SPEC2017 on neoverse-v1 with the
flags "-Ofast -mcpu=native -flto" I see an overall geomean
improvement of ~0.5%:

SPEC2017 results (+ is an improvement, - is a regression):
520.omnetpp: +2%
525.x264: +2%
557.xz: +1.2%
...
GEOMEAN: +0.5%

I didn't investigate all the differences to see if they are
genuine or noise, but I know the x264 improvement is real because
it has some hot nested loops with low trip counts where I can
see this hoisting is beneficial.

Tests have been added here:

  Transforms/LoopVectorize/runtime-checks-hoist.ll

Differential Revision: https://reviews.llvm.org/D152366
2023-08-24 12:14:02 +00:00
David Sherwood
494d28ec07 [LoopVectorize] Add pre-commit tests for D152366
Differential Revision: https://reviews.llvm.org/D154075
2023-08-24 10:52:18 +00:00
Florian Hahn
c071dba1a3 [LV] update hexagon test to use load results.
The current version of the test doesn't use any of the loads, so they
can be removed together with the mask of the interleave group.

Use some loaded values and store them, to prevent the mask from being
optimized away.
2023-08-22 20:20:58 +01:00
Florian Hahn
34d25924c4 [VPlan] Mark some VPInstruction opcodes as not having side effects.
Mark some VPInstruction opcodes as not having side effects, preparation
for D157037.
2023-08-22 20:05:57 +01:00
Kolya Panchenko
acbe886880 [LV] Vectorization remark for outerloop
Reviewed By: fhahn, ABataev

Differential Revision: https://reviews.llvm.org/D150696
2023-08-21 13:05:06 -04:00
Florian Hahn
686aef8401 [LV] Remove compares and branches on undef from a few tests. 2023-08-18 16:28:42 +01:00
Roland Froese
4d425f8663 [PowerPC] vector cost model add cost to extract i1
Try to avoid some unprofitable predication on PPC. Recognize in the cost model that computing on i1 values will require extra mask or compare operation.

Differential Revision: https://reviews.llvm.org/D155876
2023-08-14 17:04:11 -04:00
Kerry McLaughlin
5d814b3848 Revert "[AArch64][SVE2] Change the cost of extends with S/URHADD to 0"
This reverts commit dda2cd2505.
2023-08-14 10:44:13 +00:00
Kerry McLaughlin
dda2cd2505 [AArch64][SVE2] Change the cost of extends with S/URHADD to 0
When SVE2 is enabled, we can combine an add of 1, add & shift right by 1
to a single s/urhadd instruction. If the operands to the adds are extended,
these extends will fold into the s/urhadd and their costs should be 0.

Reviewed By: dtemirbulatov

Differential Revision: https://reviews.llvm.org/D157628
2023-08-14 10:32:06 +00:00
Anna Thomas
5dfdf34df0 [LV] Move interleaved test to X86 directory
Remove the x86-registered-target under REQUIRES.
2023-08-09 16:03:33 -04:00
David Spickett
c09bdfe6f7 [LV] Require x86 target for interleaved access test
This is failing on every Linaro bot that only builds
the Arm or AArch64 targets, adding X86, it passes.
2023-08-09 09:02:02 +00:00
Anna Thomas
cb7d28ef52 Fix BB failure for check lines
Fix clang build bots which complain of missing check lines for Loop
access analysis by generating two run lines (original commit: 3cf24dbb).
2023-08-08 20:28:33 -04:00
Anna Thomas
3cf24dbbdd [LV] Complete load groups and release store groups. Try 2.
This is a complete fix for CompleteLoadGroups introduced in
D154309. We need to check for dependency between A and every member of
the load Group of B.
This patch also fixes another miscompile seen when we incorrectly sink stores
below a depending load (see testcase in
interleaved-accesses-sink-store-across-load.ll). This is fixed by
releasing store groups correctly.

This change was previously reverted (e85fd3cbdd) due to Asan failure with
use-after-free error. A testcase is added and the bug is fixed in this
version of the patch.

Differential Revision: https://reviews.llvm.org/D155520
2023-08-08 18:10:23 -04:00
Florian Hahn
af635a5547 [VPlan] Model wrap flags directly, remove *NUW opcodes (NFC)
Model wrap flags directly using VPRecipeWithIRFlags and clean up the
duplicated *NUW opcodes.

D157144 will build on this and also model FMFs for VPInstruction.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D157194
2023-08-08 12:12:30 +01:00
Florian Hahn
93c5bae00e [VPlan] Use printOperands for VPInstruction.
Use the printOperands for printing VPInstruction's operands to be more
in line with other recipes and ensure consistent printing after D15719.

Also removes some stray spaces in print output.
2023-08-08 11:31:21 +01:00
Florian Hahn
539acce167 [LV] Add variant of test without dead load.
The original test has a unused load, which is removed. Also add a
variant with a store that cannot be removed, forcing the mask for the
block to always be generated.
2023-08-05 14:15:18 +01:00
Jolanta Jensen
3feb63e112 [TLI][AArch64] Add SLEEF mappings to scalable vector functions for fmod and fmodf
This patch adds SLEEF mappings to scalable vector functions for fmod and fmodf.

Differential Revision: https://reviews.llvm.org/D156920
2023-08-03 14:33:33 +00:00
Mel Chen
425e9e81a0 [LV] Rename the Select[I|F]Cmp reduction pattern to [I|F]AnyOf. (NFC)
Regarding this NFC change, please refer to the discussion in this thread. https://reviews.llvm.org/D150851#4467261

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D155786
2023-08-03 00:37:19 -07:00
Florian Hahn
cdb7d5767c [LV] Add test for select truncation.
Add test coverage for truncating selects for D149903.
2023-08-01 18:53:36 +01:00
Florian Hahn
707359ecf5 Recommit "[LV] Re-use existing broadcast value for live-ins."
This reverts commit 245ec675a4.

Recommits eea9258648 with a fix to only erase the instruction from the
first part if it is defined outside the loop. This fixes a
use-after-free error reported.
2023-08-01 15:54:02 +01:00
Zhongyunde
497966f7f2 Reland [InstSimplify] Remove the remainder loop if we know the mask is always true
We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.
Depend on the assume of power-of-two vscale on D155350

proofs: https://alive2.llvm.org/ce/z/bT62Wa

Fix https://github.com/llvm/llvm-project/issues/63616.

Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D154953
2023-08-01 22:20:22 +08:00
Florian Hahn
4e130420e3 [LV] Add test for op truncation from 245ec675a4.
Add extra test for that issue in 245ec675a4. Also generate full check
lines for tests, which should now be deterministic on all platforms.
2023-08-01 13:54:50 +01:00
Nikita Popov
d01aec4c76 [InstCombine] Set dead phi inputs to poison in more cases
Set phi inputs to poison whenever we find a dead edge (either
during initial worklist population or the main InstCombine run),
instead of only doing this for successors of dead blocks.

This means that the phi operand is set to poison even if for
critical edges without an intermediate block.

There are quite a few test changes, because the pattern is fairly
common in vectorizer output, for cases where we know the vectorized
loop will be entered.
2023-08-01 11:53:47 +02:00
Nikita Popov
7c64449e44 [LoopVectorize] Regenerate test checks (NFC)
To reduce spurious diffs in future changes.
2023-08-01 11:30:55 +02:00
Nikita Popov
eb9fce092a Revert "[InstSimplify] Remove the remainder loop if we know the mask is always true"
This reverts commit 3e386b2278.

Next to the original fold, this also implements an unnecessary and
inappropriate simplifyICmpWithDominatingAssume() based fold.
2023-08-01 09:03:20 +02:00
Zhongyunde
3e386b2278 [InstSimplify] Remove the remainder loop if we know the mask is always true
We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.
Depend on the assume of power-of-two vscale on D155350

proofs: https://alive2.llvm.org/ce/z/FkTMoy

Fix https://github.com/llvm/llvm-project/issues/63616.

Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D154953
2023-08-01 11:20:20 +08:00
Florian Hahn
4162f36bcb [LV] Regenerate check lines for shrinking tests.
Make sure the full IR is checked for loop-vectorization-factors.ll and
to make sure nothing gets missed and add missing checks for type-shrinkage-insertelt.ll.

Also removes some undef ops from tests.
2023-07-30 16:38:28 +01:00
David Green
2a859b2014 [AArch64] Change the cost of vector insert/extract to 2
The cost of vector instructions has always been high under AArch64, in order to
add a high cost for inserts/extracts, shuffles and scalarization. This is a
conservative approach to limit the scope of unusual SLP vectorization where the
codegen ends up being quite poor, but has always been higher than the correct
costs would be for any specific core.

This relaxes that, reducing the vector insert/extract cost from 3 to 2. It is a
generalization of D142359 to all AArch64 cpus. The ScalarizationOverhead is
also overridden for integer vector at the same time, to remove the effect of
lane 0 being considered free for integer vectors (something that should only be
true for float when scalarizing).

The lower insert/extract cost will reduce the cost of insert, extracts,
shuffling and scalarization. The adjustments of ScalaizationOverhead will
increase the cost on integer, especially for small vectors. The end result will
be lower cost for float and long-integer types, some higher cost for some
smaller vectors. This, along with the raw insert/extract cost being lower, will
generally mean more vectorization from the Loop and SLP vectorizer.

We may end up regretting this, as that vectorization is not always profitable.
In all the benchmarking I have done this is generally an improvement in the
overall performance, and I've attempted to address the places where it wasn't
with other costmodel adjustments.

Differential Revision: https://reviews.llvm.org/D155459
2023-07-28 21:26:50 +01:00
Florian Hahn
cc39866436 [LV] Reorganize and extend in-loop reduction tests.
Split off min-max in-loop reduction tests into separate file and extend
them by adding tests with
 * min & max intrinsics
 * fmuladd with permuted operands
 * min & max select tests with permuted operands.

Adds extra test coverage as suggested in D155845.
2023-07-26 23:23:14 +01:00
Anna Thomas
e85fd3cbdd Revert "[LV] Complete load groups and release store groups in presence of dependency"
This reverts commit eaf6117f33 (D155520).
There's an ASAN build failure that needs investigation.
2023-07-26 15:07:26 -04:00
Ramkumar Ramachandra
110ec1863a LoopVectorize/iv-select-cmp: add test for decreasing IV, const start
The most straightforward extension to D150851 would involve a loop with
decreasing induction variable, with a constant start value.
iv-select-cmp.ll only contains a negative test for the decreasing
induction variable case when the start value is variable, namely
not_vectorized_select_decreasing_induction_icmp. Hence, add a test for
the most straightforward extension to D150851, in preparation to
vectorize:

  long rdx = 331;
  for (long i = 19999; i >= 0; i--) {
    if (a[i] > 3)
      rdx = i;
  }
  return rdx;

Differential Revision: https://reviews.llvm.org/D156152
2023-07-26 14:15:26 +01:00
Anna Thomas
eaf6117f33 [LV] Complete load groups and release store groups in presence of dependency
This is a complete fix for CompleteLoadGroups introduced in
D154309. We need to check for dependency between A and every member of
the load Group of B.
This patch also fixes another miscompile seen when we incorrectly sink stores
below a depending load (see testcase in
interleaved-accesses-sink-store-across-load.ll). This is fixed by
releasing store groups correctly.

Differential Revision: https://reviews.llvm.org/D155520
2023-07-25 17:32:09 -04:00
Martin Storsjö
245ec675a4 Revert "[LV] Re-use existing broadcast value for live-ins."
This reverts commit eea9258648.

That commit triggered crashes in the following testcase:

$ cat reduced.c
typedef struct {
  int a[8]
} b;
typedef struct {
  b *c;
  short d
} e;
void f() {
  int g;
  char *h;
  e *i = f;
  short j = i->d;
  int a = i->c->a[0];
  for (;;)
    for (; g < a; g++) {
      *h = j * i->d >> 8;
      h++;
    }
}
$ clang -target aarch64-linux-gnu -w -c -O2 reduced.c
2023-07-25 10:35:41 +03:00
Florian Hahn
eea9258648 [LV] Re-use existing broadcast value for live-ins.
When requesting a vector value for a live-in, we can re-use the
broadcast of the live-in of part 0 for parts > 0.
2023-07-24 11:50:47 +01:00
Maciej Gabka
38cdb007a5 Add missing SLEEF mappings to scalable vector functions for log2 and log2f
In the original commit adding SLEEF mappings, https://reviews.llvm.org/D146839
mappings for log2/log2f were missing.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D155801
2023-07-21 13:59:13 +00:00
Maciej Gabka
b172fbff68 Revert "[TLI][AArch64] Add missing SLEEF mappings to scalable vector functions for log2 and log2f"
This reverts commit 791c89600a.
2023-07-21 13:50:10 +00:00
Maciej Gabka
791c89600a [TLI][AArch64] Add missing SLEEF mappings to scalable vector functions for log2 and log2f
In the original commit adding SLEEF mappings, https://reviews.llvm.org/D146839
mappings for log2/log2f were missing.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D155623
2023-07-21 13:46:03 +00:00