Update VPBlendRecipe::print() to print the result directly, instead of
relying on the stored Phi pointer. This brings the recipe in line with
how other recipes are printed.
Extend VPRecipeWithIRFlags to also manage predicates for compares. This
allows removing the custom ICmpULE opcode from VPInstruction which was a
workaround for missing proper predicate handling.
This simplifies the code a bit while also allowing compares with any
predicates. It also fixes a case where the compare predixcate wasn't
printed properly for VPReplicateRecipes.
Discussed/split off from D150398.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D158992
When a loop has multiple reductions, each with an intermediate invariant
store, the order in which those reductions are processed is not considered.
This can result in the invariant stores outside the loop not preserving the
original order.
This patch sorts VPReductionPHIRecipes by the order in which they have
stores in the original loop before running
`InnerLoopVectorizer::fixReduction` function, and it helps to maintain
the correct order of stores.
Fixes https://github.com/llvm/llvm-project/issues/64047
Differential Revision: https://reviews.llvm.org/D157631
The current tests in iv-select-cmp.ll are not representative of clang
output of common real-world C programs, which are often written with i32
induction vars, as opposed to i64 induction vars. Hence, add five tests
corresponding to the following programs:
int test(int *a, int n) {
int rdx = 331;
for (int i = 0; i < n; i++) {
if (a[i] > 3)
rdx = i;
}
return rdx;
}
int test(int *a) {
int rdx = 331;
for (int i = 0; i < 20000; i++) {
if (a[i] > 3)
rdx = i;
}
return rdx;
}
int test(int *a, long n) {
int rdx = 331;
for (int i = 0; i < n; i++) {
if (a[i] > 3)
rdx = i;
}
return rdx;
}
int test(int *a, unsigned n) {
int rdx = 331;
for (int i = 0; i < n; i++) {
if (a[i] > 3)
rdx = i;
}
return rdx;
}
int test(int *a) {
int rdx = 331;
for (long i = INT_MIN - 1; i < UINT_MAX; i++) {
if (a[i] > 3)
rdx = i;
}
return rdx;
}
The first two can theoretically be vectorized without a runtime-check,
while the third and fourth cannot. The fifth cannot be vectorized, even
with a runtime-check.
This issue was found while reviewing D150851.
Differential Revision: https://reviews.llvm.org/D156124
Split off from D150398 to avoid builder-related diff changes there.
Using IRBuilder to create ICmps simplifies the result if both operands
are constants.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D158332
Suppose we have a nested loop like this:
void foo(int32_t *dst, int32_t *src, int m, int n) {
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
dst[(i * n) + j] += src[(i * n) + j];
}
}
}
We currently generate runtime memory checks as a precondition for
entering the vectorised version of the inner loop. However, if the
runtime-determined trip count for the inner loop is quite small
then the cost of these checks becomes quite expensive. This patch
attempts to mitigate these costs by adding a new option to
expand the memory ranges being checked to include the outer loop
as well. This leads to runtime checks that can then be hoisted
above the outer loop. For example, rather than looking for a
conflict between the memory ranges:
1. &dst[(i * n)] -> &dst[(i * n) + n]
2. &src[(i * n)] -> &src[(i * n) + n]
we can instead look at the expanded ranges:
1. &dst[0] -> &dst[((m - 1) * n) + n]
2. &src[0] -> &src[((m - 1) * n) + n]
which are outer-loop-invariant. As with many optimisations there
is a trade-off here, because there is a danger that using the
expanded ranges we may never enter the vectorised inner loop,
whereas with the smaller ranges we might enter at least once.
I have added a HoistRuntimeChecks option that is turned off by
default, but can be enabled for workloads where we know this is
guaranteed to be of real benefit. In future, we can also use
PGO to determine if this is worthwhile by using the inner loop
trip count information.
When enabling this option for SPEC2017 on neoverse-v1 with the
flags "-Ofast -mcpu=native -flto" I see an overall geomean
improvement of ~0.5%:
SPEC2017 results (+ is an improvement, - is a regression):
520.omnetpp: +2%
525.x264: +2%
557.xz: +1.2%
...
GEOMEAN: +0.5%
I didn't investigate all the differences to see if they are
genuine or noise, but I know the x264 improvement is real because
it has some hot nested loops with low trip counts where I can
see this hoisting is beneficial.
Tests have been added here:
Transforms/LoopVectorize/runtime-checks-hoist.ll
Differential Revision: https://reviews.llvm.org/D152366
The current version of the test doesn't use any of the loads, so they
can be removed together with the mask of the interleave group.
Use some loaded values and store them, to prevent the mask from being
optimized away.
Try to avoid some unprofitable predication on PPC. Recognize in the cost model that computing on i1 values will require extra mask or compare operation.
Differential Revision: https://reviews.llvm.org/D155876
When SVE2 is enabled, we can combine an add of 1, add & shift right by 1
to a single s/urhadd instruction. If the operands to the adds are extended,
these extends will fold into the s/urhadd and their costs should be 0.
Reviewed By: dtemirbulatov
Differential Revision: https://reviews.llvm.org/D157628
This is a complete fix for CompleteLoadGroups introduced in
D154309. We need to check for dependency between A and every member of
the load Group of B.
This patch also fixes another miscompile seen when we incorrectly sink stores
below a depending load (see testcase in
interleaved-accesses-sink-store-across-load.ll). This is fixed by
releasing store groups correctly.
This change was previously reverted (e85fd3cbdd) due to Asan failure with
use-after-free error. A testcase is added and the bug is fixed in this
version of the patch.
Differential Revision: https://reviews.llvm.org/D155520
Model wrap flags directly using VPRecipeWithIRFlags and clean up the
duplicated *NUW opcodes.
D157144 will build on this and also model FMFs for VPInstruction.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D157194
Use the printOperands for printing VPInstruction's operands to be more
in line with other recipes and ensure consistent printing after D15719.
Also removes some stray spaces in print output.
The original test has a unused load, which is removed. Also add a
variant with a store that cannot be removed, forcing the mask for the
block to always be generated.
This reverts commit 245ec675a4.
Recommits eea9258648 with a fix to only erase the instruction from the
first part if it is defined outside the loop. This fixes a
use-after-free error reported.
We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.
Depend on the assume of power-of-two vscale on D155350
proofs: https://alive2.llvm.org/ce/z/bT62Wa
Fix https://github.com/llvm/llvm-project/issues/63616.
Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D154953
Set phi inputs to poison whenever we find a dead edge (either
during initial worklist population or the main InstCombine run),
instead of only doing this for successors of dead blocks.
This means that the phi operand is set to poison even if for
critical edges without an intermediate block.
There are quite a few test changes, because the pattern is fairly
common in vectorizer output, for cases where we know the vectorized
loop will be entered.
This reverts commit 3e386b2278.
Next to the original fold, this also implements an unnecessary and
inappropriate simplifyICmpWithDominatingAssume() based fold.
We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.
Depend on the assume of power-of-two vscale on D155350
proofs: https://alive2.llvm.org/ce/z/FkTMoy
Fix https://github.com/llvm/llvm-project/issues/63616.
Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D154953
Make sure the full IR is checked for loop-vectorization-factors.ll and
to make sure nothing gets missed and add missing checks for type-shrinkage-insertelt.ll.
Also removes some undef ops from tests.
The cost of vector instructions has always been high under AArch64, in order to
add a high cost for inserts/extracts, shuffles and scalarization. This is a
conservative approach to limit the scope of unusual SLP vectorization where the
codegen ends up being quite poor, but has always been higher than the correct
costs would be for any specific core.
This relaxes that, reducing the vector insert/extract cost from 3 to 2. It is a
generalization of D142359 to all AArch64 cpus. The ScalarizationOverhead is
also overridden for integer vector at the same time, to remove the effect of
lane 0 being considered free for integer vectors (something that should only be
true for float when scalarizing).
The lower insert/extract cost will reduce the cost of insert, extracts,
shuffling and scalarization. The adjustments of ScalaizationOverhead will
increase the cost on integer, especially for small vectors. The end result will
be lower cost for float and long-integer types, some higher cost for some
smaller vectors. This, along with the raw insert/extract cost being lower, will
generally mean more vectorization from the Loop and SLP vectorizer.
We may end up regretting this, as that vectorization is not always profitable.
In all the benchmarking I have done this is generally an improvement in the
overall performance, and I've attempted to address the places where it wasn't
with other costmodel adjustments.
Differential Revision: https://reviews.llvm.org/D155459
Split off min-max in-loop reduction tests into separate file and extend
them by adding tests with
* min & max intrinsics
* fmuladd with permuted operands
* min & max select tests with permuted operands.
Adds extra test coverage as suggested in D155845.
The most straightforward extension to D150851 would involve a loop with
decreasing induction variable, with a constant start value.
iv-select-cmp.ll only contains a negative test for the decreasing
induction variable case when the start value is variable, namely
not_vectorized_select_decreasing_induction_icmp. Hence, add a test for
the most straightforward extension to D150851, in preparation to
vectorize:
long rdx = 331;
for (long i = 19999; i >= 0; i--) {
if (a[i] > 3)
rdx = i;
}
return rdx;
Differential Revision: https://reviews.llvm.org/D156152
This is a complete fix for CompleteLoadGroups introduced in
D154309. We need to check for dependency between A and every member of
the load Group of B.
This patch also fixes another miscompile seen when we incorrectly sink stores
below a depending load (see testcase in
interleaved-accesses-sink-store-across-load.ll). This is fixed by
releasing store groups correctly.
Differential Revision: https://reviews.llvm.org/D155520
This reverts commit eea9258648.
That commit triggered crashes in the following testcase:
$ cat reduced.c
typedef struct {
int a[8]
} b;
typedef struct {
b *c;
short d
} e;
void f() {
int g;
char *h;
e *i = f;
short j = i->d;
int a = i->c->a[0];
for (;;)
for (; g < a; g++) {
*h = j * i->d >> 8;
h++;
}
}
$ clang -target aarch64-linux-gnu -w -c -O2 reduced.c