Split off from D150398 to avoid builder-related diff changes there.
Using IRBuilder to create ICmps simplifies the result if both operands
are constants.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D158332
When SVE2 is enabled, we can combine an add of 1, add & shift right by 1
to a single s/urhadd instruction. If the operands to the adds are extended,
these extends will fold into the s/urhadd and their costs should be 0.
Reviewed By: dtemirbulatov
Differential Revision: https://reviews.llvm.org/D157628
Model wrap flags directly using VPRecipeWithIRFlags and clean up the
duplicated *NUW opcodes.
D157144 will build on this and also model FMFs for VPInstruction.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D157194
Use the printOperands for printing VPInstruction's operands to be more
in line with other recipes and ensure consistent printing after D15719.
Also removes some stray spaces in print output.
This reverts commit 245ec675a4.
Recommits eea9258648 with a fix to only erase the instruction from the
first part if it is defined outside the loop. This fixes a
use-after-free error reported.
We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.
Depend on the assume of power-of-two vscale on D155350
proofs: https://alive2.llvm.org/ce/z/bT62Wa
Fix https://github.com/llvm/llvm-project/issues/63616.
Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D154953
Set phi inputs to poison whenever we find a dead edge (either
during initial worklist population or the main InstCombine run),
instead of only doing this for successors of dead blocks.
This means that the phi operand is set to poison even if for
critical edges without an intermediate block.
There are quite a few test changes, because the pattern is fairly
common in vectorizer output, for cases where we know the vectorized
loop will be entered.
This reverts commit 3e386b2278.
Next to the original fold, this also implements an unnecessary and
inappropriate simplifyICmpWithDominatingAssume() based fold.
We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.
Depend on the assume of power-of-two vscale on D155350
proofs: https://alive2.llvm.org/ce/z/FkTMoy
Fix https://github.com/llvm/llvm-project/issues/63616.
Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D154953
Make sure the full IR is checked for loop-vectorization-factors.ll and
to make sure nothing gets missed and add missing checks for type-shrinkage-insertelt.ll.
Also removes some undef ops from tests.
The cost of vector instructions has always been high under AArch64, in order to
add a high cost for inserts/extracts, shuffles and scalarization. This is a
conservative approach to limit the scope of unusual SLP vectorization where the
codegen ends up being quite poor, but has always been higher than the correct
costs would be for any specific core.
This relaxes that, reducing the vector insert/extract cost from 3 to 2. It is a
generalization of D142359 to all AArch64 cpus. The ScalarizationOverhead is
also overridden for integer vector at the same time, to remove the effect of
lane 0 being considered free for integer vectors (something that should only be
true for float when scalarizing).
The lower insert/extract cost will reduce the cost of insert, extracts,
shuffling and scalarization. The adjustments of ScalaizationOverhead will
increase the cost on integer, especially for small vectors. The end result will
be lower cost for float and long-integer types, some higher cost for some
smaller vectors. This, along with the raw insert/extract cost being lower, will
generally mean more vectorization from the Loop and SLP vectorizer.
We may end up regretting this, as that vectorization is not always profitable.
In all the benchmarking I have done this is generally an improvement in the
overall performance, and I've attempted to address the places where it wasn't
with other costmodel adjustments.
Differential Revision: https://reviews.llvm.org/D155459
This reverts commit eea9258648.
That commit triggered crashes in the following testcase:
$ cat reduced.c
typedef struct {
int a[8]
} b;
typedef struct {
b *c;
short d
} e;
void f() {
int g;
char *h;
e *i = f;
short j = i->d;
int a = i->c->a[0];
for (;;)
for (; g < a; g++) {
*h = j * i->d >> 8;
h++;
}
}
$ clang -target aarch64-linux-gnu -w -c -O2 reduced.c
These tests were originally added in 0aff1798b5, where they
were measuring the cost of fadd and fmuladd reductions, which should be fairly
high cost. For some reason, due to the forced vector factors, the debug costs
of each instruction are printed twice by the vectorizer. Once as if the
instruction is a simple fadd/fmuladd, and later with the correct reduction
cost.
In d827865e9f the costs were updated to match the first
print statements, where they would be better to match the second to test the
cost of the reduction.
This patch returns them to testing the original reduction costs.
Before this patch, the only way to generate streaming-compatible code
was to use the `-force-streaming-compatible-sve` flag, but the compiler
should also avoid the use of instructions invalid in streaming mode
when a function has the aarch64_pstate_sm_enabled/compatible attribute.
Reviewed By: paulwalker-arm, david-arm
Differential Revision: https://reviews.llvm.org/D155428
This patch is separated from D154953 to see what tests are affected by this
change alone according comment.
Depend on the related updating of LangRef on D155193.
Reviewed By: paulwalker-arm, nikic, david-arm
Differential Revision: https://reviews.llvm.org/D155350
Arm Performance Libraries contain math library which provides
vectorized versions of common math functions.
This patch allows to use it with clang and llvm via -fveclib=ArmPL or
-vector-library=ArmPL, so loops with such calls can be vectorized.
The executable needs to be linked with the amath library.
Arm Performance Libraries are available at:
https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Libraries
Reviewed by: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D154508
This patch extends LoopVectorize to handle the vectorization of interleaved
memory accesses with scalable vectors when mask is required or/and predicated
tail folding is enabled.
Differential Revision: https://reviews.llvm.org/D152258
Also replace aarch64_be-*-eabi with aarch64_be
Using "eabi" for aarch64 targets is a common mistake and warned by Clang Driver.
We want to avoid it elsewhere as well. Just use the common "aarch64" without
other triple components.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D153943
This patch uses the (de)interleaving intrinsics introduced in
D141924 to handle vectorization of interleaving groups with a
factor of 2 for scalable vectors.
Reviewed By: fhahn, reames
Differential Revision: https://reviews.llvm.org/D145163
For Neon, the default nonconst stride cost is conservative,
and it is a local variable, which is not convenience to
to tune the loop vectorize.
So I try to use a option, which is similar to SVEGatherOverhead brought in D115143.
Fix https://github.com/llvm/llvm-project/issues/63082.
Reviewed By: dmgreen, fhahn
Differential Revision: https://reviews.llvm.org/D152253
Update collectLoopUniforms to identify uniform pointers using
Legal::isUniform. This is more powerful and brings pointer
classification here in sync with setCostBasedWideningDecision
which uses isUniformMemOp. The existing mis-match in reasoning
can causes crashes due to D134460, which is fixed by this patch.
Fixes https://github.com/llvm/llvm-project/issues/60831.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D150991
Implement precise nuw/nsw support in the KnownBits implementation,
replacing the rather crude handling in ValueTracking.
Differential Revision: https://reviews.llvm.org/D151208
Now that IR flags are modeled as part of VPRecipeWithIRFlags, include
the flags when printing recipes.
Depends on D150027.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D150029
We noticed some runtime performance improvements by disabling maximising
bandwidth for streaming compatible sve.
Differential Revision: https://reviews.llvm.org/D150336
This patch enables the tail-folding of simple loops by default
when targeting the neoverse-v1 CPU. Simple loops exclude those
with recurrences or reductions or loops that are reversed.
New tests have been added here:
Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
In terms of SPEC2017 only one benchmark is really affected when
building with "-Ofast -mcpu=neoverse-v1 -flto", which is
(+ faster, - slower):
525.x264: +7.0%
Differential Revision: https://reviews.llvm.org/D130618
This is a follow-up to b71edfaa4e
since I forgot the lit.local.cfg files in that one.
Reformatting is done with `black`.
If you end up having problems merging this commit because you
have made changes to a python file, the best way to handle that
is to run git checkout --ours <yourfile> and then reformat it
with black.
If you run into any problems, post to discourse about it and
we will try to help.
RFC Thread below:
https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style
Reviewed By: barannikov88, kwk
Differential Revision: https://reviews.llvm.org/D150762
This patch does simple refactoring of the tail-folding
option in preparation for enabling tail-folding by
default for neoverse-v1. It adds a default tail-folding
option field to the AArch64Subtarget class that can
be set on a per-CPU.
Differential Revision: https://reviews.llvm.org/D149659
To generate cast instructions, the result type is needed. To allow
creating widened casts without underlying instruction, introduce a new
VPWidenCastRecipe that also holds the result type.
This functionality will be used in a follow-up patch to
implement truncateToMinimalBitwidths as VPlan-to-VPlan transform.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D149081
This patch adds a new preheader block the VPlan to place SCEV expansions
expansions like the trip count. This preheader block is disconnected
at the moment, as the bypass blocks of the skeleton are not yet modeled
in VPlan.
The preheader block is executed before skeleton creation, so the SCEV
expansion results can be used during skeleton creation. At the moment,
the trip count expression and induction steps are expanded in the new
preheader. The remainder of SCEV expansions will be moved gradually in
the future.
D147965 will update skeleton creation to use the steps expanded in the
pre-header to fix#58811.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D147964
This patch reverts rGae739aefd7473517d3f08b5c8d08a66c7f469198 to address performance regressions reported by our [CI](https://github.com/dtcxzyw/llvm-ci/issues/137) after rG2ec1d0f427c7822540352c0c14d057e7bfe4f77b.
For example:
```
define ptr @const_gep_chain(ptr %p, i64 %a) {
%p1 = getelementptr inbounds i8, ptr %p, i64 %a
%p2 = getelementptr inbounds i8, ptr %p1, i64 1
%p3 = getelementptr inbounds i8, ptr %p2, i64 2
%p4 = getelementptr inbounds i8, ptr %p3, i64 3
ret ptr %p4
}
```
The last three GEPs will not be folded since rG2ec1d0f427c7822540352c0c14d057e7bfe4f77b.
I think it is appropriate to remove this code because there is no compile-time regression reported in our benchmarks.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D149240
With this patch an undefined mask in a shufflevector will be printed as poison.
This change is done to support the new shufflevector semantics
for undefined mask elements.
Differential Revision: https://reviews.llvm.org/D149210
Now that we store the ScalarCost in the VectorizationFactor it is possible to
use it to get a slightly more accurate cost in isMoreProfitable between two
vector factors. This extends the logic added in D101726 to non-tail-folded
cases, using the costs of `VecCost * (TripCount / VF) + ScalarCost * (TripCount % VF)`
to compare VFs where the TripCount is known and we are not folding the tail.
This shouldn't alter very much as small trip counts are usually not vectorized,
but does seem to help in the testcase where 4 * VF4 is chosen as profitable
compared to 2 * VF8 + 4 * scalar.
Differential Revision: https://reviews.llvm.org/D147720
In LoopVectorizationCostModel::isEpilogueVectorizationProfitable we
check to see if the chosen main vector loop VF >= 16. If so, we
decide to create a vector epilogue loop. However, this doesn't
take VScaleForTuning into account because we could be targeting a
CPU where vscale > 1, and hence the runtime VF would be a multiple
of the known minimum value.
This patch multiplies scalable VFs by VScaleForTuning and several
tests have been updated that now produce vector epilogues.
Differential Revision: https://reviews.llvm.org/D147522
Based off D148215, when expanding a min/max reduction we should be creating min/max intrinsics directly instead of relying on instcombine to fold them back together.
This patch handles integer min/max cases. Hopefully we can add floating point support soon (at least for fastmath/nnan cases) - but we're missing some of the plumbing to pass the correct FMF to the intrinsic at the moment.
Differential Revision: https://reviews.llvm.org/D148221