This patch prepares the ground for #76060.
* Unifies ArmPL and SLEEF tests for better coverage
* Replaces deprecated float* and double* types with ptr
* Adds noalias attribute to pointer arguments
* Adds some cmd-line options to the RUN lines to simplify output
* Removes datalayout since target triple is provided
* Removes checks for return statements
* Refactors the regex filter for autogenerated checks
* Removes redundant test file suffix (already under the AArch64 dir)
With commit https://reviews.llvm.org/D152366 I introduced functionality
that permitted the hoisting of runtime memory checks from a vectorised
inner loop to the preheader of the next outer-most loop. This is useful
for benchmarks like SPEC2017's x264 where the inner loop is vectorised
and only has a small trip count. In such cases the runtime memory checks
become expensive and since the checks never fail in the case of x264 it
makes sense to do this. However, this behaviour was controlled by the
flag -hoist-runtime-checks which was off by default.
This patch enables this flag by default for all targets, since I believe
this is a generally beneficial thing to do. I have tested this with
SPEC2017 and I see 2.3% and 2.6% improvements with x264 on neoverse-v1
and neoverse-n1, respectively. Similarly, I saw slight improvements in
the overall geomean on both machines. The only other notable changes
were a 1% drop in the roms benchmark, which was compensated for by a 1%
improvement in fotonik3d.
This reverts commit 19918ac34d.
Fixes#75298. There is still a case where we miss the correct users
outside the main vector loop for reductions, and that is tail-folded
loops with reductions where the final value is stored after the loop.
This should be handled explicitly in #70253
When attempting to hoist runtime checks out of a loop we currently avoid
creating pointer diff checks and prefer to do expanded range checks
instead. This gives us the opportunity to hoist runtime checks out of a
loop, since these checks are loop invariant. However, in some cases the
pointer diff checks would also be loop invariant and so will naturally
get hoisted. Therefore, since diff checks are cheaper so we should
prefer to use those instead.
Added more pre-commit tests for evaluating changes to loop interleaving count computation in (https://github.com/llvm/llvm-project/pull/73766). The new set of tests address the change in IC computation to minimize the remainder TC of the vectorized loop while maximizing the IC when the
remainder TC is the same.
This patch starts initial modeling of VF * UF in VPlan.
Initially, introduce a dedicated VFxUF VPValue, which is then
populated during VPlan::prepareToExecute. Initially, the VF * UF
applies only to the main vector loop region. Once we extend the
scope of VPlan in the future, we may want to associate different VFxUFs
with different vector loop regions (e.g. the epilogue vector loop)
This allows explicitly parameterizing recipes that rely on the
VF * UF, like the canonical induction increment. At the moment, this
mainly helps to avoid generating some duplicated calls to vscale with
scalable vectors. It should also allow using EVL as induction increments
explicitly in D99750. Referring to VF * UF is also needed in other
places that we plan to migrate to VPlan, like the minimum trip count
check during skeleton creation.
The first version creates the value for VF * UF directly in
prepareToExecute to limit the scope of the patch. A follow-on patch will
model VF * UF computation explicitly in VPlan using recipes.
Moved from Phabricator (https://reviews.llvm.org/D157322)
opt accepts the -march command-line argument, but this argument only
makes sense in conjunction with -mtriple. Fix a couple of tests under
LoopVectorize that invoke opt with -march but without -mtriple, to avoid
confusing users.
This adds support for using dominating conditions in computeKnownBits()
when called from InstCombine. The implementation uses a
DomConditionCache, which stores which branches may provide information
that is relevant for a given value.
DomConditionCache is similar to AssumptionCache, but does not try to do
any kind of automatic tracking. Relevant branches have to be explicitly
registered and invalidated values explicitly removed. The necessary
tracking is done inside InstCombine.
The reason why this doesn't just do exactly the same thing as
AssumptionCache is that a lot more transforms touch branches and branch
conditions than assumptions. AssumptionCache is an immutable analysis
and mostly gets away with this because only a handful of places have to
register additional assumptions (mostly as a result of cloning). This is
very much not the case for branches.
This change regresses compile-time by about ~0.2%. It also improves
stage2-O0-g builds by about ~0.2%, which indicates that this change results
in additional optimizations inside clang itself.
Fixes https://github.com/llvm/llvm-project/issues/74242.
A new disjoint flag was added for OR instructions in #72583.
Update VPRecipeWithIRFlags to also support the new flag. This
allows printing and preserving the disjoint flag in vectorized code.
Compiler crashes when the assertion triggered for zext nneg instruction,
that checks that the instruction cannot produce poison. Changed the base
class for widencast recipe to handle dropping nneg flag to avoid
compiler crash.
These tests rely on SCEV looking recognizing an "or" with no common
bits as an "add". Add the disjoint flag to relevant or instructions
in preparation for switching SCEV to use the flag instead of the
ValueTracking query. The IR with disjoint flag matches what
InstCombine would produce.
MinBWs contains entries that specify the minimum required bitwidth. In
some cases, the old and new bitwidths can be equal (see test case) and
in those cases no truncations are needed, so skip those cases.
Fixes#74307.
Clean up sve-tail-folding-option.ll by removing the unused
CHECK-TF-NEOVERSE-V1 prefix (note the use of non-opaque pointers) and
remove IR value references.
In some cases MinBWs may contain entries for live-ins that are not used
by VPWidenRecipe or VPWidenSelectRecipes. In those cases, the live-ins
won't get processed, so make sure we include them in the count when used
as operands in VPWidenCast and VPWidenSelectRecipe.
Fixes https://github.com/llvm/llvm-project/issues/74231
The disjoint flag was recently added to IR in #72583
We already set it when we turn an add into an or. This patch sets it on Ors that weren't converted from an Add.
This patch replaces the IR based truncateToMinimalBitwidths with a VPlan
version. This has 3 benefits:
1) the VPlan-based version is simpler; we don't need to implement
special codegen for each supported instruction type like the IR based
one.
2) Removes a dependency on the cost-model after VPlan execution and
3) Removes a use of getVPValue that uses underlying values after VPlan
execution (See removed FIXME).
Depends on D149081.
Depends on D149079.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D149903
After inserting a select for the final value, update the VPlan def-use
chains. At the moment, the incorrect live-out doesn't cause a
mis-compile, as computing the final reduction value is not yet modeled
in VPlan.
Debugify is extremely useful as a testing and debugging tool, and a good
number of LLVM-IR transform tests use it. We need it to support "new"
non-instruction debug-info to get test coverage, but it's not important
enough to completely convert right now (and it'd be a large
undertaking). Thus: convert to/from dbg.value/DPValue mode on entry and
exit of the pass, which gives us the functionality without any further
work. The cost is compile-time, but again this is only happening during
tests.
Tested by: the large set of debugify tests enabled here. Note the
InstCombine test (cast-mul-select.ll) that hasn't been fully enabled:
this is because there's a debug-info sinking piece of code there that
hasn't been instrumented.
Auto-generate test `armpl-intrinsics.ll` and simplify tests:
- Eliminate scalar tail with no tail-folding flag.
- Use active lane mask for shorter check lines (no long `shufflevectors`).
- Eliminate scalar loops by providing `noalias` to relevant arguments and
run `simplifycfg` to drop them.
- Update script now use `@llvm.compiler.used` instead of a longer regex.
Update regex to _explicitly_ show which exp versions are added.
The previous regex used `exp[^e]` to avoid matching calls like:
`@llvm.experimental.stepvector`.
Note: ArmPL Mappings for scalable types are not yet utilized
(eg, `llvm.exp10.nxv2f64`, `llvm.exp10.nxv4f32`), as `replace-with-veclib`
pass needs improvements.
We can determine the VF from a combination of the mangled name (which
indicates the arguments that take vectors) and the element sizes of
the arguments for the scalar function the mapping has been established
for.
The assert when demangling fails has been removed in favour of just
not adding the mapping, which prevents the crash seen in
https://github.com/llvm/llvm-project/issues/71892
This patch also stops using _LLVM_ as an ISA for scalable vector tests,
since there aren't defined rules for the way vector arguments should be
handled (e.g. packed vs. unpacked representation).
SCEV simplifying the subtraction may result in redundant compares that
are all OR'd together. Keep track of the generated operands in
SeenCompares, with the key being the pair of operands for the compare.
If we alrady generated the same compare previously, skip it.
Instead of expanding the src/sink SCEV expressions and emitting an IR
sub to compute the difference, the subtraction can be directly be
performed by ScalarEvolution. This allows the subtraction to be
simplified by SCEV, which in turn can reduced the number of redundant
runtime check instructions generated.
It also allows to generate checks that are invariant w.r.t. an outer
loop, if he inner loop AddRecs have the same outer loop AddRec as start.
Future patches will remove some redundant instructions for runtime
checks, which brings this test case slightly below the default limit of
128. Force a lower limit to preserve the original spirit of the test
(checking that no interleaving happens if the number of checks is
above he threshold)
Add a test case where the AddRec for the pointers in the inner loop
have the AddRec of the outer loop as start value.
It is sufficient to subtract the start values (%dst, %src) of the outer
AddRecs. This simplification will be done in a follow-up commit.
THe freezes are introduced to avoid branch on undef/poison, if any of
the pointers may be poison. The same can be achieved by just freezing
the compare, which reduces the number of freezes needed. See
https://alive2.llvm.org/ce/z/NHa_ud
Note that the individual compares need to be frozen and it is not
sufficient to only freeze the resulting OR:
Result OR frozen only (UNSOUND): https://alive2.llvm.org/ce/z/YzFHQY
Individual conds frozen (SOUND): https://alive2.llvm.org/ce/z/5L6Z3f
HCFG builder doesn't correctly handle cases when non-outermost loop is
requested to be vectorized
[Original] Differential Revision: https://reviews.llvm.org/D150700