This change is part 2 x86 Loop Vectorization of :
https://github.com/llvm/llvm-project/pull/96222
It also has veclib call loop vectorization hence the test cases in
`llvm/test/Transforms/LoopVectorize/X86/veclib-calls.ll`
finally the last pr missed tests for
`llvm/test/CodeGen/X86/fp-strict-libcalls-msvc32.ll` and
`llvm/test/CodeGen/X86/vec-libcalls.ll` so added those aswell.
No evidence was found for arc and hyperbolic trig glibc vector math
functions
https://github.com/lattera/glibc/blob/master/sysdeps/x86/fpu/bits/math-vector.h
so no new `_ZGVbN2v_*` and `_ZGVdN4v_*` .
So no new tests in
`llvm/test/Transforms/LoopVectorize/X86/libm-vector-calls-VF2-VF8.ll`
Also no new svml and no new tests to:
`llvm/test/Transforms/LoopVectorize/X86/svml-calls.ll`
There was not enough evidence that there were svml arc and hyperbolic
trig vector implementations, Documentation was scarces so looked at test
cases in
[numpy](32bf2a9842/linux/avx512/svml_z0_acos_d_la.s (L8)).
Someone with more experience with svml should investigate.
## Note
amd libm doesn't have a vector hyperbolic sine api hence why youi might
notice there are no tests for `sinh`.
## History
This change is part of
https://github.com/llvm/llvm-project/issues/87367's investigation on
supporting IEEE math operations as intrinsics.
Which was discussed in this RFC:
https://discourse.llvm.org/t/rfc-all-the-math-intrinsics/78294
This change adds loop vectorization for `acos`, `asin`, `atan`, `cosh`,
`sinh`, and `tanh`.
resolves#70079resolves#70080resolves#70081resolves#70083resolves#70084resolves#95966
This patch moves the logic to create remarks for instructions with
invalid costs to work on recipes and decoupling it from
selectVectorizationFactor. This is needed to replace the remaining uses
of selectVectorizationFactor with getBestPlan using the VPlan-based cost
model.
The current implementation iterates over all VPlans and their recipes
again, to find recipes with invalid costs, which is more work but will
only be done when remarks for LV are enabled. Once the remaining uses of
selectVectorizationFactor are retired, we can collect VPlans with
invalid costs as part of getBestPlan if we want to optimize the remarks
case a bit, at the cost of adding additional complexity.
PR: https://github.com/llvm/llvm-project/pull/99322
The motivation for this change is the costing of a LD or ST with nearly
power of 2 vectors (e.g. <3 x i32> or <7 x i32>) on V. There's an
experimental option in SLP to allow emitting these if the cost model
says they're profitable. This really helps with e.g. RGB vectors.
Our actual lowering for these depends on whether a wider container type
is known available. If so, we use a vle or vse on the wider type with a
restricted VL. If not, we split until a legal type is found, and then
apply the vle/vse on the sub-pieces.
This change is intentionally restricted to only the case where promotion
(widening w/VL predication) is involved. We appear to have at least one
bug in our splitting lowering (see discussion on review), and to avoid
exposing this more widely, I chose to not adjust costs for the splitting
case. The current splitting costing assumes scalarization (which is not
true of the actual lowering), but that has the effect of biasing
vectorization away from such cases strongly.
For the widening case, the true cost scales with the next largest legal
type. The default implementation assumes that such a type is scalarized.
Changing that brings our cost in line with our actual lowering decision.
Note that since scalarization is not possible for scalable types, the
prior costing falsely returned Invalid for that case.
Replace getBestPlan by getBestVF which simply finds the best
VF out of the VFs for the available VPlans.
Then use getBestPlan to retrieve the corresponding VPlan.
This allows using getBestVF & getBestPlan for epilogue vectorization
as well. As the same plan may be used to vectorize both the main
and epilogue loop, restricting the VF of the best plan would cause
issues.
PR: https://github.com/llvm/llvm-project/pull/98821
Follow-up to ba8126b6fe.
If a scalar epilogue is required, users outside the loop won't use
live-outs from the vector loop but from the scalar epilogue. Ignore them if
that is the case.
This fixes another case where the VPlan-based cost-model more accurately
computes cost.
Fixes https://github.com/llvm/llvm-project/issues/100464.
Update collectValuesToIgnore to also ignore dead instructions in the
loop. Such instructions will be removed by VPlan-based DCE and won't be
considered by the VPlan-based cost model.
This closes a gap between the legacy and VPlan-based cost model. In
practice with the default pipelines, there shouldn't be any dead
instructions in loops reaching LoopVectorize, but it is easy to generate
such cases by hand or automatically via fuzzers.
Fixes https://github.com/llvm/llvm-project/issues/99701.
To match the behavior of the legacy cost model, only apply
-force-target-instruction-cost to recipes with underlying instructions
for now, as only original IR instructions are considered by the legacy
cost model.
This fixes a difference between legacy and VPlan based cost model,
triggering the verification assertion, reported by @JonPsson1.
This patch now bails out explicitly for negative offsets so that it's
more consistent with the unsigned remainder and add calculations,
and it fixes a genuine bug as shown with the new test.
I was comparing some SPEC CPU 2017 benchmarks across rva22u64 and
rva22u64_v, and noticed that in a few cases that rva22u64_v was
considerably slower.
One of them was 519.lbm_r, which has a large loop that was being
unprofitably vectorized. It has an if/else in the loop which requires
large amounts of predication when vectorized, but despite the loop
vectorizer taking this into account the vector cost came out as cheaper
than the scalar.
It looks like the reason for this is because we cost scalar floating
point ops as 2, but their vector equivalents as 1 (for LMUL 1). This
comes from how we use BasicTTIImpl for scalars which treats floats as
twice as expensive as integers.
This patch doubles the cost of vector floating point arithmetic ops so
that they're at least as expensive as their scalar counterparts, which
gives a 13% speedup on 519.lbm_r at -O3 on the spacemit-x60.
Fixes#62576 (the last point there about scalar fsub/fmul)
Update existing tests with dead interleave groups by adding users. This
ensures the tests keep testing what they were intended to test with a
planned change to skip unused instructions in cost computations.
The current assertion VPTransformState::get when retrieving a single
scalar only does not account for cases where a def has multiple users,
some demanding all scalar lanes, some demanding only a single scalar.
For an example, see the modified test case. Relax the assertion by also
allowing requesting scalar lanes only when the Def doesn't have only its
first lane used.
Fixes https://github.com/llvm/llvm-project/issues/88849.
Add a test case where scalar steps are used by both a VPReplicateRecipe
(demands all scalar lanes) and a VPInstruction that only demands the
first lane.
Test case for https://github.com/llvm/llvm-project/issues/88849.
Follow up to d216615518 to update dead interleave group pointer detection
to allow re-processing of operands of instructions determined to only feed
interleave groups.
This is needed because instructions feeding interleave group pointers
can become dead in any order, as per the newly added test case.
Process dead interleave pointer ops in reverse order. This also catches
cases where the same base pointer is used by multiple different
interleave groups.
This fixes another case where the legacy cost model inaccuarately
estimates cost, surfaced by b841e2eca3.
For the Neoverse V2 we would like to prefer fixed width over scalable
vectorisation if the cost-model assigns an equal cost to both for certain
loops. This improves 7 kernels from TSVC-2 and several production kernels by
about 2x, and does not affect SPEC21017 INT and FP. This also adds a new TTI
hook that can steer the loop vectorizater to preferring fixed width
vectorization, which can be set per CPU. For now, this is only enabled for the
Neoverse V2.
There are 3 reasons why preferring NEON might be better in the case the
cost-model is a tie and the SVE vector size is the same as NEON (128-bit):
architectural reasons, micro-architecture reasons, and SVE codegen reasons. The
latter will be improved over time, so the more important reasons are the former
two. I.e., (micro) architecture reason is the use of LPD/STP instructions which
are not available in SVE2 and it avoids predication.
For what it is worth: this codegen strategy to generate more NEON is inline
with GCC's codegen strategy, which is actually even more aggressive in
generating NEON when no predication is required. We could be smarter about the
decision making, but this seems to be a first good step in the right direction,
and we can always revise this later (for example make the target hook more
general).
This change allows to consider compare instructions in the loop with
multiple use inside the loop and outside.
This change allows to vectorise this loop:
int foo(float* a, int n) {
_Bool any = 0;
_Bool all = 1;
for (int i = 0; i < n; i++) {
if (a[i] < 0.0f) {
any = 1;
} else {
all = 0;
}
}
return all ? 1 : any ? 2 : 3;
}
Resume and exit values for inductions are currently still created
outside of VPlan and independent of the induction recipes. Don't add
live-outs for now, as the additional unneeded users can pessimize other
anlysis.
Fixes https://github.com/llvm/llvm-project/issues/98660.
When collecting candidates to pre-compute cost for operands of exit
conditions, skip users outside the loop when checking if they are in
ExistInstrs. The users outside the loop should be ignored, as they won't
make a value live in the VPlan.
This fixes a failure when building for X86 with sanitizers on macOS
after b841e2eca3
(https://green.lab.llvm.org/job/llvm.org/job/clang-stage2-cmake-RgSan/287/)
This patch introduces a new ResumePhi VPInstruction which creates a phi
in a leaf block of a VPlan. The first use is to create the phi node for
fixed-order recurrence resume values in the scalar preheader.
The VPInstruction takes 2 operands: 1) the incoming value from the
middle-block and a default value to be used for all other incoming
blocks.
In follow-up changes, it will also be used to create phis for reduction
and induction resume values.
Depends on https://github.com/llvm/llvm-project/pull/92651
PR: https://github.com/llvm/llvm-project/pull/94760
This patch implements limited loop vectorization support for the 'all-in-one' histogram intrinsic. The feature is disabled by default, and when enabled will only vectorize if there are no other users of values in the gather-modify-scatter sequence.
If an extend is truncated, it will be removed if the result type is <=
the source type, as there is nothing to extend. Return a cost of 0.
This was caught by the first step to perform cost-modeling based on
VPlan (b841e2e), as the legacy cost model would query the cost of an
invalid extend, while the extend has been folded away by VPlan
transforms.
Fixes https://github.com/llvm/llvm-project/issues/98413.
Adjusting the name of the recurrence phi in the scalar loop is a bit
inconsistent, as we do not adjust any other names in the scalar loops
(including other phis).
Remove this adjustment in preparation for
https://github.com/llvm/llvm-project/pull/94760/ and as discussed there.
This reverts commit 6f538f6a2d.
A number of crashes have been fixed by separate fixes, including
ttps://github.com/llvm/llvm-project/pull/96622. This version of the
PR also pre-computes the costs for branches (except the latch) instead
of computing their costs as part of costing of replicate regions, as
there may not be a direct correspondence between original branches and
number of replicate regions.
Original message:
This adds a new interface to compute the cost of recipes, VPBasicBlocks,
VPRegionBlocks and VPlan, initially falling back to the legacy cost model
for all recipes. Follow-up patches will gradually migrate recipes to
compute their own costs step-by-step.
It also adds getBestPlan function to LVP which computes the cost of all
VPlans and picks the most profitable one together with the most
profitable VF.
The VPlan selected by the VPlan cost model is executed and there is an
assert to catch cases where the VPlan cost model and the legacy cost
model disagree. Even though I checked a number of different build
configurations on AArch64 and X86, there may be some differences
that have been missed.
Additional discussions and context can be found in @arcbbb's
https://github.com/llvm/llvm-project/pull/67647 and
https://github.com/llvm/llvm-project/pull/67934 which is an earlier
version of the current PR.
PR: https://github.com/llvm/llvm-project/pull/92555
Port collectEphemeralValues to VPlan as collectEphemeralRecipesForVPlan,
use it in willGenerateVectors. This fixes a regression caused by
29b8b72117 for loops where the only vector values are ephemeral.
Introduce new canFoldTail helper which only checks if tail-folding is
possible, but without modifying MaskedOps.
Just because tail-folding is possible doesn't mean the tail will be
folded; that's up to the cost-model to decide. Separating the check if
tail-folding is possible and preparing for tail-folding makes sure that
MaskedOps is only populated when tail-folding is actually selected.
PR: https://github.com/llvm/llvm-project/pull/77612
Add tests with loops with ephemeral values that are widened.
After 29b8b72117, @ephemeral_load_and_compare_another_load_used_outside
is vectorized even though the only vector values that are generated are
ephemeral.
This patch moves the check if any vector instructions will be generated
from getInstructionCost to be based on VPlan. This simplifies
getInstructionCost, is more accurate as we check the final result and
also allows us to exit early once we visit a recipe that generates
vector instructions.
The helper can then be re-used by the VPlan-based cost model to match
the legacy selectVectorizationFactor behavior, this fixing a crash and
paving the way to recommit
https://github.com/llvm/llvm-project/pull/92555.
PR: https://github.com/llvm/llvm-project/pull/96622
LoopVectorize already always preserves DT, LI and SCEV. If any changes
get made to the CFG, cached LAA info for loops are cleared.
LoopAccessAnalysis also implements ::invalidate to clear the analysis if
SE, DT or LI gets invalidated. Hence it should be safe to preserve LAA
and save a small amount of compile-time.
This patch moves branch condition creation to enter the scalar epilogue
loop to VPlan. Modeling the branch in the middle block also requires
modeling the successor blocks. This is done using the recently
introduced VPIRBasicBlock.
Note that the middle.block is still created as part of the skeleton and
then patched in during VPlan execution. Unfortunately the skeleton needs
to create the middle.block early on, as it is also used for induction
resume value creation and is also needed to properly update the
dominator tree during skeleton creation.
After this patch lands, I plan to move induction resume value and phi
node creation in the scalar preheader to VPlan. Once that is done, we
should be able to create the middle.block in VPlan directly.
This is a re-worked version based on the earlier
https://reviews.llvm.org/D150398 and the main change is the use of
VPIRBasicBlock.
Depends on https://github.com/llvm/llvm-project/pull/92525
PR: https://github.com/llvm/llvm-project/pull/92651