Commit Graph

1929 Commits

Author SHA1 Message Date
Philip Reames
82c1b136db [LV] Don't predicate uniform mem op stores unneccessarily
We already had the reasoning about uniform mem op loads; if the address is accessed at least once, we know the instruction doesn't need predicated to ensure fault safety. For stores, we do need to ensure that the values visible in memory are the same with and without predication. The easiest sub-case to check for is that all the values being stored are the same. Since we know that at least one lane is active, this tells us that the value must be visible.

Warning on confusing terminology: "uniform" vs "uniform mem op" mean two different things here, and this patch is specific to the later. It would *not* be legal to make this same change for merely "uniform" operations.

Differential Revision: https://reviews.llvm.org/D130637
2022-07-28 08:55:52 -07:00
Philip Reames
15c645f7ee [RISCV] Enable (scalable) vectorization by default
This change enables vectorization (using scalable vectorization only, fixed vectors are not yet enabled) for RISCV when vector instructions are available for the target configuration.

At this point, the resulting configuration should be both stable (e.g. no crashes), and profitable (i.e. few cases where scalar loops beat vector ones), but is not going to be particularly well tuned (i.e. we emit the best possible vector loop). The goal of this change is to align testing across organizations and ensure the default configuration matches what downstreams are using as closely as possible.

This exposes a large amount of code which hasn't otherwise been on by default, and thus may not have been fully exercised.  Given that, having issues fall out is not unexpected.  If you find issues, please make sure to include as much information as you can when reverting this change.

Differential Revision: https://reviews.llvm.org/D129013
2022-07-27 12:36:04 -07:00
Philip Reames
e8ceadd0ce [LV][RISCV] Add a test case for a quality problem mixing vector index and data types
The problem here is target independent, but particularly painful on RISCV.  If we chose to vectorize such that vscale x 2 x i32 is our widest type and fits in a register, a naive expansion of i64 comparisons results in comparisons and index types at <scalabe x 2 x i64>.  This requires both an LMUL of 2, and a VSETVLI toggle in the loop.  Note that we could have used <vscale x 2 x i32> for the compairons legally given the range of the trip count.
2022-07-27 11:42:28 -07:00
Florian Hahn
16e0620d6d [VPlan] Mark VPPredInstPHIRecipe as not having side-effects.
Now that all uses of VPPredInstPHIRecipes are properly modeled, they can
be treated as not having side-effects, enabling removal.
2022-07-27 19:29:26 +01:00
Philip Reames
43b5e12159 [LV] Refresh an autogened test to pickup naming changes 2022-07-27 10:54:15 -07:00
Philip Reames
ebee4fbb34 [RISCV][LV] Add basic tests for default configuration
All of our other tests are functionality tests constrained to some
specific configuration.  This one is intended to float with the
default configuration so that changes in that default are visible
in reviews.  Note that our current default does not enable
vectorization at all; thus the current output is unvectorized.
2022-07-27 09:16:44 -07:00
Florian Hahn
a8fdc247e9 [LV] Add missing uses to test to make them more robust.
The changes ensure the VPPredInstPHIRecipes are actually used and cannot
be remove by VP-DCE.
2022-07-27 16:06:52 +01:00
Sanjay Patel
bfb9b8e075 [Passes] add a tail-call-elim pass near the end of the opt pipeline
We call tail-call-elim near the beginning of the pipeline,
but that is too early to annotate calls that get added later.

In the motivating case from issue #47852, the missing 'tail'
on memset leads to sub-optimal codegen.

I experimented with removing the early instance of
tail-call-elim instead of just adding another pass, but that
appears to be slightly worse for compile-time:
+0.15% vs. +0.08% time.
"tailcall" shows adding the pass; "tailcall2" shows moving
the pass to later, then adding the original early pass back
(so 1596886802 is functionally equivalent to 180b0439dc ):
https://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=rotateright

Note that there was an effort to split the tail call functionality
into 2 passes - that could help reduce compile-time if we find
that this change costs more in compile-time than expected based
on the preliminary testing:
D60031

Differential Revision: https://reviews.llvm.org/D130374
2022-07-25 15:25:47 -04:00
Nuno Lopes
a30e77b6f6 fix tests for commit 9df0b254d2 2022-07-23 22:32:30 +01:00
Nuno Lopes
9df0b254d2 [NFC] Switch a few uses of undef to poison as placeholders for unreachable code 2022-07-23 21:50:11 +01:00
Philip Reames
bd75350180 [LV] Fix a conceptual mistake around meaning of uniform in isPredicatedInst
This code confuses LV's "Uniform" and LVL/LAI's "Uniform".  Despite the
common name, these are different.
* LVs notion means that only the first lane *of each unrolled part* is
  required.  That is, lanes within a single unroll factor are considered
  uniform.  This allows e.g. widenable memory ops to be considered
  uses of uniform computations.
* LVL and LAI's notion refers to all lanes across all unrollings.

IsUniformMem is in turn defined in terms of LAI's notion.  Thus a
UniformMemOpmeans is a memory operation with a loop invariant address.
This means the same address is accessed in every iteration.

The tweaked piece of code was trying to match a uniform mem op (i.e.
fully loop invariant address), but instead checked for LV's notion of
uniformity.  In theory, this meant with UF > 1, we could speculate
a load which wasn't safe to execute.

This ends up being mostly silent in current code as it is nearly
impossible to create the case where this difference is visible.  The
closest I've come in the test case from 54cb87, but even then, the
incorrect result is only visible in the vplan debug output; before this
change we sink the unsafely speculated load back into the user's predicate
blocks before emitting IR.  Both before and after IR are correct so the
differences aren't "interesting".

The other test changes are uninteresting.  They're cases where LV's uniform
analysis is slightly weaker than SCEV isLoopInvariant.
2022-07-21 15:44:34 -07:00
Philip Reames
54cb87964d [LV] Add a load focused version of the r45679 test
This a reproducer for bug in predicated instruction handling.  The final result code is correct, but the reasoning by which we get there isn't.
2022-07-21 15:33:42 -07:00
Philip Reames
83993d666b [LV][SVE] Autogen a test for ease of update 2022-07-21 13:12:53 -07:00
Philip Reames
27945f9282 [RISCV][LV] Split coverage of uniform load with outside use
Turns out this has a large effect of tail folding, so split out a single test to cover that case and remove it from the others.
2022-07-21 12:07:26 -07:00
Philip Reames
bb5dc2918f {RISCV][LV] Add tail folding coverage of uniform load store cases 2022-07-21 11:15:36 -07:00
Philip Reames
56a25ed208 {RISCV][LV] Add a test for uniform store of a loop varying value 2022-07-21 11:15:36 -07:00
Philip Reames
0ae46693f0 {RISCV][LV] Split out and expand tests for uniform loads and stores 2022-07-21 10:42:18 -07:00
David Sherwood
f15b6b2907 [AArch64] Add target hook for preferPredicateOverEpilogue
This patch adds the AArch64 hook for preferPredicateOverEpilogue,
which currently returns true if SVE is enabled and one of the
following conditions (non-exhaustive) is met:

1. The "sve-tail-folding" option is set to "all", or
2. The "sve-tail-folding" option is set to "all+noreductions"
and the loop does not contain reductions,
3. The "sve-tail-folding" option is set to "all+norecurrences"
and the loop has no first-order recurrences.

Currently the default option is "disabled", but this will be
changed in a later patch.

I've added new tests to show the options behave as expected here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

Differential Revision: https://reviews.llvm.org/D129560
2022-07-21 17:20:06 +01:00
David Sherwood
ceb6c23b70 [NFC][LoopVectorize] Explicitly disable tail-folding on some SVE tests
This patch is in preparation for enabling vectorisation with tail-folding
by default for SVE targets. Once we do that many existing tests will
break that depend upon having normal unpredicated vector loops. For
all such tests I have added the flag:

  -prefer-predicate-over-epilogue=scalar-epilogue

Differential Revision: https://reviews.llvm.org/D129137
2022-07-21 15:23:00 +01:00
Philip Reames
f934b9b073 [LV] Refresh a couple of autogen tests for naming change
These appear to just be changes in temporary identifiers; bit suprising we have so many.
2022-07-20 14:47:52 -07:00
Philip Reames
1a73ef75fa [LV] Autogen a test for ease of update 2022-07-20 08:19:38 -07:00
Philip Reames
be25f52fec [LV] Autogen several tests for ease of update in upcoming change 2022-07-20 07:17:51 -07:00
Philip Reames
523a526a02 [LV] Fix miscompile due to srem/sdiv speculation safety condition
An srem or sdiv has two cases which can cause undefined behavior, not just one. The existing code did not account for this, and as a result, we miscompiled when we encountered e.g. a srem i64 %v, -1 in a conditional block.

Instead of hand rolling the logic, just use the utility function which exists exactly for this purpose.

Differential Revision: https://reviews.llvm.org/D130106
2022-07-20 05:35:23 -07:00
David Sherwood
79660d339e [LoopVectorize][AArch64] Add TTI hook preferPredicatedReductionSelect
By default if SVE is enabled we want the select instruction used for
reductions to be inside the loop, rather than outside. This makes it
possible for the backend to fold the select into the operation to
produce a single predicated add, fadd, etc.

Differential Revision: https://reviews.llvm.org/D129763
2022-07-20 09:33:29 +01:00
Philip Reames
f1243fa193 [LV] Autogen a partially autogened test for ease of update 2022-07-19 14:18:53 -07:00
Philip Reames
8353403f08 [LV] Add test for generic predicated sdiv 2022-07-19 12:33:36 -07:00
Philip Reames
2247fe856a [LV] Add test coverage for a bug in srem handling 2022-07-19 11:29:17 -07:00
Philip Reames
b7d3ba4bdb [LV] Add test coverage for scalable div/rem patterns 2022-07-19 11:02:14 -07:00
David Sherwood
34f81cfa3d [LoopVectorize][NFC] Split reductions out from sve-tail-folding into new file
In sve-tail-folding-reductions.ll I've also added an extra RUN line
to test normal reductions, i.e. not in-loop. This patch is a pre-commit
in preparation for a follow-on patch that changes how reduction selects
are generated in the vector loop.

Differential Revision: https://reviews.llvm.org/D129761
2022-07-18 13:56:39 +01:00
David Sherwood
1e77b0c871 [AArch64][NFC] Simplify loop vectoriser tail-folding tests
I've simplified all of the SVE vectoriser tail-folding tests to
only care about testing the flag:

  -prefer-predicate-over-epiloge=predicate-else-scalar-epilogue

In practice we always want to fall back on unpredicated vector
loops if tail-folding is not possible.

Differential Revision: https://reviews.llvm.org/D129843
2022-07-18 13:37:29 +01:00
Graham Hunter
db8fcb2c25 [LAA] Add recursive IR walker for forked pointers
This builds on the previous forked pointers patch, which only accepted
a single select as the pointer to check. A recursive function to walk
through IR has been added, which searches for either a loop-invariant
or addrec SCEV.

This will only handle a single fork at present, so selects of selects
or a GEP with a select for both the base and offset will be rejected.

There is also a recursion limit with a cli option to change it.

Reviewed By: fhahn, david-arm

Differential Revision: https://reviews.llvm.org/D108699
2022-07-18 12:06:17 +01:00
Florian Hahn
105032f549 [LV] Use PHI recipe instead of PredRecipe for subsequent uses.
At the moment, the VPPRedInstPHIRecipe is not used in subsequent uses of
the predicate recipe. This incorrectly models the def-use chains, as all
later uses should use the phi recipe. Fix that by delaying recording of
the recipe.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D129436
2022-07-18 09:35:34 +01:00
Florian Hahn
6813b41d57 [LV] Avoid creating new run-time VF expression for each runtime checks.
At the moment, the cost of runtime checks for scalable vectors is
overestimated due to creating separate vscale * VF expressions for each
check. Instead re-use the first expression.
2022-07-16 17:24:07 +01:00
Florian Hahn
aa00fb02c9 [LV] Use umax(VF * UF, MinProfTC) for scalable vectors.
For scalable vectors, it is not sufficient to only check
MinProfitableTripCount if it is >= VF.getKnownMinValue() * UF, because
this property may not holder for larger values of vscale. In those
cases, compute umax(VF * UF, MinProfTC) instead.

This should fix
https://lab.llvm.org/buildbot/#/builders/197/builds/2262
2022-07-15 10:23:14 -07:00
Florian Hahn
4c85a01758 [LV] Add scalable vector test showing incorrect min-trip count check.
The test shows a case where the minimum trip count check incorrectly
only checks the minimum profitable trip count computed due to runtime
checks. This is incorrect for scalable VFs, because the VF * UF may
exceed the minimum profitable trip count for vscale > 1.

This is the likely reason for
https://lab.llvm.org/buildbot/#/builders/197/builds/2262 failing.
2022-07-15 10:02:55 -07:00
Mel Chen
bd404fbcc8 [LV][NFC] Fix the condition for printing debug messages
Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D128523
2022-07-15 01:47:33 -07:00
Mel Chen
ae15e6a952 [LV] Pre-commit test case for D128523, NFC 2022-07-15 01:22:06 -07:00
David Sherwood
307ace7f20 [LoopVectorize] Ensure the VPReductionRecipe is placed after all it's inputs
When vectorising ordered reductions we call a function
LoopVectorizationPlanner::adjustRecipesForReductions to replace the
existing VPWidenRecipe for the fadd instruction with a new
VPReductionRecipe. We attempt to insert the new recipe in the same
place, but this is wrong because createBlockInMask may have
generated new recipes that VPReductionRecipe now depends upon. I
have changed the insertion code to append the recipe to the
VPBasicBlock instead.

Added a new RUN with tail-folding enabled to the existing test:

  Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll

Differential Revision: https://reviews.llvm.org/D129550
2022-07-13 09:29:25 +01:00
Nikita Popov
a5ee62a141 [IndVars] Call replaceLoopPHINodesWithPreheaderValues() for already constant exits
Currently we only call replaceLoopPHINodesWithPreheaderValues() if
optimizeLoopExits() replaces the exit with an unconditional exit.
However, it is very common that this already happens as part of
eliminateIVComparison(), in which case we're leaving behind the
dead header phi.

Tweak the early bailout for already-constant exits to also call
replaceLoopPHINodesWithPreheaderValues().

Differential Revision: https://reviews.llvm.org/D129214
2022-07-13 09:43:21 +02:00
David Sherwood
6b694d600a [LoopVectorize] Change PredicatedBBsAfterVectorization to be per VF
When calculating the cost of Instruction::Br in getInstructionCost
we query PredicatedBBsAfterVectorization to see if there is a
scalar predicated block. However, this meant that the decisions
being made for a given fixed-width VF were affecting the cost for a
scalable VF. As a result we were returning InstructionCost::Invalid
pointlessly for a scalable VF that should have a low cost. I
encountered this for some loops when enabling tail-folding for
scalable VFs.

Test added here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-cost.ll

Differential Revision: https://reviews.llvm.org/D128272
2022-07-12 14:53:20 +01:00
Nikita Popov
4bb7b6fae3 [IR] Remove support for float binop constant expressions
As part of https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179,
this removes support for the floating-point binop constant expressions
fadd, fsub, fmul, fdiv and frem.

As part of this change, the C APIs LLVMConstFAdd, LLVMConstFSub,
LLVMConstFMul, LLVMConstFDiv and LLVMConstFRem are removed.
The LLVMBuild APIs should be used instead.

Differential Revision: https://reviews.llvm.org/D129478
2022-07-12 09:40:49 +02:00
David Sherwood
03fee6712a [LoopVectorize] Add option to use active lane mask for loop control flow
Currently, for vectorised loops that use the get.active.lane.mask
intrinsic we only use the mask for predicated vector operations,
such as masked loads and stores, etc. The loop itself is still
controlled by comparing the canonical induction variable with the
trip count. However, for some targets this is inefficient when it's
cheap to use the mask itself to control the loop.

This patch adds support for using the active lane mask for control
flow by:

1. Generating the active lane mask for the next iteration of the
vector loop, rather than the current one. If there are still any
remaining iterations then at least the first bit of the mask will
be set.
2. Extract the first bit of this mask and use this bit for the
conditional branch.

I did this by creating a new VPActiveLaneMaskPHIRecipe that sets
up the initial PHI values in the vector loop pre-header. I've also
made use of the new BranchOnCond VPInstruction for the final
instruction in the loop region.

Differential Revision: https://reviews.llvm.org/D125301
2022-07-11 13:46:55 +01:00
Philip Reames
b12930e133 [RISCV] Switch to using get.active.lane.mask when tail folding
The motivation here is to a) bring us closer into alignment with AArch64 under the assumption that codepath is better tested, and b) simplify pattern matching in an upcoming change.

The immediate impact is a significant IR reduction but a fairly minimal change in the generated assembly. Due to a difference in expansion behavior we get a saturating add vs an unsaturating one for the old code, but that's about it. This difference comes down to different handling of overflow, which doesn't seem to be possible here anyways, so the assembly codegen is arguably a minor regression. I don't expect that to matter in practice.

Differential Revision: https://reviews.llvm.org/D129221
2022-07-08 10:24:59 -07:00
Florian Hahn
12f9c7b270 [LV] Update RISCV test missed by bc19b7c3cc. 2022-07-07 08:51:15 -07:00
Florian Hahn
bc19b7c3cc [LV] Remove collectTriviallyDeadInstructions, already handled by VP DCE.
Now that removeDeadRecipes can remove most dead recipes across a whole
VPlan, there is no need to first collect some dead instructions.
Instead removeDeadRecipes can simply clean them up.

Depends D127580.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D128408
2022-07-07 08:40:27 -07:00
David Green
438ffdb821 [ARM] Switch the costs of mve1beat and mve4beat
These three subtarget features are meant to control where MVE
instructions take 1 vs 2 vs 4 architectural beats. The mve1beat feature
is described as "Model MVE instructions as a 1 beat per tick
architecture", meaning MVE instruction will execute over 4 cycles.
mve4beat is the opposite where the entire 4 beats of the MVE instruction
execute in a single cycle. The costs for the two were backwards though,
not matching the cycle counts like they should. This patch switches the
costs on the two to bring them in-line with expectations.

Differential Revision: https://reviews.llvm.org/D129141
2022-07-07 16:10:00 +01:00
Florian Hahn
17d48c3169 [VPlan] Move remove dead recipes before merging regions.
This can enable additional region merging,  while not losing
opportunities as region merging does not produce dead recipes.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D128831
2022-07-06 20:38:38 -07:00
Florian Hahn
6826047381 [LV] Remove redundant checks from recurrence test.
The removed CHECK configurations are tested as well below, modulo the
dce/instcombine runs. This makes them redundant, and removing them
removes a substantial amount of uneeded checks.
2022-07-06 15:31:57 -07:00
Philip Reames
b9513a70e1 [RISCV] Autogen a vectorizer test for ease of update 2022-07-06 09:35:02 -07:00
Florian Hahn
e4613d8e1b [LV] Remove unnecessary memory checks from recurrence test
The tests are focused on code-gen for first-order recurrences. There are
plenty of tests specifically for runtime check generation. Using noalias
to avoid runtime checks slightly simplifies the test output and ensures
the checks focus on the relevant bits and ensures the checks focus on
the relevant bits and ensures the checks focus on the relevant bits and
ensures the checks focus on the relevant bits.
2022-07-06 09:08:29 -07:00