The new mask represents the order, not the mask itself. At first, need
to treat as the order, convert to mask and only after that reorder
gathered scalars to build correct clustered order.
Differential Revision: https://reviews.llvm.org/D141161
Need to include the cost of the initial insertelement to the cost of the
broadcasts. Also, need to adjust the cost of the gather/buildvector if
the element is inserted into poison/undef vector.
Differential Revision: https://reviews.llvm.org/D140498
This patch adds metadata to disable runtime unrolling to the vectorized
loop. If runtime unrolling/interleaving is considered profitable, LV
will interleave the loop directly. There should be no need to perform
runtime unrolling at a later stage.
Note that we already add metadata to disable runtime unrolling to the
scalar loop after vectorization.
The additional unrolling unnecessarily increases code size and compile
time. In addition to that we have several bug reports of unncessary
runtime unrolling for vectorized loops, e.g. PR40961
Compile-time improvements:
NewPM-O3: -1.04%
NewPM-ReleaseThinLTO: -0.59%
NewPM-ReleaseLTO-g: -0.97%
https://llvm-compile-time-tracker.com/compare.php?from=ce1be13a868d0f8afa367975558c1a6175cce33a&to=78bc2e67f22e9e10e61cdb6cdac4bb857d95eb1b&stat=instructions:uFixes#40306.
Reviewed By: lebedev.ri, nikic
Differential Revision: https://reviews.llvm.org/D115261
Make a separate routine for GEPs cost calculation and make
the approach uniform across load, store and GEP tree nodes.
Additional issue fixed is GEP cost savings were applied twice
for ScatterVectorize nodes (aka gather load) making them look
unrealistically profitable for vectorization.
Differential Revision: https://reviews.llvm.org/D140789
Use deduction guides instead of helper functions.
The only non-automatic changes have been:
1. ArrayRef(some_uint8_pointer, 0) needs to be changed into ArrayRef(some_uint8_pointer, (size_t)0) to avoid an ambiguous call with ArrayRef((uint8_t*), (uint8_t*))
2. CVSymbol sym(makeArrayRef(symStorage)); needed to be rewritten as CVSymbol sym{ArrayRef(symStorage)}; otherwise the compiler is confused and thinks we have a (bad) function prototype. There was a few similar situation across the codebase.
3. ADL doesn't seem to work the same for deduction-guides and functions, so at some point the llvm namespace must be explicitly stated.
4. The "reference mode" of makeArrayRef(ArrayRef<T> &) that acts as no-op is not supported (a constructor cannot achieve that).
Per reviewers' comment, some useless makeArrayRef have been removed in the process.
This is a follow-up to https://reviews.llvm.org/D140896 that introduced
the deduction guides.
Differential Revision: https://reviews.llvm.org/D140955
The validation of vplans could fail if an inloop reduction was created
with a block-in mask that did not dominate the reduction. This makes
sure that the insert point is set when creating the mask, to ensure it
dominates the reduction.
Differential Revision: https://reviews.llvm.org/D141003
This reverts commit aa2414729e.
Previously-valid IR from a tensorflow test case (as shown on the
Diffusion revision for aa2414729e) started
hanging in the loop-vectorize pass. Reverting to keep everyone working.
analysis.
Missed the analysis of the shuffle mask when trying to analyze the
operands of the shuffle instruction during peeking through shuffle
instructions.
We incorrectly assume intrinsic as a function call and it prevents us from
the opportunity to vectorize. On Aarch64 Cortex-A53 we think that
llvm.fmuladd.f64 is a function call which is wrong.
Differential Revision: https://reviews.llvm.org/D140392
Adjust mergeReplicateRegions to be in line with
mergeBlocksIntoPredecessors added in 36d70a6aea by collecting only the
valid candidates first.
Also rename to mergeReplicateRegionsIntoSuccessors and add missing
doc-comment.
This addresses post-commit suggestions by @Ayal.
This reduces the size of VPlan.h and avoids future growth of the file
when the graph traits are extended in future patches.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D140500
Even if the the sink candidate is already in the target block, its
operands can be candidates for sinking. Queue them up as well. Also
moves the queuing logic to a helper.
We do not need to emit many extractelements for each particular use, we
can reuse the only one, just need to adjust it to make it dominate on
all uses.
Differential Revision: https://reviews.llvm.org/D140580
We can use ShuffleInstructionBuilder now for shrinking shuffle emission.
It allows to remove extra shuffle from the emitted code and reuse
original vector.
Part of D110978
Differential Revision: https://reviews.llvm.org/D140499
Merging regions can enable new sinking opportunities (e.g. if users of a
scalar value are moved from different VPBBs into the same VPBB). Sinking
in turn can also enable new merging opportunities (e.g. if a recipe
between to merge-able regions is moved.
To enable more sinking opportunities, repeat sinking & merging if
regions could be merged.
Also fix mergeReplicateRegions to return the correct Changed status.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D139788
of extractvalues.
No need to get the last instruction only for vectorized extractvalues,
for gathered(buildvector sequence) still need to get the insertion
point.
Add and run VPlan transform to fold blocks with a single predecessor
into the predecessor. This remove redundant blocks and addresses a TODO
to replace special handling for the vector latch VPBB.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D139927
Just comparing constant trip counts causes LV to miss cases where the
vector loop body only executes once.
The motivation for this is to remove the need for unrolling to remove
vector loop back-edges, if the body only executes once in more cases.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D133017
This sets the stage for D133017 by moving out the code that performs
VPlan based simplifications to a separate transform that takes the
chosen VF & UF as arguments.
The main advantage is that this transform runs before any changes to
the CFG are being made. This allows using SCEV without worrying about
making queries while the IR is in an incomplete state.
Note that this patch switches the reasoning to use SCEV, but still only
simplifies loops with constant trip counts. Using SCEV here is needed to
access the backedge taken count, because the trip count IR value has not
been created yet.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D135017
Explicitly track the UFs supported in a VPlan. This is needed to
allow transformations to restrict the UFs which are supported.
Discussed as separate improvement in D135017.
The VFs and UFs may be more constrained as the plans are transformed
(e.g. see D135017 for an example).
To make sure the VFs/UFs included in the VPlan dump are accurate,
generate them when accessing a plan's name, rather than include them in
the name string set after initial construction.
The function name was misleading - the expectation set both by the name
and by other members of Function (like isDeclaration or isIntrinsic)
would be that the function somehow would "be" "debug info for
profiling". But that's not the case - the property indicates (as the
comment over the declaration also explains) whether debug info should be
emitted (for profiling).
Added BaseShuffleAnalysis as a base class for ShuffleInstructionBuilder
and integrated shuffle logic from shuffles for externally used scalars
into this class. This class is used as the main container that
implements smart shuffle instruction builder logic.
ShuffleInstructionBuilder uses this logic.
ShuffleInstructionBuilder is also used in building of the shuffle for
the externally used scalars instead of lambdas, which are now part of BaseShuffleAnalysis class.
Differential Revision: https://reviews.llvm.org/D140100
Code generation now uses the start VPValue of induction recipes.
This makes it possible to adjust the start value of the epilogue
vector loop to use the 'resume' value of the main vector loop.
Fixes#59459.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D92132
value() has undesired exception checking semantics and calls
__throw_bad_optional_access in libc++. Moreover, the API is unavailable without
_LIBCPP_NO_EXCEPTIONS on older Mach-O platforms (see
_LIBCPP_AVAILABILITY_BAD_OPTIONAL_ACCESS).
This fixes clang.
Add a VP_CLASSOF_IMPL macro to define common classof implementations for
recipes. This reduces duplication and also adds missing implementations
to existing recipes.
In scalar plans, replicate recipes will only generate a single value per
UF, independent of whether they are uniform or not. So don't consider
uniformity for plans with scalar VFs only.
This allows us to handle a few additional cases in VPlan sinking instead
of non-VPlan sinkScalarOperands.
Depends on D133762.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D134218
The patch redesigns ShuffleInstructionBuilder so it could later be used
for reshuffling of the buildvector sequences and vectorized parts of
externally used scalars. Also will allow to generalize cost model for
the gathers/buildvectors.
Part of D110978.
Differential Revision: https://reviews.llvm.org/D139718