BinomialCoefficient computes the value of W-bit IV at iteration It of a loop. When W is 1, we can call multiplicative inverse on 0 which triggers an assert since 1b76120.
Since the arithmetic is supposed to wrap if It or K does not fit in W bits, do the truncation into W bits after we do the shift.
Fixes#87798
This tries to add some costs for the shuffle in a ST3/ST4 instruction,
which are represented in LLVM IR as store(interleaving shuffle). In
order to detect the store, it needs to add a CxtI context instruction to
check the users of the shuffle. LD3 and LD4 are added, LD2 should be a
zip1 shuffle, which will be added in another patch.
It should help fix some of the regressions from #87510.
Add a number of LAA test cases with both forward and backward
dependences with non-constant strides and dependence distances.
This includes test coverage for
https://github.com/llvm/llvm-project/issues/87336
Also includes a LoopLoadElimination test to make sure the pass does not
crash on non-constant dependence distances.
Using <X x i1> undef masks means they are treated as constants, which underestimates the scalar costs as it assumes that the masks/branches will fold away.
Changes in Recommit:
Add an additional check on sign/zero extend to the same type.
Original message:
Use the destination data type to measure the LMUL size for
latency/throughput cost
Rename the intrinsics to close to the instruction mnemonic names:
Use global_load_tr_b64 and global_load_tr_b128 instead of
global_load_tr.
This patch also removes f16/bf16 versions of builtins/intrinsics. To
simplify the design, we should avoid enumerating all possible types in
implementing builtins. We can always use bitcast.
The ‘llvm.vector.reduce.fmaximum/fminimum.*’ intrinsics propagate NaNs
if any element of the vector is a NaN.
Following #79402, the patch adds the cost for NaN check (vmfne + vcpop)
In 2fe81edef6
[NFC][RemoveDIs] Insert instruction using iterators in Transforms/
we changed
if (*req_idx != *i)
return FindInsertedValue(I->getAggregateOperand(), idx_range,
- InsertBefore);
+ *InsertBefore);
}
but there is no guarantee that is InsertBefore is non-empty at that
point,
which we e.g can see in the added testcase.
Instead just pass on the optional InsertBefore in the recursive call to
FindInsertedValue, as we do at several other places already.
At the moment, getUnderlyingObjects simply continues for phis that do
not refer to the same underlying object in loops, without adding them to
the list of underlying objects, effectively ignoring those phis.
Instead of ignoring those phis, add them to the list of underlying
objects. This fixes a miscompile where LoopAccessAnalysis fails to
identify a memory dependence, because no underlying objects can be found
for a set of memory accesses.
Fixes https://github.com/llvm/llvm-project/issues/82665.
PR: https://github.com/llvm/llvm-project/pull/84339
The exact flag basically allows us to set an upper bound on shift
amount when we have a known 1 in `LHS`.
Typically we deduce exact using knownbits (on non-exact incoming
shifts), so this is particularly impactful, but may be useful in some
circumstances.
Closes#84254
This commit provides better cost estimates for
the llvm.vector.reduce.add intrinsic on SystemZ. These apply to all
vector lengths and integer types up to i128. For integer types larger
than i128, we fall back to the default cost estimate.
This has the effect of lowering the estimated costs of most common
instances of the intrinsic. The expected performance impact of this is
minimal with a tendency to slightly improve performance of some
benchmarks.
This commit also provides a test to check the proper computation of the
new estimates, as well as the fallback for types larger than i128.
Fix gap in the cost estimation for length changing shuffles, by adjusting the shuffle mask and either widening the shuffle inputs or extracting the lower elements of the result.
A small step towards moving some of this implementation inside improveShuffleKindFromMask and/or target getShuffleCost handlers (and reduce the diffs in cost estimation depending on whether coming from a ShuffleVectorInst or the raw operands / mask components)
In our analysis of guarding conditions, we were converting a-b == 0 into
a == b alternate form, but we were only checking for one of the two
forms for the sub. There's no requirement that the multiply only be on
the LHS of the add.
These tests highlight that we have missed oppurtunities proving
trip count bounds when our start/end values are sign extended
from smaller types and we have either a loop guard to relate our
start vs end, or a nsw/nuw fact to bound end.
This extends the work from 7755c26 to all of the different backend
taken count kinds that we print for the scev analysis printer. As
before, the goal is to cut down on confusion as i4 -1 is a very
different (unsigned) value from i32 -1.
When printing the result of SCEV's analysis, we can avoid printing
the predicated backedge taken count and the predicates if the predicates
are empty and no new information is provided. This helps to reduce the
verbosity of the output.
When printing the result of the analysis, i8 -1 and i64 -1 are quite
different in terms of analysis quality. In a recent conversion with
a new contributor, we ran into exactly this confusion.
Adding the type for constant scevs more globally seems worthwhile, but
introduces a much larger test diff. I'm splitting this off first since
it addresses the immediate need, and then going to do some further
changes to clarify a few related bits of analysis result output.
A few notes:
* pr34538.ll has bitrotten. The original test printed the analysis after transforms in some cases, but this appears to been lost during migration to new pass manager. Remove the now redundant pass invocations and simplify the test setup.