This tries to add some costs for the shuffle in a ST3/ST4 instruction,
which are represented in LLVM IR as store(interleaving shuffle). In
order to detect the store, it needs to add a CxtI context instruction to
check the users of the shuffle. LD3 and LD4 are added, LD2 should be a
zip1 shuffle, which will be added in another patch.
It should help fix some of the regressions from #87510.
Using <X x i1> undef masks means they are treated as constants, which underestimates the scalar costs as it assumes that the masks/branches will fold away.
Changes in Recommit:
Add an additional check on sign/zero extend to the same type.
Original message:
Use the destination data type to measure the LMUL size for
latency/throughput cost
The ‘llvm.vector.reduce.fmaximum/fminimum.*’ intrinsics propagate NaNs
if any element of the vector is a NaN.
Following #79402, the patch adds the cost for NaN check (vmfne + vcpop)
This commit provides better cost estimates for
the llvm.vector.reduce.add intrinsic on SystemZ. These apply to all
vector lengths and integer types up to i128. For integer types larger
than i128, we fall back to the default cost estimate.
This has the effect of lowering the estimated costs of most common
instances of the intrinsic. The expected performance impact of this is
minimal with a tendency to slightly improve performance of some
benchmarks.
This commit also provides a test to check the proper computation of the
new estimates, as well as the fallback for types larger than i128.
Fix gap in the cost estimation for length changing shuffles, by adjusting the shuffle mask and either widening the shuffle inputs or extracting the lower elements of the result.
A small step towards moving some of this implementation inside improveShuffleKindFromMask and/or target getShuffleCost handlers (and reduce the diffs in cost estimation depending on whether coming from a ShuffleVectorInst or the raw operands / mask components)
Currently we model subvector inserts and extracts as shuffles,
potentially going as far as scalarizing. If the types are legal then
they can just be simple zip/unzip operations, or possible even no-ops.
Change the cost to a relatively small one to ensure that simple loops
featuring such operations between fixed and scalable vector types that
are effectively the same at a given sve width can be unrolled and
further optimized.
We can use a small amount of integer arithmetic to round FP32 to BF16
and extend BF16 to FP32.
While a number of operations still require promotion, this can be
reduced for some rather simple operations like abs, copysign, fneg but
these can be done in a follow-up.
A few neat optimizations are implemented:
- round-inexact-to-odd is used for F64 to BF16 rounding.
- quieting signaling NaNs for f32 -> bf16 tries to detect if a prior
operation makes it unnecessary.
When MVT is not a vector type, TCK_CodeSize should return an invalid
cost. This patch adds a check in the beginning to make sure all cost
kinds return invalid costs consistently.
Before this patch, TCK_CodeSize returns a valid cost on scalar MVT but
other cost kinds doesn't.
This fixes the issue #83294 where a loop contains vector instructions
and MVT is scalar after type legalization when the vector extension is
not enabled,
This is mostly NFC but some output does change due to consistently
inserting into poison rather than undef and using i64 as the index
type for inserts.
In most cases, SETCC lowering will be able to simplify/commute the comparison by adjusting the constant.
TODO: We still need to adjust ExtraCost based on CostKind
Fixes#80122
If we have exact vlen knowledge, we can figure out which indices
correspond to register boundaries. Our lowering uses this knowledge to
replace the vslidedown.vi with a sub-register extract. Our costs can
reflect that as well.
This is another piece split off
https://github.com/llvm/llvm-project/pull/80164
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
P9 has vxform `Vector Extract Element Instructions` like `vextuwrx` and
P10 has vxform `Vector Insert Element instructions` like `vinsd`. Update
the instruction cost reflecting these instructions.
Fixes https://github.com/llvm/llvm-project/issues/50249
This reverts commit 834d11c215 which was
a revert of my 3a626937b1. I had failed
to rebase after new tests added overnight by
fc0b67e1d7.
Original commit message follows:
Extracing a subvector at index zero corresponds to a type conversion and
possibly a subregister operation. We will not emit a vslidedown. As such,
they are free.
As an aside, it looks like we're not passing an index in for cases where
the subvec type is scalable. For at least index zero, we probably should be.
Revert "Revert "[RISCV][TTI] Extract subvector at index zero is free (#81751)""
Extracing a subvector at index zero corresponds to a type conversion and
possibly a subregister operation. We will not emit a vslidedown. As
such, they are free.
As an aside, it looks like we're not passing an index in for cases where
the subvec type is scalable. For at least index zero, we probably should
be.
For llvm.vector.extract, this tests combinations of inserting at a zero and
non-zero index, and extracting from a fixed or scalable vector.
For llvm.vector.insert, this tests the same combinations as extracts but with
an additional configuration for an undef vector. This is because we can use a
subregister insert if the index is 0 and the vector is undef, which should be
free.
Similar to d39b4ce3ce
Using "eabi" or "gnueabi" for aarch64 targets is a common mistake and
warned by Clang Driver. We want to avoid them elsewhere as well. Just
use the common "aarch64" without other triple components.
extract subvector.
Many targets do not have cost for extractsubvector shuffle kind, but
have the costs for single source permute. If there are no costs
estimation for extractsubvector, better to switchto single source
permute for better cost estimation.
Reviewers: RKSimon, davemgreen, arsenm
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/79837
This should be the final portion of shaping-up the test suite to be
ready for turning on non-intrinsic debug-info:
* Pin CostModel tests that expect to see intrinsics in their -debug
output to not use RemoveDIs. This is a spurious test output difference.
* Add 'tail' to a bunch of intrinsics in UpdateTestChecks. We're
cannonicalising intrinsics to be printed with "tail" in RemoveDI
conversion as dbg.values usually pick that up while being optimised.
This is another spurious output difference.
* The "DebugInfoDrop" pass used in the debugify unit-tests happens to
operate inside the pass manager, thus it sees non-intrinsic debug-info.
Update it to correctly drop it.
The primary motivation of this patch is to add testing infrastructure
atop the recently landed 8ad14b6d90, so
that we can separate the costing aspects of strided memory operations
from the SLP implementation details.
I want to be clear that I am *not* proposing that we use the
vp.strided.* forms as our canonical IR representation. I'm merely using
them as a testing vehicle to exercise the costing machinery. The
canonical IR form remains a masked.gather or masked.scatter. I do want
to explore adding a non-vp strided load/store intrinsic, but that's a
separate line of work.
There is one costing change included in this. As I wrote my test, I
discovered that the default implementation was scalarized (if invoked
via generic routines such as getInstructionCost), and when adding the
call into the strided specific costing discovered that we hadn't modeled
the fallback to scalarization properly in the initial patch. After
fixing that, there is a minor difference in scalarization cost reported
for the unaligned case but I believe that to be uninteresting.
For the record, I did confirm that vp.strided.store is lowered to a
strided store on RISCV. :)