Commit Graph

1742 Commits

Author SHA1 Message Date
LiqinWeng
c3edeaa61b [Test] Rename the test function name suffix. NFC (#114504) 2024-11-01 13:49:34 +08:00
Luke Lau
9c7188871c [RISCV] Cost ordered bf16/f16 w/ zvfhmin reductions as invalid (#114250)
In #111000 we removed promotion of fadd/fmul reductions for bf16 and f16
without zvfh, and marked the cost as invalid to prevent the vectorizers
from emitting them. However it inadvertently didn't change the cost for
ordered reductions, so this moves the check earlier to fix this.

This also uses BasicTTIImpl instead which now assigns a valid but
expensive cost for fixed-length vectors, which reflects how codegen will
actually scalarize them.
2024-10-31 23:36:09 +08:00
Luke Lau
e989e31a47 [RISCV] Mark f16/bf16 lrint and llrint cost as invalid (#113924)
We currently can't lower scalable vector lrint and llrint nodes for bf16
and f16, even with zvfh, and will crash.

Mark the cost as invalid for now to prevent the vectorizers from
emitting them.

Note that we can actually lower fixed-length vectors fine by scalarizing
them, but we were still undercosting these too so I've also included
them. I presume there's an opportunity to improve the codegen later on.
2024-10-30 17:21:18 +02:00
David Sherwood
7f498a865f [CostModel][LoopVectorize] Move some loop vectoriser tests (#113702)
Many tests that were in test/Analysis/CostModel were actually
loop vectoriser tests. I've moved them as follows:

Analysis/CostModel/X86 -> Transforms/LoopVectorize/X86/CostModel
Analysis/CostModel/AArch64/arith-fp-frem.ll ->
  Transforms/LoopVectorize/AArch64/arith-fp-frem-costs.ll
2024-10-30 13:50:02 +00:00
Luke Lau
8b55162e19 [RISCV] Add cost model tests for scalable FP reductions. NFC
There are already some in reduce-scalable-fp.ll but this makes it a
bit easier to see the difference alongside their fixed-length
counterparts.
2024-10-29 23:58:06 +02:00
Luke Lau
40363d506d [RISCV] Add cost model tests for fp rounding ops for bf16. NFC 2024-10-28 14:59:06 +00:00
Paul Walker
5bb34803a4 [NFC] Migrate tests to use autoupdate for CHECK lines. 2024-10-22 12:55:15 +00:00
Han-Kuan Chen
12bcea3292 [RISCV][TTI] Recognize CONCAT_VECTORS if a shufflevector mask is multiple insert subvector. (#111459)
reference: https://github.com/llvm/llvm-project/pull/110457
2024-10-18 20:16:56 +07:00
Elvis Wang
566012a64e [RISCV][TTI] Implement instruction cost for vp_merge. (#112327)
This patch implement the instruction for `vp_merge`, which will generate
similar instruction sequence to the `select` instruction.
2024-10-17 07:47:43 +08:00
Luke Lau
2b6b7f664d [RISCV] Mark math functions as expanded for zvfhmin/zvfbfmin (#112508)
For regular floating point types we mark these as expanded on scalable
vectors so they're not legal in the cost model, so this does the same
for f16 w/ zvfhmin and bf16.
2024-10-16 21:40:37 +01:00
Luke Lau
e88bcc1204 [RISCV] Lower vector_splice on zvfhmin/zvfbfmin (#112579)
Similar to other permutation ops, we can just reuse the existing
lowering.
2024-10-16 21:40:18 +01:00
Luke Lau
1d40fefb08 [RISCV] Add zvfhmin/zvfbfmin cost model tests for libcall ops. NFC 2024-10-16 10:09:34 +01:00
Elvis Wang
f3648046ec [RISCV] Fix vp-intrinsics args in cost model tests. NFC (#112463)
This patch contains following changes to fix vp intrinsics tests.
1. v\*float -> v\*f32, v\*double -> v\*f64 and v\*half -> v\*f16
2. Fix the order of the vp-intrinsics.
2024-10-16 12:57:43 +08:00
Luke Lau
4c894730a1 [RISCV] Fix bf16 cost model tests. NFC
These were inadvertently changed in #112393
2024-10-15 23:01:53 +01:00
Luke Lau
a3cd269fbe [RISCV] Remove {s,u}int_to_fp custom op action for f16/bf16 (#111471)
It turns out that {s,u}int_to_fp nodes get their operation action from
their operand's type, not the result type, so we don't need to set it
for fp16 or bf16. vp_{s,u}int_to_fp uses the result type though so we
need to keep it.

This also means that we can lower int_to_fp for fixed length bf16
vectors already, so this adds tests for that.

The cost model test changes are due to BasicTTIImpl's getCastInstrCost
not taking into account that int_to_fp needs its legal type swapped.
This can be fixed in a later patch, but its worth noting that the
affected types in the tests currently crash when lowered anyway (due to
them needing split at LMUL > 8)
2024-10-10 14:40:24 +01:00
Philip Reames
f11568bcb0 Revert "[RISCV][TTI] Recognize CONCAT_VECTORS if a shufflevector mask is multiple insert subvector. (#110457)"
This reverts commit 554eaec639.  Change was not approved when landed.
2024-10-07 11:31:57 -07:00
Luke Lau
20864d2cf6 [ValueTypes][RISCV] Add v1bf16 type (#111112)
When trying to add RISC-V fadd reduction cost model tests for bf16, I
noticed a crash when the vector was of <1 x bfloat>.

It turns out that this was being scalarized because unlike f16/f32/f64,
there's no v1bf16 value type, and the existing cost model code assumed
that the legalized type would always be a vector.

This adds v1bf16 to bring bf16 in line with the other fp types.

It also adds some more RISC-V bf16 reduction tests which previously
crashed, including tests to ensure that SLP won't emit fadd/fmul
reductions for bf16 or f16 w/ zvfhmin after #111000.
2024-10-06 22:20:51 +08:00
Han-Kuan Chen
554eaec639 [RISCV][TTI] Recognize CONCAT_VECTORS if a shufflevector mask is multiple insert subvector. (#110457) 2024-10-05 14:58:44 +08:00
Luke Lau
3b0e120336 [RISCV] Add tests for @llvm.vector.reduce.fmul. NFC 2024-10-04 14:27:45 +08:00
RolandF77
06c8210a67 update P7 32-bit partial vector load cost (#108261)
Update cost model to reflect codegen change to use lfiwzx 
for 32-bit partial vector loads on pwr7 with
https://github.com/llvm/llvm-project/pull/104507.
2024-10-03 12:28:43 -04:00
Luke Lau
487686b82e [SDAG][RISCV] Don't promote VP_REDUCE_{FADD,FMUL} (#111000)
In https://reviews.llvm.org/D153848, promotion was added for a variety
of f16 ops with zvfhmin, including VP reductions.

However I don't believe it's correct to promote f16 fadd or fmul
reductions to f32 since we need to round the intermediate results.

Today if we lower @llvm.vp.reduce.fadd.nxv1f16 on RISC-V, we'll get two
different results depending on whether we compiled with +zvfh or
+zvfhmin, for example with a 3 element reduction:

	; v9 = [0.1563, 5.97e-8, 0.00006104]

	; zvfh
	vsetivli x0, 3, e16, m1, ta, ma
	vmv.v.i v8, 0
	vfredosum.vs v8, v9, v8
	vfmv.f.s fa0, v8
	; fa0 = 0.1563

	; zvfhmin
	vsetivli x0, 3, e16, m1, ta, ma
	vfwcvt.f.f.v v10, v9
	vsetivli x0, 3, e32, m1, ta, ma
	vmv.v.i v8, 0
	vfredosum.vs v8, v10, v8
	vfmv.f.s fa0, v8
	fcvt.h.s fa0, fa0
	; fa0 = 0.1564

This same thing happens with reassociative reductions e.g. vfredusum.vs,
and this also applies for bf16.

I couldn't find anything in the LangRef for reductions that suggest the
excess precision is allowed. There may be something we can do in Clang
with -fexcess-precision=fast, but I haven't looked into this yet.

I presume the same precision issue occurs with fmul, but not with
fmin/fmax/fminimum/fmaximum.

I can't think of another way of lowering these other than scalarizing,
and we can't scalarize scalable vectors, so this just removes the
promotion and adjusts the cost model to return an invalid cost. (It
looks like we also don't currently cost fmul reductions, so presumably
they also have an invalid cost?)

I think this should be enough to stop the loop vectorizer or SLP from
emitting these intrinsics.
2024-10-04 00:17:45 +08:00
Philip Reames
50afafbf29 [RISCV][TTI] Adjust constant materialization cost for (z/s)ext from i1 (#110282)
When we're lowering to a split sequence, we only need one
materialization of the zero constant. Our codegen looks something like
this:

  vmv.v.i	v24, 0
  vmerge.vim	v8, v24, -1, v0
  vmv1r.v	v0, v16
  vmerge.vim	v16, v24, -1, v0

Note: Doing this specific case since it was pointed out in
https://github.com/llvm/llvm-project/pull/110164#discussion_r1778268391,
but it's worth noting that we have the same basic problem (over costing
split operations with split invariant terms) at multiple places through
this file.
2024-09-27 10:53:45 -07:00
Philip Reames
1a9569c4f0 [RISCV][TTI] Avoid an infinite recursion issue in getCastInstrCost (#110164)
Calling into BasicTTI is not always safe. In particular, BasicTTI does
not have a full legalization implementation (vector widening is
missing), and falls back on scalarization. The problem is that
scalarization for <N x i1> vectors is cost in terms of the cast API and
we can end up in an infinite recursive cycle.

The "right" fix for this would be teach BasicTTI how to model the full
legalization state machine, but several attempts at doing so have
resulted in dead ends or undesirable cost changes for targets I don't
understand.

This patch instead papers over the issue by avoiding the call to the
base class when dealing with an i1 source or dest. This doesn't
necessarily produce correct costs, but it should at least return
something semi-sensible and not crash.

Fixes https://github.com/llvm/llvm-project/issues/108708
2024-09-27 07:47:09 -07:00
Philip Reames
d288574363 [TTI][RISCV] Model cost of loading constants arms of selects and compares (#109824)
This follows in the spirit of 7d82c99403,
and extends the costing API for compares and selects to provide
information about the operands passed in an analogous manner. This
allows us to model the cost of materializing the vector constant, as
some select-of-constants are significantly more expensive than others
when you account for the cost of materializing the constants involved.

This is a stepping stone towards fixing
https://github.com/llvm/llvm-project/issues/109466. A separate SLP patch
will be required to utilize the new API.
2024-09-25 07:25:57 -07:00
Luke Lau
f43ad88ae1 [RISCV] Handle zvfhmin and zvfbfmin promotion to f32 in half arith costs (#108361)
Arithmetic half or bfloat ops on zvfhmin and zvfbfmin respectively will
be promoted and carried out in f32, so this updates
getArithmeticInstrCost to check for this.
2024-09-25 18:50:16 +08:00
Philip Reames
bcbdf7ad6b [RISCV][TTI/SLP] Add test coverage for select of constants costing
Provides coverage for an upcoming change which accounts for the cost
of materializing the vector constants in the vector select.
2024-09-24 08:15:40 -07:00
Sushant Gokhale
c5672e21ca [AArch64][CostModel] Reduce the cost of fadd reduction with fast flag (#108791)
fadd reduction with
  1. Fast flag set
2. No of elements in input vector is power of 2 results in series of
faddp instructions. faddp instruction has latency/throughput identical
to fadd instruction and hence, we set relative cost=1 for faddp as well.

The change didn't show any regression with SPEC17-FP(C/C++),
llvm-test-suite on Neoverse-V2.
2024-09-24 14:35:01 +05:30
Luke Lau
cce1fa39ea [RISCV] Add zvfbfmin arithmetic cost model test coverage. NFC 2024-09-23 23:28:25 +08:00
Philip Reames
0b524efa95 [RISCV][TTI] Reduce cost of a <N x i1> build_vector pattern (#109449)
This is a follow up to 7f6bbb3. When lowering a <N x i1> build_vector,
we currently chose to extend to i8, perform the build_vector there, and
then truncate back in vector. Our costing on the other hand accounts for
it as if we performed a vector extend, an insert, and a vector extract
for every element. This significantly over estimates the cost.

Note that we can likely do better in our build_vector lowering here by
packing the bits in scalar, and doing a build_vector of the packed bits.
Regardless, our costing should match our lowering.
2024-09-23 07:21:54 -07:00
Elvis Wang
80b44517f5 [RISCV][TTI] Add instruction cost for vp.select. (#109381)
This patch make instruction cost for vp.select the same as its non-vp
counterpart.
2024-09-23 15:06:04 +08:00
Philip Reames
f7c3309b13 [RISCV] Add coverage for <N x i1> vp.strided.load
These are currently scalarized, and I need something to exercise
<N x i1> scalarization costing.  We should probably consider adding
a buildvector intrinsic for this purpose.
2024-09-20 10:30:21 -07:00
Philip Reames
7f6bbb3c4f [RISCV][TTI] Reduce cost of a build_vector pattern (#108419)
This change is actually two related changes, but they're very hard to
meaningfully separate as the second balances the first, and yet doesn't
do much good on it's own.

First, we can reduce the cost of a build_vector pattern. Our current
costing for this defers to generic insertelement costing which isn't
unreasonable, but also isn't correct. While inserting N elements
requires N-1 slides and N vmv.s.x, doing the full build_vector only
requires N vslide1down. (Note there are other cases that our build
vector lowering can do more cheaply, this is simply the easiest upper
bound which appears to be "good enough" for SLP costing purposes.)

Second, we need to tell SLP that calls don't preserve vector registers.
Without this, SLP will vectorize scalar code which performs e.g. 4 x
float @exp calls as two <2 x float> @exp intrinsic calls. Oddly, the
costing works out that this is in fact the optimal choice - except that
we don't actually have a <2 x float> @exp, and unroll during DAG. This
would be fine (or at least cost neutral) except that the libcall for the
scalar @exp blows all vector registers. So the net effect is we added a
bunch of spills that SLP had no idea about. Thankfully, AArch64 has a
similiar problem, and has taught SLP how to reason about spill cost once
the right TTI hook is implemented.

Now, for some implications...

The SLP solution for spill costing has some inaccuracies. In particular,
it basically just guesses whether a intrinsic will be lowered to a call
or not, and can be wrong in both directions. It also has no mechanism to
differentiate on calling convention.

This has the effect of making partial vectorization (i.e. starting in
scalar) more profitable. In practice, the major effect of this is to
make it more like SLP will vectorize part of a tree in an intersecting
forrest, and then vectorize the remaining tree once those uses have been
removed.

This has the effect of biasing us slightly away from strided, or indexed
loads during vectorization - because the scalar cost is more accurately
modeled, and these instructions look relevatively less profitable.
2024-09-20 08:34:36 -07:00
Luke Lau
400b725c27 [RISCV] Remove -riscv-v-vector-bits-min from cost model tests. NFC
It looks like they were added to prevent fixed length vectors from
being expanded, but that's no longer the case today:
https://reviews.llvm.org/D121447#3376520
2024-09-20 15:02:42 +08:00
Sam Tebbs
b49a6b2a9d [AArch64] Consider histcnt smaller than i32 in the cost model (#108521)
This PR updates the AArch64 cost model to consider the cheaper cost of
<i32 histograms to reflect the improvements from
https://github.com/llvm/llvm-project/pull/101017 and
https://github.com/llvm/llvm-project/pull/103037

Work by Max Beck-Jones (@DevM-uk)

---------

Co-authored-by: DevM-uk <max.beck-jones@arm.com>
2024-09-19 13:56:52 +01:00
Elvis Wang
edc71e22c0 [RISCV][TTI] Add instruction cost for vp.load/store. (#109245)
This patch makes the instruction cost of vp.load/store same as their
non-vp counterpart.
2024-09-19 16:00:21 +08:00
Luke Lau
8d7d4c25cb [RISCV] Split fp rounding ops with zvfhmin nxv32f16 (#108765)
This adds zvfhmin test coverage for fceil, ffloor, fnearbyint, frint,
fround and froundeven and splits them at nxv32f16 to avoid crashing,
similarly to what we do for other nodes that we promote.

This also sets ftrunc to promote which was previously missing. We
already promote the VP version of it, vp_froundtozero.
Marking it as promoted affects some of the cost model tests since
they're no longer expanded.
2024-09-18 16:36:13 +08:00
Sushant Gokhale
090850f15d [AArch64][CostModel] Add NFC tests for extractelement cost (#108941)
A successive patch aims to reduce the extractelement cost where the only
user(s) is fmul instruction.
2024-09-17 22:57:05 +05:30
Philip Reames
2e7c7d20d5 [RISCV][TTI] Adjust cost for extract/insert element when VLEN is known (#108595)
If we know an exact VLEN, then the index is effectively modulo the
number of elements in a single vector register. Our lowering performs
this subvector optimization.

A bit of context. This change may look a bit strange on it's own given
we are currently *not* scaling insert/extract cost by LMUL. This costing
decision needs to change, but is very intertwined with SLP
profitability, and is thus a bit hard to adjust. I'm hoping that
https://github.com/llvm/llvm-project/pull/108419 will let me start to
untangle this. This change is basically a case of finding a subset I can
tackle before other dependencies are in place which does no real harm in
the meantime.
2024-09-17 08:43:40 -07:00
David Green
960c975acd [AArch64] Expand scmp/ucmp vector operations with sub (#108830)
Unlike scalar, where AArch64 prefers expanding scmp/ucmp with select,
under Neon we can use the arithmetic expansion to generate fewer
instructions. Notably it also prevents the scalarization of vselect
during vector-legalization.
2024-09-16 18:44:52 +01:00
Sushant Gokhale
7a6945fcf6 [AArch64][SLP] Add NFC test cases for floating point reductions (#106507)
A successive patch would be added to fix some of the tests.

Pull request: #106507
2024-09-12 23:07:12 +05:30
Luke Lau
89c10e27d8 [RISCV] Add zvfhmin cost model test coverage. NFC
This adds tests coverage for zvfhmin and halfs in general in the cost
model tests.

Some existing half tests were split into separate functions so that if
the check prefixes diverge it won't affect the rest of the non-half
instructions.

Whilst we're here, also remove the redundant
-riscv-vector-bits-min=128 and declares.
2024-09-12 18:41:47 +08:00
Elvis Wang
1b3e64a9d2 [RISCV][TTI] Add vp.cmp intrinsic cost with functionalOPC. (#107504)
This patch make the instruction cost of VP compare intrinsics as same as
their non-VP counterpart.
2024-09-12 07:06:36 +08:00
Florian Hahn
ea83e1c05a [LV] Assign cost to all interleave members when not interleaving.
At the moment, the full cost of all interleave group members is assigned
to the instruction at the group's insert position, even if the decision
was to not form an interleave group.

This can lead to inaccurate cost estimates, e.g. if the instruction at
the insert position is dead. If the decision is to not vectorize but
scalarize or scather/gather, then the cost will be to total cost for all
members. In those cases, assign individual the cost per member, to more
closely reflect to choice per instruction.

This fixes a divergence between legacy and VPlan-based cost model.

Fixes https://github.com/llvm/llvm-project/issues/108098.
2024-09-11 21:04:34 +01:00
Florian Hahn
1741b9c3d7 [LV] Generalize check lines for interleave group costs.
Check cost of all instructions in an interleave group, to prepare for
follow-up changes.
2024-09-11 15:21:32 +01:00
Florian Hahn
70ff6501e6 [AArch64] Auto-generate check-lines in cost model test.
Auto-generate check lines for easier updating.
2024-09-06 22:38:02 +01:00
Florian Hahn
b0ae93e847 [AArch64] Add more type combinations to vector fp conversion cost tests.
Generealize test coverage for https://github.com/llvm/llvm-project/pull/107303

Also adjust the name to reflect the fact that it is not limited to
vectorrs with 3 elements now.
2024-09-06 14:49:45 +01:00
Jon Roelofs
bded3b3ea9 [llvm][AArch64] Improve the cost model for i128 div's (#107306) 2024-09-05 07:42:23 -07:00
Elvis Wang
845d8d909c [RISCV][TTI] Add cost of typebased cast VPIntrinsics with functionalOPC. (#97797)
This patch make the instruction cost of type-based cast VP intrinsics
will be same as their non-VP counterpart.
This is the following patch of
[#93435](https://github.com/llvm/llvm-project/pull/93435)
2024-09-05 13:05:01 +08:00
Florian Hahn
34f2c9a9ce [AArch64] Add tests for FP conversion with 3 element vectors.
Add tests showing a number of cases where costs for floating point
conversions are overestimated for vectors with 3 elements.
2024-09-04 20:44:14 +01:00
Elvis Wang
0ad6cee926 [RISCV] Fix missing i64 to double tests in the cast.ll. (NFC) (#106972) 2024-09-04 11:29:50 +08:00