Each vslide1down operation is linear in LMUL on common hardware. (For instance, the sifive-x280 cost model models slides this way.) If we do a VL unique inserts, each with a cost linear in LMUL, the overall cost is O(VL*LMUL). Since VL is a linear function of LMUL, this means the current lowering is quadradic in both LMUL and VL. To avoid the degenerate case, fallback to the stack if the cost is more than a fixed (linear) threshold.
For context, here's the sifive-x280 llvm-mca results for the current lowering and stack based lowering for each LMUL (using e64). Assumes code was compiled for V (i.e. zvl128b).
buildvector_m1_via_stack.mca:Total Cycles: 1904
buildvector_m2_via_stack.mca:Total Cycles: 2104
buildvector_m4_via_stack.mca:Total Cycles: 2504
buildvector_m8_via_stack.mca:Total Cycles: 3304
buildvector_m1_via_vslide1down.mca:Total Cycles: 804
buildvector_m2_via_vslide1down.mca:Total Cycles: 1604
buildvector_m4_via_vslide1down.mca:Total Cycles: 6400
buildvector_m8_via_vslide1down.mca:Total Cycles: 25599
There are other schemes we could use to cap the cost. The next best is recursive decomposition of the vector into smaller LMULs. That's still quadratic, but with a better constant. However, stack based seems to cost better on all LMULs, so we can just go with the simpler scheme.
Arguably, this patch is fixing a regression introduced with my D149667 as before that change, we'd always fallback to the stack, and thus didn't have the non-linearity.
Differential Revision: https://reviews.llvm.org/D159332
Now that the codegen for the expanded ISD::ROTL sequence has been improved,
it's probably profitable to lower a shuffle that's a rotate to the
vsll+vsrl+vor sequence to avoid a vrgather where possible, even if we don't
have the vror instruction.
This patch relaxes the restriction on ISD::ROTL being legal in
lowerVECTOR_SHUFFLEAsRotate. It also attempts to do the lowering twice: Once
if zvbb is enabled before any of the interleave/deinterleave/vmerge lowerings,
and a second time unconditionally just before it falls back to the vrgather.
This way it doesn't interfere with any of the above patterns that may be more
profitable than the expanded ISD::ROTL sequence.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D159353
We currently have log, log2, log10, exp and exp2 intrinsics. Add exp10
to fix this asymmetry. AMDGPU already has most of the code for f32
exp10 expansion implemented alongside exp, so the current
implementation is duplicating nearly identical effort between the
compiler and library which is inconvenient.
https://reviews.llvm.org/D157871
If the high and low 32 bits are the same, we try to use
(ADD X, (SLLI X, 32)) but that only works if bit 31 is clear since
the low 32 bits will be sign extended.
If we have Zba we can use add.uw to zero the sign extended bits.
Reviewed By: reames, wangpc
Differential Revision: https://reviews.llvm.org/D159253
A shuffle of v256i1 with a large enough minimum vlen might make it through type
legalization and into lowering. In this case, zvl1024b was enough. The
bitreverse shuffle lowering would then try to convert this to a v1i256 type
which is invalid (v1i128 exists though, which is why the existing v128i1 tests
were fine).
This patch checks to make sure that the new type is not only legal but also
valid.
Reviewed By: craig.topper, reames
Differential Revision: https://reviews.llvm.org/D159215
Now that DAG.getConstant uses splat_vector_parts if needed on RV32, we can use
it directly without having to manually lower to a vmv_v_x_vl.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D159287
This re-implements the special casing we had in lowerScalarSplat as a DAG combine. As can be seen in the tests, this ends up triggering in a bunch more cases.
The semantically interesting bit of this change is the use of the implicit truncate semantics for when XLEN > SEW. We'd already been doing this for vmv.v.x, but this change extends e.g. the constant matching to make the same assumption about vmv.s.x. Per my reading of the specification, this should be fine, and if anything, is more obviously true of vmv.s.x than vmv.v.x.
Differential Revision: https://reviews.llvm.org/D158874
We'd discussed this in the original set of patches months ago, but decided against it. I think we should reverse ourselves here as the code is significantly more readable, and we do pick up cases we'd missed by not calling the appropriate helper routine.
Differential Revision: https://reviews.llvm.org/D158854
A rotate of 8 bits of an e16 vector in either direction is equivalent to a
byteswap, i.e. vrev8. There is a generic combine on ISD::ROT{L,R} to
canonicalize these rotations to byteswaps, but on fixed vectors they are
legalized before they have the chance to be combined. This patch teaches the
rotate vector_shuffle lowering to emit these rotations as byteswaps to match
the scalable vector behaviour.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D158195
Given a shuffle mask like <3, 0, 1, 2, 7, 4, 5, 6> for v8i8, we can
reinterpret it as a shuffle of v2i32 where the two i32s are bit rotated, and
lower it as a vror.vi (if legal with zvbb enabled).
We also need to make sure that the larger element type is a valid SEW, hence
the tests for zve32x.
X86 already did this, so I've extracted the logic for it and put it inside
ShuffleVectorSDNode so it could be reused by RISC-V. I originally tried to add
this as a generic combine in DAGCombiner.cpp, but it ended up causing worse
codegen on X86 and PPC.
Reviewed By: reames, pengfei
Differential Revision: https://reviews.llvm.org/D157417
If doubling the VL will fit in a vsetivli, use it. It will be cheap
to change and cheap to change back.
This improves codegen from D158896.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D158896
We can use a 32-bit splat and bitcast to i64 vector.
This only handles the case where we are using vlmax so that the new
vl is cheap to compute. This could be generalized to double the VL.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D158879
There was quite a bit of duplication between splatPartsI64WithVL
and the scalable vector handling in lowerSPLAT_VECTOR_PARTS, but
scalable vector had one additional case. Move that case to
splatPartsI64WithVL which improves some fixed vector tests.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D158876
There is no vp.fpclass after FCLASS_VL(D151176), try to support vp.fpclass.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D152993
When lowering a splat_vector_parts, if the hi bits are undefined then we can
splat the lo bits without having to check if it's going to be sign extended or
not, because those bits will be undefined anyway.
I've handled it for both fixed and scalable vectors, but there's no diff
on the scalable vror tests, since the hi bits aren't combined away to
undef in SimplifyDemanded for scalable vectors. I'm not sure why that is.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D158625
At some point a merge operand was added to the binary vl ops, so this combine
was using the mask for the VL. This causes a crash when trying to
select the vmv_v_x_vl, which showed up locally when messing about with
selectVSplat, but thankfully in ToT the vmv_v_x_vl gets pattern matched
away into the .vx and .vi operands every time, so there's no noticeable
change.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D158634
For most fp16 vector ops, we could promote it to fp32 vector when zvfhmin is enable but zvfh is not.
But for nxv32f16, we need to split it first since nxv32f32 is not a valid MVT.
Reviewed By: michaelmaitland
Differential Revision: https://reviews.llvm.org/D153848
This extends the concat_vector of loads to strided_load transform to handle reversed index pattern. The previous code expected indexing of the form (a0, a1+S, a2+S,...). However, we can also see indexing of the form (a1+S, a2+S, a3+S, .., aS). This form is a strided load starting at address aN + S*(n-1) with stride -S.
Note that this is also fixing what looks to be a bug in the memory location reasoning for forward strided case. A strided load with negative stride access eltsize bytes past base ptr, and then bytes *before* base ptr. (That is, the range should extend from before base ptr to after base ptr.)
Differential Revision: https://reviews.llvm.org/D157886
If we have a known (or bounded) index which definitely fits in a smaller LMUL register group size, we can reduce the LMUL of the slide and extract instructions. This loosens constraints on register allocation, and allows the hardware to do less work, at the potential cost of some additional VTYPE toggles. In practice, we appear (after prior patches) to do a decent job of eliminating the additional VTYPE toggles in most cases.
Differential Revision: https://reviews.llvm.org/D158460
Preparation for developing a new rounding mode insertion algorithm
that is going to be different between them since VXRM doesn't need
to be save/restored.
This also unifies the FRM handling in RISCVISelLowering.cpp between
scalar and vector.
Fixes outdated comments in RISCVAsmPrinter and sorts the predicate
function by the reverse order of the operands being skipped.
Reviewed By: eopXD
Differential Revision: https://reviews.llvm.org/D158326
clang recently started checking for INT64_MIN being passed to 64-bit std::abs.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D158304
If we can fit an entire vector of i1 into a single element, e.g. v32i1 ->
v1i32, then we can reverse it via vbrev.v.
We need to handle the case where the vector doesn't exactly fit into the larger
element type, e.g. v4i1 -> v1i8. In this case we shift up the reversed bits
afterwards.
Reviewed By: fakepaper56, 4vtomat
Differential Revision: https://reviews.llvm.org/D157614
The constants can be with larger bit width, so we need to truncate
them to EltSize or we will exceed the width of fixed-length vector.
Fixes#64588
Reviewed By: luke, craig.topper, bjope, michaelmaitland
Differential Revision: https://reviews.llvm.org/D157603
FRINT was added to matchRoundingOp after this function was written.
So FRINT was not tested originally.
For vectors, folding this causes us to create a CSR swap that tries
to write 7 to FRM. This is an illegal value and will cause the CSR
write to fail.
While this might be a legal fold we could do, I'm disabling it for
now so we can backport to LLVM 17 with the least risk.
Differential Revision: https://reviews.llvm.org/D157583
This reuses the same strategy for fixed vectors as other ops, i.e. custom lower
to a scalable *_vl SD node.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D157294
We have a variant of this for splats already, but hadn't handled the case where a single copy of the wider element can be inserted producing the entire required bit pattern. This shows up mostly in very small vector shuffle tests.
Differential Revision: https://reviews.llvm.org/D157299
There are cases where the -1 doesn't become visible until lowering
so the folding doesn't have a chance to run.
I think in these cases there is a missed DAGCombine for truncate (undef),
which I may fix separately, but RISC-V backend should protect itself.
Fixes#64503.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D157314
If we have a dominant value, we can still use a v(f)slide1down to handle the last value in the vector if that value is neither undef nor the dominant value.
Note that we can extend this idea to any tail of elements, but that's ends up being a near complete merge of the v(f)slide1down insert path, and requires a bit more untangling on profitability heuristics first.
Differential Revision: https://reviews.llvm.org/D157120
This ports over the test cases half-convert.ll and implements patterns
or RISCVISelLowering.cpp changes for all of the most straight-forward
cases (those that don't require changes outside of lib/Target/RISCV).
The remaining cases and noted poor codegen for saturating conversions
will be handled in follow-up patches.
Differential Revision: https://reviews.llvm.org/D156943
Part of this test file was stolen from D156895. We should merge them
when committing.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D156926
This doesn't bring us to parity with the test/CodeGen/RISCV/half-* test
cases, it simply picks off an initial set that can be supported
especially easy. In order to make the review more manageable, I'll
follow up with other cases.
There is zero innovation in the test cases - they simply take the
existing half/float cases and replace f16->bf16 and half->bfloat.
Differential Revision: https://reviews.llvm.org/D156895
isOperationLegalOrCustomOrPromote returns true only if VT is other or legal
and operation action is Legal, Custom or Promote.
Permit a vector binary operation can be converted to scalar binary operation which is custom lowered with illegal type.
One of cases is i32 isn't a legal type on RV64 and its ALU operations is set to custom lowering,
so vadd for element type i32 can be converted to addw.
Reviewed By: jacquesguan, craig.topper
Differential Revision: https://reviews.llvm.org/D156692
D155929 teach lowerScalarInsert to handl start value (extractelement scalable_vector, 0)
and specifically converts fixed extracted vectors to scalable vectors when
lowering vector reduction. It's not enough because there is another way to
create (extractelement fixed_vector, 0) as a start value of lowerScalarInsert
like #64327.
#64327: https://github.com/llvm/llvm-project/issues/64327.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D156863
These test cases previously caused an error. RISCVInstrInfo::copyPhysReg also needed a tweak in order to account for copying bf16 values in FPR16 registers.
Differential Revision: https://reviews.llvm.org/D156883
I want these to have RISC-V semantics not LLVM IR semantics. Specifically
that -0.0 comes before +0.0.
This is needed to emulate FMAXIMUM/FMINIMUM for vectors.
As noted in <https://github.com/llvm/llvm-project/issues/64090>, it's
more efficient to lower a partword 'atomicrmw xchg a, 0` to and amoand
with appropriate mask. There are a range of possible ways to go about
this - e.g. writing a combine based on the
`llvm.riscv.masked.atomicrmw.xchg` intrinsic, or introducing a new
interface to AtomicExpandPass to allow target-specific atomics
conversions, or trying to lift the conversion into AtomicExpandPass
itself based on querying some target hook. Ultimately I've gone with
what appears to be the simplest approach - just covering this case in
emitMaskedAtomicRMWIntrinsic. I perhaps should have given that hook a
different name way back when it was introduced.
This also handles the `atomicrmw xchg a, -1` case suggested by Craig
during review.
Fixes https://github.com/llvm/llvm-project/issues/64090
Differential Revision: https://reviews.llvm.org/D156801
This patch implements the getOptimalMemOpType callback which is used by the generic mem* lowering in SelectionDAG to pick the widest type used. This patch only changes the behavior when vector instructions are available, as the default is reasonable for scalar.
Without this change, we were emitting either XLEN sized stores (for aligned operations) or byte sized stores (for unaligned operations.) Interestingly, the final codegen was nowhere near as bad as that would seem to imply. Generic load combining and store merging kicked in, and frequently (but not always) produced pretty reasonable vector code.
The primary effects of this change are:
* Enable the use of vector operations for memset of non-constant. Our generic store merging logic doesn't know how to merge a broadcast store, and thus we were seeing the generic (and awful) byte expansion lowering for unaligned memset.
* Enable the generic misaligned overlap trick where we write to some of the same bytes twice. The alternative is to either a) use an increasing small sequence of stores for the tail or b) use VL to restrict the vector store. The later is not implemented at this time, so the former is what previously happened. Interestingly, I'm not sure that changing VL (as opposed to the overlap trick) is even obviously profitable here.
Differential Revision: https://reviews.llvm.org/D156249