Commit Graph

7842 Commits

Author SHA1 Message Date
Fangrui Song
fee52fe0ad [X86] Fix -Wunused-variable in -DLLVM_ENABLE_ASSERTIONS=off build. NFC 2021-11-15 13:10:47 -08:00
Simon Pilgrim
441de2536b [X86] Add generic splitVectorOp helper. NFC
Update splitVectorIntUnary/splitVectorIntBinary to use this internally, after some operand type sanity checks.

Avoid code duplication and makes it easier to split other vector instruction forms in the future.
2021-11-15 17:59:23 +00:00
Simon Pilgrim
fc7c1cebbc [X86] LowerFunnelShift - pull out repeated EltSizeInBits variable. NFC. 2021-11-15 17:11:44 +00:00
Sanjay Patel
3d01507c2d [x86] fold vector (X > -1) & Y to shift+andn (2nd try)
The first try at this patch ( bf5748a1af ) was reverted ( 5be64d4164 )
because it could crash. The cause of that problem was failing to account
for the optional peek-through-bitcast in the enclosing function.

This version of the patch adds a clause to avoid the fold in case of
bitcasts because it is unlikely to be profitable in that scenario.

A test case based on https://llvm.org/PR52504 was added to make sure
we don't have that problem again.

Original commit message:

and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y

This avoids the -1 constant vector in favor of an arithmetic shift
instruction if it exists (the ISA is still not complete after all
these years...).

We catch this pattern late in combining by matching PCMPGT, so it
should not interfere with more general folds.

Differential Revision: https://reviews.llvm.org/D113603
2021-11-15 11:09:32 -05:00
Simon Pilgrim
ea9e6aa423 [X86] getAVX512Node() - find constant broadcasts to encourage load-folding
If an operand is a bitcasted or widended constant, try to more aggressively create broadcastable constants for folding, which in particular helps non-VLX modes.

I've refactored getAVX512Node so that VLX targets can make better use of this as well.

NOTE: In the future, I think we should consider removing the broadcast of constant data from DAG entirely and move this to either X86InstrInfo::foldMemoryOperand or a new pass - AVX1/2 targets has similar problems with missed (whole vector) folds that need to be improved as well.

Differential Revision: https://reviews.llvm.org/D113845
2021-11-15 15:52:03 +00:00
Hans Wennborg
5be64d4164 Revert "[x86] fold vector (X > -1) & Y to shift+andn"
This casued assertion failures:

  llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:9446:
  void llvm::SelectionDAG::ReplaceAllUsesWith(llvm::SDNode *, llvm::SDNode *):
  Assertion `(!From->hasAnyUseOfValue(i) || From->getValueType(i) == To->getValueType(i))
  && "Cannot use this version of ReplaceAllUsesWith!"' failed.

See comment on the code review.

(Had to update some expectations in test/CodeGen/X86/vselect-zero.ll
 manually due to other changes having landed after the reverted one.)

> and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y
>
> This avoids the -1 constant vector in favor of an arithmetic shift
> instruction if it exists (the ISA is still not complete after all
> these years...).
>
> We catch this pattern late in combining by matching PCMPGT, so it
> should not interfere with more general folds.
>
> Differential Revision: https://reviews.llvm.org/D113603

This reverts commit bf5748a1af.
2021-11-15 12:35:49 +01:00
Simon Pilgrim
c3a772fdf5 [X86] Add getPack helper
This helper provides a more complete approach for lowering to X86ISD::PACKSS/PACKUS nodes - testing for existing suitable sign/zero extension before recreating it.

It also optionally packs the upper half instead of the lower half.
2021-11-14 21:27:15 +00:00
Simon Pilgrim
f4143ffed7 [X86] Widen 128/256-bit VPTERNLOG patterns to 512-bit on non-VLX targets
Similar to what we've done for other ops, this patch widens VPTERNLOG to a 512-bit op for non-VLX targets.

Fixes regressions in D113192

Differential Revision: https://reviews.llvm.org/D113827
2021-11-14 13:40:53 +00:00
Simon Pilgrim
a310cbae02 [X86] Add getAVX512Node helper. NFC.
For AVX512 targets without VLX, we have to widen 128/256-bit vectors to 512-bits to use some specific AVX512 instructions (or some other instructions with predicates etc.).

I've pulled out the widening code from LowerFunnelShift into the helper function, so we can convert some other widening patterns in the future.
2021-11-13 13:59:42 +00:00
Craig Topper
82bc6a094e [X86] Promote f16 STRICT_FROUND to f32 and call libc.
Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D113817
2021-11-12 21:37:03 -08:00
Kazu Hirata
efa896e5f7 [Target] Use SDNode::uses (NFC) 2021-11-12 21:23:04 -08:00
Simon Pilgrim
6bb71738e2 [X86] convertShiftLeftToScale - improve vXi8 constant handling
Add support for v32i8/v64i8 converting shift-by-constant to multiply-by-constant. This helps us avoid the generic vXi8 shift lowering, and a lot of VPBLENDVB ops which can be particularly slow.

We also needed to reorder a few shift lowering patterns to prevent regressions, particularly for XOP+AVX2 (Excavator) targets (which can split to fast v16i8 shifts) and AVX512-BWI targets (which prefers to extend to fast v32i16 shifts).
2021-11-12 16:48:10 +00:00
Simon Pilgrim
59087dce3b [X86] combineX86ShufflesConstants - constant fold from target shuffles unless optsize = true
Currently we only constant fold target shuffles if any of the sources has one use, or it would remove a variable shuffle mask - the aim being to avoid constant pool bloat.

This patch proposes we should constant fold by default and only limit this if optsize is enabled - I've added a basic test for this in vector-mul.ll (the pmuludq case is by far the most common), I can add other specific test cases if people need them.

This should permit further constant folding, break some instruction dependencies and help reduce shuffle port pressure.

Differential Revision: https://reviews.llvm.org/D113748
2021-11-12 14:02:43 +00:00
Sanjay Patel
bf5748a1af [x86] fold vector (X > -1) & Y to shift+andn
and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y

This avoids the -1 constant vector in favor of an arithmetic shift
instruction if it exists (the ISA is still not complete after all
these years...).

We catch this pattern late in combining by matching PCMPGT, so it
should not interfere with more general folds.

Differential Revision: https://reviews.llvm.org/D113603
2021-11-12 08:17:46 -05:00
Phoebe Wang
74b979abcd [X86][FP16] Avoid to generate VZEXT_MOVL with i16
This fixes the crash due to lacking VZEXT_MOVL support with i16.

Reviewed By: LuoYuanke, RKSimon

Differential Revision: https://reviews.llvm.org/D113661
2021-11-12 09:32:29 +08:00
Simon Pilgrim
94a901a50a [X86] Move LowerFunnelShift below LowerShift. NFC.
Makes it easier to reuse the various vector shift helpers defined above LowerShift
2021-11-11 18:45:51 +00:00
Sanjay Patel
a8abd19b10 [x86] simplify code; NFC
We bail out if the types don't match, so it's clearer to
have a single variable to show that common type.
2021-11-10 13:29:57 -05:00
Sanjay Patel
5424fb164a [x86] fix formatting; NFC 2021-11-10 13:29:57 -05:00
Simon Pilgrim
a1e0aa75ca [X86] combineMulToPMADDWD - remove useless TODO
We should always be able to use PMULUDQ/PMULDQ in PMADDWD patterns with greater than 32-bit extended integer sources
2021-11-10 16:56:44 +00:00
Sanjay Patel
be9e892e9d [x86] shorten function name; NFC 2021-11-10 09:44:55 -05:00
Simon Pilgrim
d510fd2bed [X86] combineMulToPMADDWD - handle any pow2 vector type and split to legal types
combineMulToPMADDWD is currently limited to legal types, but there's no reason why we can't handle any larger type that the existing SplitOpsAndApply code can use to split to legal X86ISD::VPMADDWD ops.

This also exposed a missed opportunity for pre-SSE41 targets to handle SEXT ops from types smaller than vXi16 - without PMOVSX instructions these will always be expanded to unpack+shifts, so we can cheat and convert this into a ZEXT(SEXT()) sequence to make it a valid PMADDWD op.

Differential Revision: https://reviews.llvm.org/D110995
2021-11-09 15:20:43 +00:00
Simon Pilgrim
7f32edea23 [X86] combineMulToPMADDWD - use ComputeMinSignedBits(). NFCI.
Use ComputeMinSignedBits() to ensure the mul source operands at least sign-extend down from the bottom 16 bits.

This will make it easier if/when we try to support handling of source types larger than 32-bits.
2021-11-08 15:28:31 +00:00
Simon Pilgrim
f059b04f7b [DAG] Add SelectionDAG::ComputeMinSignedBits helper
As suggested on D113371, this adds a wrapper to SelectionDAG::ComputeNumSignBits, similar to the llvm::ComputeMinSignedBits wrapper.

I've included some usage, its not exhaustive, just the more obvious cases where the intention is obvious.

Differential Revision: https://reviews.llvm.org/D113396
2021-11-08 14:12:45 +00:00
Simon Pilgrim
f60d3ec0c7 [DAG] Add BuildVectorSDNode::getConstantRawBits helper
We have several places where we need to extract the raw bits data from a BUILD_VECTOR node, so consolidate this to a single helper function that handles Undefs and Integer/FP constants, including implicit truncation.

This should make it easier to extend D113202 to handle more constant folding of bitcasted constant data.

Differential Revision: https://reviews.llvm.org/D113351
2021-11-08 12:07:38 +00:00
Simon Pilgrim
55e4cd8485 [X86][AVX2] Recognise 256-bit truncation shuffles and mask 256-bit source
For v8i16 shuffle patterns that are lowered with AND+PACKUS, check to see if the sources are from a 256-bit vector and perform the masking using BLENDW at the 256-bit level.

With the test changes we can see more examples of duplicate XMM/YMM zero vectors (PR26018) :(
2021-11-07 21:24:55 +00:00
Kazu Hirata
41ef3187e0 [ARM, X86] Use MachineBasicBlock::{predecessors,successors} (NFC) 2021-11-07 09:53:16 -08:00
Kazu Hirata
2c4ba3e9d3 [Target] Use make_early_inc_range (NFC) 2021-11-05 09:14:32 -07:00
Simon Pilgrim
5e9ac7c0a5 [X86] Enable v32i16 rotate lowering on non-BWI targets
Fixes one of the regressions in D113192
2021-11-05 11:00:31 +00:00
Simon Pilgrim
87d5bb66eb [X86][SSE] Improve PMADDWD SimplifyDemandedVectorElts handling
Check both operands for zero elements to remove unnecessary demanded elts.

Try to help reduce some minor regressions noticed in D110995
2021-11-04 12:56:31 +00:00
Harald van Dijk
889c2b97bd [X86] Fix X32 indirect call generation
The check for whether a zero extension was needed was subtly wrong and
saw a value that was already 64 bits, so did not extend.

Fixes PR52357.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D112860
2021-11-03 16:43:44 +00:00
Phoebe Wang
8f101971b6 [X86][VARARG] Assign MMO earlier to avoid prolog insert point been sunk across VASTART_SAVE_XMM_REGS
The changes in D80163 defered the assignment of MachineMemOperand (MMO)
until the X86ExpandPseudo pass. This will result in crash due to prolog
insert point been sunk across the pseudo instruction VASTART_SAVE_XMM_REGS.

Moving the assignment to the creation of the node can avoid the problem.

Reviewed By: rnk

Differential Revision: https://reviews.llvm.org/D112859
2021-11-03 10:13:32 +08:00
Simon Pilgrim
53900a19fd [X86][AVX] combineConcatVectorOps - use getBROADCAST_LOAD helper for splat of normal vector loads. NFCI.
Reapplied from rG1cfecf4fc427 with fix for PR51226 - ensure the load is a normal (non-ext) load.
2021-11-02 20:03:25 +00:00
Simon Pilgrim
82e0eb22af [X86][AVX] combineConcatVectorOps - use getBROADCAST_LOAD helper. NFCI.
This is part of rG1cfecf4fc427 that was reverted to fix PR51226 - concating the broadcasts is OK, its the splatted loads that crash (we're not detecting extloads). I'm still creating a reduced test case so haven't added the load handling again yet.
2021-11-02 18:04:35 +00:00
Simon Pilgrim
e173631dd1 [X86][AVX] SimplifyDemandedVectorEltsForTargetNode - use getBROADCAST_LOAD helper. NFCI.
Reduce width of X86ISD::SUBV_BROADCAST_LOAD node.
2021-11-02 14:07:22 +00:00
Simon Pilgrim
8ca666a280 [X86][AVX] lowerV2X128Shuffle - use getBROADCAST_LOAD helper. NFCI. 2021-11-02 14:07:21 +00:00
Simon Pilgrim
fd485d8cda [X86][AVX] Prefer VINSERTF128 over VPERM2F128 for 128->256 subvector concatenations
The VINSERTF128 instruction is often much quicker, and never slower, than the more general VPERM2F128 instruction, so we should try to use that in more circumstances.

This requires a fallback to a commuted VPERM2F128 for the case where we need to fold the 256-bit vector source instead of the 128-bit subvector source.

There is one interesting side effect - DAGCombine's narrowExtractedVectorLoad combine gets called in a number of locations, this often creates an extracted subvector load without regard to other uses of the original wider load. I'm expecting AVX cpus to be capable of merging such aliased loads, but I do wonder whether narrowExtractedVectorLoad's call to X86TargetLowering::shouldReduceLoadWidth needs to be altered to check for more partial uses?

Noticed while investigating the quality of interleaved load/store codegen.

Differential Revision: https://reviews.llvm.org/D111960
2021-11-01 10:45:50 +00:00
Sanjay Patel
837518d6a0 [x86] make mayFold* helpers visible to more files; NFC
The first function is needed for D112464, but we might
as well keep these together in case the others can be
used someday.
2021-10-29 15:48:35 -04:00
Simon Pilgrim
154c036ebb [X86] combineX86GatherScatter - only fold scale if the index isn't extended
As mentioned on D108539, when the gather indices are smaller than the pointer size, they are sign-extended BEFORE scale is applied, making the general fold unsafe.

If the index have sufficient sign-bits then folding the scale could be safe - I'll investigate this.
2021-10-29 11:48:05 +01:00
Simon Pilgrim
d29ccbecd0 [X86][AVX] Attempt to fold a scaled index into a gather/scatter scale immediate (PR13310)
If the index operand for a gather/scatter intrinsic is being scaled (self-addition or a shl-by-immediate) then we may be able to fold that scaling into the intrinsic scale immediate value instead.

Fixes PR13310.

Differential Revision: https://reviews.llvm.org/D108539
2021-10-28 14:07:17 +01:00
Phoebe Wang
2bc28c6f82 [X86] Add a dependency breaking xor before any gathers with an undef passthru value.
In the instruction encoding, the passthru register is always
tied to the destination register. The CPU scheduler has to wait
for the last writer of this register to finish executing before
the gather can start. This is true even if the initial mask is
all ones so that the passthru will never be used.

By explicitly zeroing the register we can break the false
dependency. The zero idiom is executed completing by the
register renamer and so is immedately considered ready.

Authored by Craig.

Reviewed By: lebedev.ri

Differential Revision: https://reviews.llvm.org/D112505
2021-10-28 11:44:52 +08:00
Sanjay Patel
6c0a2c2804 [x86] enhance mayFoldLoad to check alignment
As noted in D112464, a pre-AVX target may not be able to fold an
under-aligned vector load into another op, so we shouldn't report
that as a load folding candidate. I only found one caller where
this would make a difference -- combineCommutableSHUFP() -- so
that's where I added a test to show the (minor) regression.

Differential Revision: https://reviews.llvm.org/D112545
2021-10-27 07:54:25 -04:00
Sanjay Patel
2ab0148c14 [x86] use cast instead of dyn_cast for unchecked usage; NFC
This was noted as an independent clean-up in D112464.
2021-10-26 08:20:19 -04:00
Phoebe Wang
79f9dfef0d [X86] Move splat addends from the gather/scatter index operand to the base address
This can avoid a vector add and a constant pool load. Or an explicit broadcast in case of non-constant.

Also reverse the transform any time we encounter a constant index addend that can't be moved to base. In that case pull the constant from base into the index. This reduces code size needed for the displacement since we needed the index add anyway. Limit this to scale of 1 to avoid divisibility and wrap issues.

Authored by Craig.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D111595
2021-10-26 12:35:57 +08:00
Simon Pilgrim
b09f2ee57c [X86] findEltLoadSrc - fix shift amount variable name. NFCI.
Fix the copy + paste, renaming shift amt from Idx to Amt
2021-10-23 21:24:37 +01:00
Craig Topper
cd824f9e39 [X86] Fix bad formatting. NFC 2021-10-22 13:16:35 -07:00
Sanjay Patel
40163f1df8 [x86] add special-case lowering for usubsat for AVX512
This is a small extension of D112095 to avoid another regression
seen with D112085.
In this case, we allow the same conversion from usubsat to ALU
ops if the target supports vpternlog.

That pattern will get converted later in X86DAGToDAGISel::tryVPTERNLOG().
This seems better than putting a magic immediate constant directly in
this code to create the exact vpternlog that we need. It's possible that
there are other special-cases along these lines, so we should try to
keep all of the vpternlog magic in one place.

Differential Revision: https://reviews.llvm.org/D112138
2021-10-20 16:41:13 -04:00
Sanjay Patel
3efd2a0bec [x86] make helper for useVPTERNLOG; NFC
See D112085 for another use case.
2021-10-20 10:26:53 -04:00
Sanjay Patel
92a0389b04 [x86] add special-case lowering for usubsat for pre-SSE4
usubsat X, SMIN --> (X ^ SMIN) & (X s>> BW-1)

This would be a regression with D112085 where we combine to
usubsat more aggressively, so avoid that by matching the
special-case where we are subtracting SMIN (signmask):
https://alive2.llvm.org/ce/z/4_3gBD

Differential Revision: https://reviews.llvm.org/D112095
2021-10-19 17:13:16 -04:00
Simon Pilgrim
71e39e3f18 [ADT] Add APInt::isNegatedPowerOf2() helper
Inspired by D111968, provide a isNegatedPowerOf2() wrapper instead of obfuscating code with (-Value).isPowerOf2() patterns, which I'm sure are likely avenues for typos.....

Differential Revision: https://reviews.llvm.org/D111998
2021-10-19 14:38:21 +01:00
Simon Pilgrim
a83384498b [X86] combineMulToPMADDWD - replace ASHR(X,16) -> LSHR(X,16)
If we're using an ashr to sign-extend the entire upper 16 bits of the i32 element, then we can replace with a lshr. The sign bit will be correctly shifted for PMADDWD's implicit sign-extension and the upper 16 bits are zero so the upper i16 sext-multiply is guaranteed to be zero.

The lshr also has a better chance of folding with shuffles etc.
2021-10-18 22:12:56 +01:00