Commit Graph

881 Commits

Author SHA1 Message Date
Roman Lebedev
3f1f08f0ed Revert @llvm.isnan intrinsic patchset.
Please refer to
https://lists.llvm.org/pipermail/llvm-dev/2021-September/152440.html
(and that whole thread.)

TLDR: the original patch had no prior RFC, yet it had some changes that
really need a proper RFC discussion. It won't be productive to discuss
such an RFC, once it's actually posted, while said patch is already
committed, because that introduces bias towards already-committed stuff,
and the tree is potentially in broken state meanwhile.

While the end result of discussion may lead back to the current design,
it may also not lead to the current design.

Therefore i take it upon myself
to revert the tree back to last known good state.

This reverts commit 4c4093e6e3.
This reverts commit 0a2b1ba33a.
This reverts commit d9873711cb.
This reverts commit 791006fb8c.
This reverts commit c22b64ef66.
This reverts commit 72ebcd3198.
This reverts commit 5fa6039a5f.
This reverts commit 9efda541bf.
This reverts commit 94d3ff09cf.
2021-09-02 13:53:56 +03:00
David Sherwood
d581d94385 [SVE] Fix the FP arithmetic instruction costs for SVE
Several FP instructions (fadd, fsub, etc.) were incorrectly assigned
a higher cost for SVE because they have custom lowering, however we
know they are legal. This patch explicitly assigns a cost of 2 to
these opcodes.

Tests added here:

  Analysis/CostModel/AArch64/arith-fp-sve.ll

Differential Revision: https://reviews.llvm.org/D108993
2021-09-02 09:55:13 +01:00
David Sherwood
f024a4818d [NFC] Re-run update_analyze_test_checks on Analysis/CostModel/AArch64/sve-intrinsics.ll 2021-09-01 12:09:58 +01:00
David Sherwood
930d5077f4 Revert "[NFC] Re-run update_analyze_test_checks on Analysis/CostModel/AArch64/sve-intrinsics.ll"
This reverts commit aeb2bd68dc.
2021-09-01 11:52:29 +01:00
David Sherwood
aeb2bd68dc [NFC] Re-run update_analyze_test_checks on Analysis/CostModel/AArch64/sve-intrinsics.ll 2021-09-01 11:44:02 +01:00
Daniil Fukalov
5b3fad4966 [AMDGPU][CostModel] Update shuffle instruction tests. NFC.
New tests ported over from test/Analysis/CostModel/AArch64/shuffle-other.ll.
2021-08-30 19:17:27 +03:00
Matthew Devereau
9b830c798e [AArch64][SVE] Teach cost model masked gathers/scatters are cheap
Tell the cost model to use the scalable calculation for non-neon fixed vector.
This results in a cheaper cost for fixed-length SVE masked gathers/scatters
allowing the vectorizor to emit them more frequently.
2021-08-26 11:17:47 +01:00
Simon Pilgrim
9efda541bf [CostModel][X86] Add costs for f32/f64 scalar and vector types.
The f16 half types are still pretty useless as we don't have it as a legal type (we treat them as i16 most of the time)
2021-08-20 14:31:12 +01:00
Simon Pilgrim
72ebcd3198 [CostModel][X86] Add isnan half/float/double costs tests 2021-08-19 18:07:06 +01:00
Simon Pilgrim
9419729b6a [CostModel][X86] Add VPOPCNTDQ/BITALG ctpop costs
VPOPCNTDQ + BITALG add ctpop instructions for vXi64/vXi32 + vXi16/vXi8 vector types respectively
2021-08-19 15:40:09 +01:00
Simon Pilgrim
2d60fdd7aa [CostModel][X86] Add VPOPCNT/BITALG test coverage for ctpop/cttz costs 2021-08-19 14:05:58 +01:00
Matthew Devereau
734708e04f [AArch64][SVE] Teach cost model that masked loads/stores are cheap
Reduce the cost of VLS masked loads/stores to make the vectorizor emit them more frequently.
2021-08-19 13:01:33 +01:00
David Sherwood
219d4518fc [Analysis][AArch64] Make fixed-width ordered reductions slightly more expensive
For tight loops like this:

  float r = 0;
  for (int i = 0; i < n; i++) {
    r += a[i];
  }

it's better not to vectorise at -O3 using fixed-width ordered reductions
on AArch64 targets. Although the resulting number of instructions in the
generated code ends up being comparable to not vectorising at all, there
may be additional costs on some CPUs, for example perhaps the scheduling
is worse. It makes sense to deter vectorisation in tight loops.

Differential Revision: https://reviews.llvm.org/D108292
2021-08-18 17:01:56 +01:00
Dylan Fleming
ef198cd99e [SVE] Remove usage of getMaxVScale for AArch64, in favour of IR Attribute
Removed AArch64 usage of the getMaxVScale interface, replacing it with
the vscale_range(min, max) IR Attribute.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D106277
2021-08-17 14:42:47 +01:00
Florian Hahn
f999312872 Recommit "[Matrix] Overload stride arg in matrix.columnwise.load/store."
This reverts the revert 28c04794df.

The failing MLIR test that caused the revert should be fixed  in this
version.

Also includes a PPC test fix previously in 1f87c7c478.
2021-08-12 18:31:57 +01:00
Florian Hahn
a72cd6353c Revert "[Matrix] Update column.major.load call in PPC test."
Dependent commit a1ef81de35 has been reverted in a1ef81de35.
2021-08-12 13:13:52 +01:00
Florian Hahn
1f87c7c478 [Matrix] Update column.major.load call in PPC test.
a1ef81de35 adjusted the definition of the intrinsic, but did not
update a PowerPC test. Fix the test by updating the call & declaration
of @llvm.matrix.column.major.load.
2021-08-12 11:26:33 +01:00
Archibald Elliott
b764b1ef2f [NFC][X86] New Test Requires Asserts
D105263 introduced this new test. It fails when asserts are disabled,
due to using a debug option on opt.

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D107805
2021-08-10 10:22:04 +01:00
Wang, Pengfei
6f7f5b54c8 [X86] AVX512FP16 instructions enabling 1/6
1. Enable FP16 type support and basic declarations used by following patches.
2. Enable new instructions VMOVW and VMOVSH.

Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D105263
2021-08-10 12:46:01 +08:00
David Green
649cf4514d [AArch64] Expand the SVE min/max reduction costs to NEON
This takes the existing SVE costing for the various min/max reduction
intrinsics and expands it to NEON, where I believe it applies equally
well.

In the process it changes the lowering to use min/max cost, as opposed
to summing up the cost of ICmp+Select.

Differential Revision: https://reviews.llvm.org/D106239
2021-08-05 23:23:24 +01:00
Irina Dobrescu
b01417d3c5 [AArch64] Optimise min/max lowering in ISel
Differential Revision: https://reviews.llvm.org/D106561
2021-08-02 13:40:21 +01:00
Sjoerd Meijer
46a861af3d [CostModel][AArch64] Add some shuffle concat tests. NFC.
Test ported over from test/Analysis/CostModel/ARM/shuffle.ll.
2021-08-02 12:11:00 +01:00
Simon Pilgrim
872a950033 [CostModel] Treat 'widen subvector' patterns as zero cost
As discussed on D107228, widening a subvector by inserting the whole subvector into the bottom a larger undef vector should always be cheap enough that we can treat it as zero cost.

NOTE: If this proves to cause issues we have the option of introducing a "SK_WidenSubvector" shuffle kind enum that targets could override the zero cost, but that doesn't seem necessary atm.

Differential Revision: https://reviews.llvm.org/D107228
2021-08-02 11:43:10 +01:00
Simon Pilgrim
7397dcb403 [TTI] Add basic SK_InsertSubvector shuffle mask recognition
This patch adds an initial ShuffleVectorInst::isInsertSubvectorMask helper to recognize 2-op shuffles where the lowest elements of one of the sources are being inserted into the "in-place" other operand, this includes "concat_vectors" patterns as can be seen in the Arm shuffle cost changes. This also helped fix a x86 issue with irregular/length-changing SK_InsertSubvector costs - I'm hoping this will help with D107188

This doesn't currently attempt to work with 1-op shuffles that could either be a "widening" shuffle or a self-insertion.

The self-insertion case is tricky, but we currently always match this with the existing SK_PermuteSingleSrc logic.

The widening case will be addressed in a follow up patch that treats the cost as 0.

Masks with a high number of undef elts will still struggle to match optimal subvector widths - its currently bounded by minimum-width possible insertion, whilst some cases would benefit from wider (pow2?) subvectors.

Differential Revision: https://reviews.llvm.org/D107228
2021-08-02 11:23:44 +01:00
David Green
098984a80c [AArch64] Update and expand min-max cost model test. NFC
This expands the cost model test for min/max to many more types,
including floating point minnum/maxnum and minimum/maximum, and FP16
with and without fullfp16.  The old llc run lines are removed, as those
are better tested by CodeGen tests.
2021-07-27 18:48:58 +01:00
Simon Pilgrim
77c5e6ba90 [Analysis] Fix getOrderedReductionCost to call target's getArithmeticInstrCost implementation
The getOrderedReductionCost implementation introduced in D105432 calls the CRTP base version getArithmeticInstrCost instead of the redirecting to the target version.

Differential Revision: https://reviews.llvm.org/D106795
2021-07-26 17:15:43 +01:00
David Sherwood
0aff1798b5 [Analysis] Add simple cost model for strict (in-order) reductions
I have added a new FastMathFlags parameter to getArithmeticReductionCost
to indicate what type of reduction we are performing:

  1. Tree-wise. This is the typical fast-math reduction that involves
  continually splitting a vector up into halves and adding each
  half together until we get a scalar result. This is the default
  behaviour for integers, whereas for floating point we only do this
  if reassociation is allowed.
  2. Ordered. This now allows us to estimate the cost of performing
  a strict vector reduction by treating it as a series of scalar
  operations in lane order. This is the case when FP reassociation
  is not permitted. For scalable vectors this is more difficult
  because at compile time we do not know how many lanes there are,
  and so we use the worst case maximum vscale value.

I have also fixed getTypeBasedIntrinsicInstrCost to pass in the
FastMathFlags, which meant fixing up some X86 tests where we always
assumed the vector.reduce.fadd/mul intrinsics were 'fast'.

New tests have been added here:

  Analysis/CostModel/AArch64/reduce-fadd.ll
  Analysis/CostModel/AArch64/sve-intrinsics.ll
  Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
  Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Differential Revision: https://reviews.llvm.org/D105432
2021-07-26 10:26:06 +01:00
Sander de Smalen
c3277a8828 [BasicTTI] Set scalarization cost of scalable vector casts to Invalid.
When BasicTTIImpl::getCastInstrCost can't determine the cost of a
vector cast operation when the types need legalization, it falls
back to calculating scalarization costs. Instead of crashing on
`cast<FixedVectorType>(DstVTy)` when the type is a scalable vector,
return an Invalid cost.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D106655
2021-07-24 14:13:21 +01:00
David Green
38986c6782 [AArch64] Add worst case shuffle costs
This adds some missing single source shuffle costs for AArch64, of i16
and i8 vectors. v4i16 are the same as v4i32 with a worse case cost of 3
coming from the perfect shuffle tables. The larger vector sizes expand
into a constant pool, plus a load (and adrp) and a tbl. I arbitrarily
chose 8 for the cost to be expensive but not too expensive.

Differential Revision: https://reviews.llvm.org/D106241
2021-07-23 09:01:58 +01:00
Simon Pilgrim
4185c5502c [CostModel][X86] Adjust shift SSE4 legalized costs based on llvm-mca reports.
Update shl/lshr/ashr costs based on the worst case costs from the script in D103695 - many of the 128-bit shifts (usually where integer multiplies aren't used) have similar behaviour to AVX1 so we can merge them.
2021-07-22 20:07:32 +01:00
Simon Pilgrim
2657fe1721 [CostModel][X86] Fix funnel shift check prefixes
We'd lost AVX1 test coverage due to bulldozer (XOP) trying to use the same check prefixes - we really need to fix the update script to avoid this!
2021-07-22 20:07:31 +01:00
David Green
c9cebda772 [AArch64] Adjust the cost of integer sum reductions
This changes the cost to (LT.first-1) * cost(add) + 2, where the cost of
an add is assumed to be 1. This brings it inline with the other
reductions.

Differential Revision: https://reviews.llvm.org/D106240
2021-07-22 18:19:54 +01:00
Simon Pilgrim
e1bdb57958 [CostModel][X86] Adjust shift SSE legalized costs based on llvm-mca reports.
Update shl/lshr/ashr costs based on the worst case costs from the script in D103695.
2021-07-22 18:12:49 +01:00
David Green
a92974bfdf [AArch64] Add and update reduction and shuffle costs. NFC 2021-07-22 10:22:42 +01:00
Simon Pilgrim
5939c642ae [CostModel][X86] Add fast math tests for float reductions
As noticed on D105432 we didn't have any coverage to distinguish between fast/exact float reductions
2021-07-19 13:01:28 +01:00
Sander de Smalen
eac1670739 [CostModel][AArch64] Make loads/stores of <vscale x 1 x eltty> invalid.
At the moment, <vscale x 1 x eltty> are not yet fully handled by the
code-generator, so to avoid vectorizing loops with that VF, we mark the
cost for these types as invalid.
The reason for not adding a new "TTI::getMinimumScalableVF" is because
the type is supposed to be a type that can be legalized. It partially is,
although the support for these types need some more work.

Reviewed By: paulwalker-arm, dmgreen

Differential Revision: https://reviews.llvm.org/D103882
2021-07-14 16:44:22 +01:00
Simon Pilgrim
ee71c1bbcc [X86] Implement smarter instruction lowering for FP_TO_UINT from f32/f64 to i32/i64 and vXf32/vXf64 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction.
We know that "CVTTPS2SI" returns 0x80000000 for out of range inputs (and for FP_TO_UINT, negative float values are undefined). We can use this to make unsigned conversions from vXf32 to vXi32 more efficient, particularly on targets without blend using the following logic:

small := CVTTPS2SI(x);
fp_to_ui(x) := small | (CVTTPS2SI(x - 2^31) & ARITHMETIC_RIGHT_SHIFT(small, 31))

Even on targets where "PBLENDVPS"/"PBLENDVB" exists, it is often a latency 2, low throughput instruction so this logic is applied there too (in particular for AVX2 also). It furthermore gets rid of one high latency floating point comparison in the previous lowering.

@TomHender checked the correctness of this for all possible floats between -1 and 2^32 (both ends excluded).

Original Patch by @TomHender (Tom Hender)

Differential Revision: https://reviews.llvm.org/D89697
2021-07-14 12:03:49 +01:00
Simon Pilgrim
ae0d73ac3b [CostModel][X86] Adjust fptosi/fptoui SSE/AVX legalized costs based on llvm-mca reports.
Update (mainly) vXf32/vXf64 -> vXi8/vXi16 fptosi/fptoui costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-12 20:38:25 +01:00
Simon Pilgrim
96b4117d51 [CostModel][X86] Adjust truncate SSE/AVX legalized costs based on llvm-mca reports.
Update truncation costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-12 13:50:43 +01:00
David Green
38c9a4068d [TTI] Remove IsPairwiseForm from getArithmeticReductionCost
This patch removes the IsPairwiseForm flag from the Reduction Cost TTI
hooks, along with some accompanying code for pattern matching reductions
from trees starting at extract elements. IsPairWise is now assumed to be
false, which was the predominant way that the value was used from both
the Loop and SLP vectorizers. Since the adjustments such as D93860, the
SLP vectorizer has not relied upon this distinction between paiwise and
non-pairwise reductions.

This also removes some code that was detecting reductions trees starting
from extract elements inside the costmodel. This case was
double-counting costs though, adding the individual costs on the
individual instruction _and_ the total cost of the reduction. Removing
it changes the costs in llvm/test/Analysis/CostModel/X86/reduction.ll to
not double count. The cost of reduction intrinsics is still tested
through the various tests in
llvm/test/Analysis/CostModel/X86/reduce-xyz.ll.

Differential Revision: https://reviews.llvm.org/D105484
2021-07-09 11:51:16 +01:00
Simon Pilgrim
8ef67fa9d2 [CostModel][X86] Account for older SSE targets with slow fp->int conversions
Both the conversion cost and the xmm->gpr transfer cost tend to be a lot higher on early SSE targets
2021-07-08 18:08:24 +01:00
Sander de Smalen
97215fe3f4 [CostModel] Express cost(urem) as cost(div+mul+sub) when set to Expand.
The Legalizer expands the operations of urem/srem into a div+mul+sub or divrem
when those are legal/custom. This patch changes the cost-model to reflect that
cost.

Since there is no 'divrem' Instruction in LLVM IR, the cost of divrem
is assumed to be the same as div+mul+sub since the three operations will
need to be executed at runtime regardless.

Patch co-authored by David Sherwood (@david-arm)

Reviewed By: RKSimon, paulwalker-arm

Differential Revision: https://reviews.llvm.org/D103799
2021-07-07 14:40:28 +01:00
Simon Pilgrim
4c7e9a3852 [CostModel][X86] Adjust sext/zext SSE/AVX legalized costs based on llvm-mca reports.
Update costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-07 13:58:27 +01:00
Simon Pilgrim
a7da0296a6 [CostModel][X86] Adjust sitofp/uitofp SSE/AVX legalized costs based on llvm-mca reports.
Update (mainly) vXi8/vXi16 -> vXf32/vXf64 sitofp/uitofp costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-07 12:03:45 +01:00
Simon Pilgrim
b298308ba2 [CostModel][X86] fptosi/fptoui to i8/i16 are truncated from fptosi to i32
Provide a generic fallback that performs the fptosi to i32 types, then truncates to sub-i32 scalars.

These numbers can be tweaked for specific sse levels, but we should get the default handling in place first.
2021-07-06 17:28:03 +01:00
Simon Pilgrim
6f3f9535fc [CostModel][X86] i8/i16 sitofp/uitofp are sext/zext to i32 for sitofp
Provide a generic fallback that extends sub-i32 scalars before using the existing sitofp instructions.

These numbers can be tweaked for specific sse levels, but we should get the default handling in place first.

We get the extension for free for non-vector loads.
2021-07-06 13:58:52 +01:00
Caroline Concatto
a2c5c56055 [AArch64][CostModel] Add cost model for experimental.vector.splice
This patch adds a new  ShuffleKind SK_Splice and then handle the cost in
getShuffleCost, as in experimental.vector.reverse.

Differential Revision: https://reviews.llvm.org/D104630
2021-07-05 14:30:24 +01:00
Simon Pilgrim
5db826e4ce [CostModel][X86] Handle costs for insert/extractelement with non-immediate indices via stack
Determine the insert/extractelement costs when performing this as a sequence of aliased loads+stores via the stack.
2021-07-05 13:26:53 +01:00
Simon Pilgrim
65e4240fa1 [CostModel][X86] Adjust i32/i64 to f32/f64 scalar based on llvm-mca reports (+ Agner).
Older SSE targets have slower gpr->fpu scalar conversions - we also need to account for uitofp i32 > f32/f64 being lowered as sitofp i64 -> f32/f64
2021-07-05 13:26:53 +01:00
Sjoerd Meijer
ee752134ac [AArch64] Cost-model i8 vector loads/stores
Loads of <4 x i8> vectors were modeled as extremely expensive. And while we
don't have a load instruction that supports this, it isn't that expensive to
create a vector of i8 elements. The codegen for this was fixed/optimised in
D105110. This now tweaks the cost model and enables SLP vectorisation of my
motivating case loadi8.ll.

Differential Revision: https://reviews.llvm.org/D103629
2021-07-05 11:25:10 +01:00