Commit Graph

1444 Commits

Author SHA1 Message Date
David Sherwood
9448cdc900 [SVE][Analysis] Tune the cost model according to the tune-cpu attribute
This patch introduces a new function:

  AArch64Subtarget::getVScaleForTuning

that returns a value for vscale that can be used for tuning the cost
model when using scalable vectors. The VScaleForTuning option in
AArch64Subtarget is initialised according to the following rules:

1. If the user has specified the CPU to tune for we use that, else
2. If the target CPU was specified we use that, else
3. The tuning is set to "generic".

For CPUs of type "generic" I have assumed that vscale=2.

New tests added here:

  Analysis/CostModel/AArch64/sve-gather.ll
  Analysis/CostModel/AArch64/sve-scatter.ll
  Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Differential Revision: https://reviews.llvm.org/D110259
2021-10-21 09:33:50 +01:00
Arthur Eubanks
15fefcb9eb [opt] Directly translate -O# to -passes='default<O#>'
Right now when we see -O# we add the corresponding 'default<O#>' into
the list of passes to run when translating legacy -pass-name. This has
the side effect of not using the default AA pipeline.

Instead, treat -O# as -passes='default<O#>', but don't allow any other
-passes or -pass-name. I think we can keep `opt -O#` as shorthand for
`opt -passes='default<O#>` but disallow anything more than just -O#.

Tests need to be updated to not use `opt -O# -pass-name`.

Reviewed By: asbirlea

Differential Revision: https://reviews.llvm.org/D112036
2021-10-18 16:48:10 -07:00
Florian Hahn
74c4d44d47 [LV] Update test that was missed in e844f05397. 2021-10-18 18:23:00 +01:00
Florian Hahn
e844f05397 [LoopUtils] Simplify addRuntimeCheck to return a single value.
This simplifies the return value of addRuntimeCheck from a pair of
instructions to a single `Value *`.

The existing users of addRuntimeChecks were ignoring the first element
of the pair, hence there is not reason to track FirstInst and return
it.

Additionally all users of addRuntimeChecks use the second returned
`Instruction *` just as `Value *`, so there is no need to return an
`Instruction *`. Therefore there is no need to create a redundant
dummy `and X, true` instruction any longer.

Effectively this change should not impact the generated code because the
redundant AND will be folded by later optimizations. But it is easy to
avoid creating it in the first place and it allows more accurately
estimating the cost of the runtime checks.
2021-10-18 18:03:09 +01:00
Simon Pilgrim
85b87179f4 [TTI][X86] Add v8i16 -> 2 x v4i16 stride 2 interleaved load costs
Split SSE2 and SSSE3 costs to correctly handle PSHUFB lowering - as was noted on D111938
2021-10-16 17:28:07 +01:00
Simon Pilgrim
6ec644e215 [TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load costs
These cases use the same codegen as AVX2 (pshuflw/pshufd) for the sub-128bit vector deinterleaving, and unpcklqdq for v2i64.

It's going to take a while to add full interleaved cost coverage, but since these are the same for SSE2 -> AVX2 it should be an easy win.

Fixes PR47437

Differential Revision: https://reviews.llvm.org/D111938
2021-10-16 16:21:45 +01:00
Simon Pilgrim
d5f5121ea6 [LV][X86] Add PR47437 test case 2021-10-16 13:40:54 +01:00
Roman Lebedev
d137f1288e [X86][LV] X86 does *not* prefer vectorized addressing
And another attempt to start untangling this ball of threads around gather.
There's `TTI::prefersVectorizedAddressing()`hoop, which confusingly defaults to `true`,
which tells LV to try to vectorize the addresses that lead to loads,
but X86 generally can not deal with vectors of addresses,
the only instructions that support that are GATHER/SCATTER,
but even those aren't available until AVX2, and aren't really usable until AVX512.

This specializes the hook for X86, to return true only if we have AVX512 or AVX2 w/ fast gather.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111546
2021-10-16 12:32:18 +03:00
Roman Lebedev
3d7bf6625a [X86][Costmodel] Improve cost modelling for not-fully-interleaved load
While i've modelled most of the relevant tuples for AVX2,
that only covered fully-interleaved groups.

By definition, interleaving load of stride N means:
load N*VF elements, and shuffle them into N VF-sized vectors,
with 0'th vector containing elements `[0, VF)*stride + 0`,
and 1'th vector containing elements `[0, VF)*stride + 1`.
Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6)

Now, not fully interleaved load, is when not all of these vectors is demanded.
So at worst, we could just pretend that everything is demanded,
and discard the non-demanded vectors. What this means is that the cost
for not-fully-interleaved group should be not greater than the cost
for the same fully-interleaved group, but perhaps somewhat less.
Examples:
https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4)
https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2)
https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1)

As we have established over the course of last ~70 patches, (wow)
`BaseT::getInterleavedMemoryOpCos()` is absolutely bogus,
it is usually almost an order of magnitude overestimation,
so i would claim that we should at least use the hardcoded costs
of fully interleaved load groups.

We could go further and adjust them e.g. by the number of demanded indices,
but then i'm somewhat fearful of underestimating the cost.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111174
2021-10-14 23:14:36 +03:00
Roman Lebedev
a8a64eaafc [NFC][X86][LV] Autogenerate checklines in cost-model.ll to simplify further updates 2021-10-13 22:47:43 +03:00
Roman Lebedev
18eef13dad [X86][Costmodel] Fix X86TTIImpl::getGSScalarCost()
`X86TTIImpl::getGSScalarCost()` has (at least) two issues:
* it naively computes the cost of sequence of `insertelement`/`extractelement`.
  If we are operating not on the XMM (but YMM/ZMM),
  this widely overestimates the cost of subvector insertions/extractions.
* Gather/scatter takes a vector of pointers, and scalarization results in us performing
  scalar memory operation for each of these pointers, but we never account for the cost
  of extracting these pointers out of the vector of pointers.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111222
2021-10-13 22:35:39 +03:00
Ayal Zaks
15692fd6b5 [LV] Fix 2nd crash for reverse interleaved groups under mask/fold-tail.
This patch fixes another crash revealed by PR51614:
when *deciding* to vectorize with masked interleave groups, check if the access
is reverse (which is currently not supported).

Differential Revision: https://reviews.llvm.org/D108900
2021-10-12 21:44:42 +03:00
Kerry McLaughlin
1439ef1a3f [LoopVectorize] Classify pointer induction updates as scalar only if they have one use
collectLoopScalars collects pointer induction updates in ScalarPtrs, assuming
that the instruction will be scalar after vectorization. This may crash later
in VPReplicateRecipe::execute() if there there is another user of the instruction
other than the Phi node which needs to be widened.

This changes collectLoopScalars so that if there are any other users of
Update other than a Phi node, it is not added to ScalarPtrs.

Reviewed By: david-arm, fhahn

Differential Revision: https://reviews.llvm.org/D111294
2021-10-12 13:24:49 +01:00
Florian Hahn
ab33427c86 [VPlan] Print live-in backedge taken count as part of plan.
At the moment, a VPValue is created for the backedge-taken count, which
is used by some recipes. To make it easier to identify the operands of
recipes using the backedge-taken count, print it at the beginning of the
VPlan if it is used.

Reviewed By: a.elovikov

Differential Revision: https://reviews.llvm.org/D111298
2021-10-11 20:13:01 +01:00
David Sherwood
26b7d9d622 [LoopVectorize] Permit vectorisation of more select(cmp(), X, Y) reduction patterns
This patch adds further support for vectorisation of loops that involve
selecting an integer value based on a previous comparison. Consider the
following C++ loop:

  int r = a;
  for (int i = 0; i < n; i++) {
    if (src[i] > 3) {
      r = b;
    }
    src[i] += 2;
  }

We should be able to vectorise this loop because all we are doing is
selecting between two states - 'a' and 'b' - both of which are loop
invariant. This just involves building a vector of values that contain
either 'a' or 'b', where the final reduced value will be 'b' if any lane
contains 'b'.

The IR generated by clang typically looks like this:

  %phi = phi i32 [ %a, %entry ], [ %phi.update, %for.body ]
  ...
  %pred = icmp ugt i32 %val, i32 3
  %phi.update = select i1 %pred, i32 %b, i32 %phi

We already detect min/max patterns, which also involve a select + cmp.
However, with the min/max patterns we are selecting loaded values (and
hence loop variant) in the loop. In addition we only support certain
cmp predicates. This patch adds a new pattern matching function
(isSelectCmpPattern) and new RecurKind enums - SelectICmp & SelectFCmp.
We only support selecting values that are integer and loop invariant,
however we can support any kind of compare - integer or float.

Tests have been added here:

  Transforms/LoopVectorize/AArch64/sve-select-cmp.ll
  Transforms/LoopVectorize/select-cmp-predicated.ll
  Transforms/LoopVectorize/select-cmp.ll

Differential Revision: https://reviews.llvm.org/D108136
2021-10-11 09:41:38 +01:00
Florian Hahn
09fdfd03ea [VPlan] Replace hard-coded VPValue ids with patterns in tests.
This makes the tests a bit more robust with respect to small changes in
the value numbering.
2021-10-07 09:52:01 +01:00
Roman Lebedev
62d67d9e7c [NFC][X86][LoopVectorize] Autogenerate check lines in a few tests for ease of updating
For D111220
2021-10-06 22:54:15 +03:00
Philip Reames
d652724c0b [test] refresh a couple of autogen tests 2021-10-05 18:41:24 -07:00
Roman Lebedev
3a0643e9c2 [X86][Costmodel] Load/store i32/f32 Stride=2 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/n8aMKeo4E - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

For store we have:
https://godbolt.org/z/n8aMKeo4E - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: =2.0`
So pick cost of `4`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110755
2021-10-01 17:48:13 +03:00
Roman Lebedev
b12aeaec9a [X86][Costmodel] Load/store i32/f32 Stride=2 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/EM5Ean7bd - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: =1.0`
So pick cost of `2`.

For store we have:
https://godbolt.org/z/EM5Ean7bd - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `2`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110754
2021-10-01 17:48:13 +03:00
Roman Lebedev
f44d9009c2 [X86][Costmodel] Load/store i32/f32 Stride=2 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/4rY96hnGT - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: =1.0`
So pick cost of `2`.

For store we have:
https://godbolt.org/z/vbo37Y3r9 - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: =0.5`
So pick cost of `1`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110753
2021-10-01 17:48:13 +03:00
Krasimir Georgiev
685f1bfd0a Revert "[LoopVectorize] Permit vectorisation of more select(cmp(), X, Y) reduction patterns"
It appears to cause stage2 clang build failures, e.g.,
https://lab.llvm.org/buildbot/#/builders/74/builds/7145.

This reverts commit 1fb37334bd.
2021-10-01 11:39:43 +02:00
David Sherwood
1fb37334bd [LoopVectorize] Permit vectorisation of more select(cmp(), X, Y) reduction patterns
This patch adds further support for vectorisation of loops that involve
selecting an integer value based on a previous comparison. Consider the
following C++ loop:

  int r = a;
  for (int i = 0; i < n; i++) {
    if (src[i] > 3) {
      r = b;
    }
    src[i] += 2;
  }

We should be able to vectorise this loop because all we are doing is
selecting between two states - 'a' and 'b' - both of which are loop
invariant. This just involves building a vector of values that contain
either 'a' or 'b', where the final reduced value will be 'b' if any lane
contains 'b'.

The IR generated by clang typically looks like this:

  %phi = phi i32 [ %a, %entry ], [ %phi.update, %for.body ]
  ...
  %pred = icmp ugt i32 %val, i32 3
  %phi.update = select i1 %pred, i32 %b, i32 %phi

We already detect min/max patterns, which also involve a select + cmp.
However, with the min/max patterns we are selecting loaded values (and
hence loop variant) in the loop. In addition we only support certain
cmp predicates. This patch adds a new pattern matching function
(isSelectCmpPattern) and new RecurKind enums - SelectICmp & SelectFCmp.
We only support selecting values that are integer and loop invariant,
however we can support any kind of compare - integer or float.

Tests have been added here:

  Transforms/LoopVectorize/AArch64/sve-select-cmp.ll
  Transforms/LoopVectorize/select-cmp-predicated.ll
  Transforms/LoopVectorize/select-cmp.ll

Differential Revision: https://reviews.llvm.org/D108136
2021-10-01 08:41:03 +01:00
Florian Hahn
1fbdbb5595 Revert "Recommit "[SCEV] Look through single value PHIs." (take 2)"
This reverts commit 764d9aa979.

This patch exposed a few additional cases where SCEV expressions are not
properly invalidated.

See PR52024, PR52023.
2021-09-30 20:53:51 +01:00
Craig Topper
765348298c [CostModel] Update default cost model for sadd/ssub overflow to match TargetLowering
The expansion for these was updated in https://reviews.llvm.org/D47927 but the cost model was not adjusted.

I believe the cost model was also incorrect for the old expansion.
The expansion prior to D47927 used 3 icmps using LHS, RHS, and Result
to calculate theirs signs. Then 2 icmps to compare the signs. Followed
by an And. The previous cost model was using 3 icmps and 2 selects.
Digging back through git blame, those 2 selects in the cost model used to
be 2 icmps, but were changed in https://reviews.llvm.org/D90681

Differential Revision: https://reviews.llvm.org/D110739
2021-09-30 09:41:14 -07:00
Simon Pilgrim
17f1fc1e54 [TTI] BasicTTI::getInterleavedMemoryOpCost(): use getScalarizationOverhead()
getScalarizationOverhead() results in a somewhat better cost estimation than counting the insertion/extraction costs directly. Notably, this is still overestimating the costs.

Original Patch by: @lebedev.ri (Roman Lebedev)

Differential Revision: https://reviews.llvm.org/D110713
2021-09-29 16:41:53 +01:00
Florian Hahn
764d9aa979 Recommit "[SCEV] Look through single value PHIs." (take 2)
This reverts commit 8fdac7cb7a.

The issue causing the revert has been fixed a while ago in 60b852092c.

Original message:

    Now that SCEVExpander can preserve LCSSA form,
    we do not have to worry about LCSSA form when
    trying to look through PHIs. SCEVExpander will take
    care of inserting LCSSA PHI nodes as required.

    This increases precision of the analysis in some cases.

    Reviewed By: mkazantsev, bmahjour

    Differential Revision: https://reviews.llvm.org/D71539
2021-09-28 10:32:17 +01:00
Florian Hahn
4b581e87df [LV] Add tests where rt checks may make vectorization unprofitable.
Add a few additional tests which require a large number of runtime
checks for D109368.
2021-09-27 10:32:28 +01:00
Simon Pilgrim
8c83bd3bd4 [CostModel][X86] Adjust vXi32 multiply costs if it can be performed using PMADDWD
Update the costs to match the codegen from combineMulToPMADDWD - not only can we use PMADDWD is its zero-extended, but also if its a constant or sign-extended from a vXi16 (which can be replaced with a zero-extension).
2021-09-25 16:28:48 +01:00
Simon Pilgrim
41492d77ba [LoopVectorize][X86] Add operands to make it more obvious what line the CHECK concerns
As we're checking the cost debug analysis these should match the original IR line - so we shouldn't have any variable naming issues.

I'm investigating v4i32 mul -> PMADDDW costs handling (for PR47437) and these CHECK lines were proving tricky to keep track of
2021-09-22 10:08:32 +01:00
Ayal Zaks
ab6a69dfea [LV] Fix crash for reverse interleaved loads with gap under fold-tail.
This patch fixes the crash found by PR51614:
whenever doing tail folding, interleave groups must be considered under mask.

Another fix D108900 follows for targets that support masked loads and stores:
when *deciding* to vectorize with masked interleave groups, check if the access
is reverse - which is currently not supported; rather than (only) asserting when
computing cost and generating code.

Differential Revision: https://reviews.llvm.org/D108891
2021-09-21 20:13:32 +03:00
Usman Nadeem
f417d9d821 [InstCombine] Eliminate vector reverse if all inputs/outputs to an instruction are reverses
Differential Revision: https://reviews.llvm.org/D109808

Change-Id: I1a10d2bc33acbe0ea353c6cb3d077851391fe73e
2021-09-20 18:32:24 -07:00
David Sherwood
f988f68064 [Analysis] Add support for vscale in computeKnownBitsFromOperator
In ValueTracking.cpp we use a function called
computeKnownBitsFromOperator to determine the known bits of a value.
For the vscale intrinsic if the function contains the vscale_range
attribute we can use the maximum and minimum values of vscale to
determine some known zero and one bits. This should help to improve
code quality by allowing certain optimisations to take place.

Tests added here:

  Transforms/InstCombine/icmp-vscale.ll

Differential Revision: https://reviews.llvm.org/D109883
2021-09-20 15:01:59 +01:00
Nikita Popov
80110aafa0 [Tests] Fix incorrect noalias metadata
Mostly this fixes cases where !noalias or !alias.scope were passed
a scope rather than a scope list. In some cases I opted to drop
the metadata entirely instead, because it is not really relevant
to the test.
2021-09-18 20:51:00 +02:00
David Green
61cc873a8e [LV] Recognize intrinsic min/max reductions
This extends the reduction logic in the vectorizer to handle intrinsic
versions of min and max, both the floating point variants already
created by instcombine under fastmath and the integer variants from
D98152.

As a bonus this allows us to match a chain of min or max operations into
a single reduction, similar to how add/mul/etc work.

Differential Revision: https://reviews.llvm.org/D109645
2021-09-15 10:45:50 +01:00
David Green
bddfbf91ed [LV] Min/max intrinsic reduction test cases. 2021-09-15 09:56:19 +01:00
Florian Hahn
e90d55e1c9 [VPlan] Support sinking recipes with uniform users outside sink target.
This is a first step towards addressing the last remaining limitation of
the VPlan version of sinkScalarOperands: the legacy version can
partially sink operands. For example, if a GEP has uniform users outside
the sink target block, then the legacy version will sink all scalar
GEPs, other than the one for lane 0.

This patch works towards addressing this case in the VPlan version by
detecting such cases and duplicating the sink candidate. All users
outside of the sink target will be updated to use the uniform clone.

Note that this highlights an issue with VPValue naming. If we duplicate
a replicate recipe, they will share the same underlying IR value and
both VPValues will have the same name ir<%gep>.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D104254
2021-09-15 09:21:39 +01:00
Florian Hahn
e248d69036 Recommit "[LAA] Support pointer phis in loop by analyzing each incoming pointer."
SCEV does not look through non-header PHIs inside the loop. Such phis
can be analyzed by adding separate accesses for each incoming pointer
value.

This results in 2 more loops vectorized in SPEC2000/186.crafty and
avoids regressions when sinking instructions before vectorizing.

Fixes PR50296, PR50288.

Reviewed By: Meinersbur

Differential Revision: https://reviews.llvm.org/D102266
2021-09-14 11:19:12 +01:00
Florian Hahn
4b342268c0 [VPlan] Add test that requires duplicating recipe for sinking. 2021-09-13 14:21:20 +01:00
Florian Hahn
368af7558e [VPlan] Fix crash caused by not updating all users properly.
Users of VPValues are managed in a vector, so we need to be more
careful when iterating over users while updating them. For now, just
copy them.

Fixes 51798.
2021-09-12 18:10:53 +01:00
Nikita Popov
90ec6dff86 [OpaquePtr] Forbid mixing typed and opaque pointers
Currently, opaque pointers are supported in two forms: The
-force-opaque-pointers mode, where all pointers are opaque and
typed pointers do not exist. And as a simple ptr type that can
coexist with typed pointers.

This patch removes support for the mixed mode. You either get
typed pointers, or you get opaque pointers, but not both. In the
(current) default mode, using ptr is forbidden. In -opaque-pointers
mode, all pointers are opaque.

The motivation here is that the mixed mode introduces additional
issues that don't exist in fully opaque mode. D105155 is an example
of a design problem. Looking at D109259, it would probably need
additional work to support mixed mode (e.g. to generate GEPs for
typed base but opaque result). Mixed mode will also end up
inserting many casts between i8* and ptr, which would require
significant additional work to consistently avoid.

I don't think the mixed mode is particularly valuable, as it
doesn't align with our end goal. The only thing I've found it to
be moderately useful for is adding some opaque pointer tests in
between typed pointer tests, but I think we can live without that.

Differential Revision: https://reviews.llvm.org/D109290
2021-09-10 15:18:23 +02:00
Rosie Sumpter
9d1bea9c88 [SVE][LoopVectorize] Optimise code generated by widenPHIInstruction
For SVE, when scalarising the PHI instruction the whole vector part is
generated as opposed to creating instructions for each lane for fixed-
width vectors. However, in some cases the lane values may be needed
later (e.g for a load instruction) so we still need to calculate
these values to avoid extractelement being called on the vector part.

Differential Revision: https://reviews.llvm.org/D109445
2021-09-10 11:58:04 +01:00
Simon Pilgrim
f114ef3731 [CostModel][X86] Add generic costs for vXi32 MUL -> v2Xi16 PMADDDW folds
Based off the improved fold in D108522

This should eventually allow us to replace the SLM only cost patterns with generic versions.
2021-09-05 16:08:11 +01:00
Nikita Popov
02f74eadbe [IVDescriptors] Make pointer inductions compatible with opaque pointers
Store the used element type in the InductionDescriptor. For typed
pointers, it remains the pointer element type. For opaque pointers,
we always use an i8 element type, such that the step is a simple
offset.

A previous version of this patch instead tried to guess the element
type from an induction GEP, but this is not reliable, as the GEP
may be hidden (see @both in iv_outside_user.ll).

Differential Revision: https://reviews.llvm.org/D104795
2021-09-01 21:02:05 +02:00
Simon Pilgrim
10c982e0b3 Revert rG1c9bec727ab5c53fa060560dc8d346a911142170 : [InstCombine] Fold (gep (oneuse(gep Ptr, Idx0)), Idx1) -> (gep Ptr, (add Idx0, Idx1)) (PR51069)
Reverted (manually due to merge conflicts) while regressions reported on PR51540 are investigated

As noticed on D106352, after we've folded "(select C, (gep Ptr, Idx), Ptr) -> (gep Ptr, (select C, Idx, 0))" if the inner Ptr was also a (now one use) gep we could then merge the geps, using the sum of the indices instead.

I've limited this to basic 2-op geps - a more general case further down InstCombinerImpl.visitGetElementPtrInst doesn't have the one-use limitation but only creates the add if it can be created via SimplifyAddInst.

https://alive2.llvm.org/ce/z/f8pLfD (Thanks Roman!)

Differential Revision: https://reviews.llvm.org/D106450
2021-08-23 21:09:26 +01:00
Florian Hahn
d024a01511 Recommit "[LoopVectorize][AArch64] Enable ordered reductions by default for AArch64"
This reverts the revert ab9296f13b.

The issue causing the revert should be fixed in 9baed023b4.
2021-08-23 11:25:27 +01:00
Florian Hahn
9baed023b4 [LV] Adjust reduction recipes before recurrence handling.
Adjusting the reduction recipes still relies on references to the
original IR, which can become outdated by the first-order recurrence
handling. Until reduction recipe construction does not require IR
references, move it before first-order recurrence handling, to prevent a
crash as exposed by D106653.
2021-08-22 11:02:33 +01:00
Florian Hahn
ab9296f13b Revert "[LoopVectorize][AArch64] Enable ordered reductions by default for AArch64"
This reverts commit f4122398e7 to
investigate a crash exposed by it.

The patch breaks building the code below with `clang -O2 --target=aarch64-linux`

     int a;
     double b, c;
     void d() {
       for (; a; a++) {
         b += c;
         c = a;
       }
     }
2021-08-20 21:24:28 +01:00
David Sherwood
f4122398e7 [LoopVectorize][AArch64] Enable ordered reductions by default for AArch64
I have added a new TTI interface called enableOrderedReductions() that
controls whether or not ordered reductions should be enabled for a
given target. By default this returns false, whereas for AArch64 it
returns true and we rely upon the cost model to make sensible
vectorisation choices. It is still possible to override the new TTI
interface by setting the command line flag:

  -force-ordered-reductions=true|false

I have added a new RUN line to show that we use ordered reductions by
default for SVE and Neon:

  Transforms/LoopVectorize/AArch64/strict-fadd.ll
  Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll

Differential Revision: https://reviews.llvm.org/D106653
2021-08-19 09:29:40 +01:00
David Sherwood
219d4518fc [Analysis][AArch64] Make fixed-width ordered reductions slightly more expensive
For tight loops like this:

  float r = 0;
  for (int i = 0; i < n; i++) {
    r += a[i];
  }

it's better not to vectorise at -O3 using fixed-width ordered reductions
on AArch64 targets. Although the resulting number of instructions in the
generated code ends up being comparable to not vectorising at all, there
may be additional costs on some CPUs, for example perhaps the scheduling
is worse. It makes sense to deter vectorisation in tight loops.

Differential Revision: https://reviews.llvm.org/D108292
2021-08-18 17:01:56 +01:00