Commit Graph

195 Commits

Author SHA1 Message Date
Florian Hahn
68469a80cb [LV] Disable runtime unrolling for vectorized loops.
This patch adds metadata to disable runtime unrolling to the vectorized
loop. If runtime unrolling/interleaving is considered profitable, LV
will interleave the loop directly. There should be no need to perform
runtime unrolling at a later stage.

Note that we already add metadata to disable runtime unrolling to the
scalar loop after vectorization.

The additional unrolling unnecessarily increases code size and compile
time. In addition to that we have several bug reports of unncessary
runtime unrolling for vectorized loops, e.g. PR40961

Compile-time improvements:

  NewPM-O3: -1.04%
  NewPM-ReleaseThinLTO: -0.59%
  NewPM-ReleaseLTO-g: -0.97%

https://llvm-compile-time-tracker.com/compare.php?from=ce1be13a868d0f8afa367975558c1a6175cce33a&to=78bc2e67f22e9e10e61cdb6cdac4bb857d95eb1b&stat=instructions:u

Fixes #40306.

Reviewed By: lebedev.ri, nikic

Differential Revision: https://reviews.llvm.org/D115261
2023-01-06 10:56:17 +00:00
Sanjay Patel
d5f8878a6e [InstCombine] canonicalize insertelement order based on index
This puts lower insert indexes before higher. This is independent
of endian, so it requires an adjustment to a fold added with
4446f71ce3, but it makes that fold more robust.
That's also where this patch was suggested - D139668.

This matches what we already do in DAGCombiner, but there is one
more constraint because there's an existing canonicalization for
insert-of-scalar-constant. I'm not sure if that is still needed,
so it may be adjusted/removed as a follow-up.
2022-12-18 07:08:48 -05:00
Sanjay Patel
4446f71ce3 [InstCombine] try to fold a pair of insertelements into one insertelement
This replaces patches that tried to convert related patterns to shuffles
(D138872, D138873, D138874 - reverted/abandoned) but caused codegen
problems and were questionable as a canonicalization because an
insertelement is a simpler op than a shuffle.

This detects a larger pattern -- insert-of-insert -- and replaces with
another insert, so this hopefully does not cause any problems.

As noted by TODO items in the code and tests, this could go a lot further.
But this is enough to reduce the motivating test from issue #17113.

Example proofs:
https://alive2.llvm.org/ce/z/NnUv3a

I drafted a version of this for AggressiveInstCombine, but it seems that
would uncover yet another phase ordering gap. If we do generalize this to
handle the full range of potential patterns, that may be worth looking at
again.

Differential Revision: https://reviews.llvm.org/D139668
2022-12-12 10:39:58 -05:00
Sanjay Patel
05dbdb0088 Revert "[InstCombine] canonicalize trunc + insert as bitcast + shuffle, part 1 (2nd try)"
This reverts commit e71b81cab0.

As discussed in the planned follow-on to this patch (D138874),
this and the subsequent patches in this set can cause trouble for
the backend, and there's probably no quick fix. We may even
want to canonicalize in the opposite direction (towards insertelt).
2022-12-08 14:16:46 -05:00
Roman Lebedev
75d1a815c3 [NFC] Port all PhaseOrdering tests to -passes= syntax 2022-12-08 02:38:50 +03:00
Sanjay Patel
e71b81cab0 [InstCombine] canonicalize trunc + insert as bitcast + shuffle, part 1 (2nd try)
The first attempt was reverted because a clang test changed
unexpectedly - the file is already marked with a FIXME, so
I just updated it this time to pass.

Original commit message:
This is the main patch for converting a truncated scalar that is
inserted into a vector to bitcast+shuffle. We could go either way
on patterns like this, but this direction will allow collapsing a
pair of these sequences on the motivating example from issue

The patch is split into 3 parts to make it easier to see the
progression of tests diffs. We allow inserting/shuffling into a
different size vector for flexibility, so there are several test
variations. The length-changing is handled by shortening/padding
the shuffle mask with undef elements.

In part 1, handle the basic pattern:
inselt undef, (trunc T), IndexC --> shuffle (bitcast T), IdentityMask

Proof for the endian-dependency behaving as expected:
https://alive2.llvm.org/ce/z/BsA7yC

The TODO items for handling shifts and insert into an arbitrary base
vector value are implemented as follow-ups.

Differential Revision: https://reviews.llvm.org/D138872
2022-11-30 14:52:20 -05:00
Sanjay Patel
5eacdcff06 Revert "[InstCombine] canonicalize trunc + insert as bitcast + shuffle, part 1"
This reverts commit a4c466766d.
This broke clang tests that are wrongly dependent on the optimizer.
2022-11-30 14:10:50 -05:00
Sanjay Patel
a4c466766d [InstCombine] canonicalize trunc + insert as bitcast + shuffle, part 1
This is the main patch for converting a truncated scalar that is
inserted into a vector to bitcast+shuffle. We could go either way
on patterns like this, but this direction will allow collapsing a
pair of these sequences on the motivating example from issue

The patch is split into 3 parts to make it easier to see the
progression of tests diffs. We allow inserting/shuffling into a
different size vector for flexibility, so there are several test
variations. The length-changing is handled by shortening/padding
the shuffle mask with undef elements.

In part 1, handle the basic pattern:
inselt undef, (trunc T), IndexC --> shuffle (bitcast T), IdentityMask

Proof for the endian-dependency behaving as expected:
https://alive2.llvm.org/ce/z/BsA7yC

The TODO items for handling shifts and insert into an arbitrary base
vector value are implemented as follow-ups.

Differential Revision: https://reviews.llvm.org/D138872
2022-11-30 13:22:04 -05:00
Sanjay Patel
c7bd82dfd8 [PhaseOrdering] add test for vector load combining; NFC
This is another example from issue #17113
2022-11-28 16:00:06 -05:00
Matt Arsenault
1c55cc600e PhaseOrdering: Convert tests to opaque pointers
Required manually running update_test_checks:
  AArch64/hoisting-sinking-required-for-vectorization.ll
  AArch64/peel-multiple-unreachable-exits-for-vectorization.ll
  ARM/arm_mult_q15.ll
  X86/hoist-load-of-baseptr.ll
  X86/spurious-peeling.ll
2022-11-27 21:26:41 -05:00
Sanjay Patel
163bb6d64e [Passes][VectorCombine] enable early run generally and try load folds
An early run of VectorCombine was added with D102496 specifically to
deal with unnecessary vector ops produced with the C matrix extension.
This patch is proposing to try those folds in general and add a pair
of load folds to the menu.

The load transform will partly solve (see PhaseOrdering diffs) a
longstanding vectorization perf bug by removing redundant loads via GVN:
issue #17113

The main reason for not enabling the extra pass generally in the initial
patch was compile-time cost. The cost of VectorCombine was significantly
(surprisingly) improved with:
87debdadaf
https://llvm-compile-time-tracker.com/compare.php?from=ffe05b8f57d97bc4340f791cb386c8d00e0739f2&to=87debdadaf18f8a5c7e5d563889e10731dc3554d&stat=instructions:u

...so the extra run is going to cost very little now - the total cost of
the 2 runs should be less than the 1 run before that micro-optimization:
https://llvm-compile-time-tracker.com/compare.php?from=5e8c2026d10e8e2c93c038c776853bed0e7c8fc1&to=2c4b68eab5ae969811f422714e0eba44c5f7eefb&stat=instructions:u

It may be possible to reduce the cost slightly more with a few more
earlier-exits like that, but it's probably in the noise based on timing
experiments.

Differential Revision: https://reviews.llvm.org/D138353
2022-11-21 13:57:55 -05:00
Roman Lebedev
8adfa29706 [Pipelines] Introduce SROA after (final, run-time) loop unrolling
Now that we are done with loop unrolling, be it either by LoopVectorizer,
or LoopUnroll passes, some variable-offset GEP's into alloca's could have
become constant-offset, thus enabling SROA and alloca promotion,
yet we don't capitalize on that, which is surprizing.

While it would be good to not introduce one more SROA invocation,
but instead move the one from `PassBuilder::buildFunctionSimplificationPipeline()`,
the existing test coverage says that is a bad idea,
though it would be fine compile-time wise: https://llvm-compile-time-tracker.com/compare.php?from=b150d34c47efbd8fa09604bce805c0920360f8d7&to=5a9a5c855158b482552be8c7af3e73d67fa44805&stat=instructions

So instead, i add yet another SROA run.
I have checked, and it needs to be at least after said final loop unrolling.
This is still fine compile-time wise: https://llvm-compile-time-tracker.com/compare.php?from=70324cd88328c0924e605fa81b696572560aa5c9&to=fb489bbef687ad821c3173a931709f9cad9aee8a&stat=instructions

I've encountered this in a real code, `SROA-after-final-loop-unrolling.ll` has been reduced from https://godbolt.org/z/fsdMhETh3

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D136806
2022-11-17 21:31:30 +03:00
Sanjay Patel
3682180e5f [PhaseOrdering] add test for load combining; NFC
Based on an example in issue #17113
2022-11-17 10:33:35 -05:00
Arthur Eubanks
70dc3b811e [AggressiveInstCombine] Remove legacy PM pass
As part of legacy PM optimization pipeline removal.

This shouldn't be used in codegen pipelines so it should be ok to remove.

Reviewed By: asbirlea

Differential Revision: https://reviews.llvm.org/D137116
2022-11-15 14:35:15 -08:00
Bjorn Pettersson
893e351f2f [test] Avoid legacy PM default pipelines (O0,O1 etc) when running opt
Two lit tests were found running something like this:
  opt -O<n> -pass-locked-to-legacy-PM ...

The expand-atomicrmw-xchg-fp.ll seem to have used -O1 just to ensure
that the -atomic-expand pass were thinking that it wasn't running at
O0 level. Same thing can be ensured by using the -codegen-opt-level=1
option, making it possible to avoid using O1 in that test case.

In the vector-reductions-expanded.ll test case it was possible to
split the RUN line into using two opt invocations. First running
"opt -O2" using the new PM, and then running "opt -expand-reductions"
using the legacy PM.

I think that given this patch we get closer to removing code related
to 'AddOptimizationPasses' in opt.cpp.

Differential Revision: https://reviews.llvm.org/D137626
2022-11-09 09:57:57 +01:00
Roman Lebedev
e14f30584d [NFC][PhaseOrdering] Add one more test for SROA after partial unroll
https://reviews.llvm.org/D136806
2022-10-27 19:10:27 +03:00
Roman Lebedev
70324cd883 [NFC][PhaseOrdering] Add new test for SROA misplacement 2022-10-27 03:11:23 +03:00
Yaxun (Sam) Liu
9d5adc7e49 Revert "reland e5581df60a [SimplifyCFG] accumulate bonus insts cost"
This reverts commit bd7949bcd8.

Revert this patch since reviwers have different opinions regarding
the approach in post-commit review.

Will open RFC for further discussion.

Differential Revision: https://reviews.llvm.org/D132408
2022-10-25 12:15:39 -04:00
Yaxun (Sam) Liu
bd7949bcd8 reland e5581df60a [SimplifyCFG] accumulate bonus insts cost
Fixed compile time increase due to always constructing LocalCostTracker.
Now only construct LocalCostTracker when needed.
2022-10-24 15:43:53 -04:00
bipmis
38f3e44997 [AggressiveInstCombine] Load merge the reverse load pattern of consecutive loads.
This patch extends the load merge/widen in AggressiveInstCombine() to handle reverse load patterns.

Differential Revision: https://reviews.llvm.org/D135137
2022-10-19 11:22:58 +01:00
bipmis
82e3056255 Add test for combinations of four i8-loads spliced into a 32-bit value 2022-10-18 15:40:56 +01:00
Nikita Popov
627a0c6b40 [PhaseOrdering] Name instructions in test (NFC)
Run through opt -instnamer.
2022-10-05 17:04:11 +02:00
Simon Pilgrim
5849fcb635 Revert rG1b7089fe67b924bdd5ecef786a34bdba7a88778f "[SLP] Add ScalarizationOverheadBuilder helper to track vector extractions"
Revert rGef89409a59f3b79ae143b33b7d8e6ee6285aa42f "Fix 'unused-lambda-capture' gcc warning. NFCI."
Revert rG926ccfef032d206dcbcdf74ca1e3a9ebf4d1be45 "[SLP] ScalarizationOverheadBuilder - demand all elements for scalarization if the extraction index is unknown / out of bounds"

Revert ScalarizationOverheadBuilder sequence from D134605 - when accumulating extraction costs by Type (instead of specific Value), we are not distinguishing enough when they are coming from the same source or not, and we always just count the cost once. This needs addressing before we can use getScalarizationOverhead properly.
2022-09-30 11:22:48 +01:00
Simon Pilgrim
1b7089fe67 [SLP] Add ScalarizationOverheadBuilder helper to track vector extractions
Instead of accumulating all extraction costs separately and then adjusting for repeated subvector extractions, this patch collects all the extractions and then converts to calls to getScalarizationOverhead to improve the accuracy of the costs.

I'm not entirely satisfied with the getExtractWithExtendCost handling yet - this still just adds all the getExtractWithExtendCost costs together - it really needs to be replaced with a "getScalarizationOverheadWithExtend", but that will require further refactoring first.

This replaces my initial attempt in D124769.

Differential Revision: https://reviews.llvm.org/D134605
2022-09-27 14:49:07 +01:00
Sanjay Patel
271f3b91bb [PhaseOrdering] add test for issue #50778; NFC
Several different passes are involved to get the expected IR,
and we don't want that to break again.
2022-09-23 12:12:13 -04:00
Sanjay Patel
34f8112b79 Revert "[PhaseOrdering] add test for issue #50778; NFC"
This reverts commit cdc012fa26.
This accidentally deleted a test file (not sure how that became
part of the commit).
2022-09-23 12:06:29 -04:00
Sanjay Patel
cdc012fa26 [PhaseOrdering] add test for issue #50778; NFC
Several different passes are involved to get the expected IR,
and we don't want that to break again.
2022-09-23 12:03:03 -04:00
Nikita Popov
dd61726d5b Revert "[SimplifyCFG] accumulate bonus insts cost"
This reverts commit e5581df60a.

This causes major compile-time regressions, about 2-3% end-to-end
on CTMark.
2022-09-19 14:46:43 +02:00
Yaxun (Sam) Liu
e5581df60a [SimplifyCFG] accumulate bonus insts cost
SimplifyCFG folds

bool foo() {
  if (cond1) return false;
  if (cond2) return false;
  return true;
}

as

bool foo() {
  if (cond1 | cond2) return false
  return true;
}

'cond2' is called 'bonus insts' in branch folding since they introduce overhead
since the original CFG could do early exit but the folded CFG always executes
them. SimplifyCFG calculates the costs of 'bonus insts' of a folding a BB into
its predecessor BB which shares the destination. If it is below bonus-inst-threshold,
SimplifyCFG will fold that BB into its predecessor and cond2 will always be executed.

When SimplifyCFG calculates the cost of 'bonus insts', it only consider 'bonus' insts
in the current BB to be considered for folding. This causes issue for unrolled loops
which share destinations, e.g.

bool foo(int *a) {
  for (int i = 0; i < 32; i++)
    if (a[i] > 0) return false;
  return true;
}

After unrolling, it becomes

bool foo(int *a) {
  if(a[0]>0) return false
  if(a[1]>0) return false;
  //...
  if(a[31]>0) return false;
  return true;
}

SimplifyCFG will merge each BB with its predecessor BB,
and ends up with 32 'bonus insts' which are always executed, which
is much slower than the original CFG.

The root cause is that SimplifyCFG does not consider the
accumulated cost of 'bonus insts' which are folded from
different BB's.

This patch fixes that by introducing a ValueMap to track
costs of 'bonus insts' coming from different BB's into
the same BB, and cuts off if the accumulated cost
exceeds a threshold.

Reviewed by: Artem Belevich, Florian Hahn, Nikita Popov, Matt Arsenault

Differential Revision: https://reviews.llvm.org/D132408
2022-09-18 20:21:14 -04:00
Sanjay Patel
d6498abc24 [InstCombine] remove multi-use add demanded constant fold
This was originally part of D133788. There are no visible
regressions. All of the diffs show a large unsigned constant
becoming a small negative constant. This should be better
for analysis (and slightly less compile-time) and codegen.
2022-09-18 14:23:43 -04:00
Valery N Dmitriev
18dde772d6 [SLP] Unify main/alternate selection for CmpInst instructions
Make main/alternate operation selection logic for CmpInst
consistent across SLP vectorizer.

Differential Revision: https://reviews.llvm.org/D133430
2022-09-13 09:20:25 -07:00
Simon Pilgrim
626a84db47 [CostModel][X86] getTypeBasedIntrinsicInstrCost - convert to CostKindTblEntry
Begin the refactoring to use CostKindTblEntry and return real latency/codesize/sizelatency costs instead of reusing the throughput numbers
2022-09-04 17:59:08 +01:00
Simon Pilgrim
8dc99180a6 [PhaseOrdering] Move X86 unsigned-multiply-overflow-check.ll test under X86 2022-09-04 17:54:32 +01:00
Simon Pilgrim
3edec9ba60 [CostModel][X86] Support cost kind specific look up tables (REAPPLIED)
Most of our cost model tables have been created assuming cost kind == recip-throughput. But we're starting to see passes wanting to get accurate costs for the other kinds as well. Some of these can be determined procedurally (e.g. codesize by default could just be the split count after type legalization), but others are going to need to be handled in cost tables - this is especially true for x86 which has so many ISA combinations.

I've created a 'CostKindCosts' struct which can hold cost values for the 4 cost kinds, defaulting to -1U for unknown cost, this can be used with the existing CostTblEntryT/CostTableLookup template code. I've also added a [TargetCostKind] accessor to make it much easier to look up individual <Optional> costs.

This just changes the ISD::SELECT costs to check the effect (and also to check that the ISD::SETCC are correctly handled for default/None cost kinds) - the plan would be to slowly extend this and move the CostKindTblEntry type somewhere generic to allow other targets to use it once its matured.

I'm also going to resurrect D103695 so that it can help with latency/codesize/sizelatency coverage testing.

For sizelatency - IIRC the definition was vague to let it be target specific - I've tried to use typical uop counts so they're comparable to MicroOpBufferSize etc.

REAPPLIED: Added early out to prevent getCmpSelInstrCost being used for anything but generic integer/float scalar/vector types - getTypeLegalizationCost can't handle the "exotic" TypeID enums that some passes attempt to get a costs for (aggregates etc.).

Differential Revision: https://reviews.llvm.org/D132216
2022-08-25 16:49:17 +01:00
Benjamin Kramer
ab85996e47 Revert "[CostModel][X86] Support cost kind specific look up tables"
This reverts commit 45846854a2.

This triggers an assertion failure during Clang selfhost

Unknown type!
UNREACHABLE executed at llvm/lib/CodeGen/ValueTypes.cpp:548!
*** SIGABRT received by PID 6107 (TID 6107) on cpu 218 from PID 6107; stack trace: ***
    @     0x556c8827c2d1         64  llvm::llvm_unreachable_internal()
    @     0x556c82a5542a         32  llvm::MVT::getVT()
    @     0x556c82a54a28         80  llvm::EVT::getEVT()
    @     0x556c7dda1526         80  llvm::TargetLoweringBase::getValueType()
    @     0x556c8174dd38        112  llvm::BasicTTIImplBase<>::getTypeLegalizationCost()
    @     0x556c81755e72        144  llvm::X86TTIImpl::getCmpSelInstrCost()
    @     0x556c8174cadf        512  llvm::TargetTransformInfoImplCRTPBase<>::getInstructionCost()
    @     0x556c84ab4dd2         32  llvm::TargetTransformInfo::getInstructionCost()
    @     0x556c82ead283       1968  llvm::sinkRegion()
2022-08-25 15:42:44 +02:00
Simon Pilgrim
2e5f16516a [CostModel][X86] Add CodeSize handling for fdiv ops
Eventually this will be part of the cost table lookup
2022-08-25 14:08:03 +01:00
Simon Pilgrim
45846854a2 [CostModel][X86] Support cost kind specific look up tables
Most of our cost model tables have been created assuming cost kind == recip-throughput. But we're starting to see passes wanting to get accurate costs for the other kinds as well. Some of these can be determined procedurally (e.g. codesize by default could just be the split count after type legalization), but others are going to need to be handled in cost tables - this is especially true for x86 which has so many ISA combinations.

I've created a 'CostKindCosts' struct which can hold cost values for the 4 cost kinds, defaulting to -1U for unknown cost, this can be used with the existing CostTblEntryT/CostTableLookup template code. I've also added a [TargetCostKind] accessor to make it much easier to look up individual <Optional> costs.

This just changes the ISD::SELECT costs to check the effect (and also to check that the ISD::SETCC are correctly handled for default/None cost kinds) - the plan would be to slowly extend this and move the CostKindTblEntry type somewhere generic to allow other targets to use it once its matured.

I'm also going to resurrect D103695 so that it can help with latency/codesize/sizelatency coverage testing.

For sizelatency - IIRC the definition was vague to let it be target specific - I've tried to use typical uop counts so they're comparable to MicroOpBufferSize etc.

Differential Revision: https://reviews.llvm.org/D132216
2022-08-25 12:23:36 +01:00
Jay Foad
2754ff883d [InstCombine] Try not to demand low order bits for Add
Don't demand low order bits from the LHS of an Add if:
- they are not demanded in the result, and
- they are known to be zero in the RHS, so they can't possibly
  overflow and affect higher bit positions

This is intended to avoid a regression from a future patch to change
the order of canonicalization of ADD and AND.

Differential Revision: https://reviews.llvm.org/D130075
2022-08-22 20:03:53 +01:00
Simon Pilgrim
53c0be28a7 [PhaseOrdering][X86] Regenerate vdiv.ll
Noticed while cleaning up x86 cost tables for upcoming cost kind support and it affected this test
2022-08-21 13:39:55 +01:00
Simon Pilgrim
08d153d806 [ValueTracking] computeKnownBits - attempt to use a branch condition feeding a phi to improve known bits range (PR38280)
If computeKnownBits encounters a phi node, and we fail to determine any known bits through direct analysis, see if the incoming value is part of a branch condition feeding the phi.

Handle cases where icmp(IncomingValue PRED Constant) is driving a branch instruction feeding that phi node - at the moment this only handles EQ/ULT/ULE predicate cases as they are the most straightforward to handle and most likely for branch-loop 'max upper bound' cases - we can extend this if/when necessary.

I investigated a more general icmp(LHS PRED RHS) KnownBits system, but the hard limits we put on value tracking depth through phi nodes meant that we were mainly catching constants anyhow.

Fixes the pointless vectorization in PR38280 / Issue #37628 (excessive unrolling still needs handling though)

Differential Revision: https://reviews.llvm.org/D131838
2022-08-16 16:54:44 +01:00
Florian Hahn
1638ad1ebf [PhaseOrdering] Add test showing excessive unrolling of vector loop.
Test cases based on #42332 showing excessive unrolling with both known
and runtime trip counts.
2022-08-16 16:29:15 +01:00
Simon Pilgrim
c771d0fd4b [Instcombine] Add (simplified) pointless loop unroll / vectorization test for Issue #37628 2022-08-13 14:34:08 +01:00
Sanjay Patel
bfb9b8e075 [Passes] add a tail-call-elim pass near the end of the opt pipeline
We call tail-call-elim near the beginning of the pipeline,
but that is too early to annotate calls that get added later.

In the motivating case from issue #47852, the missing 'tail'
on memset leads to sub-optimal codegen.

I experimented with removing the early instance of
tail-call-elim instead of just adding another pass, but that
appears to be slightly worse for compile-time:
+0.15% vs. +0.08% time.
"tailcall" shows adding the pass; "tailcall2" shows moving
the pass to later, then adding the original early pass back
(so 1596886802 is functionally equivalent to 180b0439dc ):
https://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=rotateright

Note that there was an effort to split the tail call functionality
into 2 passes - that could help reduce compile-time if we find
that this change costs more in compile-time than expected based
on the preliminary testing:
D60031

Differential Revision: https://reviews.llvm.org/D130374
2022-07-25 15:25:47 -04:00
Nikita Popov
41d5033eb1 [IR] Enable opaque pointers by default
This enabled opaque pointers by default in LLVM. The effect of this
is twofold:

* If IR that contains *neither* explicit ptr nor %T* types is passed
  to tools, we will now use opaque pointer mode, unless
  -opaque-pointers=0 has been explicitly passed.
* Users of LLVM as a library will now default to opaque pointers.
  It is possible to opt-out by calling setOpaquePointers(false) on
  LLVMContext.

A cmake option to toggle this default will not be provided. Frontends
or other tools that want to (temporarily) keep using typed pointers
should disable opaque pointers via LLVMContext.

Differential Revision: https://reviews.llvm.org/D126689
2022-06-02 09:40:56 +02:00
Nikita Popov
1721ff1dfd [GVN] Enable enable-split-backedge-in-load-pre option by default
This option was added in D89854. It prevents GVN from performing
load PRE in a loop, if doing so would require critical edge
splitting on the backedge. From the review:

> I know that GVN Load PRE negatively impacts peeling,
> loop predication, so the passes expecting that latch has
> a conditional branch.

In the PhaseOrdering test in this patch, splitting the backedge
negatively affects vectorization: After critical edge splitting,
the loop gets rotated, effectively peeling off the first loop
iteration. The effect is that the first element is handled
separately, then the bulk of the elements use a vectorized
reduction (but using unaligned, off-by-one memory accesses) and
then a tail of 15 elements is handled separately again.

It's probably worth noting that the loop load PRE from D99926 is
not affected by this change (as it does not need backedge
splitting). This is about normal load PRE that happens to occur
inside a loop.

Differential Revision: https://reviews.llvm.org/D126382
2022-05-30 09:55:58 +02:00
Nikita Popov
6346a026af [PhaseOrdering] Add test for unprofitable loop load PRE backedge splitting (NFC) 2022-05-25 16:06:41 +02:00
Nikita Popov
b2a13d3e2d [InstCombine] Use IRBuilder in freeze pushing transform (PR55619)
Use IRBuilder so that the newly created freeze instructions
automatically gets inserted back into the IC worklist.

The changed worklist processing order leads to some cosmetic
differences in tests.

Fixes https://github.com/llvm/llvm-project/issues/55619.
2022-05-24 15:48:28 +02:00
Florian Hahn
b7315ffc3c [LAA,LV] Add initial support for pointer-diff memory checks.
This patch adds initial support for a pointer diff based runtime check
scheme for vectorization. This scheme requires fewer computations and
checks than the existing full overlap checking, if it is applicable.

The main idea is to only check if source and sink of a dependency are
far enough apart so the accesses won't overlap in the vector loop. To do
so, it is sufficient to compute the difference and compare it to the
`VF * UF * AccessSize`. It is sufficient to check
`(Sink - Src) <u VF * UF * AccessSize` to rule out a backwards
dependence in the vector loop with the given VF and UF. If Src >=u Sink,
there is not dependence preventing vectorization, hence the overflow
should not matter and using the ULT should be sufficient.

Note that the initial version is restricted in multiple ways:

1. Pointers must only either be read or written, by a single
   instruction (this allows re-constructing source/sink for
   dependences with the available information)
 2. Source and sink pointers must be add-recs, with matching steps
 3. The step must be a constant.
 3. abs(step) == AccessSize.

Most of those restrictions can be relaxed in the future.

See https://github.com/llvm/llvm-project/issues/53590.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D119078
2022-05-16 15:27:22 +01:00
Alexey Bataev
7ea03f0b4e [SLP]Improve reductions analysis and emission, part 1.
Currently SLP vectorizer walks through the instructions and selects
3 main classes of values: 1) reduction operations - instructions with same
reduction opcode (add, mul, min/max, etc.), which build the reduction,
2) reduced values - instructions with the same opcodes, but different
from the reduction opcode, 3) extra arguments - all other values,
instructions from the different basic block rather than the root node,
instructions with to many/less uses.

This scheme is not very efficient. It excludes some instructions and all
non-instruction values from the reductions (constants, proficient
gathers), to many possibly reduced values are marked as extra arguments.
Patch improves this process by introducing a bit extended analysis
stage. During this stage, we still try to select 3 classes of the
values: 1) reduction operations - same as before, 2) possibly reduced
values - all instructions from the current block/non-instructions, which
may build a vectorization tree, 3) extra arguments - instructions from
the different basic blocks. Additionally, an extra sorting of the
possibly reduced values occurs to build the scalar sequences which
highly likely will bed vectorized, e.g. loads are grouped by the
distance between them, constants are grouped together, cmp instructions
are sorted by their compare types and predicates, extractelement
instructions are sorted by the vector operand, etc. Also, these groups
are reordered by their length so the longest group is the first in the
list of the possibly reduced values.

The vectorization process tries to emit the reductions for all these
groups. These reductions, remaining non-vectorized possible reduced
values and extra arguments are then combined into the final expression
just like it was before.

Differential Revision: https://reviews.llvm.org/D114171
2022-05-02 12:03:58 -07:00
Simon Pilgrim
6f80830f06 [PhaseOrdering][X86] Use passes="" instead of passes='' so DOS can evaluate the cmd lines
Fix regenerating the tests on windows builds
2022-04-30 19:56:49 +01:00