This patch adds metadata to disable runtime unrolling to the vectorized
loop. If runtime unrolling/interleaving is considered profitable, LV
will interleave the loop directly. There should be no need to perform
runtime unrolling at a later stage.
Note that we already add metadata to disable runtime unrolling to the
scalar loop after vectorization.
The additional unrolling unnecessarily increases code size and compile
time. In addition to that we have several bug reports of unncessary
runtime unrolling for vectorized loops, e.g. PR40961
Compile-time improvements:
NewPM-O3: -1.04%
NewPM-ReleaseThinLTO: -0.59%
NewPM-ReleaseLTO-g: -0.97%
https://llvm-compile-time-tracker.com/compare.php?from=ce1be13a868d0f8afa367975558c1a6175cce33a&to=78bc2e67f22e9e10e61cdb6cdac4bb857d95eb1b&stat=instructions:uFixes#40306.
Reviewed By: lebedev.ri, nikic
Differential Revision: https://reviews.llvm.org/D115261
As per the post-commit feedback - that was not the correct precondition
to avoid it here. I think we should generally start changing mentality
about `freeze`, the fact that we have been conditioned to be afraid of it
(or of anything in LLVM in general) is the key problem here.
This kind of thing happens really frequently in LLVM's very own
shuffle combining methods, and it is even considered bad practice
to use `%` there, instead of using this expansion directly.
Though, many of the cases there have variable divisors,
so this won't help everything.
Simple case: https://alive2.llvm.org/ce/z/PjvYf-
There's alternative expansion via `umin`:
https://alive2.llvm.org/ce/z/hWCVPb
BUT while we can transform the first expansion
into the `umin` one (e.g. for SCEV):
https://alive2.llvm.org/ce/z/iNxKmJ
... we can't go in the opposite direction.
Also, the non-`umin` expansion seems somewhat more codegen-friendly:
https://godbolt.org/z/qzjx5bqWKhttps://godbolt.org/z/a7bj1axbx
There's second variant of precondition:
https://alive2.llvm.org/ce/z/zE6cbM
but there the numerator must be non-undef / must be frozen.
This puts lower insert indexes before higher. This is independent
of endian, so it requires an adjustment to a fold added with
4446f71ce3, but it makes that fold more robust.
That's also where this patch was suggested - D139668.
This matches what we already do in DAGCombiner, but there is one
more constraint because there's an existing canonicalization for
insert-of-scalar-constant. I'm not sure if that is still needed,
so it may be adjusted/removed as a follow-up.
* This is a recommit of 3c4d2a0396,
* which was reverted in 25f01d593c,
because it exposed a miscompile in PPC backend, which was resolved
in https://reviews.llvm.org/D140089 / cb3f415cd2.
* which was a recommit of cf624b23bc,
* which was reverted in 5cfc22cafe,
because the cut-off on the number of vector elements was not low enough,
and it triggered both SDAG SDNode operand number assertions,
5and caused compile time explosions in some cases.
Let's try with something really *REALLY* conservative first,
just to get somewhere, and try to bump it later.
FIXME: should this respect TTI reg width * num vec regs?
Original commit message:
Now, there's a big caveat here - these bytes
are abstract bytes, not the i8 we have in LLVM,
so strictly speaking this is not exactly legal,
see e.g. https://github.com/AliveToolkit/alive2/issues/860
^ the "bytes" "could" have been a pointer,
and loading it as an integer inserts an implicit ptrtoint.
But at the same time,
InstCombine's `InstCombinerImpl::SimplifyAnyMemTransfer()`
would expand a memtransfer of 1/2/4/8 bytes
into integer-typed load+store,
so this isn't exactly a new problem.
Note that in memory, poison is byte-wise,
so we really can't widen elements,
but SROA seems to be inconsistent here.
Fixes#59116.
This replaces patches that tried to convert related patterns to shuffles
(D138872, D138873, D138874 - reverted/abandoned) but caused codegen
problems and were questionable as a canonicalization because an
insertelement is a simpler op than a shuffle.
This detects a larger pattern -- insert-of-insert -- and replaces with
another insert, so this hopefully does not cause any problems.
As noted by TODO items in the code and tests, this could go a lot further.
But this is enough to reduce the motivating test from issue #17113.
Example proofs:
https://alive2.llvm.org/ce/z/NnUv3a
I drafted a version of this for AggressiveInstCombine, but it seems that
would uncover yet another phase ordering gap. If we do generalize this to
handle the full range of potential patterns, that may be worth looking at
again.
Differential Revision: https://reviews.llvm.org/D139668
This reverts commit e71b81cab0.
As discussed in the planned follow-on to this patch (D138874),
this and the subsequent patches in this set can cause trouble for
the backend, and there's probably no quick fix. We may even
want to canonicalize in the opposite direction (towards insertelt).
The first attempt was reverted because a clang test changed
unexpectedly - the file is already marked with a FIXME, so
I just updated it this time to pass.
Original commit message:
This is the main patch for converting a truncated scalar that is
inserted into a vector to bitcast+shuffle. We could go either way
on patterns like this, but this direction will allow collapsing a
pair of these sequences on the motivating example from issue
The patch is split into 3 parts to make it easier to see the
progression of tests diffs. We allow inserting/shuffling into a
different size vector for flexibility, so there are several test
variations. The length-changing is handled by shortening/padding
the shuffle mask with undef elements.
In part 1, handle the basic pattern:
inselt undef, (trunc T), IndexC --> shuffle (bitcast T), IdentityMask
Proof for the endian-dependency behaving as expected:
https://alive2.llvm.org/ce/z/BsA7yC
The TODO items for handling shifts and insert into an arbitrary base
vector value are implemented as follow-ups.
Differential Revision: https://reviews.llvm.org/D138872
This is the main patch for converting a truncated scalar that is
inserted into a vector to bitcast+shuffle. We could go either way
on patterns like this, but this direction will allow collapsing a
pair of these sequences on the motivating example from issue
The patch is split into 3 parts to make it easier to see the
progression of tests diffs. We allow inserting/shuffling into a
different size vector for flexibility, so there are several test
variations. The length-changing is handled by shortening/padding
the shuffle mask with undef elements.
In part 1, handle the basic pattern:
inselt undef, (trunc T), IndexC --> shuffle (bitcast T), IdentityMask
Proof for the endian-dependency behaving as expected:
https://alive2.llvm.org/ce/z/BsA7yC
The TODO items for handling shifts and insert into an arbitrary base
vector value are implemented as follow-ups.
Differential Revision: https://reviews.llvm.org/D138872
This is a recommit of cf624b23bc,
which was reverted in 5cfc22cafe,
because the cut-off on the number of vector elements was not low enough,
and it triggered both SDAG SDNode operand number assertions,
and caused compile time explosions in some cases.
Let's try with something really *REALLY* conservative first,
just to get somewhere, and try to bump it (to 64/128) later.
FIXME: should this respect TTI reg width * num vec regs?
Original commit message:
Now, there's a big caveat here - these bytes
are abstract bytes, not the i8 we have in LLVM,
so strictly speaking this is not exactly legal,
see e.g. https://github.com/AliveToolkit/alive2/issues/860
^ the "bytes" "could" have been a pointer,
and loading it as an integer inserts an implicit ptrtoint.
But at the same time,
InstCombine's `InstCombinerImpl::SimplifyAnyMemTransfer()`
would expand a memtransfer of 1/2/4/8 bytes
into integer-typed load+store,
so this isn't exactly a new problem.
Note that in memory, poison is byte-wise,
so we really can't widen elements,
but SROA seems to be inconsistent here.
Fixes#59116.
Now, there's a big caveat here - these bytes
are abstract bytes, not the i8 we have in LLVM,
so strictly speaking this is not exactly legal,
see e.g. https://github.com/AliveToolkit/alive2/issues/860
^ the "bytes" "could" have been a pointer,
and loading it as an integer inserts an implicit ptrtoint.
But at the same time,
InstCombine's `InstCombinerImpl::SimplifyAnyMemTransfer()`
would expand a memtransfer of 1/2/4/8 bytes
into integer-typed load+store,
so this isn't exactly a new problem.
Note that in memory, poison is byte-wise,
so we really can't widen elements,
but SROA seems to be inconsistent here.
Fixes#59116.
As part of legacy PM optimization pipeline removal.
This shouldn't be used in codegen pipelines so it should be ok to remove.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D137116
Two lit tests were found running something like this:
opt -O<n> -pass-locked-to-legacy-PM ...
The expand-atomicrmw-xchg-fp.ll seem to have used -O1 just to ensure
that the -atomic-expand pass were thinking that it wasn't running at
O0 level. Same thing can be ensured by using the -codegen-opt-level=1
option, making it possible to avoid using O1 in that test case.
In the vector-reductions-expanded.ll test case it was possible to
split the RUN line into using two opt invocations. First running
"opt -O2" using the new PM, and then running "opt -expand-reductions"
using the legacy PM.
I think that given this patch we get closer to removing code related
to 'AddOptimizationPasses' in opt.cpp.
Differential Revision: https://reviews.llvm.org/D137626
https://alive2.llvm.org/ce/z/EfHlWN
In the motivating case from issue #58313,
this allows forming a duplicate 'not' op
which then gets CSE'd and simplifyCFG'd
and combined into the expected 'xor'.
In InstCombine we treat i8/i16 as desirable, even if they are not legal.
The current logic in shouldChangeType will decide to convert from an
illegal but desirable type (such as an i8) to an illegal and undesirable
type (such as i3). This patch prevents changing the switch conditions to
an irregular type on like Arm/AArch64 where i8/i16 are not legal.
This is the same issue as https://reviews.llvm.org/D54115. In the case I
was looking it is was converting an i32 switch to an i8 switch, which
then became a i3 switch.
Differential Revision: https://reviews.llvm.org/D136763
This reverts commit bd7949bcd8.
Revert this patch since reviwers have different opinions regarding
the approach in post-commit review.
Will open RFC for further discussion.
Differential Revision: https://reviews.llvm.org/D132408
Canonicalize GEP of GEP by swapping GEP with some suffix constant indices to the back (and GEP with all constant indices to the back of that), this allows more constant index GEP merging to happen. Exceptions are: If swapping violates use-def relations, or anti-optimizes LICM
For constant indexed GEP of GEP, if they cannot be merged directly, they will be casted to i8* and merged.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D125845
This patch extends the load merge/widen in AggressiveInstCombine() to handle reverse load patterns.
Differential Revision: https://reviews.llvm.org/D135137
(X << Z) / (Y << Z) --> X / Y
https://alive2.llvm.org/ce/z/E5eaxU
This fixes the motivating example from issue #58137,
but it is not the most general transform. We should
probably also convert left-shift in the divisor to
right-shift in the dividend for that, but that exposes
another missed canonicalization for shifts and adds.
This is an extension of the existing min/max+select fold (which already
has a very large number of variations) to allow a vector shuffle because
that's what we have in the motivating example from issue #42100.
A couple of Alive2 checks of variants (I don't know how to generalize
these in Alive):
https://alive2.llvm.org/ce/z/jUFAqT
And verify the PR42100 test:
https://alive2.llvm.org/ce/z/3EcASf
It's possible there is some generalization of the fold or a
VectorCombine/SLP answer for the motivating test, but I haven't found a
better/smaller solution yet.
We can also add even more variants here as follow-up patches. For example,
we can have shuffle followed by min/max; we also don't have this
canonicalization or the reverse:
https://alive2.llvm.org/ce/z/StHD9f
Differential Revision: https://reviews.llvm.org/D134879
Revert rGef89409a59f3b79ae143b33b7d8e6ee6285aa42f "Fix 'unused-lambda-capture' gcc warning. NFCI."
Revert rG926ccfef032d206dcbcdf74ca1e3a9ebf4d1be45 "[SLP] ScalarizationOverheadBuilder - demand all elements for scalarization if the extraction index is unknown / out of bounds"
Revert ScalarizationOverheadBuilder sequence from D134605 - when accumulating extraction costs by Type (instead of specific Value), we are not distinguishing enough when they are coming from the same source or not, and we always just count the cost once. This needs addressing before we can use getScalarizationOverhead properly.
The phase ordering test is the almost unoptimized IR for the example
in issue #42100; it was passed through -mem2reg to reduce obvious
excessive load/store and other noise.
D134879
Instead of accumulating all extraction costs separately and then adjusting for repeated subvector extractions, this patch collects all the extractions and then converts to calls to getScalarizationOverhead to improve the accuracy of the costs.
I'm not entirely satisfied with the getExtractWithExtendCost handling yet - this still just adds all the getExtractWithExtendCost costs together - it really needs to be replaced with a "getScalarizationOverheadWithExtend", but that will require further refactoring first.
This replaces my initial attempt in D124769.
Differential Revision: https://reviews.llvm.org/D134605