This is an attempt to reland D42600 and enabling this optimisation by default.
This also resolves the issue pointed out in the context of PGO build.
Differential Revision: https://reviews.llvm.org/D42600
This is a follow-up to b71edfaa4e
since I forgot the lit.local.cfg files in that one.
Reformatting is done with `black`.
If you end up having problems merging this commit because you
have made changes to a python file, the best way to handle that
is to run git checkout --ours <yourfile> and then reformat it
with black.
If you run into any problems, post to discourse about it and
we will try to help.
RFC Thread below:
https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style
Reviewed By: barannikov88, kwk
Differential Revision: https://reviews.llvm.org/D150762
Per discussion at
https://discourse.llvm.org/t/representing-buffer-descriptors-in-the-amdgpu-target-call-for-suggestions/68798,
we define two new address spaces for AMDGCN targets.
The first is address space 7, a non-integral address space (which was
already in the data layout) that has 160-bit pointers (which are
256-bit aligned) and uses a 32-bit offset. These pointers combine a
128-bit buffer descriptor and a 32-bit offset, and will be usable with
normal LLVM operations (load, store, GEP). However, they will be
rewritten out of existence before code generation.
The second of these is address space 8, the address space for "buffer
resources". These will be used to represent the resource arguments to
buffer instructions, and new buffer intrinsics will be defined that
take them instead of <4 x i32> as resource arguments. ptr
addrspace(8). These pointers are 128-bits long (with the same
alignment). They must not be used as the arguments to getelementptr or
otherwise used in address computations, since they can have
arbitrarily complex inherent addressing semantics that can't be
represented in LLVM. Even though, like their address space 7 cousins,
these pointers have deterministic ptrtoint/inttoptr semantics, they
are defined to be non-integral in order to prevent optimizations that
rely on pointers being a [0, [addr_max]] value from applying to them.
Future work includes:
- Defining new buffer intrinsics that take ptr addrspace(8) resources.
- A late rewrite to turn address space 7 operations into buffer
intrinsics and offset computations.
This commit also updates the "fallback address space" for buffer
intrinsics to the buffer resource, and updates the alias analysis
table.
Depends on D143437
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D145441
This patch recommits 0827e2fa3f after
reverting it in ed7ada259f. Added
workround for `Targetlowering::AddrMode` no longer being an aggregate
in C++20.
`AArch64TargetLowering::isLegalAddressingMode` has a number of
defects, including accepting an addressing mode, which consists of
only an immediate operand, or not checking the offset range for an
addressing mode in the form `1*ScaledReg + Offs`.
This patch fixes the above issues.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D143895
Change-Id: I41a520c13ce21da503ca45019979bfceb8b648fa
`AArch64TargetLowering::isLegalAddressingMode` has a number of
defects, including accepting an addressing mode which consists of only
an immediate operand, or not checking the offset range for an
addressing mode in the form `1*ScaledReg + Offs`.
This patch fixes the above issues.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D143895
Change-Id: I756fa21941844ded44f082ac7eea4391219f9851
This patch splits a restore point to allow it to only post-dominate blocks reachable by use
or def of CSRs(Callee Saved Registers)/FI(Frame Index).
Benchmarking this on SPEC2017, this gives around 4% improvement on povray and no significant change
for others.
Co-authored-by: junbuml
Differential Revision: https://reviews.llvm.org/D42600
Fixes https://github.com/llvm/llvm-project/issues/61182.
LoopStrengthReduce may sometimes break LCSSA form when applying a rewrite
for an instruction used in a PHI.
It happens if:
- The PHI is in a loop exit block,
- The edge from the corresponding exiting block to that exit is critical,
- The PHI has at least two inputs coming from loop blocks,
- and the rewritten instruction is inserted in the loop.
In such case we split the critical edge and then replace PHI inputs
with the rewritten instruction. However ExitBlock is no longer
a loop exit, so LCSSA form is broken.
This patch fixes it by collecting all inserted instructions for PHIs
whose parent block is not a loop exit and then forming LCSSA for them.
Differential Revision: https://reviews.llvm.org/D146811
Reported in https://reviews.llvm.org/D146415. I rewrote the patch and aded the test case. Per that report, spec2006.483.xalancbmk crashes without this fix.
This models the approach used in LFTR. The short summary is that we need to prove the IV is not dead first, and then we have to either prove the poison flag is valid after the new user or delete it.
There are two key differences between this and LFTR.
First, I allow a non-concrete start to the IV. The goal of LFTR is to canonicalize and IVs with constant starts are canonical, so the very restrictive definition there is mostly okay. Here on the other hand, we're explicitly moving *away* from the canonical form, and thus need to handle non-constant starts.
Second, LFTR bails out instead of removing inbounds on a GEP. This is a pragmatic tradeoff since inbounds is hard to infer and assists aliasing. This pass runs very late, and I think the tradeoff runs the other way.
A different approach we could take for the post-inc check would be to perform a pre-inc check instead of a post-inc check. We would still have to check the pre-inc IV, but that would avoid the need to drop inbounds. Doing the pre-inc check would basically trade killing a whole IV for an extra register move in the loop. I'm open to suggestions on the right approach here.
Note that this analysis is quite expensive compile time wise. I have made no effort to optimize (yet).
Differential Revision: https://reviews.llvm.org/D146464
This is a follow up to one of the side discussions on D146429. There are two semantic changes contained here.
The motivation for the change to the legality condition introduced in D146429 comes from the fact that we only check the post-inc form. As such, as long as the values of the post-inc variable don't self wrap, it's actually okay if we wrap past the starting value of the pre-inc IV.
Second, Nikic noticed during review that the test changes changed behavior for TC=0 (i.e. N=0 in the tests). On more careful inspection, it became apparent that the previous manual expansion code was incorrect in the case where the primary IV could wrap without poison, and started with the limit value (i.e. i8 post-inc starts at 255 for 0 exit test, implying pre-inc starts with 0). See @wrap_around test for an example of the (previous) miscompile.
Differential Revision: https://reviews.llvm.org/D146457
The existing logic was unsound, in two ways.
First, due to wrapping on the trip count computation, it could compute a value which convert a loop which exiting on iteration 256, to one which exited at 255. (With i8 trip counts.)
Second, it allowed rewriting when the trip count implies wrapping around the alternate IV. As a trivial example, it allowed rewriting an i128 exit test in terms of an i64 IV. This is obviously wrong.
Note that the test change is fairly minimal - i.e. only the targeted test - but that's only because I precommitted a change which switched the test from 32 to 64 bit pointers. For 32 bit point architectures with 32 bit primary inductions, this transform is almost always unsound to perform.
Differential Revision: https://reviews.llvm.org/D146429
Also, add a comment to highlight that the "good" result on this test is accidental, and not based on a principled decision. I matched the original behavior to make this nfc, but selecting the last legal IV is not well motivated here.
Main benefit here is making the logic easier to follow, slightly more efficient, and more in line with LFTR. This is not NFC. There are three semantic changes here.
First, we drop handling for constants on the LHS of the comparison. These are non-canonical, and we're very late in the optimization pipeline here, so there's no point in supporting this. I removed a test which covered this case.
Second, we don't need the almost dead IV to be an addrec. We just need SCEV to be able to compute a trip count for it.
Third, we require a simple IV for the almost dead IV. In theory, this removes cases we could have previously handled, but given a) zero testing and b) multiple known correctness issues, I'm adopting an attidute of narrowing this down to something which works correctly, and *then* expanding.
There were two major problems with the tests.
First, with the pointer size being 32 bit and the original IVs also being 32 bit, almost all of the positive tests were actually unsound. An upcoming change will add the appropriate safety check, but the test diffs are really hard to understand without switching the tests to 64 bit pointers first.
Second, checking debug messages for failures is a major bad practice. This should not have been accepted in review at all. The reason is that it makes the *order* of legality checks visibile and modifying any of them becomes annoying and tedious.
Autogen for naming change, and remove comments about C code inspiration. Multiple of these are out of sync with the actual IR, and these are IR tests anyways.
DFAJumpThreading
JumpThreading
LibCallsShrink
LoopVectorize
SLPVectorizer
DeadStoreElimination
AggressiveDCE
CorrelatedValuePropagation
IndVarSimplify
These are part of the optimization pipeline, of which the legacy version is deprecated and being removed.
Like what has been done in AArch64 (D125335).
We enable this under `-O2` to show the codegen diffs here but we
may only do this under `-O3` like AArch64.
There are two cases that we may produce these eliminable copies:
1. ISel of `FrameIndex`. Like `rvv/fixed-vectors-calling-conv.ll`.
2. Tail duplication. Like `select-optimize-multiple.ll`.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D144535
The vsetvli insertion pass can replace it with MU if needed by
a using instruction. The vsetvli insertion pass will not convert
MU to MA so we need to start at MA.
Reviewed By: eopXD
Differential Revision: https://reviews.llvm.org/D143790
Original code will cause crash when the load/store memory type is structure because isIndexedLoadLegal/isIndexedStore doesn't support struct type.
So we limit the load/store memory type to integer.
Origin commit message:
When the latch block is different from header block, IVInc will be expanded in the latch loop. We can't generate the post index load/store this case.
But if the IVInc only used in the loop, actually we still can use the post index load/store because when exit loop we don't care the last IVInc value.
So, trying to hoist IVInc to help backend to generate more post index load/store.
Fix#53625
Reviewed By: eopXD
Differential Revision: https://reviews.llvm.org/D138636
This matches the behavior from a number of other targets, including e.g. X86. This does have the effect of increasing register pressure slightly, but we have a relative abundance of registers in the ISA compared to other targets which use the same heuristic.
The motivation here is that our current cost heuristic treats number of registers as the dominant cost. As a result, an extra use outside of a loop can radically change the LSR result. As an example consider test4 from the recently added test/Transforms/LoopStrengthReduce/RISCV/lsr-cost-compare.ll. Without a use outside the loop (see test3), we convert the IV into a pointer increment. With one, we leave the gep in place.
The pointer increment version both decreases number of instructions in some loops, and creates parallel chains of computation (i.e. decreases critical path depth). Both are generally profitable.
Arguably, we should really be using a more sophisticated model here - such as e.g. using profile information or explicitly modeling parallelism gains. However, as a practical matter starting with the same mild hack that other targets have used seems reasonable.
Differential Revision: https://reviews.llvm.org/D142227
There's no need to require the start value to come directly from the loop predecessor. This was sometimes covering up a latent miscompile in this off-by-default option, but the miscompile needs fixed anyways and the issue has been raised on the original review.
Differential Revision: https://reviews.llvm.org/D142240
IR is now always parsed in opaque pointer mode, unless
-opaque-pointers=0 is explicitly given. There is no automatic
detection of typed pointers anymore.
The -opaque-pointers=0 option is added to any remaining IR tests
that haven't been migrated yet.
Differential Revision: https://reviews.llvm.org/D141912
This patch relaxes the restriction that both reassociate operands must
be in the same block as the root instruction.
The comment indicates that the reason for this restriction was that the
operands not in the same block won't have a depth in the trace.
I believe this is outdated; if the operand is in a different block, it
must dominate the current block (otherwise it would need to be phi),
which in turn means the operand's block must be included in the current
rance, and depths must be available.
There's a test case (no_reassociate_different_block) added in
70520e2f1c which shows that we have accurate depths for operands
defined in other blocks.
This allows reassociation of code that computes the final reduction
value after vectorization, among other things.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D141302
When the latch block is different from header block, IVInc will be expanded in the latch loop. We can't generate the post index load/store this case.
But if the IVInc only used in the loop, actually we still can use the post index load/store because when exit loop we don't care the last IVInc value.
So, trying to hoist IVInc to help backend to generate more post index load/store.
Fix#53625
Reviewed By: eopXD
Differential Revision: https://reviews.llvm.org/D138636