This patch adds support for unrolling inner loops using epilogue unrolling. The basic issue is that the original latch exit block of the inner loop could be outside the outer loop. When we clone the inner loop and split the latch exit, the cloned blocks need to be in the outer loop.
Differential Revision: https://reviews.llvm.org/D108476
This is a followup to D104662 to generate slightly nicer code for
pointer overflow checks. Bypass expandAddToGEP and instead
explicitly generate i8 GEPs. This saves some bitcasts and negates
the value in a more obvious way. In particular, this prevents SCEV
from looking through the umul.with.overflow, same as in the integer
case.
The wrapping-pointer-ni.ll test deserves a comment: Previously,
this generated a typed GEP which used the umulo argument rather
than the multiplication result. This results in more compact IR in
that case, but effectively does the multiplication twice, the
second one is just hidden in the GEP. Reusing the umulo result
seems pretty reasonable to me.
Differential Revision: https://reviews.llvm.org/D109093
This is a case I'd missed in 6a8237. The odd bit here is that missing the edge removal update seems to produce MemorySSA which verifies, but is still corrupt in a way which bothers following passes. I wasn't able to reduce a single pass test case, which is why the reported test case is taken as is.
Differential Revision: https://reviews.llvm.org/D109068
We'd special cased this logic to use pointer types for non-integral pointers, but there's no reason we can't do that for all pointer types. Doing it this was has a few advantages:
a) The code itself becomes more straight forward, and easier to test.
b) We avoid introducing ptrtoint into programs which didn't have them in the source.
c) The resulting codegen is easier to analyze and simplify (mostly due to lack of ptrtoint).
Note that there are some test diffs, but a) running them through instcombine helps a ton, and b) there's enough missing obvious transforms on both before and after IR that it's clear this isn't performance sensitive.
This is mostly motivated by cleaning up mentions of non-integrals to have a clearer idea of what we actually need to support.
Differential Revision: https://reviews.llvm.org/D104662
The runtime unroller will try to produce a non-loop if the unroll count is 2 and thus the prolog/epilog loop would only run at most one iteration. The old implementation did this by avoiding loop construction entirely. This patches instead constructs the trivial loop and then explicitly breaks the backedge and simplifies. This does result in some additional code churn when triggered, but a) results in better quality code and b) removes a codepath which didn't work properly for multiple exit epilogs.
One oddity that I want to draw to reviewer attention is that this somehow changes revisit order. The new order looks equivalent to me, but I don't understand how creating and erasing an extra loop here creates this effect.
Differential Revision: https://reviews.llvm.org/D108521
Previously, we'd expand *ALL* the SCEV's eagerly, because we needed to
check with `isValidRewrite()`, and discard bad rewrite candidates,
but now that we do not do that, we also don't need to always expand.
In particular, this avoids expanding potentially-huge SCEV's that we
would discard anyways because they are high-cost and we aren't
rewriting aggressively.
`isValidRewrite()` checks that the both the original SCEV,
and the rewrite SCEV have the same base pointer.
I //believe//, after all the recent SCEV improvements,
this invariant is already enforced by SCEV itself.
I originally tried changing it into an assert in D108043,
but that showed that it triggers on e.g. https://reviews.llvm.org/D108043#2946621,
where SCEV manages to forward the store to load,
test added.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D108655
ExposePointerBase() in SCEVExpander implements basically the same
functionality as removePointerBase() in SCEV, so reuse it.
The SCEVExpander code assumes that the pointer operand on adds is
the last one -- I'm not sure that always holds. As such this might
not be strictly NFC.
There can only be one pointer operand in an add expression, and
we have sorted operands to guarantee that it is the first. As
such, the pointer check for other operands is dead code.
Changes since aec08e:
* Adjust placement of a closing brace so that the general case actually runs. Turns out we had *no* coverage of the switch case. I added one in eae90fd.
* Drop .llvm.loop.* metadata from the new branch as there is no longer a loop to annotate.
Original commit message:
This special cases an unconditional latch and a conditional branch latch exit to improve codegen and test readability. I am hoping to reuse this function in the runtime unroll code, but without this change, the test diffs are far too complex to assess.
The Code Extractor does not provide an easy mechanism for determining the
inputs and outputs after extraction has occurred, this patch gives the
ability to pass in empty SetVectors to be filled with the inputs and
outputs if they need to be analyzed.
Added Tests:
- InputOutputMonitoring in unittests/Transforms/Utils/CodeExtractorTests.cpp
Reviewers: paquette
Differential Revision: https://reviews.llvm.org/D106991
Support for peeling with multiple exit blocks was added in D63921/77bb3a486fa6.
So far it has only been enabled for loops where all non-latch exits are
'de-optimizing' exits (D63923). But peeling of multi-exit loops can be
highly beneficial in other cases too, like if all non-latch exiting
blocks are unreachable.
The motivating case are loops with runtime checks, like the C++ example
below. The main issue preventing vectorization is that the invariant
accesses to load the bounds of B is conditionally executed in the loop
and cannot be hoisted out. If we peel off the first iteration, they
become dereferenceable in the loop, because they must execute before the
loop is executed, as all non-latch exits are terminated with
unreachable. This subsequently allows hoisting the loads and runtime
checks out of the loop, allowing vectorization of the loop.
int sum(std::vector<int> *A, std::vector<int> *B, int N) {
int cost = 0;
for (int i = 0; i < N; ++i)
cost += A->at(i) + B->at(i);
return cost;
}
This gives a ~20-30% increase of score for Geekbench5/HDR on AArch64.
Note that this requires a follow-up improvement to the peeling cost
model to actually peel iterations off loops as above. I will share that
shortly.
Also, peeling of multi-exits might be beneficial for exit blocks with
other terminators, but I would like to keep the scope limited to known
high-reward cases for now.
I removed the option to disable peeling for multi-deopt exits because
the code is more general now. Alternatively, the option could also be
generalized, but I am not sure if there's much value in the option?
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D108108
This special cases an unconditional latch and a conditional branch latch exit to improve codegen and test readability. I am hoping to reuse this function in the runtime unroll code, but without this change, the test diffs are far too complex to assess.
The purpose of __attribute__((disable_sanitizer_instrumentation)) is to
prevent all kinds of sanitizer instrumentation applied to a certain
function, Objective-C method, or global variable.
The no_sanitize(...) attribute drops instrumentation checks, but may
still insert code preventing false positive reports. In some cases
though (e.g. when building Linux kernel with -fsanitize=kernel-memory
or -fsanitize=thread) the users may want to avoid any kind of
instrumentation.
Differential Revision: https://reviews.llvm.org/D108029
The only thing that function should do as per it's semantic,
is to ensure that the switch's default is a block consisting only of
an `unreachable` terminator.
So let's just create such a block and update switch's default
to point to it. There should be no need for all this weird dance
around predecessors/successors.
This patch extends the runtime unrolling infrastructure to support unrolling a loop with multiple exiting blocks branching to the same exit block used by the latch. It intentionally does not include a cost model change to enable this functionality unless appropriate force flags are used.
This is the prolog companion to D107381. Since this was LGTMed, a problem with DT updating was reported against that patch. I roled in the analogous fix here as it seemed obvious, and not worth re-review.
As an aside, our prolog form leaves a lot of potential value on the floor when there is an invariant load or invariant condition in the loop being runtime unrolled. We should probably consider a "required prolog" heuristic. (Alternatively, maybe we should be peeling these cases more aggressively?)
Differential Revision: https://reviews.llvm.org/D108262
In 94d0914, I added support for unrolling of multiple exit loops which have multiple exits reaching the latch. Per reports on the review post commit, I'd missed updating the domtree for one case. This fix addresses that ommission.
There's no new test as this is covered by existing tests with expensive verification turned on.
This reverts commit 9934a5b2ed.
This patch may cause miscompiles because it missed a constraint
as shown in the examples from:
https://llvm.org/PR51531
This patch extends the runtime unrolling infrastructure to support unrolling a loop with multiple exiting blocks branching to the same exit block used by the latch. It intentionally does not include a cost model change to enable this functionality unless appropriate force flags are used.
I decided to restrict this to the epilogue case. Given the changes ended up being pretty generic, we may be able to unblock the prolog case too, but I want to do that in a separate change to reduce the amount of code we all have to understand at one time.
Differential Revision: https://reviews.llvm.org/D107381
his is a fix for PR43678, and is an alternate patch to D105723.
The basic issue we're running into is that LSR + SCEVExpander are moving the very instruction whose operand we're in the process of expanding. This breaks the subtle and ill-documented invariant which let LSR work. (Full story can be found here: https://reviews.llvm.org/D105723#2878473)
Rather than attempting a fix, this change just removes the optimization entirely. The code is entirely untested, and removing it appears to have no impact I can find. This code was added back in 2014 by 1e12f8563d with a single test which does not seem to actually test the hoisting logic.
From a philosophical standpoint, it also seems very strange to have the expander implementing optimizations which should live in a dedicated transform pass.
Differential Revision: https://reviews.llvm.org/D106178
This option has been enabled by default for quite a while now.
The practical impact of removing the option is that MSSA use
cannot be disabled in default pipelines (both LPM and NPM) and
in manual LPM invocations. NPM can still choose to enable/disable
MSSA using loop vs loop-mssa.
The next step will be to require MSSA for LICM and drop the
AST-based implementation entirely.
Differential Revision: https://reviews.llvm.org/D108075
LoopLoadElimination, LoopVersioning and LoopVectorize currently
fetch MemorySSA when construction LoopAccessAnalysis. However,
LoopAccessAnalysis does not actually use MemorySSA and we can pass
nullptr instead.
This saves one MemorySSA calculation in the default pipeline, and
thus improves compile-time.
Differential Revision: https://reviews.llvm.org/D108074
Currently/previously, while SCEV guaranteed that it produces the same value,
the way it was produced may be illegal IR, so we have an ugly check that
the replacement is valid.
But now that the SCEV strictness wrt the pointer/integer types has been improved,
i believe this invariant is already upheld by the SCEV itself, natively.
I think we should add an assertion, wait for a week, and then, if all is good,
rip out all this checking.
Or we could just do the latter directly i guess.
This reverts commit rL127839.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D108043
Previously we would allow promotion even if the byval/inalloca
attributes on the call and the callee didn't match.
It's ok if the byval/inalloca types aren't the same. For example, LTO
importing may rename types.
Fixes PR51397.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D107998