It seems the crashes we saw wasn't caused by this (see comments on the review).
> This is basically D108837 but for jump threading. Free instructions
> should be ignored for the threading decision. JumpThreading already
> skips some free instructions (like pointer bitcasts), but does not
> skip various free intrinsics -- in fact, it currently gives them a
> fairly large cost of 2.
>
> Differential Revision: https://reviews.llvm.org/D110290
This reverts commit 4604695d7c.
It caused compiler crashes, see comment on the code review for repro.
> This is basically D108837 but for jump threading. Free instructions
> should be ignored for the threading decision. JumpThreading already
> skips some free instructions (like pointer bitcasts), but does not
> skip various free intrinsics -- in fact, it currently gives them a
> fairly large cost of 2.
>
> Differential Revision: https://reviews.llvm.org/D110290
This reverts commit 1e3c6fc7cb.
This is basically D108837 but for jump threading. Free instructions
should be ignored for the threading decision. JumpThreading already
skips some free instructions (like pointer bitcasts), but does not
skip various free intrinsics -- in fact, it currently gives them a
fairly large cost of 2.
Differential Revision: https://reviews.llvm.org/D110290
This patch is for fixing potential shufflevector-related bugs like D93818.
As D93818, this patch change shufflevector's default placeholder to poison.
To reduce risk, it was divided into several patches, and this patch is for InstCombineCompares and InstructionCombining.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D110227
IR with matrix intrinsics is likely to also contain large vector
operations, which can benefit from early simplifications.
This is the last step in a series of changes to improve code-gen for
code using matrix subscript operators with the C/C++ matrix extension in
CLang, like
using matrix_t = double __attribute__((matrix_type(15, 15)));
void foo(unsigned i, matrix_t &A, matrix_t &B) {
for (unsigned j = 0; j < 4; ++j)
for (unsigned k = 0; k < i; k++)
B[k][j] -= A[k][j] * B[i][j];
}
https://clang.godbolt.org/z/6dKxK1Ed7
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D102496
This patch updates VectorCombine to use a worklist to allow iterative
simplifications where a combine enables other combines.
Suggested in D100302.
The main use case at the moment is foldSingleElementStore and
scalarizeLoadExtract working together to improve scalarization.
Note that we now also do not run SimplifyInstructionsInBlock on the
whole function if there have been changes. This means we fail to
remove/simplify instructions not related to any of the vector combines.
IMO this is fine, as simplifying the whole function seems more like a
workaround for not tracking the changed instructions.
Compile-time impact looks neutral:
NewPM-O3: +0.02%
NewPM-ReleaseThinLTO: -0.00%
NewPM-ReleaseLTO-g: -0.02%
http://llvm-compile-time-tracker.com/compare.php?from=52832cd917af00e2b9c6a9d1476ba79754dcabff&to=e66520a4637290550a945d528e3e59573485dd40&stat=instructions
Reviewed By: spatel, lebedev.ri
Differential Revision: https://reviews.llvm.org/D110171
This makes some tests in vector-reductions-logical.ll more stable when
applying D108837.
The cost of branching is higher when vector ops are involved due to
potential SLP transformations.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D108935
I can't seem to wrap my head around the proper fix here,
we should be fine without this requirement, iff we can form this form,
but the naive attempt (https://reviews.llvm.org/D106317) has failed.
So just to unblock the release, put up a restriction.
Fixes https://bugs.llvm.org/show_bug.cgi?id=51125
The min/max intrinsics are not yet canonical, but when they are the tail
predications analysis will change from treating them like icmp to
treating them like intrinsics. Unfortunately, they can currently produce
better code by not being tail predicated thanks to the vectorizer picking
higher VF's and the backend folding to better instructions (especially
for saturate patterns). In the long run we will need to improve the
vectorizers cost modelling, recognizing the instruction directly, but in
the meantime this treats min/max as before to prevent performance
regressions.
This test illustrates missed vectorization of loops with multiple
std::vector::at calls, like
int sum(std::vector<int> *A, std::vector<int> *B, int N) {
int cost = 0;
for (int i = 0; i < N; ++i)
cost += A->at(i) + B->at(i);
return cost;
}
https://clang.godbolt.org/z/KbYoaPhvq
This tests code starting from smin/smax, as opposed to the icmp/select
form. Also adds a ARM MVE phase ordering test for vectorizing to
sadd.sat from the original IR.
Proposed alternative to D105338.
This is ugly, but short-term I think it's the best way forward: first,
let's formalize the hacks into a coherent model. Then we can consider
extensions of that model (we could have different flavors of volatile
with different rules).
Differential Revision: https://reviews.llvm.org/D106309
In D106041, a freeze was added before the branch condition to solve the miscompilation problem of SimpleLoopUnswitch.
However, I found that the added freeze disturbed other optimizations in the following situations.
```
arg.fr = freeze(arg)
use(arg.fr)
...
use(arg)
```
It is a problem that occurred when arg and arg.fr were recognized as different values.
Therefore, changing to use arg.fr instead of arg throughout the function eliminates the above problem.
Thus, I add a function that changes all uses of arg to freeze(arg) to visitFreeze of InstCombine.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D106233
This function is called when some predecessor of an empty return block
ends with a conditional branch, with both successors being empty ret blocks.
Now, because of the way SimplifyCFG works, it might happen to simplify
one of the blocks in a way that makes a conditional branch
into an unconditional one, since it's destinations are now identical,
but it might not have actually simplified said conditional branch
into an unconditional one yet.
So, we have to check that ourselves first,
especially now that SimplifyCFG aggressively tail-merges
all ret and resume blocks.
Even if it was an unconditional branch already,
`SimplifyCFGOpt::simplifyReturn()` doesn't call `FoldReturnIntoUncondBranch()`
by default.
This pattern is visible in unrolled and vectorized loops.
Although the backend seems to be able to reassociate to
ideal form in the examples I looked at, we might as well
do that in IR for efficiency.
This bug was introduced with D105730 / 25ee55c0ba .
If we are not converting all of the operations of a reduction
into a vector op, we need to preserve the existing select form
of the remaining ops. Otherwise, we are potentially leaking
poison where it did not in the original code.
Alive2 agrees that the version that freezes some inputs
and then falls back to scalar is correct:
https://alive2.llvm.org/ce/z/erF4K2
`SinkCommonCodeFromPredecessors()` doesn't itself ensure that duplicate PHI nodes aren't created.
I suppose, we could teach it to do that on-the-fly (& account for the already-existing PHI nodes,
& adjust costmodel), the diff will be bigger than this.
The alternative is to schedule a new EarlyCSE pass invocation somewhere later in the pipeline.
Clearly, we don't have any EarlyCSE runs in module optimization passline, so this pattern isn't cleaned up...
That would perhaps better, but it will again have some compile time impact.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D106010
This has been a work-in-progress for a long time...we finally have all of
the pieces in place to handle vectorization of compare code as shown in:
https://llvm.org/PR41312
To do this (see PhaseOrdering tests), we converted SimplifyCFG and
InstCombine to the poison-safe (select) forms of the logic ops, so now we
need to have SLP recognize those patterns and insert a freeze op to make
a safe reduction:
https://alive2.llvm.org/ce/z/NH54Ah
We get the minimal patterns with this patch, but the PhaseOrdering tests
show that we still need adjustments to get the ideal IR in some or all of
the motivating cases.
Differential Revision: https://reviews.llvm.org/D105730
In the spirit of TRegions [0], this patch creates a custom state
machine for a generic target region based on the potentially called
parallel regions.
The code analysis is done interprocedurally via an abstract attribute
(AAKernelInfo). All outermost parallel regions are collected and we
check if there might be unknown outermost parallel regions for which
we need an indirect call. Other AAKernelInfo extensions are expected.
[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11
Differential Revision: https://reviews.llvm.org/D101977
These are based on PR41312. There needs to be effort
from all of SimplifyCFG, InstCombine, SLP, and possibly
VectorCombine to get this into ideal form.
The metadata added in D102361 introduces a module flag that we can check
to determine if the module was compiled with `-fopenmp` enables. We can
now check for the precense of this instead of scanning the call graph
for OpenMP runtime functions.
Depends on D102361
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D102423
Based ontop of D104598, which is a NFCI-ish refactoring.
Here, a restriction, that only empty blocks can be merged, is lifted.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D104597
After landing the globalization optimizations, the precense of globalization on
the device that was not put in shared or stack memory is a failed optimization
with performance consequences so it should indicate a missed remark.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D104735
Summary:
The changes to globalization introduced in D97680 introduce a large amount of overhead by default. The old globalization method would always ignore globalization code if executing in SPMD mode. This wasn't strictly correct as data sharing is still possible in SPMD mode. The new interface is correct but introduces globalization code even when unnecessary. This optimization will use the existing HeapToStack transformation in the attributor to allow for unneeded globalization to be replaced with thread-private stack memory. This is done using the newly introduced library instances for the RTL functions added in D102087.
Depends on D97818
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D102197
Summary:
Memory globalization is required to maintain OpenMP standard semantics for data sharing between
worker and master threads. The GPU cannot share data between its threads so must allocate global or
shared memory to store the data in. Currently this is implemented fully in the frontend using the
`__kmpc_data_sharing_push_stack` and __kmpc_data_sharing_pop_stack` functions to emulate standard
CPU stack sharing. The front-end scans the target region for variables that escape the region and
must be shared between the threads. Each variable then has a field created for it in a global record
type.
This patch replaces this functinality with a single allocation command, effectively mimicing an
alloca instruction for the variables that must be shared between the threads. This will be much
slower than the current solution, but makes it much easier to optimize as we can analyze each
variable independently and determine if it is not captured. In the future, we can replace these
calls with an `alloca` and small allocations can be pushed to shared memory.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D97680
A backedge-taken count doesn't refer to memory; returning a pointer type
is nonsense. So make sure we always return an integer.
The obvious way to do this would be to just convert the operands of the
icmp to integers, but that doesn't quite work out at the moment:
isLoopEntryGuardedByCond currently gets confused by ptrtoint operations.
So we perform the ptrtoint conversion late for lt/gt operations.
The test changes are mostly innocuous. The most interesting changes are
more complex SCEV expressions of the form "(-1 * (ptrtoint i8* %ptr to
i64)) + %ptr)". This is expected: we can't fold this to zero because we
need to preserve the pointer base.
The call to isLoopEntryGuardedByCond in howFarToZero is less precise
because of ptrtoint operations; this shows up in the function
pr46786_c26_char in ptrtoint.ll. Fixing it here would require more
complex refactoring. It should eventually be fixed by future
improvements to isImpliedCond.
See https://bugs.llvm.org/show_bug.cgi?id=46786 for context.
Differential Revision: https://reviews.llvm.org/D103656
Addition of this pass has been botched.
There is no particular reason why it had to be sold as an inseparable part
of new-pm transition. It was added when old-pm was still the default,
and very *very* few users were actually tracking new-pm,
so it's effects weren't measured.
Which means, some of the turnoil of the new-pm transition
are actually likely regressions due to this pass.
Likewise, there has been a number of post-commit feedback
(post new-pm switch), namely
* https://reviews.llvm.org/D37467#2787157 (regresses HW-loops)
* https://reviews.llvm.org/D37467#2787259 (should not be in middle-end, should run after LSR, not before)
* https://reviews.llvm.org/D95789 (an attempt to fix bad loop backedge metadata)
and in the half year past, the pass authors (google) still haven't found time to respond to any of that.
Hereby it is proposed to backout the pass from the pipeline,
until someone who cares about it can address the issues reported,
and properly start the process of adding a new pass into the pipeline,
with proper performance evaluation.
Furthermore, neither google nor facebook reports any perf changes
from this change, so i'm dropping the pass completely.
It can always be re-reverted should/if anyone want to pick it up again.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D104099
This is split off from D102002, and I think it is clear that
the difference in behavior was not intended. Options were
added to SimplifyCFG over time, but different chunks of
the pass pipelines were not kept in sync.
We can only scalarize memory accesses if we know the index is valid.
This patch adjusts canScalarizeAcceess to fall back to
computeConstantRange to check if the index is known to be valid.
Reviewed By: nlopes
Differential Revision: https://reviews.llvm.org/D102476
This was reverted due to performance regressions in ARM benchmarks,
which have since been addressed by D101196 (SCEV analysis improvement)
and D101778 (CGP reverse transform).
-----
The single-use case is handled implicity by converting the icmp
into a mask check first. When comparing with zero in particular,
we don't need the one-use restriction, as we only produce a single
icmp.
https://alive2.llvm.org/ce/z/MSixcmhttps://alive2.llvm.org/ce/z/GwpG0M
This patch adjusts the LTO pipeline in the new PM to run GlobalsAA
before LICM to match the legacy PM.
This fixes a regression where the new PM failed to vectorize loops that
require hoisting/sinking by LICM depending on GlobalsAA info.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D102345