AMDGPU target has faced the situation which can be illustrated with the
following testcase:
define void @dont_merge_cbranches(i32 %V) {
%divergent_cond = icmp ne i32 %V, 0
%uniform_cond = call i1 @uniform_result(i1 %divergent_cond)
br i1 %uniform_cond, label %bb2, label %exit, !prof !0
bb2:
br i1 %divergent_cond, label %bb3, label %exit
bb3:
call void @bar( )
br label %exit
exit:
ret void
}
!0 = !{!"branch_weights", i32 1, i32 100000}
SimplifyCFG merges branches on %uniform_cond and %divergent_cond which is undesirable because the first branch to bb2 is taken extremely rare and the second branch is expensive. The merged branch becomes as expensive as the second.
This patch prevents such merging if the branch to the second branch is unlikely to happen.
This patch adds #include "llvm/ADT/SmallSet.h" to a couple of files
that are relying on transitive includes of SmallSet.h. It in turn
unblocks the removal of unnecessary includes of llvm/ADT/SmallSet.h in
several other files.
Close https://github.com/llvm/llvm-project/issues/56980.
This patch tries to introduce a light-weight optimization attribute for
coroutines which are guaranteed to only be destroyed after it reached
the final suspend.
The rationale behind the patch is simple. See the example:
```C++
A foo() {
dtor d;
co_await something();
dtor d1;
co_await something();
dtor d2;
co_return 43;
}
```
Generally the generated .destroy function may be:
```C++
void foo.destroy(foo.Frame *frame) {
switch(frame->suspend_index()) {
case 1:
frame->d.~dtor();
break;
case 2:
frame->d.~dtor();
frame->d1.~dtor();
break;
case 3:
frame->d.~dtor();
frame->d1.~dtor();
frame->d2.~dtor();
break;
default: // coroutine completed or haven't started
break;
}
frame->promise.~promise_type();
delete frame;
}
```
Since the compiler need to be ready for all the cases that the coroutine
may be destroyed in a valid state.
However, from the user's perspective, we can understand that certain
coroutine types may only be destroyed after it reached to the final
suspend point. And we need a method to teach the compiler about this.
Then this is the patch. After the compiler recognized that the
coroutines can only be destroyed after complete, it can optimize the
above example to:
```C++
void foo.destroy(foo.Frame *frame) {
frame->promise.~promise_type();
delete frame;
}
```
I spent a lot of time experimenting and experiencing this in the
downstream. The numbers are really good. In a real-world coroutine-heavy
workload, the size of the build dir (including .o files) reduces 14%.
And the size of final libraries (excluding the .o files) reduces 8% in
Debug mode and 1% in Release mode.
Fix the crash for the last land PR70542.
Note:
For '%add = add nuw i32 %x, 1', we can only infer the LowerBound is 1,
but the UpperBound is wrapped to 0 in computeConstantRange.
so we can't assume the UpperBound is valid bound when its value is 0.
Fix https://github.com/llvm/llvm-project/issues/71329.
Reviewed By: zmodem, nikic
This reverts commit 957efa4ce4.
Original commit message below -- in this follow up, I've shifted
un-necessary inclusions of DebugProgramInstruction.h into being forward
declarations (fixes clang-compile time I hope), and a memory leak in the
DebugInfoTest.cpp IR unittests.
I also tracked a compile-time regression in D154080, more explanation
there, but the result of which is hiding some of the changes behind the
EXPERIMENTAL_DEBUGINFO_ITERATORS compile-time flag. This is tested by the
"new-debug-iterators" buildbot.
[DebugInfo][RemoveDIs] Add prototype storage classes for "new" debug-info
This patch adds a variety of classes needed to record variable location
debug-info without using the existing intrinsic approach, see the rationale
at [0].
The two added files and corresponding unit tests are the majority of the
plumbing required for this, but at this point isn't accessible from the
rest of LLVM as we need to stage it into the repo gently. An overview is
that classes are added for recording variable information attached to Real
(TM) instructions, in the form of DPValues and DPMarker objects. The
metadata-uses of DPValues is plumbed into the metadata hierachy, and a
field added to class Instruction, which are all stimulated in the unit
tests. The next few patches in this series add utilities to convert to/from
this new debug-info format and add instruction/block utilities to have
debug-info automatically updated in the background when various operations
occur.
This patch was reviewed in Phab in D153990 and D154080, I've squashed them
together into this commit as there are dependencies between the two
patches, and there's little profit in landing them separately.
[0] https://discourse.llvm.org/t/rfc-instruction-api-changes-needed-to-eliminate-debug-intrinsics-from-ir/68939
The current code structure results in cases where if a) we can't clone
the IV user (because it's not in our whitelist) or b) can't prove the
SCEV expressions are identical, we'd sometimes leave both the original
unwiddened IV and the partially widdened IV in code. Instead, just
truncate thw wide IV to the use - same as what we'd do if we couldn't
find an addrec to start with.
Noticed this while playing with changing how we produce addrecs. The
current structure results in a very tight interlock between SCEVs
internal capabilities and indvars code.
See RFC for details:
https://discourse.llvm.org/t/rfc-for-moving-swift-s-merge-function-pass-to-llvm/73778
We will need to refactor extension to FunctionComparator/FunctionHash to
StructuralHash. This patch adds a new pass which is ported from Swift,
and will need to discuss on how to migrate Swift’s pass over after we
land this in llvm.
Create this PR to get some early review on the patch.
---------
Co-authored-by: Manman Ren <mren@meta.com>
IndVars has the existing notion of a narrow definition which is known to
positive and thus both sign and zero extension kinds are actually the
same operations. There's existing logic for forming a SCEV based on the
extension kind and the no-wrap flags. This change extends that logic to
form the opposite extension kind for a positive def if doing so is
allowed by the flags. Note that we already do something analogous for
the getWideRecurrence case as well.
zext nneg was recently added to the IR in #67982. This patch teaches
SimplifyIndVars to prefer zext nneg over *both* sext and plain zext,
when a local SCEV query indicates the source is non-negative.
The choice to prefer zext nneg over sext looks slightly aggressive
here, but probably isn't so much in practice. For cases where we'd
"remember" the range fact, instcombine would convert the sext into
a zext nneg anyways. The only cases where this produces a different
result overall are when SCEV knows a non-local fact, and it doesn't
get materialized into the IR. Those are exactly the cases where
using zext nneg are most useful. We do run the risk of e.g. a
missing combine - since we haven't updated most of them yet - but
that seems like a manageable risk.
Note that there are much deeper algorithmic changes we could make
to this code to exploit zext nneg, but this seemed like a reasonable
and low risk starting point.
BOLT fails to process binaries in non-LBR mode, as some blocks marked as
having
a zero execution count. Adjusting code layout to process such blocks
without
assertions. This is NFC for all other use cases.
And some intervening fixups. There are two remaining problems:
* A memory leak via https://lab.llvm.org/buildbot/#/builders/236/builds/7120/steps/10/logs/stdio
* A performance slowdown with -g where I'm not completely sure what the cause it
These might be fairly straightforwards to fix, but it's the end of the day
hear, so I figure I'll clear the buildbots til tomorrow.
This reverts commit 7d77bbef4a.
This reverts commit 9026f35afe.
This reverts commit d97b2b389a.
This patch adds a variety of classes needed to record variable location
debug-info without using the existing intrinsic approach, see the rationale
at [0].
The two added files and corresponding unit tests are the majority of the
plumbing required for this, but at this point isn't accessible from the
rest of LLVM as we need to stage it into the repo gently. An overview is
that classes are added for recording variable information attached to Real
(TM) instructions, in the form of DPValues and DPMarker objects. The
metadata-uses of DPValues is plumbed into the metadata hierachy, and a
field added to class Instruction, which are all stimulated in the unit
tests. The next few patches in this series add utilities to convert to/from
this new debug-info format and add instruction/block utilities to have
debug-info automatically updated in the background when various operations
occur.
This patch was reviewed in Phab in D153990 and D154080, I've squashed them
together into this commit as there are dependencies between the two
patches, and there's little profit in landing them separately.
[0] https://discourse.llvm.org/t/rfc-instruction-api-changes-needed-to-eliminate-debug-intrinsics-from-ir/68939
There was a silly mistake in the expandBounds function that was using
the wrong type when calling expandCodeFor and always assuming the stride
is 64 bits. I've added the following test to defend this fix:
Transforms/LoopVectorize/ARM/mve-hoist-runtime-checks.ll
CloneModule is not currently designed to handle un-materialized Modules,
for example one created via a lazy initializer like
getLazyBitcodeModule(). In this case we get a somewhat cryptic
segmentation fault without a clear path forward.
In this patch, we add a comment to inform CloneModule users of this
shortcoming, and an assert to test for empty function bodies before the
segmentation fault is triggered.
This adds a writable attribute, which in conjunction with
dereferenceable(N) states that a spurious store of N bytes is
introduced on function entry. This implies that this many bytes
are writable without trapping or introducing data races. See
https://llvm.org/docs/Atomics.html#optimization-outside-atomic for
why the second point is important.
This attribute can be added to sret arguments. I believe Rust will
also be able to use it for by-value (moved) arguments. Rust likely
won't be able to use it for &mut arguments (tree borrows does not
appear to allow spurious stores).
In this patch the new attribute is only used by LICM scalar promotion.
However, the actual motivation for this is to fix a correctness issue
in call slot optimization, which needs this attribute to avoid
optimization regressions.
Followup to the discussion on D157499.
Differential Revision: https://reviews.llvm.org/D158081
zext nneg was recently added to the IR in #67982. Teaching SCEVExpander
to emit nneg when possible is valuable since SCEV may have proved
non-trivial facts about loop bounds which would otherwise be lost when
materializing the value.
In https://reviews.llvm.org/D64235 a new algorithm has been introduced
for updating the branch weights of latch blocks and their copies.
It increases the probability of going to the exit block for each next
peel iteration, calculating weights by (F - I * E, E), where:
- F is a weight of the edge from latch to header.
- E is a weight of the edge from latch to exit.
- I is a number of peeling iteration.
E.g: Let's say the latch branch weights are (100,300) and the estimated
trip count is 4. If we peel off all 4 iterations the weights of the
copied branches will be:
0: (100,300)
1: (100,200)
2: (100,100)
3: (100,1)
https://godbolt.org/z/93KnoEsT6
So we make the original loop almost unreachable from the 3rd peeled copy
according to the profile data. But that's only true if the profiling
data is accurate.
Underestimated trip count can lead to a performance issues with the
register allocator, which may decide to spill intervals inside the loop
assuming it's unreachable.
Since we don't know how accurate the profiling data is, it seems better
to set neutral 1/1 weights on the last peeled latch branch. After this
change, the weights in the example above will look like this:
0: (100,300)
1: (100,200)
2: (100,100)
3: (100,100)
Co-authored-by: Aleksandr Popov <apopov@azul.com>
When the small mask value little than 64, we can eliminate the checking
for upper limit of the range by enlarge the lookup table size to the maximum
index value. (Then the final table size grows to the next pow2 value)
```
bool f(unsigned x) {
switch (x % 8) {
case 0: return 1;
case 1: return 0;
case 2: return 0;
case 3: return 1;
case 4: return 1;
case 5: return 0;
case 6: return 1;
// This would remove the range check: case 7: return 0;
}
return 0;
}
```
Use WouldFitInRegister instead of fitsInLegalInteger to support
more result type beside bool.
Fixes https://github.com/llvm/llvm-project/issues/65120
Reviewed By: zmodem, nikic, RKSimon
Change cc2fbc648d introduced -Wdangling
warning, use temporaries to resolve.
llvm/lib/Transforms/Utils/CodeLayout.cpp:764:27: error: temporary whose address is used as value of local variable '[minDensity, maxDensity]' will be destroyed at the end of the full-expression [-Werror,-Wdangling]
764 | std::minmax(ChainPred->density(), ChainSucc->density());
llvm/lib/Transforms/Utils/CodeLayout.cpp:764:49: error: temporary whose address is used as value of local variable '[minDensity, maxDensity]' will be destroyed at the end of the full-expression [-Werror,-Wdangling]
764 | std::minmax(ChainPred->density(), ChainSucc->density());
Aggressive inlining might produce huge functions with >10K of basic
blocks. Since BFI treats _all_ blocks and jumps as "hot" having
non-negative (but perhaps small) weight, the current implementation can
be slow, taking minutes to produce an layout. This change introduces a
few modifications that significantly (up to 50x on some instances)
speeds up the computation. Some notable changes:
- reduced the maximum chain size to 512 (from the prior 4096);
- introduced MaxMergeDensityRatio param to avoid merging chains with
very different densities;
- dropped a couple of params that seem unnecessary.
Looking at some "offline" metrics (e.g., the number of created
fall-throughs), there shouldn't be problems; in fact, I do see some
metrics go up. But it might be hard/impossible to measure perf
difference for such small changes. I did test the performance clang-14
binary and do not record a perf or i-cache-related differences.
My 5 benchmarks, with ext-tsp runtime (the lower the better) and
"tsp-score" (the higher the better).
**Before**:
- benchmark 1:
num functions: 13,047
reordering running time is 2.4 seconds
score: 125503458 (128.3102%)
- benchmark 2:
num functions: 16,438
reordering running time is 3.4 seconds
score: 12613997277 (129.7495%)
- benchmark 3:
num functions: 12,359
reordering running time is 1.9 seconds
score: 1315881613 (105.8991%)
- benchmark 4:
num functions: 96,588
reordering running time is 7.3 seconds
score: 89513906284 (100.3413%)
- benchmark 5:
num functions: 1
reordering running time is 372 seconds
score: 21292505965077 (99.9979%)
- benchmark 6:
num functions: 71,155
reordering running time is 314 seconds
score: 29795381626270671437824 (102.7519%)
**After**:
- benchmark 1:
reordering running time is 2.2 seconds
score: 125510418 (128.3130%)
- benchmark 2:
reordering running time is 2.6 seconds
score: 12614502162 (129.7525%)
- benchmark 3:
reordering running time is 1.6 seconds
score: 1315938168 (105.9024%)
- benchmark 4:
reordering running time is 4.9 seconds
score: 89518095837 (100.3454%)
- benchmark 5:
reordering running time is 4.8 seconds
score: 21292295939119 (99.9971%)
- benchmark 6:
reordering running time is 104 seconds
score: 29796710925310302879744 (102.7565%)
C++20 comes with std::erase to erase a value from std::vector. This
patch renames llvm::erase_value to llvm::erase for consistency with
C++20.
We could make llvm::erase more similar to std::erase by having it
return the number of elements removed, but I'm not doing that for now
because nobody seems to care about that in our code base.
Since there are only 50 occurrences of erase_value in our code base,
this patch replaces all of them with llvm::erase and deprecates
llvm::erase_value.
Some passes has limitation that only support simple terminators:
branch/unreachable/return. Right now, they ask the pass manager to add
LowerSwitch pass to eliminate `switch`. Let's manage such kind of pass
dependency by ourselves. Also add the assertion in the related passes.
In emscripten we have a build mode (the default actually) where the
runtime never exits and therefore `__cxa_atexit` is a dummy/stub
function that does nothing. In this case we would like to be able
completely DCE any otherwise-unused global dtor functions.
Fixes: https://github.com/emscripten-core/emscripten/issues/19993
When linking an executable with a slightly larger executable,
ld.lld --call-graph-profile-sort=cdsort can be very slow (see #68638).
```
4.6% 20.7Mi .text.hot
3.5% 15.9Mi .text
3.4% 15.2Mi .text.unknown
```
Add cl option `cdsort-max-chain-size`, which is similar to
`ext-tsp-max-chain-size`, and set it to 128, to improve performance.
In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort
--time-trace"
builds, the "Total Sort sections" time is measured as follows:
* -mllvm -cdsort-max-chain-size=64: 1.321813
* -mllvm -cdsort-max-chain-size=128: 2.030425
* -mllvm -cdsort-max-chain-size=256: 2.927684
* -mllvm -cdsort-max-chain-size=512: 5.493106
* unlimited: 9 minutes
The rest part takes 6.8s.