This diff speeds up CDSplit by not considering any hot-warm splitting
point that could break a fall-through branch from a basic block to its
most likely successor.
Co-authored-by: spupyrev <spupyrev@fb.com>
We would always allocate maximum amount for vector containing
DWARFUnitInfo. In real usecases what ends up hapenning is we allocate a
giant vector when processing one CU, or for thin-lto case multiple CUs.
This lead to a lot of memory overhead, and 2x BOLT processing slowdown
for at least one service built with monolithic DWARF.
For binaries built with LTO with clang all of CUs that have cross
references will share an abbrev table and will be processed in one
batch. Rest of CUs are processesd in --cu-processing-batch-size size.
Which defaults to 1.
For theoretical cases where cross-cu references are present, but they do
not share abbrev will increase the size of CloneUnitCtxMap as each CU is
being processsed.
Apparently some BOLT bots build with a pre-installed system clang, and others
use the just-built one. These two clangs now behave slightly differently when
it comes to ifunc codegen after https://github.com/llvm/llvm-project/pull/74902
Change the test to accept both patterns.
There was an assumpiton that TUs and CUs share .debug_str_offsets
contribution. For ThinLTO builds it is not the case. Changed so that we
parse contributions for TUs also, and did some refactoring so that we
don't re-parse contributions that were not modified.
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
PR 75095 introduced some changes to lld that broke some dwarf tests that
were being incorrectly linked as a PIE. Add flags to disable any PIC/PIE
compilation, so the linker can succeed and the tests can run as
intended.
This patch fixes:
bolt/lib/Core/BinaryFunctionProfile.cpp:222:10: error: variable
'BBMergeSI' set but not used [-Werror,-Wunused-but-set-variable]
bolt/lib/Passes/VeneerElimination.cpp:67:12: error: variable
'VeneerCallers' set but not used [-Werror,-Wunused-but-set-variable]
Provide backwards compatibility for YAML profile that uses `std::hash`:
xxh3 hash is the default for newly produced profile (sets `std-hash:
false`),
whereas the profile that doesn't specify `std-hash` will be treated as
`std-hash: true`, preserving old behavior.
If a local stub is out-of-range, at LongJmp we will try to find another
local stub first. However, The original implementation do not work as
expected and it leads to an infinite loop between replaceTargetWithStub
and fixBranches.
After this patch, we first convert the target of BB back to the target
of the local stub, and then look up for other valid local stubs and so
on.
This diff implements the main splitting logic of CDSplit. CDSplit
processes functions in a binary in parallel. For each function BF, it
assumes that all other functions are hot-cold split. For each possible
hot-warm split point of BF, it computes its corresponding SplitScore,
and chooses the split point with the best SplitScore. The SplitScore of
each split point is computed in the following way: each call edge or
jump edge has an edge score that is proportional to its execution count,
and inversely proportional to its distance. The SplitScore of a split
point is a sum of edge scores over a fixed set of edges whose distance
can change due to hot-warm splitting BF. This set contains all cover
calls in the form of X->Y or Y->X given function order [... X ... BF ...
Y ...]; we refer to the sum of edge scores over the set of cover calls
as CoverCallScore. This set also contains all jump edges (branches)
within BF as well as all call edges originated from BF; we refer to the
sum of edge scores over this set of edges as LocalScore. CDSplit finds
the split index maximizing CoverCallScore + LocalScore.
This diff defines and initializes auxiliary variables used by CDSplit
and implements two important helper functions. The first helper function
approximates the block level size increase if a function is hot-warm
split at a given split index (X86 specific). The second helper function
finds all calls in the form of X->Y or Y->X for each BF given function
order [... X ... BF ... Y ...]. These calls are referred to as "cover
calls". Their distance will decrease if BF's hot fragment size is
further reduced by hot-warm splitting. NFC.
When option --dwarf-output-path is specified, if the path does not exist
BOLT will now create it. This is what also happens when
--plugin-opt=dwo_dir=<value> is specified to LLD.
This commit explicitly adds a warm code section, .text.warm, when
-split-functions -split-strategy=cdsplit is used. This replaces the
previous approach of using .text.cold.0 as warm and .text.cold.1 as cold
in 3-way function splitting. NFC.
Simplify code in fixBranches(). Mostly NFC, accept the x86-specific
check for code fragments now takes into account presence of more than
two fragments. Should only matter when we split code into multiple
fragments and can run fixBranches() more than once.
Also, don't replace a branch target with the same one, as such operation
may allocate memory for extra MCSymbolRefExpr.
This commit establishes the general structure of the CDSplit strategy in
SplitFunctions without incorporating the exact splitting logic. With
-split-functions -split-strategy=cdsplit, the SplitFunctions pass will
run twice: the first time is before function reordering and functions
are hot-cold split; the second time is after function reordering and
functions are hot-warm-cold split based on the fixed function ordering.
Currently, all functions are hot-warm split after the entry block in the
second splitting pass. Subsequent commits will introduce the precise
splitting logic. NFC.
Fixed handling of DWP as input. Before BOLT crashed. Now it will write
out
correct CU, and all the TUs. Potential future improvement is to scan all
the TUs
used in this CU, and only include those.
std::hash and ADT/Hashing::hash_value are non-deterministic functions
whose
results might vary across implementation/process/execution. Using xxh3
instead
for computing hashes of BinaryFunctions and BinaryBasicBlock for stale
profile
matching.
(A possible alternative is to use ADT/StableHashing.h based on FNV
hashing but
xxh3 seems to be more popular in LLVM)
This is to address https://github.com/llvm/llvm-project/issues/65241.
NFC processing time script identifies tests by output filename.
When `/dev/null` is used as output filename, we're unable to tell the
source test, and the reports are unhelpful.
Replace `/dev/null/` with `%t.null` which resolves the issue.
This is a follow-up to #73076. We need to reset output addresses for
deleted blocks, otherwise the address translation may mistakenly
attribute input address of a deleted block to a non-zero address.
While working on a test case, I've discovered that DWARF output ranges
were already broken for deleted basic blocks: #73428. I will provide a
test case for this PR with a DWARF address range fix PR.
This commit modifies BinaryContext::calculateEmittedSize() to update
the BinaryBasicBlock::OutputAddressRange of each basic block in the
function in place. BinaryBasicBlock::getOutputSize() now gives the
emitted size of the basic block.
Whenever LPStartEncoding was different from DW_EH_PE_omit, we used to
miscalculate LPStart. As a result, landing pads were assigned wrong
addresses. Fix that.
The LTO flag is not needed for the test to work properly. However, it
may not build on a system where compiler and linker versions don't match
one another. Remove the LTO flag.
Now PIE is default supported after clang 14. It cause parsing error when
using perf2bolt. The reason is the base address can not get correctly.
Fix the method of geting base address. If SegInfo.Alignment is not equal
to pagesize, alignDown(SegInfo.FileOffset, SegInfo.Alignment) can not
equal to FileOffset. So the SegInfo.FileOffset and FileOffset should be
aligned by SegInfo.Alignment first and then judge whether they are
equal.
The .text segment's offset from base address in VAS is aligned by
pagesize. So MMapAddress's offset from base address is
alignDown(SegInfo.Address, pagesize) instead of
alignDown(SegInfo.Address, SegInfo.Alignment). So the base address
calculate way should be changed.
Co-authored-by: Li Zhuohang <lizhuohang3@huawei.com>
Previously HasFixedIndirectBranch was set in BF to set isSimple to false
later because of unreachable bb ellimination pass which might remove the
BB with it's symbols accessed by other instructions than calls. It seems
to be that better solution would be to add extra entry point on target
offset instead of marking BF as non-simple.
Currently BOLT finds LSDA secition by it's name .gcc_except_table.main .
But sometimes it might have suffix e.g. .gcc_except_table.main. Find
LSDA section by it's address, rather by it's name.
Fixes#71804
Use MCAsmBackend::writeNopData() interface to emit NOP instructions on
x86. There are multiple forms of NOP instruction on x86 with different
sizes. Currently, LLVM's assembly/disassembly does not support all forms
correctly which can lead to a breakage of input code semantics, e.g. if
the program relies on NOP instructions for reserving a patch space.
Add "--keep-nops" option to preserve NOP instructions.
When NOP instructions are used to reserve space in the code, e.g. for
patching, it becomes critical to preserve their original size while
emitting the code. On x86, we rely on "Size" annotation for NOP
instructions size, as the original instruction size is lost in the
disassembly/assembly process.
This change makes instruction size a first-class annotation and is
affectively NFCI. A follow-up diff will use the annotation for code
emission.
After #70147, all primary annotation types are stored directly in the
instruction and hence there's no need for the temporary storage we've
used previously for repopulating preserved annotations.