We would always allocate maximum amount for vector containing
DWARFUnitInfo. In real usecases what ends up hapenning is we allocate a
giant vector when processing one CU, or for thin-lto case multiple CUs.
This lead to a lot of memory overhead, and 2x BOLT processing slowdown
for at least one service built with monolithic DWARF.
For binaries built with LTO with clang all of CUs that have cross
references will share an abbrev table and will be processed in one
batch. Rest of CUs are processesd in --cu-processing-batch-size size.
Which defaults to 1.
For theoretical cases where cross-cu references are present, but they do
not share abbrev will increase the size of CloneUnitCtxMap as each CU is
being processsed.
There was an assumpiton that TUs and CUs share .debug_str_offsets
contribution. For ThinLTO builds it is not the case. Changed so that we
parse contributions for TUs also, and did some refactoring so that we
don't re-parse contributions that were not modified.
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
This patch fixes:
bolt/lib/Core/BinaryFunctionProfile.cpp:222:10: error: variable
'BBMergeSI' set but not used [-Werror,-Wunused-but-set-variable]
bolt/lib/Passes/VeneerElimination.cpp:67:12: error: variable
'VeneerCallers' set but not used [-Werror,-Wunused-but-set-variable]
Provide backwards compatibility for YAML profile that uses `std::hash`:
xxh3 hash is the default for newly produced profile (sets `std-hash:
false`),
whereas the profile that doesn't specify `std-hash` will be treated as
`std-hash: true`, preserving old behavior.
This diff defines and initializes auxiliary variables used by CDSplit
and implements two important helper functions. The first helper function
approximates the block level size increase if a function is hot-warm
split at a given split index (X86 specific). The second helper function
finds all calls in the form of X->Y or Y->X for each BF given function
order [... X ... BF ... Y ...]. These calls are referred to as "cover
calls". Their distance will decrease if BF's hot fragment size is
further reduced by hot-warm splitting. NFC.
This commit explicitly adds a warm code section, .text.warm, when
-split-functions -split-strategy=cdsplit is used. This replaces the
previous approach of using .text.cold.0 as warm and .text.cold.1 as cold
in 3-way function splitting. NFC.
Simplify code in fixBranches(). Mostly NFC, accept the x86-specific
check for code fragments now takes into account presence of more than
two fragments. Should only matter when we split code into multiple
fragments and can run fixBranches() more than once.
Also, don't replace a branch target with the same one, as such operation
may allocate memory for extra MCSymbolRefExpr.
Fixed handling of DWP as input. Before BOLT crashed. Now it will write
out
correct CU, and all the TUs. Potential future improvement is to scan all
the TUs
used in this CU, and only include those.
std::hash and ADT/Hashing::hash_value are non-deterministic functions
whose
results might vary across implementation/process/execution. Using xxh3
instead
for computing hashes of BinaryFunctions and BinaryBasicBlock for stale
profile
matching.
(A possible alternative is to use ADT/StableHashing.h based on FNV
hashing but
xxh3 seems to be more popular in LLVM)
This is to address https://github.com/llvm/llvm-project/issues/65241.
This is a follow-up to #73076. We need to reset output addresses for
deleted blocks, otherwise the address translation may mistakenly
attribute input address of a deleted block to a non-zero address.
While working on a test case, I've discovered that DWARF output ranges
were already broken for deleted basic blocks: #73428. I will provide a
test case for this PR with a DWARF address range fix PR.
This commit modifies BinaryContext::calculateEmittedSize() to update
the BinaryBasicBlock::OutputAddressRange of each basic block in the
function in place. BinaryBasicBlock::getOutputSize() now gives the
emitted size of the basic block.
Whenever LPStartEncoding was different from DW_EH_PE_omit, we used to
miscalculate LPStart. As a result, landing pads were assigned wrong
addresses. Fix that.
Now PIE is default supported after clang 14. It cause parsing error when
using perf2bolt. The reason is the base address can not get correctly.
Fix the method of geting base address. If SegInfo.Alignment is not equal
to pagesize, alignDown(SegInfo.FileOffset, SegInfo.Alignment) can not
equal to FileOffset. So the SegInfo.FileOffset and FileOffset should be
aligned by SegInfo.Alignment first and then judge whether they are
equal.
The .text segment's offset from base address in VAS is aligned by
pagesize. So MMapAddress's offset from base address is
alignDown(SegInfo.Address, pagesize) instead of
alignDown(SegInfo.Address, SegInfo.Alignment). So the base address
calculate way should be changed.
Co-authored-by: Li Zhuohang <lizhuohang3@huawei.com>
Previously HasFixedIndirectBranch was set in BF to set isSimple to false
later because of unreachable bb ellimination pass which might remove the
BB with it's symbols accessed by other instructions than calls. It seems
to be that better solution would be to add extra entry point on target
offset instead of marking BF as non-simple.
Use MCAsmBackend::writeNopData() interface to emit NOP instructions on
x86. There are multiple forms of NOP instruction on x86 with different
sizes. Currently, LLVM's assembly/disassembly does not support all forms
correctly which can lead to a breakage of input code semantics, e.g. if
the program relies on NOP instructions for reserving a patch space.
Add "--keep-nops" option to preserve NOP instructions.
When NOP instructions are used to reserve space in the code, e.g. for
patching, it becomes critical to preserve their original size while
emitting the code. On x86, we rely on "Size" annotation for NOP
instructions size, as the original instruction size is lost in the
disassembly/assembly process.
This change makes instruction size a first-class annotation and is
affectively NFCI. A follow-up diff will use the annotation for code
emission.
Closes https://github.com/llvm/llvm-project/issues/63097
Before merging please make sure the change to
bolt/include/bolt/Passes/StokeInfo.h is correct.
bolt/include/bolt/Passes/StokeInfo.h
```diff
// This Pass solves the two major problems to use the Stoke program without
- // proting its code:
+ // probing its code:
```
I'm still not happy about the awkward wording in this comment.
bolt/include/bolt/Passes/FixRelaxationPass.h
```
$ ed -s bolt/include/bolt/Passes/FixRelaxationPass.h <<<'9,12p'
// This file declares the FixRelaxations class, which locates instructions with
// wrong targets and fixes them. Such problems usually occures when linker
// relaxes (changes) instructions, but doesn't fix relocations types properly
// for them.
$
```
bolt/docs/doxygen.cfg.in
bolt/include/bolt/Core/BinaryContext.h
bolt/include/bolt/Core/BinaryFunction.h
bolt/include/bolt/Core/BinarySection.h
bolt/include/bolt/Core/DebugData.h
bolt/include/bolt/Core/DynoStats.h
bolt/include/bolt/Core/Exceptions.h
bolt/include/bolt/Core/MCPlusBuilder.h
bolt/include/bolt/Core/Relocation.h
bolt/include/bolt/Passes/FixRelaxationPass.h
bolt/include/bolt/Passes/InstrumentationSummary.h
bolt/include/bolt/Passes/ReorderAlgorithm.h
bolt/include/bolt/Passes/StackReachingUses.h
bolt/include/bolt/Passes/StokeInfo.h
bolt/include/bolt/Passes/TailDuplication.h
bolt/include/bolt/Profile/DataAggregator.h
bolt/include/bolt/Profile/DataReader.h
bolt/lib/Core/BinaryContext.cpp
bolt/lib/Core/BinarySection.cpp
bolt/lib/Core/DebugData.cpp
bolt/lib/Core/DynoStats.cpp
bolt/lib/Core/Relocation.cpp
bolt/lib/Passes/Instrumentation.cpp
bolt/lib/Passes/JTFootprintReduction.cpp
bolt/lib/Passes/ReorderData.cpp
bolt/lib/Passes/RetpolineInsertion.cpp
bolt/lib/Passes/ShrinkWrapping.cpp
bolt/lib/Passes/TailDuplication.cpp
bolt/lib/Rewrite/BoltDiff.cpp
bolt/lib/Rewrite/DWARFRewriter.cpp
bolt/lib/Rewrite/RewriteInstance.cpp
bolt/lib/Utils/CommandLineOpts.cpp
bolt/runtime/instr.cpp
bolt/test/AArch64/got-ld64-relaxation.test
bolt/test/AArch64/unmarked-data.test
bolt/test/X86/Inputs/dwarf5-cu-no-debug-addr-helper.s
bolt/test/X86/Inputs/linenumber.cpp
bolt/test/X86/double-jump.test
bolt/test/X86/dwarf5-call-pc-function-null-check.test
bolt/test/X86/dwarf5-split-dwarf4-monolithic.test
bolt/test/X86/dynrelocs.s
bolt/test/X86/fallthrough-to-noop.test
bolt/test/X86/tail-duplication-cache.s
bolt/test/runtime/X86/instrumentation-ind-calls.s
When NOP instructions are removed by BOLT and a DWARF address range
falls past the removed instructions, it may lead to invalid DWARF ranges
in the output binary. E.g. the range may fall outside of the basic block
boundaries.
This fix makes sure the modified range fits within the containing basic
block. A proper fix requires tracking instructions within the block and
will come in a different PR.
In 8244ff6739, I've introduced an
assertion that incorrectly used BasicBlock::empty(). Some basic blocks
may contain only pseudo instructions and thus BB->empty() will evaluate
to false, while the actual code size will be zero.
BOLT currently hooks its its instrumentation finalization function via
`DT_FINI`. However, this method of calling finalization routines is not
supported anymore on newer ABIs like RISC-V. `DT_FINI_ARRAY` is
preferred there.
This patch adds support for hooking into `DT_FINI_ARRAY` instead if the
binary does not have a `DT_FINI` entry. If it does, `DT_FINI` takes
precedence so this patch should not change how the currently supported
instrumentation targets behave.
`DT_FINI_ARRAY` points to an array in memory of `DT_FINI_ARRAYSZ` bytes.
It consists of pointer-length entries that contain the addresses of
finalization functions. However, the addresses are only filled-in by the
dynamic linker at load time using relative relocations. This makes
hooking via `DT_FINI_ARRAY` a bit more complicated than via `DT_FINI`.
The implementation works as follows:
- While scanning the binary: find the section where `DT_FINI_ARRAY`
points to, read its first dynamic relocation and use its addend to find
the address of the fini function we will use to hook;
- While writing the output file: overwrite the addend of the dynamic
relocation with the address of the runtime library's fini function.
Updating the dynamic relocation required a bit of boiler plate: since
dynamic relocations are stored in a `std::multiset` which doesn't
support getting mutable references to its items, functions were added to
`BinarySection` to take an existing relocation and insert a new one.
When annotating MCInst instructions, attach extra annotation operands
directly to the annotated instruction, instead of attaching them to an
instruction pointed to by a special kInst operand.
With this change, it's no longer necessary to allocate MCInst and most
of the first-class annotations come with free memory as currently MCInst
is declared with:
SmallVector<MCOperand, 10> Operands;
i.e. more operands than are normally being used.
We still create a kInst operand with a nullptr instruction value to
designate the beginning of annotation operands. However, this special
operand might not be needed if we can rely on MCInstrDesc::NumOperands.
We emit a symbol before an instruction for a number of reasons, e.g. for
tracking LocSyms, debug line, or if the instruction has a label
annotation. Currently, we may emit multiple symbols per instruction.
Reuse the same label instead of creating and emitting new ones when
possible. I'm planning to refactor EH labels as well in a separate diff.
Change getLabel() to return a pointer instead of std::optional<> since
an empty label should be treated identically to no label.
When we create new code for indirect code promotion optimization, we
should mark it as originating from the indirect jump instruction for
BOLT address translation (BAT) to map it to the original instruction.
Create BinaryFunction::translateInputToOutputRange() and use it for
updating DWARF debug ranges and location lists while de-duplicating the
existing code. Additionally, move DWARF-specific code out of
BinaryFunction and add print functions to facilitate debugging.
Note that this change is deliberately kept "bug-level" compatible with
the existing solution to keep it NFCI and make it easier to track any
possible regressions in the future updates to the ranges-handling code.
Some optimization passes may duplicate basic blocks and assign the same
input offset to a number of different blocks in a function. This is done
e.g. to correctly map debugging ranges for duplicated code.
However, duplicate input offsets present a problem when we use
AddressMap to generate new addresses for basic blocks. The output
address is calculated based on the input offset and will be the same for
blocks with identical offsets. The result is potentially incorrect debug
info and BAT records.
To address the issue, we have to eliminate the dependency on input
offsets while generating output addresses for a basic block. Each block
has a unique label, hence we extend AddressMap to include address lookup
based on MCSymbol and use the new functionality to update block
addresses.
We used to hard-code target features for RISC-V. However, most features
(with the exception of relax) are stored in the object file. This patch
extracts those features to ensure BOLT's output doesn't use any features
not present in the input file.
On RISC-V, GNU as produces the following initial instruction in CIE's:
```
DW_CFA_def_cfa_register: r2
```
While I believe it is technically illegal to use this instruction
without first using a `DW_CFA_def_cfa` (since the offset is undefined),
both `readelf` and `llvm-dwarfdump` accept this and implicitly set the
offset to 0.
In BOLT, however, this triggers an assert (in `CFISnapshot::advanceTo`)
as it (correctly) believes the offset is not set. This patch fixes this
by setting the offset to 0 whenever executing `DW_CFA_def_cfa_register`
while the offset is undefined.
Note that this is probably the simplest workaround but it has a
downside: while emitting CFI start, we check if the initial instructions
are contained within `MCAsmInfo::getInitialFrameState` and omit them if
they are. This will not be true for GNU CIE's (since they differ from
LLVM's) which causes an unnecessary `DW_CFA_def_cfa_register` to be
emitted.
While technically correct, it would probably be better to replace the
GNU CIE with the one used by LLVM to avoid this situation. This would
solve the problem this patch solves while also preventing unnecessary
CFI instructions. However, this is a bit trickier to implement correctly
so I propose to keep this for a later time.
Note on testing: the test creates a simple function with three basic
blocks and forces the CFI state of the last one to be different from the
others using an arbitrary CFI instruction. Then,
`--reorder-blocks=reverse` is used to force `CFISnapshot::advanceTo` to
be called. This causes an assert on the current main branch.
RISC-V supports mapping syms for code that encode the exact ISA for
which the code is valid. They have the form `$x<ISA>` where `<ISA>` is
the textual encoding of an ISA specification.
BOLT currently doesn't recognize these mapping symbols causing many
binaries compiled with newer versions of GCC (which emits them) to not
be properly processed. This patch makes sure BOLT recognizes them as
code markers.
Note that LLVM does not emit these kinds of mapping symbols yet so the
test is based on a binary produced by GCC.
Note that llvm::support::endianness has been renamed to
llvm::endianness while becoming an enum class as opposed to an
enum. This patch replaces support::{big,little,native} with
llvm::endianness::{big,little,native}.
Currently minimal alignment of function is hardcoded to 2 bytes.
Add 2 more cases:
1. In case BF is data in code return the alignment of CI as minimal
alignment
2. For aarch64 and riscv platforms return the minimal value of 4 (added
test for aarch64)
Otherwise fallback to returning the 2 as it previously was.
On RISC-V, there are certain relocations that target a specific
instruction instead of a more abstract location like a function or basic
block. Take the following example that loads a value from symbol `foo`:
```
nop
1: auipc t0, %pcrel_hi(foo)
ld t0, %pcrel_lo(1b)(t0)
```
This results in two relocation:
- auipc: `R_RISCV_PCREL_HI20` referencing `foo`;
- ld: `R_RISCV_PCREL_LO12_I` referencing to local label `1` which points
to the auipc instruction.
It is of utmost importance that the `R_RISCV_PCREL_LO12_I` keeps
referring to the auipc instruction; if not, the program will fail to
assemble. However, BOLT currently does not guarantee this.
BOLT currently assumes that all local symbols are jump targets and
always starts a new basic block at symbol locations. The example above
results in a CFG the looks like this:
```
.BB0:
nop
.BB1:
auipc t0, %pcrel_hi(foo)
ld t0, %pcrel_lo(.BB1)(t0)
```
While this currently works (i.e., the `R_RISCV_PCREL_LO12_I` relocation
points to the correct instruction), it has two downsides:
- Too many basic blocks are created (the example above is logically only
one yet two are created);
- If instructions are inserted in `.BB1` (e.g., by instrumentation),
things will break since the label will not point to the auipc anymore.
This patch proposes to fix this issue by teaching BOLT to track labels
that should always point to a specific instruction. This is implemented
as follows:
- Add a new annotation type (`kLabel`) that allows us to annotate
instructions with an `MCSymbol *`;
- Whenever we encounter a relocation type that is used to refer to a
specific instruction (`Relocation::isInstructionReference`), we
register it without a symbol;
- During disassembly, whenever we encounter an instruction with such a
relocation, create a symbol for its target and store it in an offset
to symbol map (to ensure multiple relocations referencing the same
instruction use the same label);
- After disassembly, iterate this map to attach labels to instructions
via the new annotation type;
- During emission, emit these labels right before the instruction.
I believe the use of annotations works quite well for this use case as
it allows us to reliably track instruction labels. If we were to store
them as offsets in basic blocks, it would be error prone to keep them
updated whenever instructions are inserted or removed.
I have chosen to add labels as first-class annotations (as opposed to a
generic one) because the documentation of `MCAnnotation` suggests that
generic annotations are to be used for optional metadata that can be
discarded without affecting correctness. As this is not the case for
labels, a first-class annotation seemed more appropriate.
Long tail calls use the following instruction sequence on RISC-V:
```
1: auipc xi, %pcrel_hi(sym)
jalr zero, %pcrel_lo(1b)(xi)
```
Since the second instruction in isolation looks like an indirect branch,
this confused BOLT and most functions containing a long tail call got
marked with "unknown control flow" and didn't get optimized as a
consequence.
This patch fixes this by detecting long tail call sequence in
`analyzeIndirectBranch`. `FixRISCVCallsPass` also had to be updated to
expand long tail calls to `PseudoTAIL` instead of `PseudoCALL`.
Besides this, this patch also fixes a minor issue with compressed tail
calls (`c.jr`) not being detected.
Note that I had to change `BinaryFunction::postProcessIndirectBranches`
slightly: the documentation of `MCPlusBuilder::analyzeIndirectBranch`
mentions that the [`Begin`, `End`) range contains the instructions
immediately preceding `Instruction`. However, in
`postProcessIndirectBranches`, *all* the instructions in the BB where
passed in the range. This made it difficult to find the preceding
instruction so I made sure *only* the preceding instructions are passed.
Handle the following relocations related to TLS local-exec and
initial-exec:
- R_RISCV_TLS_GOT_HI20
- R_RISCV_TPREL_HI20
- R_RISCV_TPREL_ADD
- R_RISCV_TPREL_LO12_I
- R_RISCV_TPREL_LO12_S
In addition, GNU ld has a quirk where after TLS le relaxation, two
unofficial relocation types may be emitted:
- R_RISCV_TPREL_I
- R_RISCV_TPREL_S
Since they are unofficial (defined in the reserved range of relocation
types), LLVM does not define them. Hence, I've defined them locally in
BOLT in a private namespace.