When the linker relaxes a GOT load, it changes ADRP+LDR instruction pair
into ADRP+ADD. It is relatively straightforward to detect and symbolize
the second instruction in the disassembler. However, it is not always
possible to properly symbolize the ADRP instruction without looking at
the second instruction. Hence, we have the FixRelaxationPass that adjust
the operand of ADRP by looking at the corresponding ADD.
This PR tries to properly symbolize ADRP earlier in the pipeline, i.e.
in AArch64MCSymbolizer. This change makes it easier to adjust the
instruction once we add AArch64 support in `scanExternalRefs()`.
Additionally, we get a benefit of looking at proper operands while
observing the function state prior to running FixRelaxationPass.
To disambiguate the operand of ADRP that has a GOT relocation against
it, we look at the contents/value of the operand. If it contains an
address of a page that is valid for GOT, we assume that the operand
wasn't modified by the linker and leave it up to FixRelaxationPass to do
a proper adjustment. If the page referenced by ADRP cannot point to GOT,
then it's an indication that the linker has modified the operand and we
substitute the operand with a non-GOT reference to the symbol.
When an indirect branch instruction is decoded, analyzeIndirectBranch
method is asked if this is a well-known code pattern. On AArch64, the
only special pattern which is detected is Jump Table, emitted as a
branch to the sum of a constant base address and a variable offset.
Therefore, `Inst.getOpcode()` being one of `AArch64::BRA*` means Inst
cannot belong to such Jump Table pattern, thus returning early.
Instead of filtering and modifying relocations in readRelocations(),
preserve the relocation info and use it in the symbolizing disassembler.
This change mostly affects AArch64, where we need to look at original
linker relocations in order to properly symbolize instruction operands.
These tests have been failing since:
commit 1cfca53b9f
Author: Arthur Eubanks <aeubanks@google.com>
Date: Wed Mar 12 16:20:13 2025 -0700
This patch works around the failures by removing some FileCheck
directives. Hopefully, BOLT folks can chime in and commit a right
fix.
We used to filter out relocations corresponding to NOP+ADR instruction
pairs that were a result of linker "relaxation" optimization. However,
these relocations will be useful for reversing the linker optimization.
Keep the relocations and ignore them while symbolizing ADR instruction
operands.
`perf2bolt` and `llvm-boltdiff` are now not separate targets but just
symlinks to `llvm-bolt` created when you install `llvm-bolt` itself so
when you try to build it ninja reports there are no targets for both of
them
Add AArch64MCSymbolizer that symbolizes `MCInst` operands during
disassembly. The symbolization was previously done in
`BinaryFunction::disassemble()`, but it is also required by
`scanExternalRefs()` for "lite" mode functionality. Hence, similar to
x86, I've implemented the symbolizer interface that uses
`BinaryFunction` relocations to properly create instruction operands. I
expect the result of the disassembly to be identical after the change.
AArch64 disassembler was not calling `tryAddingSymbolicOperand()` for
`MOV` instructions. Fix that. Additionally, the disassembler marks `ldr`
instructions as branches by setting `IsBranch` parameter to true. Ignore
the parameter and rely on `MCPlusBuilder` interface instead.
I've modified `--check-encoding` flag to check symolization of operands
of instructions that have relocations against them.
Add BinaryContext::createInstructionPatch() interface for patching parts
of the original binary with new instruction sequences. Refactor
PatchEntries pass to use the new interface.
In analyzeInstructionForFuncReference(), use MCPlusBuilder interface
while scanning symbolic operands of MCInst. Should be NFC on x86, but
will make the function work on other architectures. Note that it's
currently unused on non-x86 as its functionality is exclusive to safe
ICF that runs on x86 only.
Add two additional profile quality stats for CG (call graph) and CFG
(control flow graph) flow conservations besides the CFG discontinuity
stats introduced in #109683. The two new stats quantify how different
"in-flow" is from "out-flow" in the following cases where they should be
equal. The smaller the reported stats, the better the flow conservations
are.
CG flow conservation: for each function that is not a program entry, the
number of times the function is called according to CG ("in-flow")
should be equal to the number of times the transition from an entry
basic block of the function to another basic block within the function
is recorded ("out-flow").
CFG flow conservation: for each basic block that is not a function entry
or exit, the number of times the transition into this basic block from
another basic block within the function is recorded ("in-flow") should
be equal to the number of times the transition from this basic block to
another basic block within the function is recorded ("out-flow").
Use `-v=1` for more detailed bucketed stats, and use `-v=2` to dump
functions / basic blocks with bad flow conservations.
BOLT instrumented binary today has a readable (R), writeable (W) and also
executable (X) segment, which Android system won't load due to its WX
attribute. Such RWX segment was produced because BOLT has a two step linking,
first for everything in the updated or rewritten input binary and next for
runtime library. Each linking will layout sections in the order of RX sections
followed by RO sections and then followed by RW sections. So we could end up
having a RW section `.bolt.instr.counters` surrounded by a number of RO and RX
sections, and a new text segment was then formed by including all RX sections
which includes the RW section in the middle, and hence the RWX segment. One
way to fix this is to separate the RW `.bolt.instr.counters` section into its
own segment by a). assigning the starting addresses for section
`.bolt.instr.counters` and its following section with regular page aligned
addresses and b). creating two extra program headers accordingly.
When processing BOLTed binaries with BAT section, we used to
indiscriminately use `BAT->getFallthroughsInTrace` to record
fall-throughs, even if the function is not covered by BAT.
Fix that by using non-BAT CFG-based `getFallthroughsInTrace` if the
function is not in BAT.
Test Plan: updated bolt-address-translation-yaml.test
BOLT used to mark multi-entry functions non-simple in non-relocation
mode with the reasoning that we can't move them due to potentially
undetected references. However, in aggregation mode it doesn't apply as
BOLT doesn't perform optimizations.
Relax this constraint in case of an aggregation job.
Test Plan: added entry-point-fallthru.s
... which is caused by a seemingly recent change in BOLTs basic block
calculation, where function calls seem to be ending basic blocks? I
don't have a pointer to the commit that caused this change. I'll be
looking for that later. For now, I'm trying to get the regression tests
passing again.
Sometimes we need to know the size of a symbol besides its address, so
maybe we can start using the existing `BOLTLinker::lookupSymbolInfo()`
(that returns symbol address and size) and remove
`BOLTLinker::lookupSymbol()` (that only returns symbol address). And for
both we need to check return value as it is wrapped in `std::optional<>`,
which makes the difference even smaller.
Bolt currently links and initializes all LLVM targets. This
substantially increases the binary size, and link time if LTO is used.
Instead, only link the targets specified by BOLT_TARGETS_TO_BUILD. We
also have to only initialize those targets, so generate a
TargetConfig.def file with the necessary information. The way the
initialization is done mirrors what llvm-exegesis does.
This reduces llvm-bolt size from 137MB to 78MB for me.
Traces are triplets of branch source, target, and fall-through end (next
branch).
Traces simplify differentiation of fall-throughs into local- and
external-origin, which improves performance over profile with
undifferentiated fall-throughs by eliminating profile discontinuity in
call to continuation fall-throughs. This makes it possible to avoid
converting return profile into call to continuation profile which may
introduce statistical biases.
The existing format makes provisions for local- (F) and external- (f)
origin fall-throughs, but the profile producer needs to know function
boundaries. BOLT has that information readily available, so providing
the origin branch of a fall-through is a functional replacement of the
fall-through kind (f or F). This also has an effect of combining
branches and fall-throughs into a single record.
As traces subsume other pre-aggregated profile kinds, BOLT may drop
support for them soon. Users of pre-aggregated profile format are
advised to migrate to the trace format.
Test Plan: Updated callcont-fallthru.s
`addPendingRelocation` is the only way to add a pending
relocation. Can no longer use `addRelocation` for this.
Update the only user (`BinaryContextTester`).
Use LLVM's getMainExecutable() helper instead of rolling our own. This
will result in standard behavior across platforms, such as making sure
that symlinks are always resolved.
When printing disassembly of a function with constant islands, include
the island info in the dump.
At the moment, only print islands in pre-CFG state. Include islands that
are interleaved with instructions.
A few tests generate a statically-linked position-independent executable
with `-nostdlib -Wl,--unresolved-symbols=ignore-all -pie` (`%clang`) and
test PLT handling. (--unresolved-symbols=ignore-all suppresses undefined
symbol errors and serves as a convenience hack.)
This relies on an unguaranteed linker behavior: a statically-linked PIE
does not necessarily generate PLT entries.
While current lld generates a PLT entry, it will change to suppress the
PLT entry to simplify internal handling and improve consistency.
(The behavior has no consistency in GNU ld, some ports generated a
.dynsym entry while some don't. While most seem to generate a PLT entry
but some ports use a weird `R_*_NONE` relocation.)
When a function has an indirect branch with unknown control flow, we
preserve nops in order to keep all instruction offsets (from the start
of the function) the same in case the indirect branch is used by a
PC-relative jump table. However, when we know the control flow of the
function, we should be able to safely remove nops.
The code for jump table detection on AArch64 asserts liberally whenever
the input instruction sequence does not match the expected pattern. As a
result, BOLT fails to process binaries with such sequences instead of
ignoring functions with unknown control flow.
Remove asserts in analyzeIndirectBranchFragment() and mark indirect
jumps as instructions with unknown control flow instead.
Remove options to generate autofdo data (unused) and `use-event-pc`
(not beneficial).
Cuts down perf2bolt time for 11GB perf.data by 40s (11:10->10:30).
This fixes the following tests:
BOLT :: AArch64/check-init-not-moved.s
BOLT :: X86/dwarf5-dwarf4-types-backward-forward-cross-reference.test
BOLT :: X86/dwarf5-locexpr-referrence.test
When clang is compiled with `-DENABLE_LINKER_BUILD_ID=ON`.
This fixes a number of issues introduced in #97130 when
LLVM_LIBDIR_SUFFIX is a non-empty string. Make sure that the libdir is
always referenced as `lib${LLVM_LIBDIR_SUFFIX}`, not as just `lib` or
`${CMAKE_INSTALL_LIBDIR}${LLVM_LIBDIR_SUFFIX}`.
This is the standard libdir convention for all LLVM subprojects. Using
`${CMAKE_INSTALL_LIBDIR}${LLVM_LIBDIR_SUFFIX}` would result in a
duplicate suffix.