Expand the `-order_file` also accept cstrings to order.
The purpose is to order hot cstrings for performance (implemented in
this diff), and then later on we can also order cold cstrings for
compression size win.
Due to the speciality of cstrings, there's no way to pass in symbol
names in the order file as the existing -order_file, so we expect `<hash
of cstring literal content>` to represent/identify each cstring.
```
// An order file has one entry per line, in the following format:
//
// <cpu>:<object file>:[<symbol name> | CStringEntryPrefix <cstring hash>]
//
// <cpu> and <object file> are optional.
// If not specified, then that entry tries to match either,
//
// 1) any symbol of the <symbol name>;
// Parsing this format is not quite straightforward because the symbol name
// itself can contain colons, so when encountering a colon, we consider the
// preceding characters to decide if it can be a valid CPU type or file path.
// If a symbol is matched by multiple entries, then it takes the
// lowest-ordered entry (the one nearest to the front of the list.)
//
// or 2) any cstring literal with the given hash, if the entry has the
// CStringEntryPrefix prefix defined below in the file. <cstring hash> is the
// hash of cstring literal content.
//
// Cstring literals are not symbolized, we can't identify them by name
// However, cstrings are deduplicated, hence unique, so we use the hash of
// the content of cstring literals to identify them and assign priority to it.
// We use the same hash as used in StringPiece, i.e. 31 bit:
// xxh3_64bits(string) & 0x7fffffff
//
```
The ordering of cstring has to happen during/before the finalizing of
the cstring section content in the `finalizeContents()` function, which
happens before the writer is run
---------
Co-authored-by: Sharon Xu <sharonxu@fb.com>
When using the linker flags `--icf=safe_thunks` and `--keep-icf-stabs`
together, an issue arises with the STABS debugging entries in the linked
output. The problem affects STABS entries for functions that are folded
via ICF using thunks.
For instance, if `func1` is merged into `func2` through a thunk, the
STABS entry for `func1` incorrectly points to the object file of
`func2`. This is incorrect behavior—each function’s STABS entry should
consistently point to its own original object file (e.g., the STABS
entry for `func1` should reference `func1`’s object file). This issue
causes `dsymutil` to not be able to retrieve the debug information for
the problematic function.
This patch corrects this behavior - making it so that STABS entries
always point to the correct object file.
The symbol string table does not have deduplication.
Here we add code to deduplicate the symbol string table.
This has a rather large size impact (20-30%) on unstripped binaries
(typically debug binaries) but no size impact on stripped
binaries(typically release binaries).
We enable deduplication by default and add a flag to disable it
(`-no-deduplicate-symbol-strings`).
Note that PointerUnion::dyn_cast has been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
Literal migration would result in dyn_cast_if_present (see the
definition of PointerUnion::dyn_cast), but this patch uses cast
because we know expect isa<Symbol *>(rel.referent) to be true.
Note that PointerUnion::{is,get} have been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
I'm not touching PointerUnion::dyn_cast for now because it's a bit
complicated; we could blindly migrate it to dyn_cast_if_present, but
we should probably use dyn_cast when the operand is known to be
non-null.
Currently when `--icf=safe_thunks` is used, `STABS` entries cannot be
generated for ICF'ed functions. This is because if ICF converts a full
function into a thunk and then we generate a `STABS` entry for the
thunk, `dsymutil` will expect to find the entire function body at the
location of the thunk. Because just a thunk will be present at the
location of the `STABS` entry - dsymutil will generate invalid debug
info for such scenarios.
With this change, if `--icf=safe_thunks` is used and `--keep-icf-stabs`
is also specified, STABS entries will be created for all functions, even
merged ones. However, the STABS entries will point at the actual (full)
function body while having the name of the thunk. This way we still get
program correctness as well as correct DWARF data. When doing this, the
debug data will be identical to the scenario where we're using
`--icf=all` and `--keep-icf-stabs`, but the actual program will also
contain thunks, which won't show up in the DWARF data.
There is a bug in the current implementation of `--icf=safe_thunks`
where a STABS entry is emitted for generated thunks. This is problematic
as we end up generating invalid DWARF as dsymutil will think the entire
function body is at the thunk location, when in actuality there will
only be a single branch present. This will end up causing overlapping
DWARF entries.
To fix this we never generate STABS entries for such thunks.
The existing `--icf=safe_thunks` test is updated to also generate debug
info and we add a check that no corrupt DWARF is generated.
As a future TODO we need to make `--keep-icf-stabs` compatible with
`--icf=safe_thunks`.
Currently, our `safe` ICF mode only merges non-address-significant code,
leaving duplicate address-significant functions in the output. This
patch introduces `safe_thunks` ICF mode, which keeps a single master
copy of each function and replaces address-significant duplicates with
thunks that branch to the master copy.
Currently `--icf=safe_thunks` is only supported for `arm64`
architectures.
**Perf stats for a large binary:**
| ICF Option | Total Size | __text Size | __unwind_info | % total |
|-------------------|------------|-------------|---------------------|---------------------------|
| `--icf=none` | 91.738 MB | 55.220 MB | 1.424 MB | 0% |
| `--icf=safe` | 85.042 MB | 49.572 MB | 1.168 MB | 7.30% |
| `--icf=safe_thunks` | 84.650 MB | 49.219 MB | 1.143 MB | 7.72% |
| `--icf=all` | 82.060 MB | 48.726 MB | 1.111 MB | 10.55% |
So overall we can expect a `~0.45%` binary size reduction for a typical
large binary compared to the `--icf=safe` option.
**Runtime:**
Linking the above binary took ~10 seconds. Comparing the link
performance of --icf=safe_thunks vs --icf=safe, a ~2% slowdown was
observed.
This was comparing some .size() (uint64_t) against the sizeof a size_t
which changes with system bitness. This produced a warning that
brought this to my attention.
These tests were failing too on 32 bit Arm only:
lld :: MachO/objc-category-merging-complete-test.s
lld :: MachO/objc-category-merging-minimal.s
The assert I think meant to check the value of target->wordSize,
not the size of its type. Which is a type that changes size between
systems.
Local data is referenced in Objective-C metadata via section + offset
relocations on x86-64 rather than via symbols. Without this change, we
would crash on incorrect casts of the referents to `Defined`.
A basic test based on the existing `objc-relative-method-lists-simple.s`
adopted to x86-64 is added.
The default `dyld_chained_import` entry format only allocates 8 bits to
the library ordinal, of which values 0xF1 - 0xFF are reserved for
special ordinals (`BIND_SPECIAL_DYLIB_*` values). If there are more than
240 imported dylibs (as in the case of component builds of Chromium), we
need to switch to `dyld_chained_import_addend64`, which stores 16-bit
ordinals.
This expands on the fix in 4e572db. The issue is pretty similar: we
might put symbols in the GOT which don't need run-time binding, locally
defined personality symbols in this case. We should set their indirect
symbol table entries to `INDIRECT_SYMBOL_LOCAL` to help `strip` remove
these local names from the symbol table.
Checking if the symbol is private-extern doesn't cover all cases; it can
also be a non-weak extern function too, for instance; use the
`needsBinding()` helper to determine it. This was the case for the
personality function in statically linked Rust executables.
The extra non-`LOCAL` symbols triggered a bug in Apple's `strip`
implementation. As the indirect value for the personality function was
not set to the flag, but the symbol didn't require binding, it tried to
make the symbol local, overwriting the GOT entry with the function's
address in the process. This normally wouldn't be a problem, but if
chained fixups are used, the fixup also encodes the offset to the next
fixup, and it effectively zeroed this offset out, causing the remaining
relocations on the page to not be performed by dyld.
This caused the crash in https://issues.chromium.org/issues/325410295
The change in tests is a bit ugly, as a lot of symbol information is now
removed by turning more symbols `LOCAL`.
This change adds the `--keep-icf-stabs` which, when specified, preserves
symbols that were folded by ICF in the binary's stabs entries.
This allows `dsymutil` to process debug information for the folded
symbols.
Currently, when moving symbols from one `InputSection` to another (like
in ICF) we directly update the symbol's `isec`, `unwindEntry` and
`size`. By doing this we lose the original information. This information
will be needed in a future change. Since when moving symbols we always
set the symbol's `wasCoalesced` and `isec-> replacement`, we can just
use this info to conditionally get the information we need at access
time.
Previously, `makeSyntheticInputSection` would create a new
`ConcatInputSection` without setting `live` explicitly for it. Without
`-dead_strip` this would be OK since `live` would default to `true`.
However, with `-dead_strip`, `live` would default to false, and it would
remain set to `false`.
This hasn't resulted in any issues so far since no code paths that
exposed this issue were present.
However a recent change - ObjC relative method lists
(https://github.com/llvm/llvm-project/pull/86231) exposes this issue by
creating relocations to the `SyntheticInputSection`.
When these relocations are attempted to be written, this ends up with a
crash(assert), since the `SyntheticInputSection` they refer to is marked
as dead (`live` = `false`).
With this change, we set the correct behavior - `live` will always be
`true`. We add a test case that before this change would trigger an
assert in the linker.
The MachO format supports relative offsets for ObjC method lists. This
support is present already in ld64. With this change we implement this
support in lld also.
Relative method lists can be identified by a specific flag (0x80000000)
in the method list header. When this flag is present, the method list
will contain 32-bit relative offsets to the current Program Counter
(PC), instead of absolute pointers.
Additionally, when relative method lists are used, the offset to the
selector name will now be relative and point to the selector reference
(selref) instead of the name itself.
In a previous PR: https://github.com/llvm/llvm-project/pull/83878, the
intent was to make no functional changes, just refactor out the code for
reuse.
However, by creating `ObjCSelRefsSection` as a `SyntheticSection` - this
slightly changed the functionality of the application as the
`SyntheticSection` constructor registers the `SyntheticSection` as a
functional one - with an associated `SyntheticInputSection`.
With this change we remove this unintended consequence by making the
code not use a `SyntheticSection` as base, but just by having it be a
static helper.
Before this change, after `InputSection` objects are created, they need
to be added to the appropriate container for tracking.
The logic for selecting the appropriate container lives in `Driver.cpp`
/ `gatherInputSections`, where the `InputSection` is added to the
matching container depending on the input config and the type of
`InputSection`.
Also, multiple other locations also insert directly into `inputSections`
array - assuming that that is the appropriate container for the
`InputSection`'s they create. Currently this is the correct assumption,
however an upcoming feature will change this.
For an upcoming feature (relative method lists), we need to route
`InputSection`'s either to `inputSections` array or to a synthetic
section, depending on weather the relative method list optimization is
enabled or not.
We can achieve the above either by duplicating some of the logic or
refactoring the routing and `InputSection`'s and reusing that.
The refactoring & code sharing approach seems the correct way to go - as
such this diff performs the refactoring while not introducing any
functional changes. Later on we can just call `addInputSection` and not
have to worry about routing logic.
---------
Currently ObjCStubsSection is handling both the logic for the
"__objc_stubs" section, as well as the logic for the "__objc_selrefs"
section.
While this is OK for now, it will be an issue for other features that
want to interact with the "__objc_selrefs" section, such as upcoming
relative method lists feature - which will also want to create /
reference entries in the "__objc_selrefs" section.
In this PR we split the logic relating to handling the "__objc_selrefs"
section into a new SyntheticSection (ObjCSelRefsSection). Non-functional
change - neither the behavior nor implementation changes, the interface
is just made more friendly to not have "__objc_selrefs" so bound to
"__objc_stubs".
---------
Co-authored-by: Alex B <alexborcan@meta.com>
When the data-in-code entries are in separate sections, they are not
guaranteed to be sorted. In particular, 68b1cc36f3df marked some libc++
string functions as noinline, which leads to global ctors involving
strings now producing data-in-code sections in __TEXT,__StaticInit,
which is why this now happens in practice.
Since data-in-code entries are relatively rare and small, just sort
them. No observed performance impact.
See also crbug.com/41487860
This patch implements `-objc_stubs_small` targeting arm64, aiming to
align with ld64's behavior.
1. `-objc_stubs_fast`: As previously implemented, this always uses the
Global Offset Table (GOT) to invoke `objc_msgSend`. The alignment of the
objc stub is 32 bytes.
2. `-objc_stubs_small`: This behavior depends on whether `objc_msgSend`
is defined. If it is, it directly jumps to `objc_msgSend`. If not, it
creates another stub to indirectly jump to `objc_msgSend`, minimizing
the size. The alignment of the objc stub in this case is 4 bytes.
Detail:
LD64 uses the name provided via -[dylib]install_name as "Identifier", when available.
For compatiblity, LLD should do that too.
Differential Revision: https://reviews.llvm.org/D155508
xxh3 is substantially faster than xxh64.
For lld/ELF, there is substantial speedup in `.debug_str` duplicate
elimination (D154813). Use xxh3 for lld-macho as well.
Reviewed By: #lld-macho, oontvoo
Differential Revision: https://reviews.llvm.org/D155677
Apple deprecated bitcode in the deployment process in Xcode 14.0. Last
month Apple started requiring Xcode 14.1+ to submit apps to the App
Store. Since there isn't a use for bundling bitcode outside of
submitting to the App Store we should be safe to delete this handling
entirely from LLD.
Differential Revision: https://reviews.llvm.org/D150697
We never really supported 32-bit ARM arch entirely, and partial support was added for
very specific features. Regardless, it fails to even link the most basic applications that at
this point, it might be better to move this arch as unsupported. Given that Apple will be
moving towards arm64 long term, I don't see any reason for anyone to invest time in
supporting this either, and for those who still need it should use apple's ld64 linker.
Fixes#62691
Reviewed By: #lld-macho, int3
Differential Revision: https://reviews.llvm.org/D150544
Specifically, we support this:
ld64.lld -dylib foo.o libbar.dylib -exported_symbol _bar -o libfoo.dylib
Where `_bar` is defined in libbar.dylib.
Reviewed By: #lld-macho, smeenai
Differential Revision: https://reviews.llvm.org/D144153
At some point PlatformInfo's Target changed types to a type that also
has minimum deployment target info. This caused ambiguity if you tried
to get the target triple from the Target, as the actual minimum version
info was being stored separately. This bulk of this change is changing
the parsing of these values to support this.
Differential Revision: https://reviews.llvm.org/D145263
I love C++17!
chromium_framework_less_dwarf on my 16-core Mac Pro shows no stat sig change in wall time but a slight decrease in user time:
```
base diff difference (95% CI)
sys_time 1.759 ± 0.037 1.761 ± 0.033 [ -0.9% .. +1.1%]
user_time 4.920 ± 0.043 4.886 ± 0.051 [ -1.2% .. -0.2%]
wall_time 5.950 ± 0.117 5.900 ± 0.116 [ -1.8% .. +0.2%]
samples 26 37
```
Reviewed By: #lld-macho, thakis
Differential Revision: https://reviews.llvm.org/D136518
... instead of mapping them to the intermediate object file.
This matches ld64.
Reviewed By: #lld-macho, Roger
Differential Revision: https://reviews.llvm.org/D136380
This commit adds support for chained fixups, which were introduced in
Apple's late 2020 OS releases. This format replaces the dyld opcodes
used for supplying rebase and binding information, and encodes most of
that data directly in the memory location that will have the fixup
applied.
This reduces binary size and is a requirement for page-in linking, which
will be available starting with macOS 13.
A high-level overview of the format and my implementation can be found
in SyntheticSections.h.
This feature is currently gated behind the `-fixup_chains` flag, and
will be enabled by default for supported targets in a later commit.
Like in ld64, lazy binding is disabled when chained fixups are in use,
and the `-init_offsets` transformation is performed by default.
Differential Revision: https://reviews.llvm.org/D132560
So @keith observed
[here](https://reviews.llvm.org/D128108#inline-1263900) that the
StringRefs we were returning from `CStringInputSection::getStringRef()`
included the null terminator in their total length, but regular
StringRefs do not. Let's fix that so these StringRefs are less confusing
to use.
Reviewed By: #lld-macho, keith, Roger
Differential Revision: https://reviews.llvm.org/D133728
Previously, we would add entries to DataInCodeSection in the order they
appeared in input files. Because of this, entries would not be sorted if
sections were reordered due to e.g. `-order_file` or call graph profile
sorting. ld64 always keeps data-in-code information sorted.
This commit also fixes an incorrect assertion. The original assertion
from D103006 used to check that data-in-code entries are sorted in the
input objects -- likely because we use binary search on that data. In
D115556, the assertion was moved into `collectDataInCodeEntries`, but
the checked variable's name was not changed, so it ended up checking the
final contents of the DataInCodeSection.
We no longer crash when building LLVM with PGO using an asserts build of
LLD as the linker.
Fixes https://bugs.chromium.org/p/chromium/issues/detail?id=1265937
Numbers for linking the Chromium Framework reproducer from #48001, which
has 6829 data-in-code entries:
x before
+ after
N Min Max Median Avg Stddev
x 20 2.1076453 2.3059683 2.1132485 2.1350302 0.049905767
+ 20 2.1069031 2.3915262 2.14465 2.1728429 0.084065898
No difference proven at 95.0% confidence
Differential Revision: https://reviews.llvm.org/D133581
This section stores 32-bit `__TEXT` segment offsets of initializer
functions, and is used instead of `__mod_init_func` when chained fixups
are enabled.
Storing the offsets lets us avoid emitting fixups for the initializers.
Differential Revision: https://reviews.llvm.org/D132947