Reuse data structures used by perf data reader for pre-aggregated data.
Combined with #136531 this allows using pre-aggregated data for heatmap.
Test Plan: heatmap-preagg.test
Remove duplicate profile parsing in heatmap construction, switching to
using parsed profile. #138798 adds support for using pre-aggregated
profile for heatmap construction.
Test Plan: added heatmap.test in
0868850a15
Sample is a general term covering both basic (IP) and branch (LBR)
profiles. Find and replace ambiguous uses of sample in a basic sample
sense.
Rename `RawBranchCount` into `RawSampleCount` reflecting its use for
both kinds of profile.
Rename `PF_LBR` profile type as `PF_BRANCH` reflecting non-LBR based
branch profiles (non-brstack SPE, synthesized brstack ETM/PT).
Follow-up to #137644.
Test Plan: NFC
For profile with LBR samples, binary function profile density is
computed as a ratio of executed bytes to function size in bytes.
For profile with IP samples, use the size of basic block containing the
sample IP as a numerator.
Test Plan: updated perf_test.test
The workaround was not implemented for BAT case, and it is no longer
needed with pre-aggregated traces, alternatively, the effect can be
achieved with `infer-fall-throughs` with old pre-aggregated format
(branches + ranges).
Test Plan: updated callcont-fallthru.s
DataAggregator supports reading different kinds of profile data:
- perf data: branch records or IP samples,
- pre-aggregated branch data.
Make profile quality reporting uniform across all kinds of input:
- out-of-range and mismatching samples,
- samples in cold code in BAT mode (profiled BOLTed binary).
Test Plan: NFCI
We used to report PLT traces as invalid (mismatching disassembled
function contents) because PLT functions are marked as pseudo and
ignored, thus missing CFG. However, such traces are not mismatching
the function contents. Accept them without attaching the profile.
Test Plan: updated callcont-fallthru.s
When processing BOLTed binaries with BAT section, we used to
indiscriminately use `BAT->getFallthroughsInTrace` to record
fall-throughs, even if the function is not covered by BAT.
Fix that by using non-BAT CFG-based `getFallthroughsInTrace` if the
function is not in BAT.
Test Plan: updated bolt-address-translation-yaml.test
Traces are triplets of branch source, target, and fall-through end (next
branch).
Traces simplify differentiation of fall-throughs into local- and
external-origin, which improves performance over profile with
undifferentiated fall-throughs by eliminating profile discontinuity in
call to continuation fall-throughs. This makes it possible to avoid
converting return profile into call to continuation profile which may
introduce statistical biases.
The existing format makes provisions for local- (F) and external- (f)
origin fall-throughs, but the profile producer needs to know function
boundaries. BOLT has that information readily available, so providing
the origin branch of a fall-through is a functional replacement of the
fall-through kind (f or F). This also has an effect of combining
branches and fall-throughs into a single record.
As traces subsume other pre-aggregated profile kinds, BOLT may drop
support for them soon. Users of pre-aggregated profile format are
advised to migrate to the trace format.
Test Plan: Updated callcont-fallthru.s
Remove options to generate autofdo data (unused) and `use-event-pc`
(not beneficial).
Cuts down perf2bolt time for 11GB perf.data by 40s (11:10->10:30).
When a binary has multiple text segments, the Size is computed as the
difference of the last address of these segments from the BaseAddress.
The base addresses of all text segments must be the same.
Introduces flag 'perf-script-events' for testing, which allows passing
perf events without BOLT having to parse them by invoking 'perf script'.
The flag is used to pass a mock perf profile that has two memory
mappings for a mock binary that has two text segments. The mapping
size is updated as `parseMMapEvents` now processes all text segments.
This caused test failures, see comment on the PR:
Failed Tests (2):
BOLT-Unit :: Core/./CoreTests/AArch64/MemoryMapsTester/MultipleSegmentsMismatchedBaseAddress/0
BOLT-Unit :: Core/./CoreTests/X86/MemoryMapsTester/MultipleSegmentsMismatchedBaseAddress/0
> When a binary has multiple text segments, the Size is computed as the
> difference of the last address of these segments from the BaseAddress.
> The base addresses of all text segments must be the same.
>
> Introduces flag 'perf-script-events' for testing. It allows passing perf events
> without BOLT having to parse them using 'perf script'. The flag is used to
> pass a mock perf profile that has two memory mappings for a mock binary
> that has two text segments. The size of the mapping is updated as this
> change `parseMMapEvents` processes all text segments.
This reverts commit 4b71b3782d.
When a binary has multiple text segments, the Size is computed as the
difference of the last address of these segments from the BaseAddress.
The base addresses of all text segments must be the same.
Introduces flag 'perf-script-events' for testing. It allows passing perf events
without BOLT having to parse them using 'perf script'. The flag is used to
pass a mock perf profile that has two memory mappings for a mock binary
that has two text segments. The size of the mapping is updated as this
change `parseMMapEvents` processes all text segments.
#109683 identified an issue with pre-aggregated profile where a call to
continuation fallthrough edge count is missing (profile discontinuity).
This issue only affects pre-aggregated profile but not perf data since
LBR stack has the necessary information to determine if the trace (fall-
through) starts at call continuation, whereas pre-aggregated fallthrough
lacks this information.
The solution is to look at branch records in pre-aggregated profiles
that correspond to returns and assign counts to call to continuation
fallthrough:
- BranchFrom is in another function or DSO,
- BranchTo may be a call continuation site:
- not an entry point/landing pad.
Note that we can't directly check if BranchFrom corresponds to a return
instruction if it's in external DSO.
Keep call continuation handling for perf data (`getFallthroughsInTrace`)
[1] as-is due to marginally better performance. The difference is that
return-converted call to continuation fallthrough is slightly more
frequent than other fallthroughs since the former only requires one LBR
address while the latter need two that belong to the profiled binary.
Hence return-converted fallthroughs have larger "weight" which affects
code layout.
[1] `DataAggregator::getFallthroughsInTrace`
fea18afeed/bolt/lib/Profile/DataAggregator.cpp (L906-L915)
Test Plan: added callcont-fallthru.s
Reviewers: maksfb, ayermolo, ShatianWang, dcci
Reviewed By: maksfb, ShatianWang
Pull Request: https://github.com/llvm/llvm-project/pull/109486
Reuse the definition of profile density from llvm-profgen (#92144):
- the density is computed in perf2bolt using raw samples (perf.data or
pre-aggregated data),
- function density is the ratio of dynamically executed function bytes
to the static function size in bytes,
- profile density:
- functions are sorted by density in decreasing order, accumulating
their respective sample counts,
- profile density is the smallest density covering 99% of total sample
count.
In other words, BOLT binary profile density is the minimum amount of
profile information per function (excluding functions in tail 1% sample
count) which is sufficient to optimize the binary well.
The density threshold of 60 was determined through experiments with
large binaries by reducing the sample count and checking resulting
profile density and performance. The threshold is conservative.
perf2bolt would print the warning if the density is below the threshold
and suggest to increase the sampling duration and/or frequency to reach
a given density, e.g.:
```
BOLT-WARNING: BOLT is estimated to optimize better with 2.8x more samples.
```
Test Plan: updated pre-aggregated-perf.test
Reviewers: maksfb, wlei-llvm, rafaelauler, ayermolo, dcci, WenleiHe
Reviewed By: WenleiHe, wlei-llvm
Pull Request: https://github.com/llvm/llvm-project/pull/101094
Align DataAggregator (Linux perf and pre-aggregated profile reader) to
DataReader (fdata profile reader) behavior: set BF->RawBranchCount which
is used in profile density computation (#101094).
Reviewers: ayermolo, maksfb, dcci, rafaelauler, WenleiHe
Reviewed By: WenleiHe
Pull Request: https://github.com/llvm/llvm-project/pull/101093
… segments in Elf binary.
The heuristic is improved by also taking into account that only
executable segments should contain instructions.
Fixes#109384.
Add probe inline tree information to YAML profile, at function level:
- function GUID,
- checksum,
- parent node id,
- call site in the parent.
This information is used for pseudo probe block matching (#99891).
The encoding adds/changes probe information in multiple levels of
YAML profile:
- BinaryProfile: add pseudo_probe_desc with GUIDs and Hashes, which
permits deduplication of data:
- many GUIDs are duplicate as the same callee is commonly inlined
into multiple callers,
- hashes are also very repetitive, especially for functions with
low block counts.
- FunctionProfile: add inline tree (see above). Top-level function
is included as root of function inline tree, which makes guid and
pseudo_probe_desc_hash fields redundant.
- BlockProfile: densely-encoded block probe information:
- probes reference their containing inline tree node,
- separate lists for block, call, indirect call probes,
- block probe encoding is specialized: ids are encoded as bitset
in uint64_t. If only block probe with id=1 is present, it's
encoded as implicit entry (id=0, omitted).
- inline tree nodes with identical probes share probe description
where node indices are combined into a list.
On top of #107970, profile with new probe encoding has the following
characteristics (profile for a large binary):
- Profile without probe information: 33MB, 3.8MB compressed (baseline).
- Profile with inline tree information: 92MB, 14MB compressed.
Profile processing time (YAML parsing, inference, attaching steps):
- profile without pseudo probes: 5s,
- profile with pseudo probes, without pseudo probe matching: 11s,
- with pseudo probe matching: 12.5s.
Test Plan: updated pseudoprobe-decoding-inline.test
Reviewers: wlei-llvm, ayermolo, rafaelauler, dcci, maksfb
Reviewed By: wlei-llvm, rafaelauler
Pull Request: https://github.com/llvm/llvm-project/pull/107137
The flag currently controls writing of probe information in YAML
profile. #99891 adds a separate flag to use probe information for stale
profile matching. Thus `profile-use-pseudo-probes` becomes a misnomer
and `profile-write-pseudo-probes` better captures the intent.
Reviewers: maksfb, WenleiHe, ayermolo, rafaelauler, dcci
Reviewed By: rafaelauler
Pull Request: https://github.com/llvm/llvm-project/pull/106364
Replace the map from addresses to list of probes with a flat vector
containing probe references sorted by their addresses.
Reduces pseudo probe parsing time from 9.56s to 8.59s and peak RSS from
9.66 GiB to 9.08 GiB as part of perf2bolt processing a large binary.
Test Plan:
```
bin/llvm-lit -sv test/tools/llvm-profgen
```
Reviewers: maksfb, rafaelauler, dcci, ayermolo, wlei-llvm
Reviewed By: wlei-llvm
Pull Request: https://github.com/llvm/llvm-project/pull/102904
Read pseudo probes in regular and BAT YAML profile generation, and
attach them to YAML profile basic blocks. This exposes GUID, probe id,
and probe type in profile for future use in stale profile matching.
Test Plan: updated pseudoprobe-decoding-inline.test
Reviewers: dcci, rafaelauler, ayermolo, maksfb
Reviewed By: rafaelauler
Pull Request: https://github.com/llvm/llvm-project/pull/99554
Add a BinaryFunction field for pseudo probe function GUID.
Populate it during pseudo probe section parsing, and emit it in YAML
profile (both regular and BAT), along with function checksum.
To be used for stale function matching.
Test Plan: update pseudoprobe-decoding-inline.test
Disambiguate local functions using the containing file symbol in BAT
mode. Make local function naming consistent across BAT fdata and YAML
profiles.
Test Plan: updated register-fragments-bolt-symbols.s
Align the name to its counterpart `FTInfo` which avoids name aliasing
with llvm::bolt::BranchInfo and allows to drop namespace specifier.
Test Plan: NFC
Reviewers: maksfb, rafaelauler, ayermolo, dcci
Reviewed By: dcci
Pull Request: https://github.com/llvm/llvm-project/pull/92017
Switch from FuncBranchData intermediate maps (Intra/InterIndex)
to aggregated Data, same as one used by DataReader:
e62ce1f884/bolt/lib/Profile/DataReader.cpp (L385-L389)
This aligns the order of the output between YAMLProfileWriter and
writeBATYAML.
Test Plan: updated bolt-address-translation-yaml.test
Reviewers: rafaelauler, dcci, ayermolo, maksfb
Reviewed By: ayermolo, maksfb
Pull Request: https://github.com/llvm/llvm-project/pull/91289
I'm planning to remove StringRef::equals in favor of
StringRef::operator==.
- StringRef::operator==/!= outnumber StringRef::equals by a factor of
276 under llvm-project/ in terms of their usage.
- The elimination of StringRef::equals brings StringRef closer to
std::string_view, which has operator== but not equals.
- S == "foo" is more readable than S.equals("foo"), especially for
!Long.Expression.equals("str") vs Long.Expression != "str".
Fix an issue where the profile for all branches that have a BRANCHENTRY
is dropped. If the branch has an entry in BAT, it will be translated to
its input offset. We used to only permit the basic block offset as a
branch source. Perform a lookup of containing basic block instead.
Test Plan: Updated bolt-address-translation-yaml.test
Reviewers: maksfb, dcci, rafaelauler, ayermolo
Reviewed By: maksfb
Pull Request: https://github.com/llvm/llvm-project/pull/91273
Returns are ignored in perf/pre-aggregated/fdata profile reader (see
DataReader::convertBranchData). They are also omitted in
YAMLProfileWriter by virtue of not having the profile attached to them
in the reader, and YAMLProfileWriter converting the profile attached to
BinaryFunctions. Thus, return profile is universally ignored across all
profile types except BAT YAML.
To make returns ignored for YAML produced in BAT mode, we can:
1) ignore them in YAMLProfileReader,
2) omit them from YAML profile in profile conversion/writing.
The first option is prone to profile staleness issue, where the profiled
binary doesn't match the one to be optimized, and thus returns in the
profile can no longer be reliably detected (as we don't distinguish them
from calls in the profile).
The second option is robust to staleness but requires disassembling the
branch source instruction.
Test Plan: Updated bolt-address-translation-yaml.test
Reviewers: rafaelauler, dcci, ayermolo, maksfb
Reviewed By: maksfb
Pull Request: https://github.com/llvm/llvm-project/pull/90807
Call site information setting was conditioned on branch information
presence for a given block. However, it's possible to have sampled
profile lacking one or the other for a given basic block.
Iterate over branch profiles and call profiles independently to cover
all recorded profile data.
Depends on https://github.com/llvm/llvm-project/pull/87569
Test Plan: Updated bolt/test/X86/yaml-secondary-entry-discriminator.s
Reviewers: ayermolo, dcci, maksfb, rafaelauler
Reviewed By: maksfb
Pull Request: https://github.com/llvm/llvm-project/pull/87743
Move BAT parent function lookup outside `getLocationName`, to the
scope where we retrieve `FuncBranchData` linked with the function.
Previously DataAggregator would store branch profile recorded in the
split fragment in `FuncBranchData` associated with the fragment, and
perform name translation in `getLocationName` for symbol name only.
This works for fdata profile which is printed out as-is, but doesn't
work with BAT YAML profile writer which requires a combined profile.
The issue necessitated `fixupBATProfile` which partially addressed the
issue (reassigned inter-fragment calls back into intra-function
branches). However, `fixupBATProfile` fails to address disjoint
profiles (i.e. doesn't merge `FuncBranchData` for fragments back
into parent). This diff eliminates the need for `fixupBATProfile` by
removing the root cause of the issue.
Test Plan: NFC for existing tests
Reviewers: ayermolo, dcci, rafaelauler, maksfb
Reviewed By: maksfb
Pull Request: https://github.com/llvm/llvm-project/pull/87569
Provide a mechanism to resolve call target information for calls from non-BAT
functions to BAT functions (`YAMLProfileWriter::convert`). Make it generic for
future use in BAT-to-BAT calls.
Test Plan: Updated bolt/test/X86/bolt-address-translation-yaml.test
Reviewers: ayermolo, maksfb, rafaelauler, dcci
Reviewed By: maksfb
Pull Request: https://github.com/llvm/llvm-project/pull/86219
Attach call counters to YAML profile, covering inter-function control
flow.
Depends on: https://github.com/llvm/llvm-project/pull/86218
Test Plan:
Updated bolt/test/X86/bolt-address-translation-yaml.test
Relax assumptions that YAML output is not supported in BAT mode.
Set up basic infrastructure for emitting YAML for functions not covered
by BAT, such as from `.bolt.org.text` section (code identical to input binary
sans external refs), or non-rewritten functions in non-relocation mode (where
the function stays in the same section but BAT mapping is not emitted).
This diff only produces YAML profile for non-BAT functions (skipped,
non-simple). YAML profile for BAT functions is added in follow-up diffs:
- https://github.com/llvm/llvm-project/pull/76911 emits YAML profile with
internal control flow information only (branch profile),
- https://github.com/llvm/llvm-project/pull/76896 adds cross-function profile
(calls profile).
Test Plan: Added bolt/test/X86/bolt-address-translation-yaml.test
Reviewers: ayermolo, dcci, maksfb, rafaelauler
Reviewed By: rafaelauler
Pull Request: https://github.com/llvm/llvm-project/pull/76910
Check if program header addresses fall into the kernel space to detect a
Linux kernel binary on x86-64.
Delete opts::LinuxKernelMode and use BinaryContext::IsLinuxKernel
instead.
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
If you have a perf.data with Arm ETM data the only way to use perf2bolt
with Branch Aggregation is to first run `perf inject --itrace=l64i1us -o
perf-brstack.data` and then pass the new perf-brstack.data into
perf2bolt. perf2bolt then runs `perf script -F pid,ip,brstack` to
produce the brstacks.
This PR adds `--itrace` arg to perf2bolt to enable Itrace Aggregation.
It takes a string which is what is passed to the `perf script -F
pid,ip,brstack --itrace={0}`. This command produces the brstacks without
having to run perf inject and creating a new perf.data file.
perf2bolt launches a few perf script commands and stores the output in
temporary files before processing the output and cleaning them up before
it exits.
The command `perf script --show-mmap-events` outputs PERF_RECORD_MMAP2
and instruction tracing data but when processed it only looks for
PERF_RECORD_MMAP2 and the instruction tracing data is ignored. This is
fine for small amounts of instruction trace data but when I've recorded
Arm ETM or Intel PT AUX I get lots of it
By adding `--no-itrace` is will just show the PERF_RECORD_MMAP2 records
and will save on time running the `perf script`, disk space storing the
output & time parsing the output.
It is the same for `perf script --show-task-events` where BOLT is only
interested in the PERF_RECORD_COMM & PERF_RECORD_FORK records.
### Data
| Perf Record | Perf Data Size | MMap Size | MMap No Itrace Size |
|---|---|---|---|
| perf record -e cs_etm/@tmc_etr0/u | 137K | 4468K | 0.632K |
| perf record -e intel_pt//u | 890K | 33378K | 0.673K |