Summary:
Profile reading was tightly coupled with building CFG. Since I plan
to move to a new profile format that will be associated with CFG
it is critical to decouple the two phases.
We now have read profile right after the cfg was constructed, but
before it is "canonicalized", i.e. CTCs will till be there.
After reading the profile, we do a post-processing pass that fixes
CFG and does some post-processing for debug info, such as
inference of fall-throughs, which is still required with the current
format.
Another good reason for decoupling is that we can use profile with
CFG to more accurately record fall-through branches during
aggregation.
At the moment we use "Offset" annotations to facilitate location
of instructions corresponding to the profile. This might not be
super efficient. However, once we switch to the new profile format
the offsets would be no longer needed. We might keep them for
the aggregator, but if we have to trust LBR data that might
not be strictly necessary.
I've tried to make changes while keeping backwards compatibly. This makes
it easier to verify correctness of the changes, but that also means
that we lose accuracy of the profile.
Some refactoring is included.
Flag "-prof-compat-mode" (on by default) is used for bug-level
backwards compatibility. Disable it for more accurate tracing.
(cherry picked from FBD6506156)
Summary:
Here's an implementation of an abstract instruction iterator for the branch/call
analysis code in MCInstrAnalysis. I'm posting it up to see what you guys think.
It's a bit sloppy with constness and probably needs more tidying up.
(cherry picked from FBD6244012)
Summary:
Add functionality to support reordering bzip2 compiled to
AArch64, with function splitting but without relocations:
* Expand the AArch64 backend to support inverting branches and
analyzing branches so BOLT reordering machinery is able to shuffle
blocks and fix branches correctly;
* Add a new pass named LongJmp to add stubs whenever code needs to
jump to the cold area, when using function splitting, because of the
limited target encoding capability in AArch64 (as a RISC architecture).
(cherry picked from FBD5748184)
Summary:
If you attempted to use a function filter on a binary when in relocation mode, the resulting binary would probably crash. This is because we weren't calling fixBranches on all functions. This was breaking bughunter.sh
I also strengthened the validation of basic blocks. The cond branch should always be non-null when there are two successors.
(cherry picked from FBD6261930)
Summary:
Add support to read profiles collected without LBR. This
involves adapting our data aggregator perf2bolt and adding support
in llvm-bolt itself to read this data.
This patch also introduces different options to convert basic block
execution count to edge count, so BOLT can operate with its regular
algorithms to perform basic block layout. The most successful approach
is the default one.
(cherry picked from FBD5664735)
Summary:
Add an implementation for shrink wrapping, a frame optimization
that moves callee-saved register spills from hot prologues to cold
successors.
(cherry picked from FBD4983706)
Summary:
Add jump table support to ICP. The optimization is basically the same
as ICP for tail calls. The big difference is that the profiling data
comes from the jump table and the targets are local symbols rather than
global.
I've removed an instruction from ICP for tail calls. The code used to
have a conditional jump to a block with a direct jump to the target, i.e.
B1: cmp foo,(%rax)
jne B3
B2: jmp foo
B3: ...
this code is now:
B1: cmp foo,(%rax)
je foo
B2: ...
The other changes in this diff:
- Move ICP + new jump table support to separate file in Passes.
- Improve the CFG validation to handle jump tables.
- Fix the double jump peephole so that the successor of the modified
block is updated properly. Also make sure that any existing branches
in the block are modified to properly reflect the new CFG.
- Add an invocation of the double jump peephole to SCTC. This allows
us to remove a call to peepholes/UCE occurring after fixBranches() in
the pass manager.
- Miscellaneous cleanups to BOLT output.
(cherry picked from FBD4727757)
Summary:
When we merge the original branch counts we have to make sure
both of them have a profile. Otherwise set the count to COUNT_NO_PROFILE.
The misprediction count should be 0.
(cherry picked from FBD4837774)
Summary:
I split some of this out from the jumptable diff since it fixes the
double jump peephole.
I've changed the pass manager so that UCE and peepholes are not called
after SCTC. I've incorporated a call to the double jump fixer to SCTC
since it is needed to fix things up afterwards.
While working on fixing the double jump peephole I discovered a few
useless conditional branches that could be removed as well. I highly
doubt that removing them will improve perf at all but it does seem
odd to leave in useless conditional branches.
There are also some minor logging improvements.
(cherry picked from FBD4751875)
Summary:
Fix validateCFG to handle BBs that were generated from code that used
_builtin_unreachable().
Add -verify-cfg option to run CFG validation after every optimization
pass.
(cherry picked from FBD4641174)
Summary:
The new interface for handling Call Frame Information:
* CFI state at any point in a function (in CFG state) is defined by
CFI state at basic block entry and CFI instructions inside the
block. The state is independent of basic blocks layout order
(this is implied by CFG state but wasn't always true in the past).
* Use BinaryBasicBlock::getCFIStateAtInstr(const MCInst *Inst) to
get CFI state at any given instruction in the program.
* No need to call fixCFIState() after any given pass. fixCFIState()
is called only once during function finalization, and any function
transformations after that point are prohibited.
* When introducing new basic blocks, make sure CFI state at entry
is set correctly and matches CFI instructions in the basic block
(if any).
* When splitting basic blocks, use getCFIStateAtInstr() to get
a state at the split point, and set the new basic block's CFI
state to this value.
Introduce CFG_Finalized state to indicate that no further optimizations
are allowed on the function. This state is reached after we have synced
CFI instructions and updated EH info.
Rename "-print-after-fixup" option to "-print-finalized".
This diffs fixes CFI for cases when we split conditional tail calls,
and for indirect call promotion optimization.
(cherry picked from FBD4629307)
Summary:
Add pass to strip 'repz' prefix from 'repz retq' sequence. The prefix
is not used in Intel CPUs afaik. The pass is on by default.
(cherry picked from FBD4610329)
Summary:
Perform indirect call promotion optimization in BOLT.
The code scans the instructions during CFG creation for all
indirect calls. Right now indirect tail calls are not handled
since the functions are marked not simple. The offsets of the
indirect calls are stored for later use by the ICP pass.
The indirect call promotion pass visits each indirect call and
examines the BranchData for each. If the most frequent targets
from that callsite exceed the specified threshold (default 90%),
the call is promoted. Otherwise, it is ignored. By default,
only one target is considered at each callsite.
When an candiate callsite is processed, we modify the callsite
to test for the most common call targets before calling through
the original generic call mechanism.
The CFG and layout are modified by ICP.
A few new command line options have been added:
-indirect-call-promotion
-indirect-call-promotion-threshold=<percentage>
-indirect-call-promotion-topn=<int>
The threshold is the minimum frequency of a call target needed
before ICP is triggered.
The topn option controls the number of targets to consider for
each callsite, e.g. ICP is triggered if topn=2 and the total
requency of the top two call targets exceeds the threshold.
Example of ICP:
C++ code:
int B_count = 0;
int C_count = 0;
struct A { virtual void foo() = 0; }
struct B : public A { virtual void foo() { ++B_count; }; };
struct C : public A { virtual void foo() { ++C_count; }; };
A* a = ...
a->foo();
...
original:
400863: 49 8b 07 mov (%r15),%rax
400866: 4c 89 ff mov %r15,%rdi
400869: ff 10 callq *(%rax)
40086b: 41 83 e6 01 and $0x1,%r14d
40086f: 4d 89 e6 mov %r12,%r14
400872: 4c 0f 44 f5 cmove %rbp,%r14
400876: 4c 89 f7 mov %r14,%rdi
...
after ICP:
40085e: 49 8b 07 mov (%r15),%rax
400861: 4c 89 ff mov %r15,%rdi
400864: 49 ba e0 0b 40 00 00 movabs $0x400be0,%r10
40086b: 00 00 00
40086e: 4c 3b 10 cmp (%rax),%r10
400871: 75 29 jne 40089c <main+0x9c>
400873: 41 ff d2 callq *%r10
400876: 41 83 e6 01 and $0x1,%r14d
40087a: 4d 89 e6 mov %r12,%r14
40087d: 4c 0f 44 f5 cmove %rbp,%r14
400881: 4c 89 f7 mov %r14,%rdi
...
40089c: ff 10 callq *(%rax)
40089e: eb d6 jmp 400876 <main+0x76>
(cherry picked from FBD3612218)
Summary:
Re-worked the way ICF operates. The pass now checks for more than just
call instructions, but also for all references including function
pointers. Jump tables are handled too.
(cherry picked from FBD4372491)
Summary:
An optimization to simplify conditional tail calls by removing unnecessary branches. It adds the following two command line options:
-simplify-conditional-tail-calls - simplify conditional tail calls by removing unnecessary jumps
-sctc-mode - mode for simplify conditional tail calls
=always - always perform sctc
=preserve - only perform sctc when branch direction is preserved
=heuristic - use branch prediction data to control sctc
This optimization considers both of the following cases:
foo: ...
jcc L1 original
...
L1: jmp bar # TAILJMP
->
foo: ...
jcc bar iff jcc L1 is expected
...
L1 is unreachable
OR
foo: ...
jcc L2
L1: jmp dest # TAILJMP
L2: ...
->
foo: jncc dest # TAILJMP
L2: ...
L1 is unreachable
For this particular case, the first basic block ends with a conditional branch and has two successors, one fall-through and one for when the condition is true. The target of the conditional is a basic block with a single unconditional branch (i.e. tail call) to another function. We don't care about the contents of the fall-through block.
(cherry picked from FBD3719617)
Summary:
Allow UCE when blocks have EH info. Since UCE may remove blocks
that are referenced from debugging info data structures, we don't
actually delete them. We just mark them with an "invalid" index
and store them in a different vector to be cleaned up later once
the BinaryFunction is destroyed. The debugging code just skips
any BBs that have an invalid index.
Eliminating blocks may also expose useless jmp instructions, i.e.
a jmp around a dead block could just be a fallthrough. I've added
a new routine to cleanup these jmps. Although, @maks is working on
changing fixBranches() so that it can be used instead.
(cherry picked from FBD3793259)
Summary:
Add level for "-jump-tables=<n>" option:
1 - all jump tables are output in the same section (default).
2 - basic splitting, if the table is used it is output to hot section
otherwise to cold one.
3 - aggressively split compound jump tables and collect profile for
all entries.
Option "-print-jump-tables" outputs all jump tables for debugging
and/or analyzing purposes. Use with "-jump-tables=3" to get profile
values for every entry in a jump table.
(cherry picked from FBD3912119)
Summary:
This is just a bit of refactoring to make sure that BinaryFunction goes
through methods to get at the state in BinaryBasicBlock. I did this so
that changing the way Index/LayoutIndex/Valid works will be easier.
(cherry picked from FBD3860899)
Summary:
Replace jumps to other unconditional jumps with the final
destination, e.g.
B0: ...
jmp B1 (or jcc B1)
B1: jmp B2
->
B0: ...
jmp B2 (or jcc B1)
This peephole removes 8928 double jumps from a test binary.
Note: after filtering out double jumps found in EH code and infinite
loops, the number of double jumps patched is 49 (24 for a clang
compiled test). The 24 in the clang build are all from external
libraries which have probably been compiled with gcc. This peephole
is still useful for cleaning up after ICP though.
(cherry picked from FBD3815420)
Summary:
For now we make SCTC a special pass that runs at the end of all
optimizations and transformations right after fixupBranches().
Since it's the last pass, it has to do its own UCE.
(cherry picked from FBD3838051)
Summary:
Add "-dyno-stats" option that prints instruction stats based on
the execution profile similar to below:
BOLT-INFO: program-wide dynostats after optimizations:
executed forward branches : 109706407 (+8.1%)
taken forward branches : 13769074 (-55.5%)
executed backward branches : 24517582 (-25.0%)
taken backward branches : 15330256 (-27.2%)
executed unconditional branches : 6009826 (-35.5%)
function calls : 17192114 (+0.0%)
executed instructions : 837733057 (-0.4%)
total branches : 140233815 (-2.3%)
taken branches : 35109156 (-42.8%)
Also fixed pseudo instruction discrepancies and added assertions
for BinaryBasicBlock::getNumPseudos() to make sure the number is
synchronized with real number of pseudo instructions.
(cherry picked from FBD3826995)
Summary:
The CFG represents "the ultimate source of truth". Transformations
on functions and blocks have to update the CFG and fixBranches() would
make sure the correct branch instructions are inserted at the end of
basic blocks (or removed when necessary).
We do require a conditional branch at the end of the basic block if
the block has 2 successors as CFG currently lacks the conditional
code support (it will probably stay that way). We only use this
branch instruction for its conditional code, the destination is
determined by CFG - first successor representing true/taken branch,
while the second successor - false/fall-through branch.
When we reverse the branch condition, the CFG is updated accordingly.
The previous version used to insert jumps after some terminating
instructions sometimes resulting in a larger code than needed. As a
result with the new version 1 extra function becomes overwritten for
HHVM binary.
With this diff we also convert conditional branches with one successor
(result of code from __builtin_unreachable()) into unconditional
jumps.
(cherry picked from FBD3802062)
Summary:
I've added a verbosity level to help keep the BOLT spewage to a minimum.
The default level is pretty terse now, level 1 is closer to the original,
I've saved level 2 for the noisiest of messages. Error messages should
never be suppressed by the verbosity level only warnings and info messages.
The rational behind stream usage is as follows:
outs() for info and debugging controlled by command line flags.
errs() for errors and warnings.
dbgs() for output within DEBUG().
With the exception of a few of the level 2 messages I don't have any strong feelings about the others.
(cherry picked from FBD3814259)
Summary:
Added an ICF pass to BOLT, that can recognize identical functions
and replace references to these functions with references to just one
representative.
(cherry picked from FBD3460297)
Summary:
I've factored out the instruction printing and size computation routines to
methods on BinaryContext. I've also added some more debug print functions.
This was split off the ICP diff to simplify it a bit.
(cherry picked from FBD3610690)
Summary: The inference algorithm for counts of fall through edges takes possible jumps to landing pad blocks into account. Also, the landing pad block execution counts are updated using profile data.
(cherry picked from FBD3350727)
Summary:
A simple optimization to prevent branch misprediction for tail calls.
Convert the sequence:
j<cc> L1
...
L1: jmp foo # tail call
into:
j<cc> foo
but only if 'j<cc> foo' turns out to be a forward branch.
(cherry picked from FBD3234207)
Summary:
Updates DWARF lexical blocks address ranges in the output binary after optimizations.
This is similar to updating function address ranges except that the ranges representation needs
to be more general, since address ranges can begin or end in the middle of a basic block.
The following changes were made:
- Added a data structure for iterating over the basic blocks that intersect an address range: BasicBlockTable.h
- Added some more bookkeeping in BinaryBasicBlock. Basically, I needed to keep track of the block's size in the input binary as well as its address in the output binary. This information is mostly set by BinaryFunction after disassembly.
- Added a representation for address ranges relative to basic blocks (BasicBlockOffsetRanges.h). Will also serve for location lists.
- Added a representation for Lexical Blocks (LexicalBlock.h)
- Small refactorings in DebugArangesWriter:
-- Renamed to DebugRangesSectionsWriter since it also writes .debug_ranges
-- Refactored it not to depend on BinaryFunction but instead on anything that can be assined an aoffset in .debug_ranges (added an interface for that)
- Iterate over the DIE tree during initialization to find lexical blocks in .debug_info (BinaryContext.cpp)
- Added patches to .debug_abbrev and .debug_info in RewriteInstance to update lexical blocks attributes (in fact, this part is very similar to what was done to function address ranges and I just refactored/reused that code)
- Added small test case (lexical_blocks_address_ranges_debug.test)
(cherry picked from FBD3113181)
Summary:
Changes DataReader to organize branch perf data per function name and
sets up logistics to bring this data to BinaryFunction::buildCFG(). To do this,
we expand BinaryContext with a const reference to DataReader. This patch also
adds the "-dump-functions" flag to force llvm-flo to dump the current state of
BinaryFunctions once they are disassembled and their CFG built, allowing us to
test whether the builder is sane with LLVM LIT tests.
(cherry picked from FBD2534675)