clang-p2996

Author	SHA1	Message	Date
Maksim Panchenko	d15b93bade	[BOLT] Major overhaul of profiling in BOLT Summary: Profile reading was tightly coupled with building CFG. Since I plan to move to a new profile format that will be associated with CFG it is critical to decouple the two phases. We now have read profile right after the cfg was constructed, but before it is "canonicalized", i.e. CTCs will till be there. After reading the profile, we do a post-processing pass that fixes CFG and does some post-processing for debug info, such as inference of fall-throughs, which is still required with the current format. Another good reason for decoupling is that we can use profile with CFG to more accurately record fall-through branches during aggregation. At the moment we use "Offset" annotations to facilitate location of instructions corresponding to the profile. This might not be super efficient. However, once we switch to the new profile format the offsets would be no longer needed. We might keep them for the aggregator, but if we have to trust LBR data that might not be strictly necessary. I've tried to make changes while keeping backwards compatibly. This makes it easier to verify correctness of the changes, but that also means that we lose accuracy of the profile. Some refactoring is included. Flag "-prof-compat-mode" (on by default) is used for bug-level backwards compatibility. Disable it for more accurate tracing. (cherry picked from FBD6506156)	2017-11-28 09:57:21 -08:00
Bill Nell	b2f132c7c2	[RFC] [BOLT] Use iterators for MC branch/call analysis code. Summary: Here's an implementation of an abstract instruction iterator for the branch/call analysis code in MCInstrAnalysis. I'm posting it up to see what you guys think. It's a bit sloppy with constness and probably needs more tidying up. (cherry picked from FBD6244012)	2017-11-04 19:22:05 -07:00
Rafael Auler	76d7740cc9	[BOLT-AArch64] Support reordering bzip2 no relocs Summary: Add functionality to support reordering bzip2 compiled to AArch64, with function splitting but without relocations: * Expand the AArch64 backend to support inverting branches and analyzing branches so BOLT reordering machinery is able to shuffle blocks and fix branches correctly; * Add a new pass named LongJmp to add stubs whenever code needs to jump to the cold area, when using function splitting, because of the limited target encoding capability in AArch64 (as a RISC architecture). (cherry picked from FBD5748184)	2017-08-31 11:45:37 -07:00
Bill Nell	0b967eb012	[BOLT] Always call fixBranches in relocation mode. Summary: If you attempted to use a function filter on a binary when in relocation mode, the resulting binary would probably crash. This is because we weren't calling fixBranches on all functions. This was breaking bughunter.sh I also strengthened the validation of basic blocks. The cond branch should always be non-null when there are two successors. (cherry picked from FBD6261930)	2017-11-06 21:04:28 -08:00
Maksim Panchenko	1288c81c9b	[BOLT][Refactoring] Change landing pads handling Summary: Change the way we store and handle landing pads and throwers. (cherry picked from FBD6169992)	2017-10-26 18:36:30 -07:00
Rafael Auler	9df155ce11	[BOLT] Introduce non-LBR mode Summary: Add support to read profiles collected without LBR. This involves adapting our data aggregator perf2bolt and adding support in llvm-bolt itself to read this data. This patch also introduces different options to convert basic block execution count to edge count, so BOLT can operate with its regular algorithms to perform basic block layout. The most successful approach is the default one. (cherry picked from FBD5664735)	2017-08-02 10:59:33 -07:00
Bohan Ren	f819f53d27	Normalize Clusters Twice Summary: This one will normalize cluster twice, leaving edges connecting two basic block untouched (cherry picked from FBD5207416)	2017-06-07 20:25:30 -07:00
Rafael Auler	d850ca3622	[BOLT] Add shrink wrapping pass Summary: Add an implementation for shrink wrapping, a frame optimization that moves callee-saved register spills from hot prologues to cold successors. (cherry picked from FBD4983706)	2017-05-01 16:52:54 -07:00
Maksim Panchenko	2428567f7d	[BOLT] Fix no-assertions build. (cherry picked from FBD5130285)	2017-05-25 10:29:38 -07:00
Bill Nell	4806b13835	[BOLT] Add jump table support to ICP Summary: Add jump table support to ICP. The optimization is basically the same as ICP for tail calls. The big difference is that the profiling data comes from the jump table and the targets are local symbols rather than global. I've removed an instruction from ICP for tail calls. The code used to have a conditional jump to a block with a direct jump to the target, i.e. B1: cmp foo,(%rax) jne B3 B2: jmp foo B3: ... this code is now: B1: cmp foo,(%rax) je foo B2: ... The other changes in this diff: - Move ICP + new jump table support to separate file in Passes. - Improve the CFG validation to handle jump tables. - Fix the double jump peephole so that the successor of the modified block is updated properly. Also make sure that any existing branches in the block are modified to properly reflect the new CFG. - Add an invocation of the double jump peephole to SCTC. This allows us to remove a call to peepholes/UCE occurring after fixBranches() in the pass manager. - Miscellaneous cleanups to BOLT output. (cherry picked from FBD4727757)	2017-03-08 19:58:33 -08:00
Maksim Panchenko	3f42fdf7da	[BOLT] Update function address and size in relocation mode. Summary: Set function addresses after code emission but before we update debug info and symbol table entries. (cherry picked from FBD5029609)	2017-05-08 22:51:36 -07:00
Maksim Panchenko	a99005397f	[BOLT] Fix branch count in removeDuplicateConditionalSuccessor(). Summary: When we merge the original branch counts we have to make sure both of them have a profile. Otherwise set the count to COUNT_NO_PROFILE. The misprediction count should be 0. (cherry picked from FBD4837774)	2017-04-05 13:00:20 -07:00
Bill Nell	6c5c65e3a3	[BOLT] Fix double jump peephole, remove useless conditional branches. Summary: I split some of this out from the jumptable diff since it fixes the double jump peephole. I've changed the pass manager so that UCE and peepholes are not called after SCTC. I've incorporated a call to the double jump fixer to SCTC since it is needed to fix things up afterwards. While working on fixing the double jump peephole I discovered a few useless conditional branches that could be removed as well. I highly doubt that removing them will improve perf at all but it does seem odd to leave in useless conditional branches. There are also some minor logging improvements. (cherry picked from FBD4751875)	2017-03-20 22:44:25 -07:00
Bill Nell	fed0980139	[BOLT] Update tests Summary: Fix validateCFG to handle BBs that were generated from code that used _builtin_unreachable(). Add -verify-cfg option to run CFG validation after every optimization pass. (cherry picked from FBD4641174)	2017-02-27 21:44:38 -08:00
Maksim Panchenko	6dc2351505	[BOLT] New CFI handling policy. Summary: The new interface for handling Call Frame Information: * CFI state at any point in a function (in CFG state) is defined by CFI state at basic block entry and CFI instructions inside the block. The state is independent of basic blocks layout order (this is implied by CFG state but wasn't always true in the past). * Use BinaryBasicBlock::getCFIStateAtInstr(const MCInst Inst) to get CFI state at any given instruction in the program. No need to call fixCFIState() after any given pass. fixCFIState() is called only once during function finalization, and any function transformations after that point are prohibited. * When introducing new basic blocks, make sure CFI state at entry is set correctly and matches CFI instructions in the basic block (if any). * When splitting basic blocks, use getCFIStateAtInstr() to get a state at the split point, and set the new basic block's CFI state to this value. Introduce CFG_Finalized state to indicate that no further optimizations are allowed on the function. This state is reached after we have synced CFI instructions and updated EH info. Rename "-print-after-fixup" option to "-print-finalized". This diffs fixes CFI for cases when we split conditional tail calls, and for indirect call promotion optimization. (cherry picked from FBD4629307)	2017-02-24 21:59:33 -08:00
Maksim Panchenko	2029458f34	[BOLT] Strip 'repz' prefix from 'repz retq'. Summary: Add pass to strip 'repz' prefix from 'repz retq' sequence. The prefix is not used in Intel CPUs afaik. The pass is on by default. (cherry picked from FBD4610329)	2017-02-23 18:09:10 -08:00
Bill Nell	d74997c3cc	Indirect call promotion optimization. Summary: Perform indirect call promotion optimization in BOLT. The code scans the instructions during CFG creation for all indirect calls. Right now indirect tail calls are not handled since the functions are marked not simple. The offsets of the indirect calls are stored for later use by the ICP pass. The indirect call promotion pass visits each indirect call and examines the BranchData for each. If the most frequent targets from that callsite exceed the specified threshold (default 90%), the call is promoted. Otherwise, it is ignored. By default, only one target is considered at each callsite. When an candiate callsite is processed, we modify the callsite to test for the most common call targets before calling through the original generic call mechanism. The CFG and layout are modified by ICP. A few new command line options have been added: -indirect-call-promotion -indirect-call-promotion-threshold=<percentage> -indirect-call-promotion-topn=<int> The threshold is the minimum frequency of a call target needed before ICP is triggered. The topn option controls the number of targets to consider for each callsite, e.g. ICP is triggered if topn=2 and the total requency of the top two call targets exceeds the threshold. Example of ICP: C++ code: int B_count = 0; int C_count = 0; struct A { virtual void foo() = 0; } struct B : public A { virtual void foo() { ++B_count; }; }; struct C : public A { virtual void foo() { ++C_count; }; }; A* a = ... a->foo(); ... original: 400863: 49 8b 07 mov (%r15),%rax 400866: 4c 89 ff mov %r15,%rdi 400869: ff 10 callq (%rax) 40086b: 41 83 e6 01 and $0x1,%r14d 40086f: 4d 89 e6 mov %r12,%r14 400872: 4c 0f 44 f5 cmove %rbp,%r14 400876: 4c 89 f7 mov %r14,%rdi ... after ICP: 40085e: 49 8b 07 mov (%r15),%rax 400861: 4c 89 ff mov %r15,%rdi 400864: 49 ba e0 0b 40 00 00 movabs $0x400be0,%r10 40086b: 00 00 00 40086e: 4c 3b 10 cmp (%rax),%r10 400871: 75 29 jne 40089c <main+0x9c> 400873: 41 ff d2 callq %r10 400876: 41 83 e6 01 and $0x1,%r14d 40087a: 4d 89 e6 mov %r12,%r14 40087d: 4c 0f 44 f5 cmove %rbp,%r14 400881: 4c 89 f7 mov %r14,%rdi ... 40089c: ff 10 callq *(%rax) 40089e: eb d6 jmp 400876 <main+0x76> (cherry picked from FBD3612218)	2016-09-07 18:59:23 -07:00
Maksim Panchenko	bc8a456309	ICF improvements. Summary: Re-worked the way ICF operates. The pass now checks for more than just call instructions, but also for all references including function pointers. Jump tables are handled too. (cherry picked from FBD4372491)	2016-12-21 17:13:56 -08:00
Bill Nell	3a3dfc3dc2	BOLT: Use profiling info to control branch simplification optimization. Summary: An optimization to simplify conditional tail calls by removing unnecessary branches. It adds the following two command line options: -simplify-conditional-tail-calls - simplify conditional tail calls by removing unnecessary jumps -sctc-mode - mode for simplify conditional tail calls =always - always perform sctc =preserve - only perform sctc when branch direction is preserved =heuristic - use branch prediction data to control sctc This optimization considers both of the following cases: foo: ... jcc L1 original ... L1: jmp bar # TAILJMP -> foo: ... jcc bar iff jcc L1 is expected ... L1 is unreachable OR foo: ... jcc L2 L1: jmp dest # TAILJMP L2: ... -> foo: jncc dest # TAILJMP L2: ... L1 is unreachable For this particular case, the first basic block ends with a conditional branch and has two successors, one fall-through and one for when the condition is true. The target of the conditional is a basic block with a single unconditional branch (i.e. tail call) to another function. We don't care about the contents of the fall-through block. (cherry picked from FBD3719617)	2016-09-22 18:08:20 -07:00
Bill Nell	4a0c494bc1	BOLT: Remove restrictions on unreachable code elimination Summary: Allow UCE when blocks have EH info. Since UCE may remove blocks that are referenced from debugging info data structures, we don't actually delete them. We just mark them with an "invalid" index and store them in a different vector to be cleaned up later once the BinaryFunction is destroyed. The debugging code just skips any BBs that have an invalid index. Eliminating blocks may also expose useless jmp instructions, i.e. a jmp around a dead block could just be a fallthrough. I've added a new routine to cleanup these jmps. Although, @maks is working on changing fixBranches() so that it can be used instead. (cherry picked from FBD3793259)	2016-09-07 18:59:23 -07:00
Maksim Panchenko	4464861a02	Support for splitting jump tables. Summary: Add level for "-jump-tables=<n>" option: 1 - all jump tables are output in the same section (default). 2 - basic splitting, if the table is used it is output to hot section otherwise to cold one. 3 - aggressively split compound jump tables and collect profile for all entries. Option "-print-jump-tables" outputs all jump tables for debugging and/or analyzing purposes. Use with "-jump-tables=3" to get profile values for every entry in a jump table. (cherry picked from FBD3912119)	2016-09-16 15:54:32 -07:00
Bill Nell	7483cd0fa6	BOLT: Clean up interface between BinaryFunction and BinaryBasicBlock. Summary: This is just a bit of refactoring to make sure that BinaryFunction goes through methods to get at the state in BinaryBasicBlock. I did this so that changing the way Index/LayoutIndex/Valid works will be easier. (cherry picked from FBD3860899)	2016-09-13 17:12:00 -07:00
Bill Nell	861d5a1586	BOLT: Remove double jumps peephole. Summary: Replace jumps to other unconditional jumps with the final destination, e.g. B0: ... jmp B1 (or jcc B1) B1: jmp B2 -> B0: ... jmp B2 (or jcc B1) This peephole removes 8928 double jumps from a test binary. Note: after filtering out double jumps found in EH code and infinite loops, the number of double jumps patched is 49 (24 for a clang compiled test). The 24 in the clang build are all from external libraries which have probably been compiled with gcc. This peephole is still useful for cleaning up after ICP though. (cherry picked from FBD3815420)	2016-09-02 18:09:07 -07:00
Maksim Panchenko	c4c518ee9d	Rewrite SCTC pass to do UCE and make it the last optimization pass. Summary: For now we make SCTC a special pass that runs at the end of all optimizations and transformations right after fixupBranches(). Since it's the last pass, it has to do its own UCE. (cherry picked from FBD3838051)	2016-09-08 14:52:26 -07:00
Maksim Panchenko	6bef336cc2	Add dyno stats to BOLT. Summary: Add "-dyno-stats" option that prints instruction stats based on the execution profile similar to below: BOLT-INFO: program-wide dynostats after optimizations: executed forward branches : 109706407 (+8.1%) taken forward branches : 13769074 (-55.5%) executed backward branches : 24517582 (-25.0%) taken backward branches : 15330256 (-27.2%) executed unconditional branches : 6009826 (-35.5%) function calls : 17192114 (+0.0%) executed instructions : 837733057 (-0.4%) total branches : 140233815 (-2.3%) taken branches : 35109156 (-42.8%) Also fixed pseudo instruction discrepancies and added assertions for BinaryBasicBlock::getNumPseudos() to make sure the number is synchronized with real number of pseudo instructions. (cherry picked from FBD3826995)	2016-08-29 21:11:22 -07:00
Maksim Panchenko	17e691915b	Make BinaryFunction::fixBranches() more flexible and support CFG updates. Summary: The CFG represents "the ultimate source of truth". Transformations on functions and blocks have to update the CFG and fixBranches() would make sure the correct branch instructions are inserted at the end of basic blocks (or removed when necessary). We do require a conditional branch at the end of the basic block if the block has 2 successors as CFG currently lacks the conditional code support (it will probably stay that way). We only use this branch instruction for its conditional code, the destination is determined by CFG - first successor representing true/taken branch, while the second successor - false/fall-through branch. When we reverse the branch condition, the CFG is updated accordingly. The previous version used to insert jumps after some terminating instructions sometimes resulting in a larger code than needed. As a result with the new version 1 extra function becomes overwritten for HHVM binary. With this diff we also convert conditional branches with one successor (result of code from __builtin_unreachable()) into unconditional jumps. (cherry picked from FBD3802062)	2016-08-29 21:11:22 -07:00
Bill Nell	c27a6a5c63	Add verbosity level and clean up stream usage. Summary: I've added a verbosity level to help keep the BOLT spewage to a minimum. The default level is pretty terse now, level 1 is closer to the original, I've saved level 2 for the noisiest of messages. Error messages should never be suppressed by the verbosity level only warnings and info messages. The rational behind stream usage is as follows: outs() for info and debugging controlled by command line flags. errs() for errors and warnings. dbgs() for output within DEBUG(). With the exception of a few of the level 2 messages I don't have any strong feelings about the others. (cherry picked from FBD3814259)	2016-09-02 14:15:29 -07:00
Theodoros Kasampalis	a9bb3320ad	Identical Code Folding (ICF) pass Summary: Added an ICF pass to BOLT, that can recognize identical functions and replace references to these functions with references to just one representative. (cherry picked from FBD3460297)	2016-06-09 11:36:55 -07:00
Bill Nell	82401630a2	Factor out instruction printing and size computation. Summary: I've factored out the instruction printing and size computation routines to methods on BinaryContext. I've also added some more debug print functions. This was split off the ICP diff to simplify it a bit. (cherry picked from FBD3610690)	2016-07-23 08:01:53 -07:00
Theodoros Kasampalis	65ac8bbdf2	Better edge counts for fall through blocks in presence of C++ exceptions. Summary: The inference algorithm for counts of fall through edges takes possible jumps to landing pad blocks into account. Also, the landing pad block execution counts are updated using profile data. (cherry picked from FBD3350727)	2016-05-26 15:10:09 -07:00
Bill Nell	e63984f325	Patch forward jumping tail calls to prevent branch mispredictions. Summary: A simple optimization to prevent branch misprediction for tail calls. Convert the sequence: j<cc> L1 ... L1: jmp foo # tail call into: j<cc> foo but only if 'j<cc> foo' turns out to be a forward branch. (cherry picked from FBD3234207)	2016-05-02 12:47:18 -07:00
Gabriel Poesia	ffa9641e16	Update DWARF lexical blocks address ranges. Summary: Updates DWARF lexical blocks address ranges in the output binary after optimizations. This is similar to updating function address ranges except that the ranges representation needs to be more general, since address ranges can begin or end in the middle of a basic block. The following changes were made: - Added a data structure for iterating over the basic blocks that intersect an address range: BasicBlockTable.h - Added some more bookkeeping in BinaryBasicBlock. Basically, I needed to keep track of the block's size in the input binary as well as its address in the output binary. This information is mostly set by BinaryFunction after disassembly. - Added a representation for address ranges relative to basic blocks (BasicBlockOffsetRanges.h). Will also serve for location lists. - Added a representation for Lexical Blocks (LexicalBlock.h) - Small refactorings in DebugArangesWriter: -- Renamed to DebugRangesSectionsWriter since it also writes .debug_ranges -- Refactored it not to depend on BinaryFunction but instead on anything that can be assined an aoffset in .debug_ranges (added an interface for that) - Iterate over the DIE tree during initialization to find lexical blocks in .debug_info (BinaryContext.cpp) - Added patches to .debug_abbrev and .debug_info in RewriteInstance to update lexical blocks attributes (in fact, this part is very similar to what was done to function address ranges and I just refactored/reused that code) - Added small test case (lexical_blocks_address_ranges_debug.test) (cherry picked from FBD3113181)	2016-03-28 17:45:22 -07:00
Maksim Panchenko	d1526083fc	Rename binary optimizer to BOLT. Summary: BOLT - Binary Optimization and Layout Tool replaces FLO. I'm keeping .fdata extension for "feedback data". (cherry picked from FBD2908028)	2016-02-05 14:42:04 -08:00
Maksim Panchenko	b4ed5cc942	Make FLO work on hhvm binary. Summary: Fixes several issues that prevented us from running hhvm binary. (cherry picked from FBD2543057)	2015-10-14 15:35:14 -07:00
Rafael Auler	4c1da22ae9	Add branch count information to binary CFG Summary: Changes DataReader to organize branch perf data per function name and sets up logistics to bring this data to BinaryFunction::buildCFG(). To do this, we expand BinaryContext with a const reference to DataReader. This patch also adds the "-dump-functions" flag to force llvm-flo to dump the current state of BinaryFunctions once they are disassembled and their CFG built, allowing us to test whether the builder is sane with LLVM LIT tests. (cherry picked from FBD2534675)	2015-10-12 12:30:47 -07:00
Maksim Panchenko	9a2fe7ebe4	Commit FLO with control flow graph. Summary: llvm-flo disassembles, builds control flow graph, and re-writes simple functions. (cherry picked from FBD2524024)	2015-10-09 17:21:14 -07:00

36 Commits