Replace `BPFMIPeepholeTruncElim` by adding an overload for `TargetLowering::isZExtFree()` aware that zero extension is free for `ISD::LOAD`. Short description ================= The `BPFMIPeepholeTruncElim` handles two patterns: Pattern #1: %1 = LDB %0, ... %1 = LDB %0, ... %2 = AND_ri %1, 0xff -> %2 = MOV_ri %1 <-- (!) Pattern #2: bb.1: bb.1: %a = LDB %0, ... %a = LDB %0, ... br %bb3 br %bb3 bb.2: bb.2: %b = LDB %0, ... -> %b = LDB %0, ... br %bb3 br %bb3 bb.3: bb.3: %1 = PHI %a, %b %1 = PHI %a, %b %2 = AND_ri %1, 0xff %2 = MOV_ri %1 <-- (!) Plus variations: - AND_ri_32 instead of AND_ri - SLL/SLR instead of AND_ri - LDH, LDW, LDB32, LDH32, LDW32 Both patterns could be handled by built-in transformations at instruction selection phase if suitable `isZExtFree()` implementation is provided. The idea is borrowed from `ARMTargetLowering::isZExtFree`. When evaluating on BPF kernel selftests and remove_truncate_*.ll LLVM test cases this revisions performs slightly better than BPFMIPeepholeTruncElim, see "Impact" section below for details. Commit also adds a few test cases to make sure that patterns in question are handled. Long description ================ Why this works: Pattern #1 -------------------------- Consider the following example: define i1 @foo(ptr %p) { entry: %a = load i8, ptr %p, align 1 %cond = icmp eq i8 %a, 0 ret i1 %cond } Log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command: ... Type-legalized selection DAG: %bb.0 'foo:entry' SelectionDAG has 13 nodes: t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64 t19: i64 = and t16, Constant:i64<255> t17: i64 = setcc t19, Constant:i64<0>, seteq:ch t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17 t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1 ... Replacing.1 t19: i64 = and t16, Constant:i64<255> With: t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64 and 0 other values ... Optimized type-legalized selection DAG: %bb.0 'foo:entry' SelectionDAG has 11 nodes: t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t20: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64 t17: i64 = setcc t20, Constant:i64<0>, seteq:ch t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17 t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1 ... Note: - Optimized type-legalized selection DAG: - `t19 = and t16, 255` had been replaced by `t16` (load). - Patterns like `(and (load ... i8), 255)` are replaced by `load` in `DAGCombiner::BackwardsPropagateMask` called from `DAGCombiner::visitAND`. - Similarly patterns like `(shl (srl ..., 56), 56)` are replaced by `(and ..., 255)` in `DAGCombiner::visitSRL` (this function is huge, look for `TLI.shouldFoldConstantShiftPairToMask()` call). Why this works: Pattern #2 -------------------------- Consider the following example: define i1 @foo(ptr %p) { entry: %a = load i8, ptr %p, align 1 br label %next next: %cond = icmp eq i8 %a, 0 ret i1 %cond } Consider log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command. Log for first basic block: Initial selection DAG: %bb.0 'foo:entry' SelectionDAG has 9 nodes: t0: ch,glue = EntryToken t3: i64 = Constant<0> t2: i64,ch = CopyFromReg t0, Register:i64 %1 t5: i8,ch = load<(load (s8) from %ir.p)> t0, t2, undef:i64 t6: i64 = zero_extend t5 t8: ch = CopyToReg t0, Register:i64 %0, t6 ... Replacing.1 t6: i64 = zero_extend t5 With: t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64 and 0 other values ... Optimized lowered selection DAG: %bb.0 'foo:entry' SelectionDAG has 7 nodes: t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %1 t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64 t8: ch = CopyToReg t0, Register:i64 %0, t9 Note: - Initial selection DAG: - `%a = load ...` is lowered as `t6 = (zero_extend (load ...))` w/o special `isZExtFree()` overload added by this commit it is instead lowered as `t6 = (any_extend (load ...))`. - The decision to generate `zero_extend` or `any_extend` is done in `RegsForValue::getCopyToRegs` called from `SelectionDAGBuilder::CopyValueToVirtualRegister`: - if `isZExtFree()` for load returns true `zero_extend` is used; - `any_extend` is used otherwise. - Optimized lowered selection DAG: - `t6 = (any_extend (load ...))` is replaced by `t9 = load ..., zext from i8` This is done by `DagCombiner.cpp:tryToFoldExtOfLoad()` called from `DAGCombiner::visitZERO_EXTEND`. Log for second basic block: Initial selection DAG: %bb.1 'foo:next' SelectionDAG has 13 nodes: t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t4: i64 = AssertZext t2, ValueType:ch:i8 t5: i8 = truncate t4 t8: i1 = setcc t5, Constant:i8<0>, seteq:ch t9: i64 = any_extend t8 t11: ch,glue = CopyToReg t0, Register:i64 $r0, t9 t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1 ... Replacing.2 t18: i64 = and t4, Constant:i64<255> With: t4: i64 = AssertZext t2, ValueType:ch:i8 ... Type-legalized selection DAG: %bb.1 'foo:next' SelectionDAG has 13 nodes: t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t4: i64 = AssertZext t2, ValueType:ch:i8 t18: i64 = and t4, Constant:i64<255> t16: i64 = setcc t18, Constant:i64<0>, seteq:ch t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16 t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1 ... Optimized type-legalized selection DAG: %bb.1 'foo:next' SelectionDAG has 11 nodes: t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t4: i64 = AssertZext t2, ValueType:ch:i8 t16: i64 = setcc t4, Constant:i64<0>, seteq:ch t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16 t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1 ... Note: - Initial selection DAG: - `t0` is an input value for this basic block, it corresponds load instruction (`t9`) from the first basic block. - It is accessed within basic block via `t4` (AssertZext (CopyFromReg t0, ...)). - The `AssertZext` is generated by RegsForValue::getCopyFromRegs called from SelectionDAGBuilder::getCopyFromRegs, it is generated only when `LiveOutInfo` with known number of leading zeros is present for `t0`. - Known register bits in `LiveOutInfo` are computed by `SelectionDAG::computeKnownBits` called from `SelectionDAGISel::ComputeLiveOutVRegInfo`. - `computeKnownBits()` generates leading zeros information for `(load ..., zext from ...)` but *does not* generate leading zeros information for `(load ..., anyext from ...)`. This is why `isZExtFree()` added in this commit is important. - Type-legalized selection DAG: - `t5 = truncate t4` is replaced by `t18 = and t4, 255` - Optimized type-legalized selection DAG: - `t18 = and t4, 255` is replaced by `t4`, this is done by `DAGCombiner::SimplifyDemandedBits` called from `DAGCombiner::visitAND`, which simplifies patterns like `(and (assertzext ...))` Impact ------ This change covers all remove_truncate_*.ll test cases: - for -mcpu=v4 there are no changes in the generated code; - for -mcpu=v2 code generated for remove_truncate_7 and remove_truncate_8 improved slightly, for other tests it is unchanged. For remove_truncate_7: Before this revision After this revision -------------------- ------------------- r1 <<= 0x20 r1 <<= 0x20 r1 >>= 0x20 r1 >>= 0x20 if r1 == 0x0 goto +0x2 <LBB0_2> if r1 == 0x0 goto +0x2 <LBB0_2> r1 = *(u32 *)(r2 + 0x0) r0 = *(u32 *)(r2 + 0x0) goto +0x1 <LBB0_3> goto +0x1 <LBB0_3> <LBB0_2>: <LBB0_2>: r1 = *(u32 *)(r2 + 0x4) r0 = *(u32 *)(r2 + 0x4) <LBB0_3>: <LBB0_3>: r0 = r1 exit exit For remove_truncate_8: Before this revision After this revision -------------------- ------------------- r2 = *(u32 *)(r1 + 0x0) r2 = *(u32 *)(r1 + 0x0) r3 = r2 r3 = r2 r3 <<= 0x20 r3 <<= 0x20 r4 = r3 r3 s>>= 0x20 r4 s>>= 0x20 if r4 s> 0x2 goto +0x5 <LBB0_3> if r3 s> 0x2 goto +0x4 <LBB0_3> r4 = *(u32 *)(r1 + 0x4) r3 = *(u32 *)(r1 + 0x4) r3 >>= 0x20 if r3 >= r4 goto +0x2 <LBB0_3> if r2 >= r3 goto +0x2 <LBB0_3> r2 += 0x2 r2 += 0x2 *(u32 *)(r1 + 0x0) = r2 *(u32 *)(r1 + 0x0) = r2 <LBB0_3>: <LBB0_3>: r0 = 0x3 r0 = 0x3 exit exit For kernel BPF selftests statistics is as follows: (-mcpu=v4): - For -mcpu=v4: 9 out of 655 object files have differences, in all cases total number of instructions marginally decreased (-27 instructions). - For -mcpu=v2: 9 out of 655 object files have differences: - For 19 object files number of instruction decreased (-129 instruction in total): some redundant `rX &= 0xffff` and register to register assignments removed; - For 2 object files number of instructions increased +2 instructions in each file. Both -mcpu=v2 instruction increases could be reduced to the same example: define void @foo(ptr %p) { entry: %a = load i32, ptr %p, align 4 %b = sext i32 %a to i64 %c = icmp ult i64 1, %b br i1 %c, label %next, label %end next: call void inttoptr (i64 62 to ptr)(i32 %a) br label %end end: ret void } Note that this example uses value loaded to `%a` both as a sign extended (`%b`) and as zero extended (`%a` passed as parameter). Here is the difference in final assembly code: Before this revision After this revision -------------------- ------------------- r1 = *(u32 *)(r1 + 0) r1 = *(u32 *)(r1 + 0) r1 <<= 32 r1 <<= 32 r1 s>>= 32 r1 s>>= 32 if r1 < 2 goto <LBB0_2> if r1 < 2 goto <LBB0_2> r1 <<= 32 r1 >>= 32 call 62 call 62 <LBB0_2>: <LBB0_2>: exit exit Before this commit `%a` is passed to call as a sign extended value, after this commit `%a` is passed to call as a zero extended value, both are correct as 32-bit sub-register is the same. The difference comes from `DAGCombiner` operation on the initial DAG: Initial selection DAG before this commit: t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64 t6: i64 = any_extend t5 <--------------------- (1) t8: ch = CopyToReg t0, Register:i64 %0, t6 t9: i64 = sign_extend t5 t12: i1 = setcc Constant:i64<1>, t9, setult:ch Initial selection DAG after this commit: t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64 t6: i64 = zero_extend t5 <--------------------- (2) t8: ch = CopyToReg t0, Register:i64 %0, t6 t9: i64 = sign_extend t5 t12: i1 = setcc Constant:i64<1>, t9, setult:ch The node `t9` is processed before node `t6` and `load` instruction is combined to load with sign extension: Replacing.1 t9: i64 = sign_extend t5 With: t30: i64,ch = load<(load (s32) from %ir.p), sext from i32> t0, t2, undef:i64 and 0 other values Replacing.1 t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64 With: t31: i32 = truncate t30 and 1 other values This is done by `DAGCombiner.cpp:tryToFoldExtOfLoad` called from `DAGCombiner::visitSIGN_EXTEND`. Note that `t5` is used by `t6` which is `any_extend` in (1) and `zero_extend` in (2). `tryToFoldExtOfLoad()` rewrites such uses of `t5` differently: - `any_extend` is simply removed - `zero_extend` is replaced by `and t30, 0xffffffff`, which is later converted to a pair of shifts. This pair of shifts survives till the end of translation. Differential Revision: https://reviews.llvm.org/D157870
166 lines
5.7 KiB
C++
166 lines
5.7 KiB
C++
//===-- BPFTargetMachine.cpp - Define TargetMachine for BPF ---------------===//
|
|
//
|
|
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
|
// See https://llvm.org/LICENSE.txt for license information.
|
|
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
|
//
|
|
//===----------------------------------------------------------------------===//
|
|
//
|
|
// Implements the info about BPF target spec.
|
|
//
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
#include "BPFTargetMachine.h"
|
|
#include "BPF.h"
|
|
#include "BPFTargetTransformInfo.h"
|
|
#include "MCTargetDesc/BPFMCAsmInfo.h"
|
|
#include "TargetInfo/BPFTargetInfo.h"
|
|
#include "llvm/CodeGen/Passes.h"
|
|
#include "llvm/CodeGen/TargetLoweringObjectFileImpl.h"
|
|
#include "llvm/CodeGen/TargetPassConfig.h"
|
|
#include "llvm/IR/PassManager.h"
|
|
#include "llvm/MC/TargetRegistry.h"
|
|
#include "llvm/Passes/PassBuilder.h"
|
|
#include "llvm/Support/FormattedStream.h"
|
|
#include "llvm/Target/TargetOptions.h"
|
|
#include "llvm/Transforms/Scalar.h"
|
|
#include "llvm/Transforms/Scalar/SimplifyCFG.h"
|
|
#include "llvm/Transforms/Utils/SimplifyCFGOptions.h"
|
|
#include <optional>
|
|
using namespace llvm;
|
|
|
|
static cl::
|
|
opt<bool> DisableMIPeephole("disable-bpf-peephole", cl::Hidden,
|
|
cl::desc("Disable machine peepholes for BPF"));
|
|
|
|
extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeBPFTarget() {
|
|
// Register the target.
|
|
RegisterTargetMachine<BPFTargetMachine> X(getTheBPFleTarget());
|
|
RegisterTargetMachine<BPFTargetMachine> Y(getTheBPFbeTarget());
|
|
RegisterTargetMachine<BPFTargetMachine> Z(getTheBPFTarget());
|
|
|
|
PassRegistry &PR = *PassRegistry::getPassRegistry();
|
|
initializeBPFCheckAndAdjustIRPass(PR);
|
|
initializeBPFMIPeepholePass(PR);
|
|
initializeBPFDAGToDAGISelPass(PR);
|
|
}
|
|
|
|
// DataLayout: little or big endian
|
|
static std::string computeDataLayout(const Triple &TT) {
|
|
if (TT.getArch() == Triple::bpfeb)
|
|
return "E-m:e-p:64:64-i64:64-i128:128-n32:64-S128";
|
|
else
|
|
return "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128";
|
|
}
|
|
|
|
static Reloc::Model getEffectiveRelocModel(std::optional<Reloc::Model> RM) {
|
|
return RM.value_or(Reloc::PIC_);
|
|
}
|
|
|
|
BPFTargetMachine::BPFTargetMachine(const Target &T, const Triple &TT,
|
|
StringRef CPU, StringRef FS,
|
|
const TargetOptions &Options,
|
|
std::optional<Reloc::Model> RM,
|
|
std::optional<CodeModel::Model> CM,
|
|
CodeGenOpt::Level OL, bool JIT)
|
|
: LLVMTargetMachine(T, computeDataLayout(TT), TT, CPU, FS, Options,
|
|
getEffectiveRelocModel(RM),
|
|
getEffectiveCodeModel(CM, CodeModel::Small), OL),
|
|
TLOF(std::make_unique<TargetLoweringObjectFileELF>()),
|
|
Subtarget(TT, std::string(CPU), std::string(FS), *this) {
|
|
initAsmInfo();
|
|
|
|
BPFMCAsmInfo *MAI =
|
|
static_cast<BPFMCAsmInfo *>(const_cast<MCAsmInfo *>(AsmInfo.get()));
|
|
MAI->setDwarfUsesRelocationsAcrossSections(!Subtarget.getUseDwarfRIS());
|
|
}
|
|
|
|
namespace {
|
|
// BPF Code Generator Pass Configuration Options.
|
|
class BPFPassConfig : public TargetPassConfig {
|
|
public:
|
|
BPFPassConfig(BPFTargetMachine &TM, PassManagerBase &PM)
|
|
: TargetPassConfig(TM, PM) {}
|
|
|
|
BPFTargetMachine &getBPFTargetMachine() const {
|
|
return getTM<BPFTargetMachine>();
|
|
}
|
|
|
|
void addIRPasses() override;
|
|
bool addInstSelector() override;
|
|
void addMachineSSAOptimization() override;
|
|
void addPreEmitPass() override;
|
|
};
|
|
}
|
|
|
|
TargetPassConfig *BPFTargetMachine::createPassConfig(PassManagerBase &PM) {
|
|
return new BPFPassConfig(*this, PM);
|
|
}
|
|
|
|
void BPFTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
|
|
PB.registerPipelineParsingCallback(
|
|
[](StringRef PassName, FunctionPassManager &FPM,
|
|
ArrayRef<PassBuilder::PipelineElement>) {
|
|
if (PassName == "bpf-ir-peephole") {
|
|
FPM.addPass(BPFIRPeepholePass());
|
|
return true;
|
|
}
|
|
return false;
|
|
});
|
|
PB.registerPipelineStartEPCallback(
|
|
[=](ModulePassManager &MPM, OptimizationLevel) {
|
|
FunctionPassManager FPM;
|
|
FPM.addPass(BPFAbstractMemberAccessPass(this));
|
|
FPM.addPass(BPFPreserveDITypePass());
|
|
FPM.addPass(BPFIRPeepholePass());
|
|
MPM.addPass(createModuleToFunctionPassAdaptor(std::move(FPM)));
|
|
});
|
|
PB.registerPeepholeEPCallback([=](FunctionPassManager &FPM,
|
|
OptimizationLevel Level) {
|
|
FPM.addPass(SimplifyCFGPass(SimplifyCFGOptions().hoistCommonInsts(true)));
|
|
});
|
|
PB.registerPipelineEarlySimplificationEPCallback(
|
|
[=](ModulePassManager &MPM, OptimizationLevel) {
|
|
MPM.addPass(BPFAdjustOptPass());
|
|
});
|
|
}
|
|
|
|
void BPFPassConfig::addIRPasses() {
|
|
addPass(createBPFCheckAndAdjustIR());
|
|
TargetPassConfig::addIRPasses();
|
|
}
|
|
|
|
TargetTransformInfo
|
|
BPFTargetMachine::getTargetTransformInfo(const Function &F) const {
|
|
return TargetTransformInfo(BPFTTIImpl(this, F));
|
|
}
|
|
|
|
// Install an instruction selector pass using
|
|
// the ISelDag to gen BPF code.
|
|
bool BPFPassConfig::addInstSelector() {
|
|
addPass(createBPFISelDag(getBPFTargetMachine()));
|
|
|
|
return false;
|
|
}
|
|
|
|
void BPFPassConfig::addMachineSSAOptimization() {
|
|
addPass(createBPFMISimplifyPatchablePass());
|
|
|
|
// The default implementation must be called first as we want eBPF
|
|
// Peephole ran at last.
|
|
TargetPassConfig::addMachineSSAOptimization();
|
|
|
|
const BPFSubtarget *Subtarget = getBPFTargetMachine().getSubtargetImpl();
|
|
if (!DisableMIPeephole) {
|
|
if (Subtarget->getHasAlu32())
|
|
addPass(createBPFMIPeepholePass());
|
|
}
|
|
}
|
|
|
|
void BPFPassConfig::addPreEmitPass() {
|
|
addPass(createBPFMIPreEmitCheckingPass());
|
|
if (getOptLevel() != CodeGenOpt::None)
|
|
if (!DisableMIPeephole)
|
|
addPass(createBPFMIPreEmitPeepholePass());
|
|
}
|