Commit Graph

29893 Commits

Author SHA1 Message Date
Bjorn Pettersson
a89d751fb4 Add intrinsics for saturating float to int casts
This patch adds support for the fptoui.sat and fptosi.sat intrinsics,
which provide basically the same functionality as the existing fptoui
and fptosi instructions, but will saturate (or return 0 for NaN) on
values unrepresentable in the target type, instead of returning
poison. Related mailing list discussion can be found at:
https://groups.google.com/d/msg/llvm-dev/cgDFaBmCnDQ/CZAIMj4IBAAJ

The intrinsics have overloaded source and result type and support
vector operands:

    i32 @llvm.fptoui.sat.i32.f32(float %f)
    i100 @llvm.fptoui.sat.i100.f64(double %f)
    <4 x i32> @llvm.fptoui.sat.v4i32.v4f16(half %f)
    // etc

On the SelectionDAG layer two new ISD opcodes are added,
FP_TO_UINT_SAT and FP_TO_SINT_SAT. These opcodes have two operands
and one result. The second operand is an integer constant specifying
the scalar saturation width. The idea here is that initially the
second operand and the scalar width of the result type are the same,
but they may change during type legalization. For example:

    i19 @llvm.fptsi.sat.i19.f32(float %f)
    // builds
    i19 fp_to_sint_sat f, 19
    // type legalizes (through integer result promotion)
    i32 fp_to_sint_sat f, 19

I went for this approach, because saturated conversion does not
compose well. There is no good way of "adjusting" a saturating
conversion to i32 into one to i19 short of saturating twice.
Specifying the saturation width separately allows directly saturating
to the correct width.

There are two baseline expansions for the fp_to_xint_sat opcodes. If
the integer bounds can be exactly represented in the float type and
fminnum/fmaxnum are legal, we can expand to something like:

    f = fmaxnum f, FP(MIN)
    f = fminnum f, FP(MAX)
    i = fptoxi f
    i = select f uo f, 0, i # unnecessary if unsigned as 0 = MIN

If the bounds cannot be exactly represented, we expand to something
like this instead:

    i = fptoxi f
    i = select f ult FP(MIN), MIN, i
    i = select f ogt FP(MAX), MAX, i
    i = select f uo f, 0, i # unnecessary if unsigned as 0 = MIN

It should be noted that this expansion assumes a non-trapping fptoxi.

Initial tests are for AArch64, x86_64 and ARM. This exercises all of
the scalar and vector legalization. ARM is included to test float
softening.

Original patch by @nikic and @ebevhan (based on D54696).

Differential Revision: https://reviews.llvm.org/D54749
2020-12-18 11:09:41 +01:00
Rong Xu
3733463dbb [IR][PGO] Add hot func attribute and use hot/cold attribute in func section
Clang FE currently has hot/cold function attribute. But we only have
cold function attribute in LLVM IR.

This patch adds support of hot function attribute to LLVM IR.  This
attribute will be used in setting function section prefix/suffix.
Currently .hot and .unlikely suffix only are added in PGO (Sample PGO)
compilation (through isFunctionHotInCallGraph and
isFunctionColdInCallGraph).

This patch changes the behavior. The new behavior is:
(1) If the user annotates a function as hot or isFunctionHotInCallGraph
    is true, this function will be marked as hot. Otherwise,
(2) If the user annotates a function as cold or
    isFunctionColdInCallGraph is true, this function will be marked as
    cold.

The changes are:
(1) user annotated function attribute will used in setting function
    section prefix/suffix.
(2) hot attribute overwrites profile count based hotness.
(3) profile count based hotness overwrite user annotated cold attribute.

The intention for these changes is to provide the user a way to mark
certain function as hot in cases where training input is hard to cover
all the hot functions.

Differential Revision: https://reviews.llvm.org/D92493
2020-12-17 18:41:12 -08:00
Layton Kifer
385e9a2a04 [DAGCombiner] Improve shift by select of constant
Clean up a TODO, to support folding a shift of a constant by a
select of constants, on targets with different shift operand sizes.

Reviewed By: RKSimon, lebedev.ri

Differential Revision: https://reviews.llvm.org/D90349
2020-12-18 02:21:42 +00:00
dfukalov
9ed8e0caab [NFC] Reduce include files dependency and AA header cleanup (part 2).
Continuing work started in https://reviews.llvm.org/D92489:

Removed a bunch of includes from "AliasAnalysis.h" and "LoopPassManager.h".

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D92852
2020-12-17 14:04:48 +03:00
Krasimir Georgiev
e71a4cc207 fix a -Wunused-variable warning in release build 2020-12-17 11:52:00 +01:00
Barry Revzin
92310454bf Make LLVM build in C++20 mode
Part of the <=> changes in C++20 make certain patterns of writing equality
operators ambiguous with themselves (sorry!).
This patch goes through and adjusts all the comparison operators such that
they should work in both C++17 and C++20 modes. It also makes two other small
C++20-specific changes (adding a constructor to a type that cases to be an
aggregate, and adding casts from u8 literals which no longer have type
const char*).

There were four categories of errors that this review fixes.
Here are canonical examples of them, ordered from most to least common:

// 1) Missing const
namespace missing_const {
    struct A {
    #ifndef FIXED
        bool operator==(A const&);
    #else
        bool operator==(A const&) const;
    #endif
    };

    bool a = A{} == A{}; // error
}

// 2) Type mismatch on CRTP
namespace crtp_mismatch {
    template <typename Derived>
    struct Base {
    #ifndef FIXED
        bool operator==(Derived const&) const;
    #else
        // in one case changed to taking Base const&
        friend bool operator==(Derived const&, Derived const&);
    #endif
    };

    struct D : Base<D> { };

    bool b = D{} == D{}; // error
}

// 3) iterator/const_iterator with only mixed comparison
namespace iter_const_iter {
    template <bool Const>
    struct iterator {
        using const_iterator = iterator<true>;

        iterator();

        template <bool B, std::enable_if_t<(Const && !B), int> = 0>
        iterator(iterator<B> const&);

    #ifndef FIXED
        bool operator==(const_iterator const&) const;
    #else
        friend bool operator==(iterator const&, iterator const&);
    #endif
    };

    bool c = iterator<false>{} == iterator<false>{} // error
          || iterator<false>{} == iterator<true>{}
          || iterator<true>{} == iterator<false>{}
          || iterator<true>{} == iterator<true>{};
}

// 4) Same-type comparison but only have mixed-type operator
namespace ambiguous_choice {
    enum Color { Red };

    struct C {
        C();
        C(Color);
        operator Color() const;
        bool operator==(Color) const;
        friend bool operator==(C, C);
    };

    bool c = C{} == C{}; // error
    bool d = C{} == Red;
}

Differential revision: https://reviews.llvm.org/D78938
2020-12-17 10:44:10 +00:00
Simon Pilgrim
cdb692ee0c [X86] Add X86ISD::SUBV_BROADCAST_LOAD and begin removing X86ISD::SUBV_BROADCAST (PR38969)
Subvector broadcasts are only load instructions, yet X86ISD::SUBV_BROADCAST treats them more generally, requiring a lot of fallback tablegen patterns.

This initial patch replaces constant vector lowering inside lowerBuildVectorAsBroadcast with direct X86ISD::SUBV_BROADCAST_LOAD loads which helps us merge a number of equivalent loads/broadcasts.

As well as general plumbing/analysis additions for SUBV_BROADCAST_LOAD, I needed to wrap SelectionDAG::makeEquivalentMemoryOrdering so it can handle result chains from non generic LoadSDNode nodes.

Later patches will continue to replace X86ISD::SUBV_BROADCAST usage.

Differential Revision: https://reviews.llvm.org/D92645
2020-12-17 10:25:25 +00:00
QingShan Zhang
ebdd20f430 Expand the fp_to_int/int_to_fp/fp_round/fp_extend as libcall for fp128
X86 and AArch64 expand it as libcall inside the target. And PowerPC also
want to expand them as libcall for P8. So, propose an implement in the
legalizer to common the logic and remove the code for X86/AArch64 to
avoid the duplicate code.

Reviewed By: Craig Topper

Differential Revision: https://reviews.llvm.org/D91331
2020-12-17 07:59:30 +00:00
Fangrui Song
c70f36865e Use basic_string::find(char) instead of basic_string::find(const char *s, size_type pos=0)
Many (StringRef) cannot be detected by clang-tidy performance-faster-string-find.
2020-12-16 23:28:32 -08:00
Xiang1 Zhang
39584ae5b5 [Debugify] Support checking Machine IR debug info
Add mir-check-debug pass to check MIR-level debug info.

For IR-level, currently, LLVM have debugify + check-debugify to generate
and check debug IR. Much like the IR-level pass debugify, mir-debugify
inserts sequentially increasing line locations to each MachineInstr in a
Module, But there is no equivalent MIR-level check-debugify pass, So now
we support it at "mir-check-debug".

Reviewed By: djtodoro

Differential Revision: https://reviews.llvm.org/D91595
2020-12-16 22:17:25 -08:00
Kazu Hirata
5501b92957 [IR, CodeGen] Use llvm::is_contained (NFC) 2020-12-16 21:30:44 -08:00
Xiang1 Zhang
1e42ad9d62 Revert "[Debugify] Support checking Machine IR debug info"
This reverts commit 50aaa8c274.
2020-12-16 20:12:33 -08:00
Xiang1 Zhang
50aaa8c274 [Debugify] Support checking Machine IR debug info
Add mir-check-debug pass to check MIR-level debug info.

For IR-level, currently, LLVM have debugify + check-debugify to generate
and check debug IR. Much like the IR-level pass debugify, mir-debugify
inserts sequentially increasing line locations to each MachineInstr in a
Module, But there is no equivalent MIR-level check-debugify pass, So now
we support it at "mir-check-debug".

Reviewed By: djtodoro

Differential Revision: https://reviews.llvm.org/D91595
2020-12-16 18:04:05 -08:00
Guozhi Wei
687e80be7f [MBP] Add whole chain to BlockFilterSet instead of individual BB
Currently we add individual BB to BlockFilterSet if its frequency satisfies

  LoopFreq / Freq <= LoopToColdBlockRatio

LoopFreq is edge frequency from outside to loop header.
LoopToColdBlockRatio is a command line parameter.

It doesn't make sense since we always layout whole chain, not individual BBs.

It may also cause a tricky problem. Sometimes it is possible that the LoopFreq
of an inner loop is smaller than LoopFreq of outer loop. So a BB can be in
BlockFilterSet of inner loop, but not in BlockFilterSet of outer loop,
like .cold in the test case. So it is added to the chain of inner loop. When
work on the outer loop, .cold is not added to BlockFilterSet, so the edge to
successor .problem is not counted in UnscheduledPredecessors of .problem chain.
But other blocks in the inner loop are added BlockFilterSet, so the whole inner
loop chain can be layout, and markChainSuccessors is called to decrease
UnscheduledPredecessors of following chains. markChainSuccessors calls
markBlockSuccessors for every BB, even it is not in BlockFilterSet, like .cold,
so .problem chain's UnscheduledPredecessors is decreased, but this edge was not
counted on in fillWorkLists, so .problem chain's UnscheduledPredecessors
becomes 0 when it still has an unscheduled predecessor .pred! And it causes
problems in following various successor BB selection algorithms.

Differential Revision: https://reviews.llvm.org/D89088
2020-12-16 15:45:47 -08:00
Bardia Mahjour
6eff12788e [DDG] Data Dependence Graph - DOT printer - recommit
This is being recommitted to try and address the MSVC complaint.

This patch implements a DDG printer pass that generates a graph in
the DOT description language, providing a more visually appealing
representation of the DDG. Similar to the CFG DOT printer, this
functionality is provided under an option called -dot-ddg and can
be generated in a less verbose mode under -dot-ddg-only option.

Reviewed By: Meinersbur

Differential Revision: https://reviews.llvm.org/D90159
2020-12-16 12:37:36 -05:00
diggerlin
a1e1dcabe4 [XCOFF][AIX] Emit EH information in traceback table
SUMMARY:

In order for the runtime on AIX to find the compact unwind section(EHInfo table),
we would need to set the following on the traceback table:

The 6th byte's longtbtable field to true to signal there is an Extended TB Table Flag.
The Extended TB Table Flag to be 0x08 to signal there is an exception handling info presents.
Emit the offset between ehinfo TC entry and TOC base after all other optional portions of traceback table.

The patch is authored by Jason Liu.

Reviewers: David Tenty, Digger Lin
Differential Revision: https://reviews.llvm.org/D92766
2020-12-16 09:34:59 -05:00
Matt Arsenault
60eba8161b RegisterCoalescer: Remove phi-only subranges when erasing identity copies
Undef subranges are not present in the live range values, except when
they cross block boundaries. In this situation, a identity copy is
inside a loop, and one of the lanes is undefined. It only appears
alive inside the loop due to the copy. Once the copy was erased, it
would leave behind a segment inside the loop body with no
corresponding def anywhere in the program.

When RenameIndependentSubregs processed this dummy interval, it would
introduce a "Multiple connected components in live interval" verifier
error when IMPLICIT_DEFs were added to the other two blocks. I believe
there is a missing verifier check for this type of dummy interval.

I have found additional cases from the same fundamental problem in
other areas I haven't managed to fix yet (e.g. the commented out
prune_subrange_phi_value_* cases).
2020-12-15 17:36:32 -05:00
Fangrui Song
8c4e55762d [docs][unittest][Go][StackProtector] Migrate deprecated DebugInfo::get to DILocation::get 2020-12-15 14:17:04 -08:00
Matt Arsenault
e7e7d371fd GlobalISel: Fix generic handling of single outgoing call arguments
Simply call the argument handler like is done for the incoming
case. This will allow removal of hacks in the AMDGPU call lowering in
a future change.
2020-12-15 17:00:27 -05:00
Paul Walker
6d35bd1d48 [CodeGenPrepare] Update optimizeGatherScatterInst for scalable vectors.
optimizeGatherScatterInst does nothing specific to fixed length vectors
but uses FixedVectorType to extract the number of elements.  This patch
simply updates the code to use VectorType and getElementCount instead.

For testing I just copied Transforms/CodeGenPrepare/X86/gather-scatter-opt.ll
replacing `<4 x ` with `<vscale x 4`.

Differential Revision: https://reviews.llvm.org/D92572
2020-12-15 10:57:51 +00:00
Amara Emerson
a69b76c500 [GlobalISel][IRTranslator] Ensure branch probabilities are added when translating invoke edges.
This uses a straightforward port of findUnwindDestinations() from SelectionDAG.

Differential Revision: https://reviews.llvm.org/D93256
2020-12-14 23:36:54 -08:00
Nico Weber
a852ee199c Reland "[MachineDebugify] Insert synthetic DBG_VALUE instructions"
This reverts commit 841f9c937f.
The change landed many months ago; something else broke those tests.
2020-12-14 22:34:23 -05:00
Nico Weber
841f9c937f Revert "[MachineDebugify] Insert synthetic DBG_VALUE instructions"
This reverts commit 2a5675f11d.
The tests it adds fail: https://reviews.llvm.org/D78135#2453736
2020-12-14 22:14:48 -05:00
Nico Weber
da2551f3d1 Revert "[Debugify] Support checking Machine IR debug info"
This reverts commit c4d2d4337d.
Necessary to revert 2a5675f11d.
2020-12-14 22:14:48 -05:00
Qiu Chaofan
38b4442198 [NFC] [Legalizer] Use common method for expanding fp-to-int operands
Reviewed By: RKSimon, steven.zhang

Differential Revision: https://reviews.llvm.org/D92481
2020-12-15 10:45:40 +08:00
Xiang1 Zhang
c4d2d4337d [Debugify] Support checking Machine IR debug info
Add mir-check-debug pass to check MIR-level debug info.

For IR-level, currently, LLVM have debugify + check-debugify to generate
and check debug IR. Much like the IR-level pass debugify, mir-debugify
inserts sequentially increasing line locations to each MachineInstr in a
Module, But there is no equivalent MIR-level check-debugify pass, So now
we support it at "mir-check-debug".

Reviewed By: djtodoro

Differential Revision: https://reviews.llvm.org/D91595
2020-12-14 17:53:46 -08:00
Xiang1 Zhang
fc0f4010bb Revert "[Debugify] Support checking Machine IR debug info"
This reverts commit 57a3d9ec4a.
2020-12-14 17:48:49 -08:00
Xiang1 Zhang
57a3d9ec4a [Debugify] Support checking Machine IR debug info
Add mir-check-debug pass to check MIR-level debug info.

For IR-level, currently, LLVM have debugify + check-debugify to generate
and check debug IR. Much like the IR-level pass debugify, mir-debugify
inserts sequentially increasing line locations to each MachineInstr in a
Module, But there is no equivalent MIR-level check-debugify pass, So now
we support it at "mir-check-debug".

Reviewed By: djtodoro

Differential Revision: https://reviews.llvm.org/D95195
2020-12-14 17:38:01 -08:00
Bardia Mahjour
a29ecca781 Revert "[DDG] Data Dependence Graph - DOT printer"
This reverts commit fd4a10732c, to
investigate the failure on windows: http://lab.llvm.org:8011/#/builders/127/builds/3274
2020-12-14 16:54:20 -05:00
Bardia Mahjour
fd4a10732c [DDG] Data Dependence Graph - DOT printer
This patch implements a DDG printer pass that generates a graph in
the DOT description language, providing a more visually appealing
representation of the DDG. Similar to the CFG DOT printer, this
functionality is provided under an option called -dot-ddg and can
be generated in a less verbose mode under -dot-ddg-only option.

Differential Revision: https://reviews.llvm.org/D90159
2020-12-14 16:41:14 -05:00
Matt Arsenault
2e0e03c6a0 OpaquePtr: Require byval on x86_intrcc parameter 0
Currently the backend special cases x86_intrcc and treats the first
parameter as byval. Make the IR require byval for this parameter to
remove this special case, and avoid the dependence on the pointee
element type.

Fixes bug 46672.

I'm not sure the IR is enforcing all the calling convention
constraints. clang seems to ignore the attribute for empty parameter
lists, but the IR tolerates it.
2020-12-14 16:34:37 -05:00
Zequan Wu
b6b522c4db [NFC] cleanup cg-profile emission on TargetLowerinng
Differential Revision: https://reviews.llvm.org/D93150
2020-12-14 13:07:44 -08:00
Guozhi Wei
d50d7c37a1 [MBP] Prevent rotating a chain contains entry block
The entry block should always be the first BB in a function.
So we should not rotate a chain contains the entry block.

Differential Revision: https://reviews.llvm.org/D92882
2020-12-14 12:48:55 -08:00
Kerry McLaughlin
c5ced82c8e [SVE][CodeGen] Lower scalable floating-point vector reductions
Changes in this patch:
-  Minor changes to the LowerVECREDUCE_SEQ_FADD function added by @cameron.mcinally
   to also work for scalable types
- Added TableGen patterns for FP reductions with unpacked types (nxv2f16, nxv4f16 & nxv2f32)
- Asserts added to expandFMINNUM_FMAXNUM & expandVecReduceSeq for scalable types

Reviewed By: cameron.mcinally

Differential Revision: https://reviews.llvm.org/D93050
2020-12-14 11:45:42 +00:00
QingShan Zhang
08e287aaf3 [PowerPC][FP128] Fix the incorrect signature for math library call
The runtime library has two family library implementation for ppc_fp128 and fp128.
For IBM Long double(ppc_fp128), it is suffixed with 'l', i.e(sqrtl). For
IEEE Long double(fp128), it is suffixed with "ieee128" or "f128".
We miss to map several libcall for IEEE Long double.

Reviewed By: qiucf

Differential Revision: https://reviews.llvm.org/D91675
2020-12-14 07:52:56 +00:00
Chen Zheng
4830d458dd [MachineCombiner][NFC] Add MustReduceRegisterPressure goal
add a new goal MustReduceRegisterPressure for machine combiner pass.

PowerPC will use this new goal to do some register pressure related optimization.

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D92068
2020-12-14 00:02:42 -05:00
Kazu Hirata
ee5b5b7a35 [CodeGen] Use llvm::erase_value (NFC) 2020-12-13 20:05:48 -08:00
Amara Emerson
21de99d43c [[GlobalISel][IRTranslator] Fix a crash when the use of an extractvalue is a non-dominated metadata use.
We don't expect uses to come before defs in the CFG, so allocateVRegs() asserted.

Fixes PR48211
2020-12-12 14:58:54 -08:00
diggerlin
997d286f2d [AIX][XCOFF] emit traceback table for function in aix
SUMMARY:
 1. added a new option -xcoff-traceback-table to control whether generate traceback table for function.
 2. implement the functionality of emit traceback table of a function.

Reviewers: hubert.reinterpretcast, Jason Liu
Differential Revision: https://reviews.llvm.org/D92398
2020-12-11 17:50:25 -05:00
Fangrui Song
b5ad32ef5c Migrate deprecated DebugLoc::get to DILocation::get
This migrates all LLVM (except Kaleidoscope and
CodeGen/StackProtector.cpp) DebugLoc::get to DILocation::get.

The CodeGen/StackProtector.cpp usage may have a nullptr Scope
and can trigger an assertion failure, so I don't migrate it.

Reviewed By: #debug-info, dblaikie

Differential Revision: https://reviews.llvm.org/D93087
2020-12-11 12:45:22 -08:00
Joe Ellis
d863a0ddeb [SelectionDAG] Implement SplitVecOp_INSERT_SUBVECTOR
This function is needed for when it is necessary to split the subvector
operand of an llvm.experimental.vector.insert call. Splitting the
subvector operand means performing two insertions: one inserting the
lower part of the split subvector into the destination vector, and
another for inserting the upper part.

Through experimenting, it seems quite rare to need split the subvector
operand, but this is necessary to avoid assertion errors.

Differential Revision: https://reviews.llvm.org/D92760
2020-12-11 11:07:59 +00:00
David Sherwood
9b76160e53 [Support] Introduce a new InstructionCost class
This is the first in a series of patches that attempts to migrate
existing cost instructions to return a new InstructionCost class
in place of a simple integer. This new class is intended to be
as light-weight and simple as possible, with a full range of
arithmetic and comparison operators that largely mirror the same
sets of operations on basic types, such as integers. The main
advantage to using an InstructionCost is that it can encode a
particular cost state in addition to a value. The initial
implementation only has two states - Normal and Invalid - but these
could be expanded over time if necessary. An invalid state can
be used to represent an unknown cost or an instruction that is
prohibitively expensive.

This patch adds the new class and changes the getInstructionCost
interface to return the new class. Other cost functions, such as
getUserCost, etc., will be migrated in future patches as I believe
this to be less disruptive. One benefit of this new class is that
it provides a way to unify many of the magic costs in the codebase
where the cost is set to a deliberately high number to prevent
optimisations taking place, e.g. vectorization. It also provides
a route to represent the extremely high, and unknown, cost of
scalarization of scalable vectors, which is not currently supported.

Differential Revision: https://reviews.llvm.org/D91174
2020-12-11 08:12:54 +00:00
Hongtao Yu
705a4c149d [CSSPGO] Pseudo probe encoding and emission.
This change implements pseudo probe encoding and emission for CSSPGO. Please see RFC here for more context: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s

Pseudo probes are in the form of intrinsic calls on IR/MIR but they do not turn into any machine instructions. Instead they are emitted into the binary as a piece of data in standalone sections.  The probe-specific sections are not needed to be loaded into memory at execution time, thus they do not incur a runtime overhead. 

**ELF object emission**

The binary data to emit are organized as two ELF sections, i.e, the `.pseudo_probe_desc` section and the `.pseudo_probe` section. The `.pseudo_probe_desc` section stores a function descriptor for each function and the `.pseudo_probe` section stores the actual probes, each fo which corresponds to an IR basic block or an IR function callsite. A function descriptor is stored as a module-level metadata during the compilation and is serialized into the object file during object emission.

Both the probe descriptors and pseudo probes can be emitted into a separate ELF section per function to leverage the linker for deduplication.  A `.pseudo_probe` section shares the same COMDAT group with the function code so that when the function is dead, the probes are dead and disposed too. On the contrary, a `.pseudo_probe_desc` section has its own COMDAT group. This is because even if a function is dead, its probes may be inlined into other functions and its descriptor is still needed by the profile generation tool.

The format of `.pseudo_probe_desc` section looks like:

```
.section   .pseudo_probe_desc,"",@progbits
.quad   6309742469962978389  // Func GUID
.quad   4294967295           // Func Hash
.byte   9                    // Length of func name
.ascii  "_Z5funcAi"          // Func name
.quad   7102633082150537521
.quad   138828622701
.byte   12
.ascii  "_Z8funcLeafi"
.quad   446061515086924981
.quad   4294967295
.byte   9
.ascii  "_Z5funcBi"
.quad   -2016976694713209516
.quad   72617220756
.byte   7
.ascii  "_Z3fibi"
```

For each `.pseudoprobe` section, the encoded binary data consists of a single function record corresponding to an outlined function (i.e, a function with a code entry in the `.text` section). A function record has the following format :

```
FUNCTION BODY (one for each outlined function present in the text section)
    GUID (uint64)
        GUID of the function
    NPROBES (ULEB128)
        Number of probes originating from this function.
    NUM_INLINED_FUNCTIONS (ULEB128)
        Number of callees inlined into this function, aka number of
        first-level inlinees
    PROBE RECORDS
        A list of NPROBES entries. Each entry contains:
          INDEX (ULEB128)
          TYPE (uint4)
            0 - block probe, 1 - indirect call, 2 - direct call
          ATTRIBUTE (uint3)
            reserved
          ADDRESS_TYPE (uint1)
            0 - code address, 1 - address delta
          CODE_ADDRESS (uint64 or ULEB128)
            code address or address delta, depending on ADDRESS_TYPE
    INLINED FUNCTION RECORDS
        A list of NUM_INLINED_FUNCTIONS entries describing each of the inlined
        callees.  Each record contains:
          INLINE SITE
            GUID of the inlinee (uint64)
            ID of the callsite probe (ULEB128)
          FUNCTION BODY
            A FUNCTION BODY entry describing the inlined function.
```

To support building a context-sensitive profile, probes from inlinees are grouped by their inline contexts. An inline context is logically a call path through which a callee function lands in a caller function. The probe emitter builds an inline tree based on the debug metadata for each outlined function in the form of a trie tree. A tree root is the outlined function. Each tree edge stands for a callsite where inlining happens. Pseudo probes originating from an inlinee function are stored in a tree node and the tree path starting from the root all the way down to the tree node is the inline context of the probes. The emission happens on the whole tree top-down recursively. Probes of a tree node will be emitted altogether with their direct parent edge. Since a pseudo probe corresponds to a real code address, for size savings, the address is encoded as a delta from the previous probe except for the first probe. Variant-sized integer encoding, aka LEB128, is used for address delta and probe index.

**Assembling**

Pseudo probes can be printed as assembly directives alternatively. This allows for good assembly code readability and also provides a view of how optimizations and pseudo probes affect each other, especially helpful for diff time assembly analysis.

A pseudo probe directive has the following operands in order: function GUID, probe index, probe type, probe attributes and inline context. The directive is generated by the compiler and can be parsed by the assembler to form an encoded `.pseudoprobe` section in the object file.

A example assembly looks like:

```
foo2: # @foo2
# %bb.0: # %bb0
pushq %rax
testl %edi, %edi
.pseudoprobe 837061429793323041 1 0 0
je .LBB1_1
# %bb.2: # %bb2
.pseudoprobe 837061429793323041 6 2 0
callq foo
.pseudoprobe 837061429793323041 3 0 0
.pseudoprobe 837061429793323041 4 0 0
popq %rax
retq
.LBB1_1: # %bb1
.pseudoprobe 837061429793323041 5 1 0
callq *%rsi
.pseudoprobe 837061429793323041 2 0 0
.pseudoprobe 837061429793323041 4 0 0
popq %rax
retq
# -- End function
.section .pseudo_probe_desc,"",@progbits
.quad 6699318081062747564
.quad 72617220756
.byte 3
.ascii "foo"
.quad 837061429793323041
.quad 281547593931412
.byte 4
.ascii "foo2"
```

With inlining turned on, the assembly may look different around %bb2 with an inlined probe:

```
# %bb.2:                                # %bb2
.pseudoprobe    837061429793323041 3 0
.pseudoprobe    6699318081062747564 1 0 @ 837061429793323041:6
.pseudoprobe    837061429793323041 4 0
popq    %rax
retq
```

**Disassembling**

We have a disassembling tool (llvm-profgen) that can display disassembly alongside with pseudo probes. So far it only supports ELF executable file.

An example disassembly looks like:

```
00000000002011a0 <foo2>:
  2011a0: 50                    push   rax
  2011a1: 85 ff                 test   edi,edi
  [Probe]:  FUNC: foo2  Index: 1  Type: Block
  2011a3: 74 02                 je     2011a7 <foo2+0x7>
  [Probe]:  FUNC: foo2  Index: 3  Type: Block
  [Probe]:  FUNC: foo2  Index: 4  Type: Block
  [Probe]:  FUNC: foo   Index: 1  Type: Block  Inlined: @ foo2:6
  2011a5: 58                    pop    rax
  2011a6: c3                    ret
  [Probe]:  FUNC: foo2  Index: 2  Type: Block
  2011a7: bf 01 00 00 00        mov    edi,0x1
  [Probe]:  FUNC: foo2  Index: 5  Type: IndirectCall
  2011ac: ff d6                 call   rsi
  [Probe]:  FUNC: foo2  Index: 4  Type: Block
  2011ae: 58                    pop    rax
  2011af: c3                    ret
```

Reviewed By: wmi

Differential Revision: https://reviews.llvm.org/D91878
2020-12-10 17:29:28 -08:00
Mitch Phillips
7ead5f5aa3 Revert "[CSSPGO] Pseudo probe encoding and emission."
This reverts commit b035513c06.

Reason: Broke the ASan buildbots:
  http://lab.llvm.org:8011/#/builders/5/builds/2269
2020-12-10 15:53:39 -08:00
Mitch Phillips
9aafa9fc15 Revert "[NFC] Fix a gcc build break by not using an initializer."
This reverts commit 1dc0a8521f.

Reason: Dependency of patch that broke the ASan buildbots:
  http://lab.llvm.org:8011/#/builders/5/builds/2269
2020-12-10 15:53:38 -08:00
Hongtao Yu
1dc0a8521f [NFC] Fix a gcc build break by not using an initializer.
Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
2020-12-10 11:54:41 -08:00
Hongtao Yu
b035513c06 [CSSPGO] Pseudo probe encoding and emission.
This change implements pseudo probe encoding and emission for CSSPGO. Please see RFC here for more context: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s

Pseudo probes are in the form of intrinsic calls on IR/MIR but they do not turn into any machine instructions. Instead they are emitted into the binary as a piece of data in standalone sections.  The probe-specific sections are not needed to be loaded into memory at execution time, thus they do not incur a runtime overhead. 

**ELF object emission**

The binary data to emit are organized as two ELF sections, i.e, the `.pseudo_probe_desc` section and the `.pseudo_probe` section. The `.pseudo_probe_desc` section stores a function descriptor for each function and the `.pseudo_probe` section stores the actual probes, each fo which corresponds to an IR basic block or an IR function callsite. A function descriptor is stored as a module-level metadata during the compilation and is serialized into the object file during object emission.

Both the probe descriptors and pseudo probes can be emitted into a separate ELF section per function to leverage the linker for deduplication.  A `.pseudo_probe` section shares the same COMDAT group with the function code so that when the function is dead, the probes are dead and disposed too. On the contrary, a `.pseudo_probe_desc` section has its own COMDAT group. This is because even if a function is dead, its probes may be inlined into other functions and its descriptor is still needed by the profile generation tool.

The format of `.pseudo_probe_desc` section looks like:

```
.section   .pseudo_probe_desc,"",@progbits
.quad   6309742469962978389  // Func GUID
.quad   4294967295           // Func Hash
.byte   9                    // Length of func name
.ascii  "_Z5funcAi"          // Func name
.quad   7102633082150537521
.quad   138828622701
.byte   12
.ascii  "_Z8funcLeafi"
.quad   446061515086924981
.quad   4294967295
.byte   9
.ascii  "_Z5funcBi"
.quad   -2016976694713209516
.quad   72617220756
.byte   7
.ascii  "_Z3fibi"
```

For each `.pseudoprobe` section, the encoded binary data consists of a single function record corresponding to an outlined function (i.e, a function with a code entry in the `.text` section). A function record has the following format :

```
FUNCTION BODY (one for each outlined function present in the text section)
    GUID (uint64)
        GUID of the function
    NPROBES (ULEB128)
        Number of probes originating from this function.
    NUM_INLINED_FUNCTIONS (ULEB128)
        Number of callees inlined into this function, aka number of
        first-level inlinees
    PROBE RECORDS
        A list of NPROBES entries. Each entry contains:
          INDEX (ULEB128)
          TYPE (uint4)
            0 - block probe, 1 - indirect call, 2 - direct call
          ATTRIBUTE (uint3)
            reserved
          ADDRESS_TYPE (uint1)
            0 - code address, 1 - address delta
          CODE_ADDRESS (uint64 or ULEB128)
            code address or address delta, depending on ADDRESS_TYPE
    INLINED FUNCTION RECORDS
        A list of NUM_INLINED_FUNCTIONS entries describing each of the inlined
        callees.  Each record contains:
          INLINE SITE
            GUID of the inlinee (uint64)
            ID of the callsite probe (ULEB128)
          FUNCTION BODY
            A FUNCTION BODY entry describing the inlined function.
```

To support building a context-sensitive profile, probes from inlinees are grouped by their inline contexts. An inline context is logically a call path through which a callee function lands in a caller function. The probe emitter builds an inline tree based on the debug metadata for each outlined function in the form of a trie tree. A tree root is the outlined function. Each tree edge stands for a callsite where inlining happens. Pseudo probes originating from an inlinee function are stored in a tree node and the tree path starting from the root all the way down to the tree node is the inline context of the probes. The emission happens on the whole tree top-down recursively. Probes of a tree node will be emitted altogether with their direct parent edge. Since a pseudo probe corresponds to a real code address, for size savings, the address is encoded as a delta from the previous probe except for the first probe. Variant-sized integer encoding, aka LEB128, is used for address delta and probe index.

**Assembling**

Pseudo probes can be printed as assembly directives alternatively. This allows for good assembly code readability and also provides a view of how optimizations and pseudo probes affect each other, especially helpful for diff time assembly analysis.

A pseudo probe directive has the following operands in order: function GUID, probe index, probe type, probe attributes and inline context. The directive is generated by the compiler and can be parsed by the assembler to form an encoded `.pseudoprobe` section in the object file.

A example assembly looks like:

```
foo2: # @foo2
# %bb.0: # %bb0
pushq %rax
testl %edi, %edi
.pseudoprobe 837061429793323041 1 0 0
je .LBB1_1
# %bb.2: # %bb2
.pseudoprobe 837061429793323041 6 2 0
callq foo
.pseudoprobe 837061429793323041 3 0 0
.pseudoprobe 837061429793323041 4 0 0
popq %rax
retq
.LBB1_1: # %bb1
.pseudoprobe 837061429793323041 5 1 0
callq *%rsi
.pseudoprobe 837061429793323041 2 0 0
.pseudoprobe 837061429793323041 4 0 0
popq %rax
retq
# -- End function
.section .pseudo_probe_desc,"",@progbits
.quad 6699318081062747564
.quad 72617220756
.byte 3
.ascii "foo"
.quad 837061429793323041
.quad 281547593931412
.byte 4
.ascii "foo2"
```

With inlining turned on, the assembly may look different around %bb2 with an inlined probe:

```
# %bb.2:                                # %bb2
.pseudoprobe    837061429793323041 3 0
.pseudoprobe    6699318081062747564 1 0 @ 837061429793323041:6
.pseudoprobe    837061429793323041 4 0
popq    %rax
retq
```

**Disassembling**

We have a disassembling tool (llvm-profgen) that can display disassembly alongside with pseudo probes. So far it only supports ELF executable file.

An example disassembly looks like:

```
00000000002011a0 <foo2>:
  2011a0: 50                    push   rax
  2011a1: 85 ff                 test   edi,edi
  [Probe]:  FUNC: foo2  Index: 1  Type: Block
  2011a3: 74 02                 je     2011a7 <foo2+0x7>
  [Probe]:  FUNC: foo2  Index: 3  Type: Block
  [Probe]:  FUNC: foo2  Index: 4  Type: Block
  [Probe]:  FUNC: foo   Index: 1  Type: Block  Inlined: @ foo2:6
  2011a5: 58                    pop    rax
  2011a6: c3                    ret
  [Probe]:  FUNC: foo2  Index: 2  Type: Block
  2011a7: bf 01 00 00 00        mov    edi,0x1
  [Probe]:  FUNC: foo2  Index: 5  Type: IndirectCall
  2011ac: ff d6                 call   rsi
  [Probe]:  FUNC: foo2  Index: 4  Type: Block
  2011ae: 58                    pop    rax
  2011af: c3                    ret
```

Reviewed By: wmi

Differential Revision: https://reviews.llvm.org/D91878
2020-12-10 09:50:08 -08:00
Craig Topper
a1ae3c6ac9 [RISCV][LegalizeDAG] Expand SETO and SETUO comparisons. Teach LegalizeDAG to expand SETUO expansion when UNE isn't legal.
If SETUNE isn't legal, UO can use the NOT of the SETO expansion.

Removes some complex isel patterns. Most of the test changes are
from using XORI instead of SEQZ.

Differential Revision: https://reviews.llvm.org/D92008
2020-12-10 09:15:52 -08:00
David Green
0447f3508f [ARM][RegAlloc] Add t2LoopEndDec
We currently have problems with the way that low overhead loops are
specified, with LR being spilled between the t2LoopDec and the t2LoopEnd
forcing the entire loop to be reverted late in the backend. As they will
eventually become a single instruction, this patch introduces a
t2LoopEndDec which is the combination of the two, combined before
registry allocation to make sure this does not fail.

Unfortunately this instruction is a terminator that produces a value
(and also branches - it only produces the value around the branching
edge). So this needs some adjustment to phi elimination and the register
allocator to make sure that we do not spill this LR def around the loop
(needing to put a spill after the terminator). We treat the loop very
carefully, making sure that there is nothing else like calls that would
break it's ability to use LR. For that, this adds a
isUnspillableTerminator to opt in the new behaviour.

There is a chance that this could cause problems, and so I have added an
escape option incase. But I have not seen any problems in the testing
that I've tried, and not reverting Low overhead loops is important for
our performance. If this does work then we can hopefully do the same for
t2WhileLoopStart and t2DoLoopStart instructions.

This patch also contains the code needed to convert or revert the
t2LoopEndDec in the backend (which just needs a subs; bne) and the code
pre-ra to create them.

Differential Revision: https://reviews.llvm.org/D91358
2020-12-10 12:14:23 +00:00
Luo, Yuanke
f80b29878b [X86] AMX programming model.
This patch implements amx programming model that discussed in llvm-dev
 (http://lists.llvm.org/pipermail/llvm-dev/2020-August/144302.html).
 Thank Hal for the good suggestion in the RA. The fast RA is not in the patch yet.
 This patch implemeted 7 components.

1. The c interface to end user.
2. The AMX intrinsics in LLVM IR.
3. Transform load/store <256 x i32> to AMX intrinsics or split the
   type into two <128 x i32>.
4. The Lowering from AMX intrinsics to AMX pseudo instruction.
5. Insert psuedo ldtilecfg and build the def-use between ldtilecfg to amx
   intruction.
6. The register allocation for tile register.
7. Morph AMX pseudo instruction to AMX real instruction.

Change-Id: I935e1080916ffcb72af54c2c83faa8b2e97d5cb0

Differential Revision: https://reviews.llvm.org/D87981
2020-12-10 17:01:54 +08:00