Some cl::ZeroOrMore were added to avoid the `may only occur zero or one times!`
error. More were added due to cargo cult. Since the error has been removed,
cl::ZeroOrMore is unneeded.
Also remove cl::init(false) while touching the lines.
This fixes an inconsistency between RV32 and RV64. Still considering
trying to do this peephole during isel, but wanted to fix the
inconsistency first.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126986
The AMDGPUResourceUsageAnalysis was previously a CGSCC pass, and assumed
that a function's callees were always analyzed prior to their callees.
When it was refactored into a module pass, this assumption no longer
always holds. This results in calls being erroneously identified as
indirect, and reserving private segment space for them. This results in
significantly slower kernel launch latency.
This patch changes the order in which the module's functions are analyzed
from the order in which they occur in the module to a post-order traversal
of the call graph. Perhaps Clang always generates the module's functions
in such an order, but this is not the case for the Cray Fortran compiler.
Reviewed By: #amdgpu, arsenm
Differential Revision: https://reviews.llvm.org/D126025
We intentionally disable Thumb2SizeReduction for SEH
prologues/epilogues, to avoid needing to guess what will happen with
the instructions in a potential future pass in frame lowering.
But for this specific case, where we know we can express the
intent with a narrow instruction, change to that instruction form
directly in frame lowering.
Differential Revision: https://reviews.llvm.org/D126949
We intentionally disable Thumb2SizeReduction for SEH
prologues/epilogues, to avoid needing to guess what will happen with
the instructions in a potential future pass in frame lowering.
But for this specific case, where we know we can express the
intent with a narrow instruction, change to that instruction form
directly in frame lowering.
Differential Revision: https://reviews.llvm.org/D126948
Test changes are because isBaseWithConstantOffset uses computeKnownBits
and that is able to see that an earlier AND instruction guaranteed
alignment so that we can treat an OR as an ADD.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126970
Previously we had 3 different isel patterns for every scalar load
store instruction.
This reduces them to a single ComplexPattern that returns the Base
and Offset. Or an offset of 0 if there was no offset identified
I've done a similar thing for the 2 isel patterns that match add/or
with FrameIndex and immediate. Using the offset of 0, I was also
able to remove the custom handler for FrameIndex. Happy to split that
to another patch.
We might be able to enhance in the future to remove the post-isel
peephole or the special handling for ADD with constant added by D126576.
A nice side effect is that this removes nearly 3000 bytes from the isel
table.
Differential Revision: https://reviews.llvm.org/D126932
Patch enables custom lowering for MVT::nxv4bf16 because otherwise
the refactored test file triggers a selection failure.
The reason for the refactoring it to highlight cases where the
generated code is wrong.
This patch is trying to fix issue 48588(https://github.com/llvm/llvm-project/issues/48588)
I found the results of Instruction Selection between SelectionDAG and FastISEL for the `%mul = mul i32 %A, 4294967295`:
(seldag-isel) mul --> sub --> SUB32dp
(fast-isel) mul --> sub --> NEG32d
My patch to fix this issue is by overriding a virtual function M68kDAGToDAGISel::IsProfitableToFold(). Return `false` when it was trying to match with SUB, then it will match with NEG.
Reviewed By: myhsu
Differential Revision: https://reviews.llvm.org/D116886
The immediate range check for CSImm12MulBy8 included some values
covered by CSImm12MulBy4. I assume CSImm12MulBy4 had priority due
to pattern order in the td file, but this makes the priority
explicit in the predicate.
This patch improves the codegen of extractelement and insertelement for vector
containing 8 elements. Before, a dag combine transformation was generating a
sequence of 8 select/cmp.
This patch changes the upper limit for this transformation and the movrel
instruction will eventually be used instead. Extractlement/insertelement for
vectors containing less than 8 elements are unchanged.
Differential Revision: https://reviews.llvm.org/D126389
If the imm is out of range for an ADDI, we will materialize it in
a register using multiple instructions. If the ADD is used by a
load/store, doPeepholeLoadStoreADDI can try to pull an ADDI from
the constant materialization into the load/store offset. This only
works if the ADD has a single use, otherwise the peephole would have
to rebuild multiple nodes.
This patch instead tries to solve the problem when the add is selected.
We check that the add is only used by loads/stores and if it is
we will select it to (ADDI (ADD X, Imm-Lo12), Lo12). This will enable
the simple case in doPeepholeLoadStoreADDI that can bypass an ADDI
used as a pointer. As a result we can remove the more complicated
peephole from doPeepholeLoadStoreADDI.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126576
This patch includes MC layer support for VOP3 encoded instructions and generic VOP support
classes.
Some VOP1 and VOP2 instructions which share an encoding with gfx10 and are using
the AssemblerPredicate = isGFX10Plus are also enabled. That predicate
will be changed to isGFX10Only in a later patch.
Patch 15/N for upstreaming of AMDGPU gfx11 architecture.
Depends on D126468
Reviewed By: dp
Differential Revision: https://reviews.llvm.org/D126475
MC layer support for ds instructions
Contributors:
Piotr Sobczak <Piotr.Sobczak@amd.com>
Patch 14/N for upstreaming of AMDGPU gfx11 architecture.
Depends on D126463
Reviewed By: arsenm, #amdgpu
Differential Revision: https://reviews.llvm.org/D126468
Once we've computed the incoming predecessor state, we should use the same compatibility check with knowledge of MI as we did in phase 2 in order to be consistent across all phases.
Differential Revision: https://reviews.llvm.org/D126574
LowerINSERT_SUBVECTOR emits AArch64ISD::UUNPK## when lowering
scalable vector floating point INSERT_SUBVECTOR. However, these
nodes only make sense for integer types and thus isel patterns do
not exist for floating point, which leads to isel failures.
This patch ensures floating point operands are cast to integer
before the core lowering takes place.
Fixes: #55037
Differential Revision: https://reviews.llvm.org/D126487
For functions that require restoring SP from FP (e.g. that need to
align the stack, or that have variable sized allocations), the prologue
and epilogue previously used to look like this:
push {r4-r5, r11, lr}
add r11, sp, #8
...
sub r4, r11, #8
mov sp, r4
pop {r4-r5, r11, pc}
This is problematic, because this unwinding operation (restoring sp
from r11 - offset) can't be expressed with the SEH unwind opcodes
(probably because this unwind procedure doesn't map exactly to
individual instructions; note the detour via r4 in the epilogue too).
To make unwinding work, the GPR push is split into two; the first one
pushing all other registers, and the second one pushing r11+lr, so that
r11 can be set pointing at this spot on the stack:
push {r4-r5}
push {r11, lr}
mov r11, sp
...
mov sp, r11
pop {r11, lr}
pop {r4-r5}
bx lr
For the same setup, MSVC generates code that uses two registers;
r11 still pointing at the {r11,lr} pair, but a separate register
used for restoring the stack at the end:
push {r4-r5, r7, r11, lr}
add r11, sp, #12
mov r7, sp
...
mov sp, r7
pop {r4-r5, r7, r11, pc}
For cases with clobbered float/vector registers, they are pushed
after the GPRs, before the {r11,lr} pair.
Differential Revision: https://reviews.llvm.org/D125649
Skip inserting regular CFI instructions if using WinCFI.
This is based a fair amount on the corresponding ARM64 implementation,
but instead of trying to insert the SEH opcodes one by one where
we generate other prolog/epilog instructions, we try to walk over the
whole prolog/epilog range and insert them. This is done because in
many cases, the exact number of instructions inserted is abstracted
away deeper.
For some cases, we manually insert specific SEH opcodes directly where
instructions are generated, where the automatic mapping of instructions
to SEH opcodes doesn't hold up (e.g. for __chkstk stack probes).
Skip Thumb2SizeReduction for SEH prologs/epilogs, and force
tail calls to wide instructions (just like on MachO), to make sure
that the unwind info actually matches the width of the final
instructions, without heuristics about what later passes will do.
Mark SEH instructions as scheduling boundaries, to make sure that they
aren't reordered away from the instruction they describe by
PostRAScheduler.
Mark the SEH instructions with the NoMerge flag, to avoid doing
tail merging of functions that have multiple epilogs that all end
with the same sequence of "b <other>; .seh_nop_w, .seh_endepilogue".
Differential Revision: https://reviews.llvm.org/D125648
Rename CalleeSavedRegs defs to avoid being overly specific:
* CSR_AMDGPU_AGPRs_32_255 => CSR_AMDGPU_AGPRs
* CSR_AMDGPU_SGPRs_30_31 + CSR_AMDGPU_SGPRs_32_105 => CSR_AMDGPU_SGPRs
* CSR_AMDGPU_SI_Gfx_SGPRs_4_29 + CSR_AMDGPU_SI_Gfx_SGPRs_64_105 =>
CSR_AMDGPU_SI_Gfx_SGPRs
* CSR_AMDGPU_HighRegs => CSR_AMDGPU
* CSR_AMDGPU_HighRegs_With_AGPRs => CSR_AMDGPU_GFX90AInsts
* CSR_AMDGPU_SI_Gfx_With_AGPRs => CSR_AMDGPU_SI_Gfx_GFX90AInsts
Introduce a class RegMask to mark the cases where we use the
CalleeSavedRegs class purely as an expedient way to produce a mask.
Update the names of these masks to not mention "CSR". Other targets also
seem to do this, so a reasonable alternative is to actually update
table-gen to include a new class to do this explicitly, but the current
approach seems harmless so I opted to just make it more explicit.
Reviewed By: arsenm, sebastian-ne
Differential Revision: https://reviews.llvm.org/D109008
We enable a custom handler to optimize conversions between scalars
and fixed vectors. Unfortunately, the custom handler picks up scalar
to scalar conversions as well. If the scalar types are both legal,
we wouldn't match any of the fixed vector cases and would return SDValue()
causing the LegalizeDAG to expand the bitcast through memory.
This patch fixes this by checking if it's a scalar to scalar conversion
and returns `Op` if both types are legal.
Differential Revision: https://reviews.llvm.org/D126739
We may need to peek through a bitcast when identifying an fneg idiom
via its pool constant, but we can't allow a different-sized constant
in that match.
This is noted in issue #55758 with an example that needs fast-math,
but as the test here shows, this has potential to miscompile more
generally (no fast-math required).
Differential Revision: https://reviews.llvm.org/D126775
Avoid the dependency on TargetInstrInfo, which depends on the subtarget
and therefore the individual function.
Currently AMDGPU is constructing PseudoSourceValue instances in MachineFunctionInfo.
In order to facilitate copying MachineFunctionInfo, we need to stop allocating these
there. Alternatively we could allow targets to subclass PseudoSourceValueManager,
and allocate them similarly to MachineFunctionInfo.
If the LHS/RHS selection operands can be cheaply concatenated back together then replace 2 x 128-bit selection nodes with 1 x 256-bit node
Addresses the regression introduced in the bug fix from rGd5af6a38082b39ae520a328e44dc29ebcb036bb2
It's a fairly common issue that the generating code incorrectly marks
instructions as narrow or wide; check that the instruction lengths
add up to the expected value, and error out if it doesn't. This allows
catching code generation bugs.
Also check that prologs and epilogs are properly terminated, to
catch other code generation issues.
Differential Revision: https://reviews.llvm.org/D125647
This includes .seh_* directives for generating it from assembly.
It is designed fairly similarly to the ARM64 handling.
For .seh_handler directives, such as
".seh_handler __C_specific_handler, @except" (which is supported
on x86_64 and aarch64 so far), the "@except" bit doesn't work in
ARM assembly, as '@' is used as a comment character (on all current
platforms).
Allow using '%' instead of '@' for this purpose. This convention
is used by GAS in similar contexts already,
e.g. [1]:
Note on targets where the @ character is the start of a comment
(eg ARM) then another character is used instead. For example the
ARM port uses the % character.
In practice, this unfortunately means that all such .seh_handler
directives will need ifdefs for ARM.
Contrary to ARM64, on ARM, it's quite common that we can't evaluate
e.g. the function length at this point, due to instructions whose
length is finalized later. (Also, inline jump tables end with
a ".p2align 1".)
If unable to to evaluate the function length immediately, emit
it as an MCExpr instead. If we'd implement splitting the unwind
info for a function (which isn't implemented for ARM64 yet either),
we wouldn't know whether we need to split it though.
Avoid calling getFrameIndexOffset() on an unset
FuncInfo.UnwindHelpFrameIdx, to avoid triggering asserts in the
preexisting testcase CodeGen/ARM/Windows/wineh-basic.ll. (Once
MSVC exception handling is fully implemented, those changes
can be reverted.)
[1] https://sourceware.org/binutils/docs/as/Section.html#Section
Differential Revision: https://reviews.llvm.org/D125645
For ARM SEH, the epilogs will need a little more associated data than
just the plain list of opcodes.
This is a preparatory refactoring for D125645.
Differential Revision: https://reviews.llvm.org/D125879
In CSKYConstantIslands, when fix up an unconditional branch(CSKY::BR32) whose destination is
too far away to fit in its displacement field, and if the R15(LR) register has been
spilled in the prologue, then we can use BSR to implement a far jump. So we need estimate function
size, and spill R15(LR) when the function size >= unconditional branch(CSKY::BR32) can reach.
EstimateFunctionSizeInBytes function adds up all instructions and constant pool entries(each entry is 4 bytes).
This patch implements a DAG mutation which adds edges between different groups of instructions. The purpose is to try to generate code that conforms to a pipeline (groupA instructions occur before groupB, groupB -> groupC, and so on). Currently the pipeline order is hardcoded as VMEM->DSRead->MFMA->DSWrite, but the patch was designed to be easily extensible. Alias analysis is problematic for pipelining as memory instructions will usually not be able to be reordered w.r.t one another.
Differential Revision: https://reviews.llvm.org/D125997
If the adjustment doesn't fit in 12 bits, try to break it into
two 12 bit values before falling back to movImm+add/sub.
This is based on a similar idea from isel.
Reviewed By: luismarques, reames
Differential Revision: https://reviews.llvm.org/D126392