VMEM soft clauses only contain VMEM and FLAT instructions. Teaching
GCNHazardRecognizer::checkSoftClauseHazards that other kinds of
instructions will naturally break the clause means there are far fewer
cases where it has to insert an s_nop instruction to forcibly break the
clause.
Differential Revision: https://reviews.llvm.org/D79353
Marking a section as ALLOC tells the ELF loader to load the section into memory.
As we do not want to load the notes into VRAM, the flag should not be there.
On AMDHSA, .note is still marked as ALLOC, apparently this is currently
needed for OpenCL (see https://reviews.llvm.org/D74995).
Differential Revision: https://reviews.llvm.org/D76278
This a hack to fix illegal 32 to 16 bit copies.
The problem is when we make 16 bit subregs legal it creates
a huge amount of failures which can only be resolved at once
without a temporary hack like this.
The next step is to change operands, instruction definitions
and patterns until this hack is not needed.
Differential Revision: https://reviews.llvm.org/D79119
Summary: This change enables all kind of carry out ISD opcodes to be selected according to the node divergence.
Reviewers: rampitec, arsenm, vpykhtin
Reviewed By: rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78091
The two code paths have the same goal, legalizing a load of a non-byte-sized vector by loading the "flattened" representation in memory, slicing off each single element and then building a vector out of those pieces.
The technique employed by `ExpandLoad` is slightly more convoluted and produces slightly better codegen on ARM, AMDGPU and x86 but suffers from some bugs (D78480) and is wrong for BE machines.
Differential Revision: https://reviews.llvm.org/D79096
VMEM loads of the same type (sampler vs no sampler) are guaranteed to
write their result registers in order, so there is no need for an
s_waitcnt even if they write to overlapping vgprs.
Differential Revision: https://reviews.llvm.org/D79176
It allows it not to crash and analyze 16 bit subregs if those
appear in the instructions. At the same time it does not attempt
to reassign these. It still can correctly identify register
banks to let larger registers to be reassigned.
More work will be needed here when real instructions will use
these registers and more tests as well.
Differential Revision: https://reviews.llvm.org/D78772
These are used in SReg_32 and when we start to use SGPR_LO16
there will be compaints that not all registers in RC support
all subreg indexes. For now it is NFC.
Unused regunits are reserved so that verifier does not complain
about missing phys reg live-ins.
Differential Revision: https://reviews.llvm.org/D78591
These are used in SReg_32 and when we start to use SGPR_LO16
there will be compaints that not all registers in RC support
all subreg indexes. For now it is NFC.
Unused regunits are reserved so that verifier does not complain
about missing phys reg live-ins.
Differential Revision: https://reviews.llvm.org/D78591
This is to fix performance regressions introduced by
86c944d790.
The old search would collect all potentially mergeable instructions in
the entire block. In this case, the same address is written in
multiple places in the block on the other side of a fence. When sorted
by offset, the two unmergeable, identical addresses would be next to
each other and the merge would give up.
Break the search space when we encounter an instruction we won't be
able to merge across. This will keep the identical addresses in
different merge attempts.
This may also improve compile time by reducing the merge list size.
Summary:
Frontend guarantees that coherent accesses have
corresponding cache policy bits set (glc, dlc).
Therefore there is no need for extra instructions
that invalidate cache.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78800
If f32 denormals were enabled pre-gfx9, we would still try to
implement this with v_max_f32. Pre-gfx9, these instructions ignored
the denormal mode and did not flush. Switch to the multiply form for
f32 as a workaround which should always work in any case.
This fixes conformance failures when the library implementation of
fmin/fmax were accidentally not inlined, forcing the assumption of no
flushing on targets where denormals are not enabled by default. This
is a workaround, since really we should not be mixing code with
different FP mode expectations, but prefer the lowering that will work
in any mode.
Now this will always use max to implement canonicalize on gfx9+. This
is only really beneficial for f64. For f32/f16 it's a neutral choice
(and worse in terms of code size in 1 case), but possibly worse for
the compiler since it does add an extra register use operand. Leave
this change for later.
12994a70cf did this for 128-bit classes:
SGPR_128 only includes the real allocatable SGPRs, and SReg_128 adds
the additional non-allocatable TTMP registers. There's no point in
allocating SReg_128 vregs. This shrinks the size of the classes
regalloc needs to consider, which is usually good.
This patch extends it to all classes > 64 bits, for consistency.
Differential Revision: https://reviews.llvm.org/D78622
These are needed as a counterpart for VGPR subregs even though
there are no scalar instructions which can operate 16 bit values.
When we are materializing a constant that is done into an SGPR
and that SGPR may/will be copied into a 16 bit VGPR subreg. Such
copy is illegal. There are also similar problems if a source
operand of a 16 bit VALU instruction is an SGPR. In addition
we need to get a register with a lo16 subregister of an SGPR
RC during selection and this fails as well.
All of that makes me believe we need these subregisters as a
syntactic glue.
Differential Revision: https://reviews.llvm.org/D78250
Summary:
The INLINEASM MIR instructions use immediate operands to encode the values of some operands.
The MachineInstr pretty printer function already handles those operands and prints human readable annotations instead of the immediates. This patch adds similar annotations to the output of the MIRPrinter, however uses the new MIROperandComment feature.
Reviewers: SjoerdMeijer, arsenm, efriedma
Reviewed By: arsenm
Subscribers: qcolombet, sdardis, jvesely, wdng, nhaehnle, hiraditya, jrtc27, atanasyan, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78088
We generally only combine starting from users to defs in the artifact combiner,
but this doesn't catch cases where at the point of combining a G_UNMERGE we don't
yet have the opposite G_MERGE on input yet since we haven't legalized that far.
This change adds the users of a G_MERGE to the artifact combiner worklist if one
of the uses is a G_UNMERGE or G_TRUNC.
Differential Revision: https://reviews.llvm.org/D77931
Summary:
The combine for unmerge(cast(merge)) is only valid for vectors, but was
missing a corresponding check. Add a check that the operands are vectors
to avoid an invalid combine.
Without this check, the combiner would emit incorrect code for scalars
and pointers because the artifact cast (trunc/ext) only affects bits at
the end of the type, while this combine assumes that the casted bits
appear between meaningful bits.
This also uncovered a segmentation fault in the AMDGPU
InstructionSelector. The tests triggering this bug have been moved to
their own file and a check for the segmentation fault has been added.
Reviewers: arsenm, dsanders, aemerson, paquette, aditya_nandakumar
Reviewed By: arsenm
Subscribers: tpr, jvesely, wdng, nhaehnle, rovka, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78191
This was hitting the default instruction constraint code which uses
the register classes in the instruction def, which REG_SEQUENCE does
not have.
Fixes not constraining the register class for AMDGPU fneg/fabs
patterns, which would fail when the use was another generic,
unconstrained instruction.
Another oddity I noticed is that the temporary registers are created
with an unnecessary, but incorrect 16-bit LLT but this shouldn't
matter.
I'm also still unclear why root and sub-instructions have to be
handled differently.
Previously, getWithOffset() would drop the offset if the base was null.
Because of this, MachineMemOperand would return the wrong result from
getAlign() in these cases. MachineMemOperand stores the alignment of
the pointer without the offset.
A bunch of MIR tests changed because we print the offset now.
Split off from D77687.
Differential Revision: https://reviews.llvm.org/D78049
Ports the existing DAG combines, minus the simplify demanded bits
which seems to have no equivalent now. Without these, this isn't
particularly helpful in most of the IR sample cases.
This is equivalent in terms of LLVM IR semantics, but we want to
transition away from using MaybeAlign to represent the alignment of
these instructions.
Differential Revision: https://reviews.llvm.org/D77984
Since 1725f28841, this should check
isFMADLegalForFAddFSub rather than the the plain isOperationLegal.
This would assert in a subset of cases due to an oddity in how FMAD is
selected. We will allow FMA formation pre-legalize, but not FMAD even
in cases where it would be valid.
The current hook requires passing in the root fadd/fsub. However, in
this distributed case, this would be far more complicated to pass in
the relevant operand. AMDGPU doesn't get any value from the node, and
only needs the type and is the only implementor, so I'm not sure why
we have this complexity. Just rename and expand the assert to avoid
the more complicated checks spread through the distribution logic.
Selection would fail after the post legalize combiner put an illegal
zextload back together.
The base combiner has parameter to only allow legal operations, but
they appear to not be used. I also don't see a nice way to remove a
single entry from all_combines, so just hack around this.
These will be widened in the DAG. In the meanwhile early
widening prevents otherwise possible vectorization of
such loads.
Differential Revision: https://reviews.llvm.org/D77835