As otherwise SGPR+IMM instructions are not distinguishable to SGPR-only
ones in AsmParser, leading to ambiguities.
GFX12 doesn't have special SGPR-only variants, so we still allow
optional immediate offsets for the subtarget.
Also rename the offset operand classes while there.
Part of <https://github.com/llvm/llvm-project/issues/69256>.
One V_NOP or unrelated VALU instruction in between is required for
correctness when matrix A or B of current WMMA instruction overlaps with
matrix D of previous WMMA instruction.
Remaining cases of WMMA operand overlaps are handled by the hardware and
do not require handling in hazard recognizer.
Hardware may stall in cases where:
- matrix C of current WMMA instruction overlaps with matrix D of
previous WMMA instruction
- VALU instruction reads matrix D of previous WMMA instruction
- matrix A,B or C of WMMA instruction reads result of previous VALU
instruction
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.
TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.
patch 3 from: https://github.com/llvm/llvm-project/pull/73337
CSR SGPR spilling currently uses the early available physical VGPRs. It
currently imposes a high register pressure while trying to allocate
large VGPR tuples within the default register budget.
This patch changes the spilling strategy by picking the VGPRs in the
reverse order, the highest available VGPR first and later after regalloc
shift them back to the lowest available range. With that, the initial
VGPRs would be available for allocation and possibility
of finding large number of contiguous registers will be more.
int_amdgcn_global_load_tr did not specify non-temporal load transpose,
thus we should
not genetrate the non-temporal hint for the load. We need to implement
getTgtMemIntrinsic
to create the corresponding MemSDNode. And we don't set the non-temporal
flag because
the intrinsic did not specify it.
NOTE: We need to implement getTgtMemIntrinsic for any memory intrinsics.
VGPRs used for spilling do not require explicit reservation with MRI.
freezeReservedRegs() executed before register allocation ensures these
are placed in the reserve set.
The only pass after SILowerSGPRSpills is SIPreAllocateWWMRegs which
explicitly tests for interference before register allocation so should
not reuse a WWM VGPR holding spill data. reserveReg prevents calculation
of correct liveness for physical registers which could be used to extend
SIPreAllocateWWMRegs.
This patch tweaks two AMDGPU passes to use iterators rather than
instruction pointers for expressing an insertion point. This is needed
to accurately support DPValues, the non-instruction storage object for
debug-info.
Two tests were sensitive to this change (variable assignments were being
put in the wrong place), and I've added extra run-lines with the "try
new debug-info..." flag. These get tested on our public buildbot to
ensure they continue to work accurately.
Named '.amdhsa_code_object_version'. This directive sets the
e_ident[ABIVERSION] in the ELF header, and should be used as the assumed
COV for the rest of the asm file.
This commit also weakens the --amdhsa-code-object-version CL flag.
Previously, the CL flag took precedence over the IR flag. Now the IR
flag/asm directive take precedence over the CL flag. This is implemented
by merging a few COV-checking functions in AMDGPUBaseInfo.h.
That instruction is not supported on GFX12.
Added a testcase which previously crashed without this change.
Co-authored-by: pvanhout <pierre.vanhoutryve@amd.com>
Add custom lowering for ctlz.i8 to avoid multiple add/sub operations.
---------
Co-authored-by: Leon Clark <leoclark@amd.com>
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Support new amdgcn_global_load_tr instructions for load with transpose.
* MC layer support for GLOBAL_LOAD_TR_B64/GLOBAL_LOAD_TR_B128
* Intrinsic int_amdgcn_global_load_tr
* Clang builtins amdgcn_global_load_tr*
New pseudos were added for instructions that were natively VOP3 on
GFX11: V_ADD_F64_pseudo, V_MUL_F64_pseudo, V_MIN_NUM_F64, V_MAX_NUM_F64,
V_LSHLREV_B64_pseudo
---------
Co-authored-by: Mirko Brkusanin <Mirko.Brkusanin@amd.com>
Endoding is VOP3P. Tagged as deep/machine learning instructions. i32
type (v4fp8 or v4bf8 packed in i32) is used for src0 and src1. src0 and
src1 have no src_modifiers. src2 is f32 and has src_modifiers: f32
fneg(neg_lo[2]) and f32 fabs(neg_hi[2]).
---------
Co-authored-by: Petar Avramovic <Petar.Avramovic@amd.com>
Update SIMemoryLegalizer and SIInsertWaitcnts to use separate wait
instructions per counter (e.g. S_WAIT_LOADCNT) and split VMCNT into
separate LOADCNT, SAMPLECNT and BVHCNT counters.
LDA DMA loads increase VMCNT and a load from the LDS stored must wait on
this counter to only read memory after it is written. Wait count
insertion pass does not track memory dependencies, it tracks register
dependencies. To model the LDS dependency a pseudo register is used in
the scoreboard, acting like if LDS DMA writes it and LDS load reads it.
This patch adds 8 more pseudo registers to use for independent LDS
locations if we can prove they are disjoint using alias analysis.
Fixes: SWDEV-433427
We currently force users to use a negative contant in the intrinsic
call. Changing it zext would break existing programs, so just sign
extend an argument.