Commit Graph

8091 Commits

Author SHA1 Message Date
Jay Foad
92542f2a40 [AMDGPU] Add targets gfx1150 and gfx1151
This is the target definition only. Currently they are treated the same
as GFX 11.0.x.

Differential Revision: https://reviews.llvm.org/D155429
2023-07-17 13:06:12 +01:00
Jakub Chlanda
3cd3f11c17 [NFC][AMDGPU] Default initialize the Subtarget
This is to address a static analizer warning:

The pointer field will point to an arbitrary memory location, any
attempt to write may cause corruption. In <unnamed>
R600DAGToDAGISel.:R600DAGToDAGISel (llvm::TargetMachine &,
livm::CodeGenOpt::Level): A pointer field is not initialized in the
constructor (CWE-457)

Differential Revision: https://reviews.llvm.org/D154414
2023-07-17 11:39:29 +02:00
Jon Chesterfield
6043d4dfec [amdgpu] Accept an optional max to amdgpu-lds-size attribute for use in PromoteAlloca 2023-07-15 21:37:21 +01:00
Jon Chesterfield
a222951148 [amdgpu][nfc] Use unsigned for getIntegerPairAttribute to match the only call sites 2023-07-15 20:42:13 +01:00
pvanhout
e5296c52e5 [AMDGPU] Relax restrictions on unbreakable PHI users in BreakLargePHis
The previous heuristic rejected a PHI if one of its user was an unbreakable PHI, no matter what the other users were.

This worked well in most cases, but there's one case in rocRAND where
it doesn't work. In that case, a PHI node has 2 PHI users where one is
breakable but not the other. When that PHI node isn't broken performance falls by 35%.

Relaxing the restriction to "require that  half of the PHI node users are breakable" fixes the issue, and seems like a sensible change.

Solves SWDEV-409648, SWDEV-398393

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D155184
2023-07-14 09:02:51 +02:00
Jon Chesterfield
d3316bc111 [amdgpu] Delete elide-module-lds attribute
Requires D155190

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D155238
2023-07-14 00:36:33 +01:00
Jon Chesterfield
74e928a081 [amdgpu][lds] Remove recalculation of LDS frame from backend
Do the LDS frame calculation once, in the IR pass, instead of repeating the work in the backend.

Prior to this patch:
The IR lowering pass sets up a per-kernel LDS frame and annotates the variables with absolute_symbol
metadata so that the assembler can build lookup tables out of it. There is a fragile association between
kernel functions and named structs which is used to recompute the frame layout in the backend, with
fatal_errors catching inconsistencies in the second calculation.

After this patch:
The IR lowering pass additionally sets a frame size attribute on kernels. The backend uses the same
absolute_symbol metadata that the assembler uses to place objects within that frame size.

Deleted the now dead allocation code from the backend. Left for a later cleanup:
- enabling lowering for anonymous functions
- removing the elide-module-lds attribute (test churn, it's not used by llc any more)
- adjusting the dynamic alignment check to not use symbol names

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D155190
2023-07-13 23:54:38 +01:00
Stanislav Mekhanoshin
7972b9c829 [AMDGPU] Move SIEncodingFamily into SIDefines.h. NFC.
I need this for future patch in the MC, while TII is not available
in the llvm-mc. Besides this is not a first time I want it there.

Differential Revision: https://reviews.llvm.org/D155228
2023-07-13 12:42:28 -07:00
Jeffrey Byrnes
6b7805fcb1 [AMDGPU][IGLP] Add iglp_opt(1) strategy for single wave gemms
This adds the IGLP strategy for single-wave gemms. The SchedGroup pipeline is laid out in multiple phases, with each phase corresponding to a distinct pattern present in gemm kernels. The resilience of the optimization is dependent upon IR (as seen by pre-RA scheduling) continuing to have these patterns (as defined by instruction class and dependencies) in their current relative ordering.

The kernels of interest have these specific phases:
NT: 1, 2a, 2c
NN: 1, 2a, 2b
TT: 1, 2b, 2c
TN: 1, 2b

The general approach taken was to have a long SchedGroup pipeline. In this way the scheduler will have less capability of doing the wrong thing. In order to resolve the challenge of correctly fitting these long pipelines, we leverage the rules infrastructure to help the solver.

Differential Revision: https://reviews.llvm.org/D149773

Change-Id: I1a35962a95b4bdf740602b8f110d3297c6fb9d96
2023-07-13 12:03:04 -07:00
Ivan Kosarev
7b6e606dac [AMDGPU][AsmParser][NFC] Translate parsed MIMG instructions to MCInsts automatically.
Part of <https://github.com/llvm/llvm-project/issues/62629>.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D155061
2023-07-13 19:47:31 +01:00
Ivan Kosarev
289ae6525d [AMDGPU][MC] Fix handling of A16 operands in intersect_ray instructions.
The patch adds the support for 'noa16' operands in non-A16 variants of
the instructions, fixes validation of A16 operands and eliminates the
custom conversion to MCInst.

Part of <https://github.com/llvm/llvm-project/issues/62629>.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D155057
2023-07-13 19:46:03 +01:00
Mateja Marjanovic
fa46feb314 [AMDGPU] Use V_FMA_MIX* more often
Combine mul (f32) + fptrunc (f32->f16) to "v_fma_mixlo_f16 mulSrc1, mulSrc2, 0".

Differential Revision: https://reviews.llvm.org/D153544
Reviewers: arsenm, foad
2023-07-13 16:56:16 +02:00
pvanhout
07c5920487 Reland "[AMDGPU] Wave32 CodeGen for amdgcn.ballot.i64"
This time without the extra `->dump()`

A recent addition to the device libs, `__ockl_dm_trim`, caused a series of
failures at O0 due to a i64 ballot intrinsic being inlined into a wave32 function.

The quick fix for this is to support codegen for this rare case.
A proper long-term fix for this type of issue is still being discussed.

Fixes SWDEV-408929, SWDEV-408957, SWDEV-409885, SWDEV-410193

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D155050
2023-07-13 15:58:48 +02:00
pvanhout
aec971adec Revert "[AMDGPU] Wave32 CodeGen for amdgcn.ballot.i64"
This reverts commit cfa2d0a3aa.
2023-07-13 15:52:27 +02:00
Mateja Marjanovic
701c4adcea Check for denormal flushing when selecting V_FMA/MAD_MIX* 2023-07-13 15:26:20 +02:00
pvanhout
cfa2d0a3aa [AMDGPU] Wave32 CodeGen for amdgcn.ballot.i64
A recent addition to the device libs, `__ockl_dm_trim`, caused a series of
failures at O0 due to a i64 ballot intrinsic being inlined into a wave32 function.

The quick fix for this is to support codegen for this rare case.
A proper long-term fix for this type of issue is still being discussed.

Fixes SWDEV-408929, SWDEV-408957, SWDEV-409885, SWDEV-410193

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D155050
2023-07-13 15:20:58 +02:00
Jon Chesterfield
9418c40af7 [amdgpu][lds] Raise an explicit unimplemented error on absolute address LDS variables
These aren't implemented. They could be at moderate implementation
complexity. Raising an error is better than silently miscompiling.

Patching now because the patch at D155125 is a step towards using this metadata
more extensively as part of the lowering path and that will interact badly with
input variables with this annotation.

Lowering user defined variables at specific addresses would drop this error,
put them at the requested position in the frame during this pass, and then
use the same codegen that will be used for the kernel specific struct shortly.

Reviewed By: jmmartinez

Differential Revision: https://reviews.llvm.org/D155132
2023-07-13 11:32:03 +01:00
pvanhout
361e9eec51 [AMDGPU] Corrrectly emit AGPR copies in tryFoldPhiAGPR
- Don't create COPY instructions between PHI nodes.
- Don't create V_ACCVGPR_WRITE with operands that aren't AGPR_32

Solves SWDEV-410408

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D155080
2023-07-13 08:55:22 +02:00
pvanhout
3c30179e98 [GlobalISel] Rename KnownBits field of InstructionSelector
`KnownBits` is also a type name. Having a field with this name
prevents derived classes from using the `KnownBits` type unless they use `struct KnownBits`.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D155082
2023-07-12 15:28:11 +02:00
Juan Manuel MARTINEZ CAAMAÑO
367b1f28db [NFC][AMDGPULowerModuleLDSPass] Fix buildbot santizier failed to compile
It seems that the sanitizer-x86_64-linux-android wasn't able to deduce
the template argument:

  AMDGPULowerModuleLDSPass.cpp:1192:53: error: no viable constructor or
  deduction guide for deduction of template arguments of 'vector'
        auto TableLookupVariablesOrdered = sortByName(std::vector(

This patch makes the template argument explicit.
2023-07-12 11:08:16 +02:00
Juan Manuel MARTINEZ CAAMAÑO
3a75551e85 Reland "[NFC][AMDGPULowerModuleLDSPass] Factorize repetead sort code"
Fixed compilation error and reudndant copy warning

Differential Revision: https://reviews.llvm.org/D154977
2023-07-12 09:27:20 +02:00
Jay Foad
f7684d8510 [DAG] Use legal shift amount type in DAGTypeLegalizer::JoinIntegers
Documentation for TargetLowering::getShiftAmountTy says that LegalTypes
should generally be true during type legalization, so this patch does
that.

On AMDGPU the effect is that we use i32 (a sane type) instead of i64
(pointer sized type) for more shift amounts, which in turn allows more
formation of rotates and funnel shifts pre-legalization.

Differential Revision: https://reviews.llvm.org/D154960
2023-07-12 08:12:09 +01:00
Jon Chesterfield
e75ce77cd7 [amdgpu][lds] Fix missing markUsedByKernel calls and undef lookup table elements
More robust association between the kernels and lds struct.

Use poison instead of value() for lookup table elements introduced by dynamic lds lowering.

Extracted from D154946, new test from there verbatim. Segv fixed.

Fixes issues/63338

Fixes SWDEV-404491

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D154972
2023-07-12 00:37:21 +01:00
Matt Arsenault
fbe4ff8149 AMDGPU: Partially fix not respecting dynamic denormal mode
The most notable issue was producing v_mad_f32 in functions with the
dynamic mode, since it just ignores the mode. fdiv lowering is still
somewhat broken because it involves a mode switch and we need to query
the original mode.
2023-07-11 15:14:52 -04:00
Juan Manuel MARTINEZ CAAMAÑO
ebdd610ad4 Revert "[NFC][AMDGPULowerModuleLDSPass] Factorize repetead sort code"
This reverts commit 125b90749a.
2023-07-11 17:08:59 +02:00
Juan Manuel MARTINEZ CAAMAÑO
125b90749a [NFC][AMDGPULowerModuleLDSPass] Factorize repetead sort code
Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D154970
2023-07-11 17:03:58 +02:00
Juan Manuel MARTINEZ CAAMAÑO
70bb5d2b9d [NFC][AMDGPULowerModuleLDSPass] Add const to some variables/parameters
Moving out some changes not related to the bugfix in https://reviews.llvm.org/D154946

Reviewed By: JonChesterfield, arsenm

Differential Revision: https://reviews.llvm.org/D154959
2023-07-11 15:51:57 +02:00
Juan Manuel MARTINEZ CAAMAÑO
abf081975e [NFC][AMDGPULowerModuleLDSPass] Remove dead variable 2023-07-11 12:35:21 +02:00
pvanhout
8444038d16 [AMDGPU] Use GlobalISel MatchTable Combiner Backend
Use the new matchtable-based combiner backend for all AMDGPU combiners.
This drop-in from the user's perspective; there are no test changes, the new combiner behaves exactly like the old one.

Depends on D153757

NOTE: This would land iff D153757 (RFC) lands too.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D153758
2023-07-11 11:27:13 +02:00
pvanhout
1fe7d9c799 [GlobalISel] Generalize InstructionSelector Match Tables
Makes `InstructionSelector.h`/`InstructionSelectorImpl.h` generic so the match tables can also be used for the combiner.

Some notes:
 - Coverage was made an optional parameter of `executeMatchTable`, combines won't use it for now.
 - `GIPFP_` -> `GICXXPred_` so it's more generic. Those are just C++ predicates and aren't PatFrag-specific.
 - Pass the MatcherState directly to testMIPredicate_MI, the combiner will need it.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D153755
2023-07-11 09:42:30 +02:00
Amara Emerson
3a80bdb316 [GlobalISel] Remove an erroneous oneuse check in the G_ADD reassociation combine.
This check was unnecessary/incorrect, it was already being done by the target
hook default implementation, and the one in the matcher was checking for a
completely different thing. This change:
 1) Removes the check and updates affected tests which now do some more reassociations.
 2) Modifies the AMDGPU hooks which were stubbed with "return true" to also do the oneuse
    check. Not sure why I didn't do this the first time.
2023-07-10 01:03:12 -07:00
Johannes Doerfert
02a4fcec6b [Attributor] Port AANonNull to the isImpliedByIR interface
AANonNull is now the first AA that is always queried via the new APIs
and not created manually. Others will follow shortly to avoid trivial
AAs whenever possible.

This commit introduced some helper logic that will make it simpler to
port the next one. It also untangles AADereferenceable and AANonNull
such that the former does not keep a handle on the latter. Finally,
we stop deducing `nonnull` for `undef`, which was incorrect.
2023-07-09 16:04:19 -07:00
Matt Arsenault
64d325454b AMDGPU: Delete custom combine on class intrinsic
This is no longer necessary as class-with-constant will always be
transformed to the generic class intrinsic.

https://reviews.llvm.org/D153901
2023-07-07 15:28:21 -04:00
Christudasan Devadasan
7a98f084c4 [AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.

This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.

Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which is an unproblematic case.

Differential Revision: https://reviews.llvm.org/D124196
2023-07-07 23:14:32 +05:30
Christudasan Devadasan
b4a62b1fa5 [AMDGPU] Enable whole wave register copy
So far, we haven't exposed the allocation of whole-wave
registers to regalloc. We hand-picked them for various
whole wave mode operations. With a future patch, we
want the allocator to efficiently allocate them rather
than using the custom pre-allocation pass.

Any liverange split of virtual registers involved in
whole-wave operations require the resulting COPY
introduced with the split to be performed for all
lanes. It isn't implemented in the compiler yet.

This patch would identify all such copies and
manipulate the exec mask around them to enable all
lanes without affecting the value of exec mask
elsewhere.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D143762
2023-07-07 22:58:55 +05:30
Christudasan Devadasan
b78b36e1a2 [AMDGPU] Implement whole wave register spill
To reduce the register pressure during allocation,
when the allocator spills a virtual register that
corresponds to a whole wave mode operation, the
spill loads and restores should be activated for
all lanes by temporarily flipping all bits in exec
register to one just before the spills. It is not
implemented in the compiler as of today and this
patch enables the necessary support.

This is a pre-patch before the SGPR spill to virtual
VGPR lanes that would eventually causes the whole
wave register spills during allocation.

Reviewed By: arsenm, cdevadas

Differential Revision: https://reviews.llvm.org/D143759
2023-07-07 22:51:45 +05:30
Matt Arsenault
94e24624c2 AMDGPU: Remove attempt at simplifying the format string in printf lowering
This avoids computing the dominator tree by removing the
simplifyInstruction use.

This was applying simplification with some kind of questionable
load-store forwarding and looking for the global. This had to have
been an ancient hack copied from previous backends. In the OpenCL
case, this is always emitted as required the direct global reference
anyway.
2023-07-07 09:26:07 -04:00
Scott Linder
986001c827 [AMDGPU] Improve assembler + disassembler handling of kernel descriptors
* Relax the AsmParser to accept `.amdhsa_wavefront_size32 0` when the
  `.amdhsa_shared_vgpr_count` directive is present.
* Teach the KD disassembler to respect the setting of
  KERNEL_CODE_PROPERTY_ENABLE_WAVEFRONT_SIZE32 when calculating the
  value of `.amdhsa_next_free_vgpr`.
* Teach the KD disassembler to disassemble COMPUTE_PGM_RSRC3 for gfx90a
  and gfx10+.
* Include "pseudo directive" comments for gfx10 fields which are not
  controlled by any assembler directive.
* Fix disassembleObject failure diagnostic in llvm-objdump to not
  hard-code a comment string, and to follow the convention of not
  capitalizing the first sentence.

Reviewed By: rochauha

Differential Revision: https://reviews.llvm.org/D128014
2023-07-06 21:20:51 +00:00
Tom Stellard
4b36b2c23c [Support] Use C++11 attribute syntax for visibility attributes
The gnu extension __attribute syntax cannot be mixed with the
C++11 alignas specifier, so in order to use visibility attributes on
classes that also use alignas, we need to use the C++11 standard syntax.

Also fix a few warnings introduced by this change.

Reviewed By: compnerd

Differential Revision: https://reviews.llvm.org/D152043
2023-07-06 10:30:56 -07:00
Matt Arsenault
9df70e4a4d AMDGPU: Fix not applying the correct default memcpy expansion threshold
Fixes 3c848194f2. The TTI hook name got
renamed at some point in the process and the target implementation was
left behind.

Fixes: SWDEV-407329
2023-07-06 12:14:14 -04:00
Matt Arsenault
c70cae6315 AMDGPU: Make SIFixVGPRCopies preserve everything
All this does is add uses of reserved registers, which
aren't tracked by anything. Saves a loop info computation.
2023-07-06 10:26:21 -04:00
Matt Arsenault
8ee1cc82c9 AMDGPU: Fold out sign bit ops on frexp_exp
The sign bit has no impact on the exponent, so strip these away. Saves
on the source modifier encoding cost. I left the GlobalISel handling
until there's a resolution to issue #62628.

We should do this in instcombine too, but legalization should be
introducing more frexps than it currently is where this would occur.
2023-07-06 10:26:21 -04:00
Valery Pykhtin
98aa8439f5 [AMDGPU] Fix register class for a subreg in GCNRewritePartialRegUses.
1. Improved code that deduces register class from instruction definitions. Previously if some instruction didn't contain a reg class for an operand it was considered as no information on register class even if other instructions specified the class.

2. Added check on required size of resulting register because in some cases classes with smaller registers had been selected (for example VReg_1).

Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D152832
2023-07-06 08:48:45 +02:00
Tom Stellard
62748e934c AMDGPU: Remove add_dependencies calls from CMakeLists.txt
These are redundant.  The same dependencies are being added as part
of the add_llvm_component_library() call.  I confirmed this by diff'ing
the build.ninja files before and after the change and saw no change.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D153166
2023-07-05 20:03:11 -07:00
Matt Arsenault
5491666248 AMDGPU: Correctly lower llvm.exp.f32
The library expansion has too many paths for all the permutations of
DAZ, unsafe and the 3 exp functions. It's easier to expand it in the
backend when we know all of these things. The library currently misses
the no-infinity check on the overflow, which this handles optimizing
out.

Some of the <3 x half> fast tests regress due to vector widening
dropping flags which will be fixed separately.

Apparently there is no exp10 intrinsic, but there should be. Adds some
deadish code in preparation for adding one while I'm following along
with the current library expansion.
2023-07-05 17:23:49 -04:00
Matt Arsenault
ed556a1ad5 AMDGPU: Correctly lower llvm.exp2.f32
Previously this did a fast math expansion only.
2023-07-05 17:23:48 -04:00
Matt Arsenault
9c82dc6a6b AMDGPU: Always use v_rcp_f16 and v_rsq_f16
These inherited the fast math checks from f32, but the manual suggests
these should be accurate enough for unconditional use. The definition
of correctly rounded is 0.5ulp, but the manual says "0.51ulp". I've
been a bit nervous about changing this as the OpenCL conformance test
does not cover half. Brute force produces identical values compared to
a reference host implementation for all values.
2023-07-05 16:53:01 -04:00
Matt Arsenault
4e15f378ee AMDGPU: Correctly lower llvm.log.f32 and llvm.log10.f32
Previously we expanded these in a fast-math way and the device
libraries were relying on this behavior. The libraries have a pending
change to switch to the new target intrinsic.

Unlike the library version, this takes advantage of no-infinities on
the result overflow check.
2023-07-05 15:30:35 -04:00
Ivan Kosarev
7208fde09e [AMDGPU][AsmParser][NFC] Generate printers for named-bit operands automatically.
Part of <https://github.com/llvm/llvm-project/issues/62629>.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D154433
2023-07-05 10:53:33 +01:00
Ivan Kosarev
12460cf90f [AMDGPU][AsmParser] Simplify the implementation of SWZ operands.
Those are implicit helper operands and therefore don't need any parsers
or printers.

Part of <https://github.com/llvm/llvm-project/issues/62629>.

Reviewed By: piotr, foad

Differential Revision: https://reviews.llvm.org/D154432
2023-07-05 10:45:12 +01:00