Commit Graph

6072 Commits

Author SHA1 Message Date
Matt Arsenault
270e96f435 Revert "AMDGPU: Invert handling of enqueued block detection"
This reverts commit 47288cc977.

The runtime is having trouble with this at -O0 when the inputs are
always enabled.
2023-01-07 21:48:07 -05:00
Matt Arsenault
47554a0c73 AMDGPU: Use more accurate IR type for block handle
The device library uses this as a struct with a pointer sized integer
and 2 ints.
2023-01-06 21:23:28 -05:00
Matt Arsenault
b7587ca837 AMDGPU: Add more opencl printf tests 2023-01-06 21:23:14 -05:00
Matt Arsenault
47288cc977 AMDGPU: Invert handling of enqueued block detection
Invert the sense of the attribute and let the attributor figure this
out like everything else. If needed we can have the not-OpenCL
languages set amdgpu-no-default-queue and amdgpu-no-completion-action
up front so they never have to pay the cost.

There are also so many of these now, the offset use API should
probably consider all of them at once. Maybe they should merge into
one attribute with used fields. Having separate functions for each
field in AMDGPUBaseInfo is also not the greatest API (might as well
fix this when the patch to get the object version from the module
lands).
2023-01-06 21:16:08 -05:00
Matt Arsenault
0416883dc1 AMDGPU: Fix enqueue block lowering for opaque pointers
This was looking for a specific constant cast of the function, when
the type doesn't matter. Doesn't bother trying to handle typed
pointers, it will just assert.

Things probably don't work completely correctly if the block kernel
address is captured somewhere else, but that wouldn't work before
either. The uses should really be loads out of the handle, and the
handle initializer should contain the kernel address.
2023-01-06 21:15:39 -05:00
Matt Arsenault
4ce5400a3f AMDGPU: Convert enqueue-kernel.ll to opaque pointers
This demonstrates the pass is broken with them, the follow up change
will fix it.
2023-01-06 21:15:39 -05:00
Matt Arsenault
8723836358 AMDGPU: Add additional printf string tests
Test various inputs passed to %s.
2023-01-06 17:22:13 -05:00
Matt Arsenault
b4d44322d9 AMDGPU/GlobalISel: Add missing test for implicit_def regbankselect 2023-01-06 08:58:10 -05:00
Matt Arsenault
6fe85933d4 AMDGPU/GlobalISel: Add wave32 checks to bool test 2023-01-06 08:58:10 -05:00
Juan Manuel MARTINEZ CAAMAÑO
543db09b97 [CodeGen][AMDGPU] EXTRACT_VECTOR_ELT: input vector element type can differ from output type
In function SITargetLowering::performExtractVectorElt,
the output type was not considered which could lead to type mismatches
later.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D139943
2023-01-06 09:46:02 +01:00
Jeffrey Byrnes
33aba5d0d0 [AMDGPU] Switch to autogenerated checks 2023-01-05 16:27:18 -08:00
Vang Thao
25d72330ff [AMDGPU] Add .uniform_work_group_size metadata to v5
Amdgpu kernel with function attribute "uniform-work-group-size"="true" requires
uniform work group size (i.e. each dimension of global size is a multiple of
corresponding dimension of work group size). hipExtModuleLaunchKernel allows to
launch HIP kernel with non-uniform workgroup size, which makes it necessary for
runtime to check and enforce uniform workgroup size if kernel requires it. To
let runtime be able to enforce that, this metadata is needed to indicate that
the kernel requires uniform workgroup size.

Reviewed By: kzhuravl, arsenm

Differential Revision: https://reviews.llvm.org/D141012
2023-01-05 21:29:56 +00:00
Alexander Timofeev
6daa983c9d [AMDGPU] MachineScheduler: schedule execution metric added for the UnclusteredHighRPStage
Since the divergence-driven ISel was fully enabled we have more VGPRs available.
         MachineScheduler trying to take advantage of that bumps up the occupancy sacrificing
         the hiding of memory access latency.  This really spoils the initially good schedule.
         A new metric that reflects the latency hiding quality of the schedule has been created
         to make it to balance between occupancy and latency. The metric is based on the latency
         model which computes the bubble to working cycles ratio. Then we use this ratio to decide
         if the higher occupancy schedule is profitable as follows:

             Profit = NewOccupancy/OldOccupancy * OldMetric/NewMetric

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D139710
2023-01-05 21:10:56 +01:00
Matt Arsenault
7c327c2fbb AMDGPU: Fix broken opaque pointer handling in printf pass
This was directly considering the pointee type, and also applying
special semantics to constant address space.
2023-01-05 13:48:32 -05:00
Matt Arsenault
1f93517b25 AMDGPU: Switch enqueue kernel test to generated checks 2023-01-05 11:39:23 -05:00
Matt Arsenault
7b922fc0c3 AMDGPU: Fix broken and permissive handling of printf format strings
This was completely broken with opaque pointers because it was
specifically looking for a constant expression with the global
variable as the first operand. Strip casts like normal, and properly
validate all of the restrictions rather than silently ignoring any
unhandled cases. Also be stricter that we aren't calling into some
unresolved or non-constant format string.

Also converts the test to opaque pointers and generated tests. There's
more broken initializer handling for strings inside the format string
processing too, but there's just no test coverage for this at all.
2023-01-05 09:18:00 -05:00
Nikita Popov
60442f0d44 [CodeGen] Convert some tests to opaque pointers (NFC)
These are mostly MIR tests, which I did not handle during previous
conversions.
2023-01-05 13:21:20 +01:00
Jay Foad
0d518ae50c [GlobalISel] New combine to commute constant operands to the RHS
Differential Revision: https://reviews.llvm.org/D140907
2023-01-05 11:12:40 +00:00
Diana Picus
6ee4f253b2 [GlobalISel] Add G_BUILD_VECTOR[_TRUNC] to CSE
Add G_BUILD_VECTOR and G_BUILD_VECTOR_TRUNC to the list of opcodes in
`shouldCSEOpc`. This simplifies the code generated for vector splats.

Differential Revision: https://reviews.llvm.org/D140965
2023-01-05 10:15:31 +01:00
Diana Picus
61c5775b36 [GlobalISel] Precommit a test for D140965
Add a test for CSE-ing G_BUILD_VECTOR. This will be enabled in D140965.
2023-01-05 09:59:27 +01:00
Matt Arsenault
8dfe60c356 AMDGPU: Set scratch_en if there is dynamic stack but no fixed stack 2023-01-04 20:51:18 -05:00
Anshil Gandhi
4bbcbdaee5 [AMDGPU] Unify divergent nodes if the PostDom tree has one root
This patch allows AMDGPUUnifyDivergenceExitNodes pass
to transform a function whose PDT has exactly one root
and ends in a branch instruction. Fixes
https://github.com/llvm/llvm-project/issues/58861.

Reviewed By: ruiling, arsenm

Differential Revision: https://reviews.llvm.org/D139780
2023-01-04 10:45:03 -07:00
Matt Arsenault
687e0e205e AMDGPU: Create alloca wide load/store with explicit alignment
This was introducing transient UB by using the default alignment of a
larger vector type.
2023-01-03 11:29:18 -05:00
Matt Arsenault
6fed2c90d3 AMDGPU: Diagnose which LDS global failed to lower
Also lowercase the message to start since that seems to be the
prevailing convention for error messages.
2023-01-03 09:31:07 -05:00
Dmitry Preobrazhensky
e7a306310b [AMDGPU][GFX11] Correct tied src2 of v_fmac_f16_e64
src2 was incorrectly defined as VSrc_f16 but it is tied to dst which is VGPR_32. As a result, disassembler failed to decode src2.

Differential Revision: https://reviews.llvm.org/D140299
2022-12-30 16:42:15 +03:00
Matt Arsenault
e630d9b299 AMDGPU/clang: Remove target features from address space test builtins
It turns out we can codegen these on targets without flat addressing,
although the runtime probably didn't put anything useful there. The
proper diagnostic would be to disallow flat pointer uses or languages
with them, not this one edge case. Allows removing one of the special
cases requiring subtarget support in the device libraries.
2022-12-29 18:46:41 -05:00
Craig Topper
8abd70081f [TargetLowering] Teach BuildUDIV to take advantage of leading zeros in the dividend.
If the dividend has leading zeros, we can use them to reduce the
size of the multiplier and avoid the fixup cases.

This patch is for scalars only, but we might be able to do this
for vectors in a follow up.

Differential Revision: https://reviews.llvm.org/D140750
2022-12-29 13:58:46 -08:00
Matt Arsenault
52c44a441c AMDGPU: Modernize sqrt f64 test
Use the readfirstlane hack for the scalar cases as a hack to
combine globalisel and sdag tests. gfx6 stores are a bit broken
in globalisel, and scalar returns are totally broken in sdag.
2022-12-22 13:01:41 -05:00
Matt Arsenault
5da812461a AMDGPU: Update constant address spaces used in printf test
This was never updated for the address space number shuffle.
2022-12-22 12:38:59 -05:00
Jay Foad
7e1e993816 [AMDGPU] Remove permlane discard vdst_in optimization from isel
D72845 implemented the equivalent IR optimization in InstCombine so it
seems that there's no advantage to doing it during isel too.

This partially reverts D72844.

Differential Revision: https://reviews.llvm.org/D140546
2022-12-22 15:49:26 +00:00
Yashwant Singh
9e0d8ab822 [AMDGPU][Test] Update perfhint test to use opaque pointers
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D140452
2022-12-22 09:49:03 +05:30
Mirko Brkusanin
a80edb7fc9 [AMDGPU][GlobalISel] Fix mapping G_FREEZE
Differential Revision: https://reviews.llvm.org/D140416
2022-12-21 15:25:04 +01:00
Christudasan Devadasan
a3028239a7 Revert "[AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs"
This reverts commit 40ba0942e2.
2022-12-21 16:17:42 +05:30
Jay Foad
e73b35699b [SelectionDAG] Fix EmitCopyFromReg for cloned nodes
Change EmitCopyFromReg to check all users of cloned nodes (as well as
non-cloned nodes) instead of assuming that they all copy the defined
value back to the same physical register.

This partially reverts 968e2e7b3d (svn r62356) which claimed:

  CreateVirtualRegisters does trivial copy coalescing. If a node def is
  used by a single CopyToReg, it reuses the virtual register assigned to
  the CopyToReg. This won't work for SDNode that is a clone or is itself
  cloned. Disable this optimization for those nodes or it can end up
  with non-SSA machine instructions.

This is true for CreateVirtualRegisters but r62356 also updated
EmitCopyFromReg where it is not true. Firstly EmitCopyFromReg only
coalesces physical register copies, so the concern about SSA form does
not apply. Secondly making the loop over users in EmitCopyFromReg
conditional on `!IsClone && !IsCloned` breaks the handling of cloned
nodes, because it leaves MatchReg set to true by default, so it assumes
that all users will copy the defined value back to the same physical
register instead of actually checking.

Differential Revision: https://reviews.llvm.org/D140417
2022-12-21 10:44:45 +00:00
Jay Foad
087cd5e5d1 [SelectionDAG] Precommit EmitCopyFromReg test for D140417 2022-12-21 10:44:45 +00:00
Leon Clark
daa022ca57 Enable roundeven.
Add support for roundeven and implement appropriate tests.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D137954
2022-12-20 15:40:20 +00:00
Jessica Del
5ee13e6c65 [AMDGPU] Wide multiplies tests for D140208
These tests show suboptimal code generation that will
be improved by the changes in D140208
2022-12-20 12:08:36 +01:00
Matt Arsenault
0dc4bdd888 GlobalISel: Enable CSE of G_SELECT
Stop trying to delete a select in one combine since it would
be deleting the CSE'd instruction if that happened.
2022-12-19 21:26:47 -05:00
Matt Arsenault
a20503caa1 AMDGPU: Add regression tests for fmin/fmax legacy matching 2022-12-19 11:36:13 -05:00
Matt Arsenault
c60e67b1f9 AMDGPU: Add more fneg combine tests 2022-12-19 10:34:55 -05:00
Matt Arsenault
7a3682f666 AMDGPU: Convert a few more special case tests to opaque pointers
lower-kernargs.ll needed a switch to use update_test_checks metadata
matching.
2022-12-19 09:42:42 -05:00
Matt Arsenault
262c2c0fd2 AMDGPU: Update some tests to use opaque pointers
vectorize-buffer-fat-pointer.ll required a manual check line fix.
vector-alloca-addrspacecast.ll required a manual fixup of a check
line. partial-regcopy-and-spill-missed-at-regalloc.ll required
re-running update_mir_test_checks. The HSA metadata tests required
avoiding the script touching the type name in the metadata.

annotate-noclobber.ll ran into one update script bug. It deleted a
check line with a 0 offset GEP, moving the following -NEXT check
logically up one line.
2022-12-19 09:28:58 -05:00
Matt Arsenault
04bd576f89 AMDGPU: Convert some amdgpu-codegenprepare tests to opaque pointers
amdgpu-late-codegenprepare.ll required running update_test_checks
after converting.
2022-12-19 09:28:58 -05:00
Matt Arsenault
ce096b2207 AMDGPU: Convert some tests to opaque pointers
These required update_mir_test_checks.
2022-12-19 09:04:17 -05:00
Nikita Popov
bdf2fbba9c [AMDGPU] Convert some tests to opaque pointers (NFC) 2022-12-19 12:41:13 +01:00
Matt Arsenault
012a85296b AMDGPU/GlobalISel: Use ptrtoint to legalize constant 32-bit addrspacecast
This was trying to merge 2 32-bit pointers into a 64-bit pointer. The
artifact combiner was assuming merges to pointers use scalar sources,
and ended up inserting invalid bitcast from a pointer to a scalar. It
should probably be a verifier error to have pointer merge sources with
a pointer result.

Fixes verifier errors with EXPENSIVE_CHECKS.
2022-12-18 13:15:58 -05:00
Matt Arsenault
1706960894 AMDGPU/R600: Special case addrspacecast lowering for null
Due to poor support for non-0 null pointers, clang always emits
addrspacecast from a null flat constant for private/local null. We can
trivially handle this case for old hardware.

Should fix issue 55679.
2022-12-18 08:02:45 -05:00
Matt Arsenault
9d6003c764 AMDGPU: Lower addrspacecast on gfx6
Fixes inconsistent handling of constant-32bit case. Turns out we can
lower all the casts just fine, it's just accessing the flat results
that's a problem.
2022-12-18 08:02:45 -05:00
Sameer Sahasrabuddhe
9c1b82599d [AAPointerInfo] handle multiple offsets in PHI
Previously reverted in 8b446ea2ba

Reapplying because this commit is NOT DEPENDENT on the reverted commit
fc21f2d7ba, which broke the ASAN buildbot.
See https://reviews.llvm.org/rGfc21f2d7bae2e0be630470cc7ca9323ed5859892 for
more information.

The arguments to a PHI may represent a recurrence by eventually using the output
of the PHI itself. This is now handled by checking for cycles in the control
flow. If a PHI is not in a recurrence, it is now able to report multiple offsets
instead of conservatively reporting unknown.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D138991
2022-12-18 10:51:20 +05:30
Christudasan Devadasan
40ba0942e2 [AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.

This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.

Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which isn an unproblematic case.

This patch also implements the whole wave spills which
might occur if RA spills any live range of virtual registers
involved in the whole wave operations. Earlier, we had
been hand-picking registers for such machine operands.
But now with SGPR spills into virtual VGPR lanes, we are
exposing them to the allocator.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D124196
2022-12-17 11:56:32 +05:30