Commit Graph

4971 Commits

Author SHA1 Message Date
Jon Chesterfield
30b29db7c7 [amdgpu] Don't crash on empty global ctor/dtor
Global ctor/dtor can be an empty array, which is a Constant not a
ConstantArray. The cast<ConstantArray> therefore asserts / crashes.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D113800
2021-11-16 14:36:08 +00:00
Matt Arsenault
659887b405 AMDGPU: Mark prolog/epilog SCC defs as dead
A future change will add SCC liveness checks. Since we are still
relying on forward register scavenging, add dead flags to avoid
spuriously detecting SCC as live.
2021-11-15 21:35:06 -05:00
Matt Arsenault
e6bfbd7e0d AMDGPU: Regenerate test checks 2021-11-15 21:35:06 -05:00
Dmitry Preobrazhensky
91f4650ebb [AMDGPU][MC][GFX10] Corrected global_atomic_fcmpswap*
Corrected src data size of global_atomic_fcmpswap and global_atomic_fcmpswap_x2 opcodes.

Differential Revision: https://reviews.llvm.org/D113746
2021-11-15 12:51:12 +03:00
Matt Arsenault
54172326e0 AMDGPU: Regenerate test checks
Regenerate with -NEXT checks to make a future diff clearer.
2021-11-13 11:35:35 -05:00
Simon Pilgrim
3170670541 [AMDGPU] Regenerate udiv.ll tests 2021-11-12 17:57:40 +00:00
Jay Foad
a70bbb5f7a [AMDGPU] Simplify 64-bit division/remainder expansion
The old expansion open-coded a 64-bit addition in a strange way, by
adding the high parts *without* carry-in from the low part, and then
adding the carry back in later on. Fixing this saves a couple of
instructions and makes the code much easier to understand.

Differential Revision: https://reviews.llvm.org/D113679
2021-11-12 15:48:41 +00:00
Jay Foad
8313b47a58 [AMDGPU] Regenerate some div/rem test checks 2021-11-11 15:26:22 +00:00
Jay Foad
9ba73b6099 [AMDGPU] Fix line endings 2021-11-11 15:18:22 +00:00
Jay Foad
491beae71d [TwoAddressInstruction] Update LiveIntervals after rewriting INSERT_SUBREG to COPY
Also add subranges to an existing live interval when introducing a new
subreg def.

Differential Revision: https://reviews.llvm.org/D113044
2021-11-11 12:24:59 +00:00
Jay Foad
6abbc3a420 [LiveIntervals] Update subranges in processTiedPairs
In TwoAddressInstructionPass::processTiedPairs when updating live
intervals after moving the last use of RegB back to the newly inserted
copy, update any affected subranges as well as the main range.

Differential Revision: https://reviews.llvm.org/D110411
2021-11-11 12:24:59 +00:00
kpyzhov
c9690092c8 [AMDGPU] Small correction in SITargetLowering::performOrCombine().
Differential Revision: https://reviews.llvm.org/D113203
2021-11-10 21:07:27 -05:00
Stanislav Mekhanoshin
476ab0f809 [AMDGPU] Fixed stack pointer init with architected flat scratch
Even if wave offset is not present we still need to do the rest
of the initialization. The mov into s32 was missing in the
kernels.

Fixes: SWDEV-310935

Differential Revision: https://reviews.llvm.org/D113628
2021-11-10 17:18:38 -08:00
Matt Arsenault
c7a0c2d0f7 AMDGPU: Report large stack usage for recursive calls
We were previously setting an ignored bit in the kernel headers. The
current behavior is to add the large amount on top of the statically
known size of a single stack frame. I'm not sure if we should just use
the large size as the entire reported size instead.
2021-11-10 20:02:01 -05:00
Yaxun (Sam) Liu
4b3881e9f3 Emit hidden hostcall argument for sanitized kernels
this patch - https://reviews.llvm.org/D110337 changes the way how hostcall
hidden argument is emitted for printf, but the sanitized kernels also use
hostcall buffer to report a error for invalid memory access, which is not
handled by the above patch and it leads to vdi runtime error:

Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT:
Agent attempted to access an inaccessible address. code: 0x2b

Patch by: Praveen Velliengiri

Reviewed by: Yaxun Liu, Matt Arsenault

Differential Revision: https://reviews.llvm.org/D112820
2021-11-10 17:05:57 -05:00
Matt Arsenault
90ff148719 AMDGPU: Account for implicit argument alignment for kernarg segment
If a kernel had no formal arguments but did have the implicit
arguments, we were reporting a required kernarg alignment of 4. For
some reason we require an 8-byte alignment for this, even though
there's no real advantage and I don't see where this is documented in
the ABI.

The code object header code also claims the minimum alignment is 16,
which is what I thought you always got at runtime anyway so I don't
know why this matters.
2021-11-09 17:48:37 -05:00
Matt Arsenault
62ffcc5f37 AMDGPU: Regenerate test checks
Update these to include -NEXT to avoid spurious changes in a future
commit.
2021-11-09 15:28:59 -05:00
Thomas Symalla
76cbe62262 [AMDGPU] Changes the AMDGPU_Gfx calling convention by making the SGPRs 4..29 callee-save. This is to avoid superfluous s_movs when executing amdgpu_gfx function calls as the callee is likely not going to change the argument values.
This patch changes the AMDGPU_Gfx calling convention. It defines the SGPR registers s[4:29] as callee-save and leaves some SGPRs usable for callers. The intention is to avoid unneccessary s_mov instructions for arguments the caller would otherwise save and restore in these registers.

Reviewed By: sebastian-ne

Differential Revision: https://reviews.llvm.org/D111637
2021-11-04 21:50:18 +01:00
Simon Pilgrim
53becf5df2 [AMDGPU] Regenerate shift-and-i128-ubfe.ll test checks 2021-11-04 14:27:30 +00:00
RamNalamothu
539f500e78 [AMDGPU] Do not add debug locations to the code inside prologue
There is no real source location for code inside prologue as it is
generated by compiler but source locations are being added to code
inside prologue as a side effect of https://reviews.llvm.org/D99269
because buildSpillLoadStore() is using source location of the real
instruction in the basic block if any.

Fixes: SWDEV-307590

Reviewed By: scott.linder, sebastian-ne

Differential Revision: https://reviews.llvm.org/D113100
2021-11-04 08:02:41 +05:30
alex-t
0a3d755ee9 [AMDGPU] Enable divergence-driven BFE selection
Detailed description: This change enables the bit field extract patterns
selection to s_bfe_u32 or v_bfe_u32 dependent on the pattern root node
divergence.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D110950
2021-11-03 23:26:59 +03:00
Abinav Puthan Purayil
fbe61fb0aa [AMDGPU] Fix SGPR checks in S_MOV_B64_IMM_PSEUDO generation.
The function to generate S_MOV_B64_IMM_PSEUDO was recently modified to
optimize AGPR to AGPR copy but it missed checking for the SGPR
clobbering for the S_MOV_B64_IMM_PSEUDO generation.

Differential Revision: https://reviews.llvm.org/D113005
2021-11-03 09:09:24 +05:30
Jay Foad
fce5a567c6 [AMDGPU] More robust checks in extract_vector_dynelt.ll 2021-11-02 13:26:31 +00:00
hsmahesha
e9ea992496 [IR] Replace *all* uses of a constant expression by corresponding instruction
When a constant expression CE is being converted into a corresponding instruction I,
CE is supposed to be replaced by I. However, it is possible that CE is being used multiple
times within a parent instruction PI. Make sure that *all* the uses of CE within PI are
replaced by I.

Reviewed By: rampitec, arsenm

Differential Revision: https://reviews.llvm.org/D112717
2021-11-02 10:01:46 +05:30
Jay Foad
b85995f6c4 [AMDGPU] Add tests for legacy multiply-add with immediate 2021-11-01 14:24:13 +00:00
Jay Foad
2b548b18c1 [AMDGPU] Shrink v_mac_legacy_f32 and v_fmac_legacy_f32
Differential Revision: https://reviews.llvm.org/D112917
2021-11-01 13:55:53 +00:00
Christudasan Devadasan
aa2d3b59ce GlobalISel/Utils: Use incoming regbank while constraining the superclasses
Register operands with superclasses can possibly have multiple regBanks
if they have different register types. The regBank ambiguity resolved
during regbankselect should be used to constrain the operand regclass
instead of obtaining one from the MCInstrDesc.

This is a prerequisite patch for D109300 that introduces allocatable AV_*
Superclasses for AMDGPU by combining both VGPRs and AGPRs and we want to
restrain the regclass to either A or V based on the incoming regbank.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D112323
2021-10-30 07:20:45 -04:00
Stanislav Mekhanoshin
e5340ed30c [AMDGPU] Fix global isel for kernels using agprs on gfx90a
With Global ISel getReservedRegs() is called before function is
regbank selected for the first time. Defer caching of usesAGPRs()
in this case.

Differential Revision: https://reviews.llvm.org/D112644
2021-10-29 14:23:14 -07:00
Matt Arsenault
52fc2edb53 AMDGPU: Check kernarg alignments in test
Strangely the kernel code object header clamps the value to a minimum
of 16, but the emitted metadata only clamps to a minimum of 4.
2021-10-29 12:42:36 -04:00
Neubauer, Sebastian
c78640ee6a [TailDuplicator] Fix merging block with terminator
The TailDuplicator merged two blocks, even if the first one ended with
a terminator, resulting in invalid MIR, where a terminator is in the
middle of a block.

Abort merging if the first block ends with a terminator.

Differential Revision: https://reviews.llvm.org/D112226
2021-10-29 10:52:46 +02:00
Vang Thao
52b43d1549 [AMDGPU] Fix cvt_f32_ubyte combine with shl
Shift node is still needed to check if the shift is shr or shl to increment/decrement offset. Do not override the node.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D112733
2021-10-28 21:43:06 -07:00
Abinav Puthan Purayil
2da6ef3664 [AMDGPU] Add 24-bit mulhi intrinsics in INTRINSIC_WO_CHAIN combine.
mul24 intrinsic's operands are simplified by
AMDGPUTargetLowering::performIntrinsicWOChainCombine(). This change adds
the mul24hi intrinsics in the combine since its operands can be
simplified like that of the mul24 intrinsics.

Differential Revision: https://reviews.llvm.org/D112702
2021-10-28 16:57:48 +05:30
Abinav Puthan Purayil
9f8e779b42 [AMDGPU] Fix rhs of the tests in amdgpu-codegenprepare-mul24.ll.
Differential Revision: https://reviews.llvm.org/D112685
2021-10-28 16:57:48 +05:30
Jay Foad
c6b4fb87c0 [AMDGPU] Add gfx10 uaddsat test coverage. NFC. 2021-10-28 10:24:12 +01:00
Sebastian Neubauer
fd1cfc9094 [AMDGPU][GlobalISel] Fix waterfall loops
- Move the `s_and exec` to its correct position before the content of
  the waterfall loop
- Use the SI_WATERFALL pseudo instruction, like for sdag, to benefit
  from optimizations
- Add support for indirect function calls

To support indirect calls, add a G_SI_CALL instruction without register
class restrictions and insert a waterfall loop when applying register
banks.

Differential Revision: https://reviews.llvm.org/D109052
2021-10-28 10:30:55 +02:00
Abinav Puthan Purayil
fa592180b3 [AMDGPU] Add more llc tests for 48-bit mul generation.
Differential Revision: https://reviews.llvm.org/D112554
2021-10-28 08:10:04 +05:30
Michael Liao
e6a4ba3aa6 [amdgpu] Handle the case where there is no scavenged register.
- When an unconditional branch is expanded into an indirect branch, if
  there is no scavenged register, an SGPR pair needs spilling to enable
  the destination PC calculation. In addition, before jumping into the
  destination, that clobbered SGPR pair need restoring.
- As SGPR cannot be spilled to or restored from memory directly, the
  spilling/restoring of that SGPR pair reuses the regular SGPR spilling
  support but without spilling it into memory. As that spilling and
  restoring points are fully controlled, we only need to spill that SGPR
  into the temporary VGPR, which needs spilling into its emergency slot.
- The target-specific hook is revised to take additional restore block,
  where the restoring code is filled. After that, the relaxation will
  place that restore block directly before the destination block and
  insert an unconditional branch in any fall-through block into the
  destination block.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D106449
2021-10-27 18:37:27 -04:00
Austin Kerbow
02e60f2e77 [AMDGPU] Use max waves for scheduler's initial occupancy target
The scheduler should set critical/excess register usage thresholds that
are guided by the maximum possible occupancy for the function. This
change is focused on setting proper lower bounds on register usage which
we would typically only see when a specific number of maximum waves is
requested with the "waves-per-eu" attribute, or by setting
"amdgpu-num-vgpr|sgpr" directly. This was broken previously. I have a
follow-on patch that will address issues with the scheduler not
targeting correct upper bounds on register usage which is typical with
launch bounds and min "waves-per-eu".

Changes by this patch:

Set the initial critical register usage thresholds to minimum values
that are determined by the maximum possible occupancy for the function,
or the number of allocatable registers, whichever is lower.

Avoid unisgned overflow if register limits are lower than the register
tracking "ErrorMargin", I.e. when using stress-regalloc=2.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D112373
2021-10-26 15:30:26 -07:00
Abinav Puthan Purayil
61e3b9fefe [AMDGPU] Add constrained shift pattern matches.
The motivation for this is due to clang's conformance to
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html#operators-shift
which makes clang emit (<shift> a, (and b, <width> - 1)) for `a <shift> b`
in OpenCL where a is an int of bit width <width>.

Differential revision: https://reviews.llvm.org/D110231
2021-10-26 19:07:19 +05:30
Abinav Puthan Purayil
781dd39b7b [AMDGPU] Enable 48-bit mul in AMDGPUCodeGenPrepare.
We were bailing out of creating 24-bit muls for results wider than 32
bits in AMDGPUCodeGenPrepare. With the 24-bit mulhi intrinsic, this
change teaches AMDGPUCodeGenPrepare to generate the 48-bit mul
correctly.

Differential Revision: https://reviews.llvm.org/D112395
2021-10-26 18:53:07 +05:30
Abinav Puthan Purayil
9bd5cfeb1f [AMDGPU] Implement llvm.amdgcn.mulhi.[i,u]24 intrinsics.
These intrinsics maps to the 24-bit v_mul_hi instructions.

This change also fixes an incorrect assumption on the associativity of
24-bit mulhi in its SDNode record in tblgen.

Differential Revision: https://reviews.llvm.org/D112394
2021-10-26 18:53:07 +05:30
Neubauer, Sebastian
487f15603e [AMDGPU] Fix setcc combine for i128
The combine asserted if constants could not be represented as uint64_t.
Use APInts to fix this.

Differential Revision: https://reviews.llvm.org/D112416
2021-10-26 13:39:50 +02:00
Sanjay Patel
6e46b66e2a [DAGCombiner] make matching bit-hack form of usubsat more flexible
(i8 X ^ 128) & (i8 X s>> 7) --> usubsat X, 128

As suggested in D112085, we can substitute 'xor' with 'add'
in this pattern, and it is logically equivalent:
https://alive2.llvm.org/ce/z/eJtWWC

We canonicalize to 'xor' in IR, but SDAG does not do that
(and it probably should not - https://llvm.org/PR52267 ), so
it is possible to see either pattern in codegen. Note that
'sub' is a another potential pattern, but that is
canonicalized to 'add' in DAGCombiner, so we don't need to
worry about that variation.

Differential Revision: https://reviews.llvm.org/D112377
2021-10-25 09:01:52 -04:00
Thomas Symalla
f0331100f7 [AMDGPU] Regenerate some tests with the current version of update_mir_test_checks.py 2021-10-25 14:42:13 +02:00
Sanjay Patel
d34cad3196 [AMDGPU] add tests for alternate form of usubsat; NFC 2021-10-24 07:52:07 -04:00
Matt Arsenault
ec57b37551 AMDGPU: Use attributor to propagate amdgpu-flat-work-group-size
This can merge the acceptable ranges based on the call graph, rather
than the simple application of the attribute. Remove the handling from
the old pass.
2021-10-22 16:23:50 -04:00
Matt Arsenault
8d4b74ac3f AMDGPU: Don't consider whether amdgpu-flat-work-group-size was set
It should be semantically identical if it was set to the same value as
the default. Also improve the documentation.
2021-10-22 16:23:50 -04:00
Matt Arsenault
7d962f9ca3 AMDGPU: Regenerate MIR test checks
Recently this started using -NEXT checks, so regenerate these to avoid
extra test churn in a future change.
2021-10-22 15:36:50 -04:00
Matt Arsenault
ae698f89b8 AMDGPU: Fix hardcoded registers in tests 2021-10-22 15:36:50 -04:00
Jay Foad
58e7ec471c [AMDGPU] Run SIShrinkInstructions before post-RA scheduling
Run post-RA SIShrinkInstructions just before post-RA scheduling, instead
of afterwards. After the fixes in D112305 and D112317 this seems to make
no difference, but it paves the way for scheduler tweaks that are
sensitive to the e32 vs e64 encoding of VALU instructions.

Differential Revision: https://reviews.llvm.org/D112341
2021-10-22 20:24:03 +01:00