Commit Graph

653 Commits

Author SHA1 Message Date
Ahmed Bougacha
128f8732a5 [CodeGen] Add getBuildVector and getSplatBuildVector helpers. NFCI.
Differential Revision: http://reviews.llvm.org/D17176

llvm-svn: 267606
2016-04-26 21:15:30 +00:00
Konstantin Zhuravlyov
71515e57f9 [AMDGPU] Move reserved vgpr count for trap handler usage to SIMachineFunctionInfo + minor commenting changes
Differential Revision: http://reviews.llvm.org/D19537

llvm-svn: 267573
2016-04-26 17:24:40 +00:00
Konstantin Zhuravlyov
1d99c4d03c [AMDGPU] Reserve VGPRs for trap handler usage if instructed
Differential Revision: http://reviews.llvm.org/D19235

llvm-svn: 267563
2016-04-26 15:43:14 +00:00
Sam Kolton
3025e7f25f [AMDGPU] Assembler: basic support for SDWA instructions
Support for SDWA instructions for VOP1 and VOP2 encoding.
Not done yet:
  - converters for support optional operands and modifiers
  - VOPC
  - sext() modifier
  - intrinsics
  - VOP2b (see vop_dpp.s)
  - V_MAC_F32 (see vop_dpp.s)

Differential Revision: http://reviews.llvm.org/D19360

llvm-svn: 267553
2016-04-26 13:33:56 +00:00
Andrew Kaylor
7de74af929 Add optimization bisect opt-in calls for AMDGPU passes
Differential Revision: http://reviews.llvm.org/D19450

llvm-svn: 267485
2016-04-25 22:23:44 +00:00
Matt Arsenault
074ea2851c AMDGPU/SI: Optimize adjacent s_nop instructions
Use the operand for how long to wait. This is somewhat
distasteful, since it would be better to just emit s_nop
with the right argument in the first place. This would require
changing TII::insertNoop to emit N operands, which would be easy.
Slightly more problematic is the post-RA scheduler and hazard recognizer
represent nops as a single null node, and would require inventing
another way of representing N nops.

llvm-svn: 267456
2016-04-25 19:53:22 +00:00
Matt Arsenault
99c14524ec AMDGPU: Implement addrspacecast
llvm-svn: 267452
2016-04-25 19:27:24 +00:00
Matt Arsenault
48ab526f12 AMDGPU: Add queue ptr intrinsic
llvm-svn: 267451
2016-04-25 19:27:18 +00:00
Matt Arsenault
dfaf4261ab AMDGPU: Add DAG to debug dump
Also reorder case to match enum order

llvm-svn: 267449
2016-04-25 19:27:09 +00:00
Etienne Bergeron
06c14ec31e Fix incorrect redundant expression in target AMDGPU.
Summary:
The expression is detected as a redundant expression.
Turn out, this is probably a bug.

```
/home/etienneb/llvm/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp:306:26: warning: both side of operator are equivalent [misc-redundant-expression]
  if (isSMRD(*FirstLdSt) && isSMRD(*FirstLdSt)) {
```

Reviewers: rnk, tstellarAMD

Subscribers: arsenm, cfe-commits

Differential Revision: http://reviews.llvm.org/D19460

llvm-svn: 267415
2016-04-25 15:06:33 +00:00
Artem Tamazov
d6468666b5 [AMDGPU][llvm-mc] s_getreg/setreg* - Add hwreg(...) syntax.
Added hwreg(reg[,offset,width]) syntax.
Default offset = 0, default width = 32.
Possibility to specify 16-bit immediate kept.
Added out-of-range checks.
Disassembling is always to hwreg(...) format.
Tests updated/added.

Differential Revision: http://reviews.llvm.org/D19329

llvm-svn: 267410
2016-04-25 14:13:51 +00:00
Craig Topper
855d182656 Fix a couple assertions that can never fire because they just contained the text string which always evaluates to true. Add a ! so they'll evaluate to false.
llvm-svn: 267312
2016-04-24 02:01:25 +00:00
Matt Arsenault
7e8de01f84 AMDGPU: sext_inreg (srl x, K), vt -> bfe x, K, vt.Size
llvm-svn: 267244
2016-04-22 22:59:16 +00:00
Matt Arsenault
efa3fe14d1 AMDGPU: Re-visit nodes in performAndCombine
This fixes test regressions when i64 loads/stores are made promote.

llvm-svn: 267240
2016-04-22 22:48:38 +00:00
Konstantin Zhuravlyov
a40d8358e7 [AMDGPU] Insert nop pass: take care of outstanding feedback
- Switch few loops to range-based for loops
- Fix nop insertion at the end of BB
- Fix formatting
- Check for endpgm

Differential Revision: http://reviews.llvm.org/D19380

llvm-svn: 267167
2016-04-22 17:04:51 +00:00
Nicolai Haehnle
b0c9748709 AMDGPU/SI: add llvm.amdgcn.ps.live intrinsic
Summary:
This intrinsic returns true if the current thread belongs to a live pixel
and false if it belongs to a pixel that we are executing only for derivative
computation. It will be used by Mesa to implement gl_HelperInvocation.

Note that for pixels that are killed during the shader, this implementation
also returns true, but it doesn't matter because those pixels are always
disabled in the EXEC mask.

This unearthed a corner case in the instruction verifier, which complained
about a v_cndmask 0, 1, exec, exec<imp-use> instruction. That's stupid but
correct code, so make the verifier accept it as such.

Reviewers: arsenm, tstellarAMD

Subscribers: arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D19191

llvm-svn: 267102
2016-04-22 04:04:08 +00:00
Matt Arsenault
98f8394e7c AMDGPU: Fix debug name of pass to better match
I get this wrong every time I try to debug this.

llvm-svn: 267030
2016-04-21 18:21:54 +00:00
Nicolai Haehnle
97788020c5 Split IntrReadArgMem into IntrReadMem and IntrArgMemOnly
Summary:
IntrReadWriteArgMem simply becomes IntrArgMemOnly.

So there are fewer intrinsic properties that express their orthogonality
better, and correspond more closely to the corresponding IR attributes.

Suggested by: Philip Reames

Reviewers: joker.eph, reames, tstellarAMD

Subscribers: jholewinski, arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D19291

llvm-svn: 267021
2016-04-21 17:48:02 +00:00
Sam Kolton
201398e8a3 [AMDGPU] Assembler: prevent parseDPPCtrlOps from eating invalid tokens
Reviewers: nhaustov, tstellarAMD

Subscribers: arsenm

Differential Revision: http://reviews.llvm.org/D19317

llvm-svn: 266984
2016-04-21 13:14:24 +00:00
Nikolay Haustov
fb5c307ccd AMDGPU/SI: Assembler: improvements to support trap handlers.
Add ParseAMDGPURegister which can be invoked recursively for parsing lists.
Rename getRegForName to getSpecialRegForName.
Support legacy SP3 register list syntax: [s2,s3,s4,s5] or [flat_scratch_lo,flat_scratch_hi].
Add 64-bit registers TBA, TMA where missing.
Add some tests.

Differential Revision: http://reviews.llvm.org/D19163

llvm-svn: 266865
2016-04-20 09:34:48 +00:00
Nicolai Haehnle
b48275f134 Add IntrWrite[Arg]Mem intrinsic property
Summary:
This property is used to mark an intrinsic that only writes to memory, but
neither reads from memory nor has other side effects.

An example where this is useful is the llvm.amdgcn.buffer.store.format.*
intrinsic, which corresponds to a store instruction that goes through a special
buffer descriptor rather than through a plain pointer.

With this property, the intrinsic should still be handled as having side
effects at the LLVM IR level, but machine scheduling can make smarter
decisions.

Reviewers: tstellarAMD, arsenm, joker.eph, reames

Subscribers: arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D18291

llvm-svn: 266826
2016-04-19 21:58:33 +00:00
Nicolai Haehnle
e2dda4f750 AMDGPU: Guard VOPC instructions against incorrect commute
Summary:
The added testcase, which triggered this, was derived from a shader-db case
via bugpoint. A separate question is why scalar branching wasn't used.

Reviewers: arsenm, tstellarAMD

Subscribers: arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D19208

llvm-svn: 266825
2016-04-19 21:58:22 +00:00
Nicolai Haehnle
7483937bf0 AMDGPU/SI: SGPR accounting in getSIProgramInfo must ignore exec_lo/hi
Summary:
A shader stored the live mask (initial exec mask) in an SGPR which was then
spilled during register allocation. The allocator quite reasonably
optimized turned the spill into

  v_writelane_b32 %vgpr, exec_lo, N
  v_writelane_b32 %vgpr, exec_hi, N+1

at the beginning of the shader, confusing the SGPR accounting.

No test case, because si-sgpr-spill.ll together with an upcoming patch for
WQM handling exhibits the problem.

Reviewers: arsenm, tstellarAMD

Subscribers: arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D19199

llvm-svn: 266824
2016-04-19 21:58:17 +00:00
Konstantin Zhuravlyov
8c273ad719 [AMDGPU] Add insert nops pass based on subtarget features instead of cl::opt
Also,
- Skip pass if machine module does not have debug info
- Minor comment changes
- Added test

Differential Revision: http://reviews.llvm.org/D19079

llvm-svn: 266626
2016-04-18 16:28:23 +00:00
Artem Tamazov
e2762423c2 [AMDGPU][llvm-mc] s_setreg* - Fix order of operands
Order should match the sp3 syntax, where destination (simm16 denoting the hwreg) is coming first.

Differential Revision: http://reviews.llvm.org/D19161

llvm-svn: 266617
2016-04-18 14:54:26 +00:00
Aaron Ballman
2eeefe8ed8 Silence some "initialized but unused" warnings from MSVC -- the function being called is a static function, so there's no need for an instance variable. NFC.
llvm-svn: 266616
2016-04-18 14:47:19 +00:00
Mehdi Amini
b550cb1750 [NFC] Header cleanup
Removed some unused headers, replaced some headers with forward class declarations.

Found using simple scripts like this one:
clear && ack --cpp -l '#include "llvm/ADT/IndexedMap.h"' | xargs grep -L 'IndexedMap[<]' | xargs grep -n --color=auto 'IndexedMap'

Patch by Eugene Kosov <claprix@yandex.ru>

Differential Revision: http://reviews.llvm.org/D19219

From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 266595
2016-04-18 09:17:29 +00:00
Matt Arsenault
c10783c42d AMDGPU: Enable LocalStackSlotAllocation pass
This resolves more frame indexes early and folds
the immediate offsets into the scratch mubuf instructions.

This cleans up a lot of the mess that's currently emitted,
such as emitting add 0s and repeatedly initializing the same
register to 0 when spilling.

llvm-svn: 266508
2016-04-16 02:13:37 +00:00
Matt Arsenault
b6be202779 AMDGPU: Use s_addk_i32 / s_mulk_i32
llvm-svn: 266506
2016-04-16 01:46:49 +00:00
Jun Bum Lim
4c5bd58ebe [MachineScheduler]Add support for store clustering
Perform store clustering just like load clustering. This change add
StoreClusterMutation in machine-scheduler. To control StoreClusterMutation,
added enableClusterStores() in TargetInstrInfo.h. This is enabled only on
AArch64 for now.

This change also add support for unscaled stores which were not handled in
getMemOpBaseRegImmOfs().

llvm-svn: 266437
2016-04-15 14:58:38 +00:00
Nicolai Haehnle
750082d1fe AMDGPU/SI: Fix regression with no-return atomics
Summary:
In the added test-case, the atomic instruction feeds into a non-machine
CopyToReg node which hasn't been selected yet, so guard against
non-machine opcodes here.

Reviewers: arsenm, tstellarAMD

Subscribers: arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D19043

llvm-svn: 266433
2016-04-15 14:42:36 +00:00
Matt Arsenault
9c499c3a74 AMDGPU: Remove custom load/store scalarization
llvm-svn: 266385
2016-04-14 23:31:26 +00:00
Matt Arsenault
fd8ab09c0e AMDGPU: Include LDS size in printed comment
llvm-svn: 266382
2016-04-14 22:11:51 +00:00
Matt Arsenault
3d1c1deb04 AMDGPU: Run SIFoldOperands after PeepholeOptimizer
PeepholeOptimizer cleans up redundant copies, which makes
the operand folding more effective.

shader-db stats:

Totals:
SGPRS: 34200 -> 34336 (0.40 %)
VGPRS: 22118 -> 21655 (-2.09 %)
Code Size: 632144 -> 633460 (0.21 %) bytes
LDS: 11 -> 11 (0.00 %) blocks
Scratch: 10240 -> 11264 (10.00 %) bytes per wave
Max Waves: 8822 -> 8918 (1.09 %)
Wait states: 0 -> 0 (0.00 %)

Totals from affected shaders:
SGPRS: 7704 -> 7840 (1.77 %)
VGPRS: 5169 -> 4706 (-8.96 %)
Code Size: 234444 -> 235760 (0.56 %) bytes
LDS: 2 -> 2 (0.00 %) blocks
Scratch: 0 -> 1024 (0.00 %) bytes per wave
Max Waves: 1188 -> 1284 (8.08 %)
Wait states: 0 -> 0 (0.00 %)

Increases:
SGPRS: 35 (0.01 %)
VGPRS: 1 (0.00 %)
Code Size: 59 (0.02 %)
LDS: 0 (0.00 %)
Scratch: 1 (0.00 %)
Max Waves: 48 (0.02 %)
Wait states: 0 (0.00 %)

Decreases:
SGPRS: 26 (0.01 %)
VGPRS: 54 (0.02 %)
Code Size: 68 (0.03 %)
LDS: 0 (0.00 %)
Scratch: 0 (0.00 %)
Max Waves: 4 (0.00 %)
Wait states: 0 (0.00 %)

llvm-svn: 266378
2016-04-14 21:58:24 +00:00
Matt Arsenault
4ac341c8b3 AMDGPU: Directly emit m0 initialization with s_mov_b32
Currently what comes out of instruction selection is a
register initialized to -1, and then copied to m0.
MachineCSE doesn't consider copies, but we want these
to be CSEed. This isn't much of a problem currently,
because SIFoldOperands is run immediately after.

This avoids regressions when SIFoldOperands is run later
from leaving all copies to m0.

llvm-svn: 266377
2016-04-14 21:58:15 +00:00
Matt Arsenault
7900334dd5 AMDGPU: Fold bitcasts of scalar constants to vectors
This cleans up some messes since the individual scalar components
can be CSEed.

llvm-svn: 266376
2016-04-14 21:58:07 +00:00
Tom Stellard
000c5af3e6 AMDGPU: Add skeleton GlobalIsel implementation
Summary:
This adds the necessary target code to be able to run the ir translator.
Lowering function arguments and returns is a nop and there is no support
for RegBankSelect.

Reviewers: arsenm, qcolombet

Subscribers: arsenm, joker.eph, vkalintiris, llvm-commits

Differential Revision: http://reviews.llvm.org/D19077

llvm-svn: 266356
2016-04-14 19:09:28 +00:00
Nicolai Haehnle
05b127da06 [StructurizeCFG] Annotate branches that were treated as uniform
Summary:
This fully solves the problem where the StructurizeCFG pass does not
consider the same branches as uniform as the SIAnnotateControlFlow pass.
The patch in D19013 helps with this problem, but is not sufficient
(and, interestingly, causes a "regression" with one of the existing
test cases).

No tests included here, because tests in D19013 already cover this.

Reviewers: arsenm, tstellarAMD

Subscribers: arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D19018

llvm-svn: 266346
2016-04-14 17:42:35 +00:00
Nicolai Haehnle
723b73b4eb AMDGPU: Remove SIFixSGPRLiveRanges pass
Summary:
This pass is unnecessary and overly conservative. It was motivated by
situations like

  def %vreg0:SGPR_32
  ...
if-block:
  ..
  def %vreg1:SGPR_32
  ...
else-block:
  ...
  use %vreg0:SGPR_32
  ...

and similar situations with uses after the non-uniform control flow, where
we are not allowed to assign %vreg0 and %vreg1 to the same physical register,
even though in the original, thread/workitem-based CFG, it looks like the
live ranges of these registers do not overlap.

However, by the time register allocation runs, we have moved to a wave-based
CFG that accurately represents the fact that the wave may run through both
the if- and the else-block. So the live ranges of %vreg0 and %vreg1 already
overlap even without the SIFixSGPRLiveRanges pass.

In addition to proving this change correct, I have tested it with Piglit
and a small number of other tests.

Reviewers: arsenm, tstellarAMD

Subscribers: MatzeB, arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D19041

llvm-svn: 266345
2016-04-14 17:42:29 +00:00
Nicolai Haehnle
19f0f5177d AMDGPU: change a redundant if () to an assert(). NFC
Summary:
I've been carrying this change around with me for a while, because the if ()
managed to confuse me while following the code. All callers ensure that the
assertion holds.

Reviewers: arsenm, tstellarAMD

Subscribers: arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D19042

llvm-svn: 266344
2016-04-14 17:42:18 +00:00
Tom Stellard
79a1fd718c AMDGPU: allow specifying a workgroup size that needs to fit in a compute unit
Summary:
For GL_ARB_compute_shader we need to support workgroup sizes of at least 1024. However, if we want to allow large workgroup sizes, we may need to use less registers, as we have to run more waves per SIMD.

This patch adds an attribute to specify the maximum work group size the compiled program needs to support. It defaults, to 256, as that has no wave restrictions.

Reducing the number of registers available is done similarly to how the registers were reserved for chips with the sgpr init bug.

Reviewers: mareko, arsenm, tstellarAMD, nhaehnle

Subscribers: FireBurn, kerberizer, llvm-commits, arsenm

Differential Revision: http://reviews.llvm.org/D18340

Patch By: Bas Nieuwenhuizen

llvm-svn: 266337
2016-04-14 16:27:07 +00:00
Tom Stellard
f110f8f9f7 AMDGPU/SI: Use the correct scratch wave offset register for shaders.
Summary:
The code previously always used s1 as it was using the user + system SGPR
information for compute kernels. This is incorrect for Mesa shaders though,

The register should be the next SGPR after all user and system SGPR's.
We use that Mesa adds arguments for all input and system SGPR's and
take the next available SGPR for the scratch wave offset register.

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>

Reviewers: mareko, arsenm, nhaehnle, tstellarAMD

Subscribers: qcolombet, arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D18941

Patch By: Bas Nieuwenhuizen

llvm-svn: 266336
2016-04-14 16:27:03 +00:00
Matt Arsenault
9cd90712f0 AMDGPU: Implement canonicalize
Also add generic DAG node for it.

llvm-svn: 266272
2016-04-14 01:42:16 +00:00
Tom Stellard
b951f10701 AMDGPU/SI: Add support for spilling VGPRs without having to scavenge registers
Summary:
When we are spilling SGPRs to scratch memory, we usually don't have
free SGPRs to do the address calculation, so we need to re-use the
ScratchOffset register for the calculation.

Reviewers: arsenm

Subscribers: arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D18917

llvm-svn: 266244
2016-04-13 20:44:16 +00:00
Artem Tamazov
eb4d5a9b0b [AMDGPU][llvm-mc] Support of Trap Handler registers (TTMP0..11 and TBA/TMA)git status
Tests added along with implemented feature.
Note that there is a small leftover of unecessary MI sheduling issue
(more info in the review). CodeGen/AMDGPU/salu-to-valu.ll updated to fix
the false regression.

TODO: Support for TTMP quads, comma-separated syntax in "[]" and more.

Differential Revision: http://reviews.llvm.org/D17825

llvm-svn: 266205
2016-04-13 16:18:41 +00:00
Tom Stellard
703b2ec43f AMDGPU/SI: Fix spilling of 96-bit registers
Summary:
It seems like this was broken in r252327.  I thought we had test cases
for this, but it's really hard to tirgger spills of this exact register
size since they aren't used very much.

Reviewers: arsenm, nhaehnle

Subscribers: nhaehnle, arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D19021

llvm-svn: 266152
2016-04-12 23:57:30 +00:00
Nicolai Haehnle
df77c9ada4 AMDGPU: add llvm.amdgcn.buffer.load/store intrinsics
Summary:
They correspond to BUFFER_LOAD/STORE_DWORD[_X2,X3,X4] and mostly behave like
llvm.amdgcn.buffer.load/store.format. They will be used by Mesa for SSBO and
atomic counters at least when robust buffer access behavior is desired.
(These instructions perform no format conversion and do buffer range checking
per component.)

As a side effect of sharing patterns with llvm.amdgcn.buffer.store.format,
it has become trivial to add support for the f32 and v2f32 variants of that
intrinsic, so the patch does so.

Also DAG-ify (and fix) some tests that I noticed intermittent failures in
while developing this patch.

Some tests were (temporarily) adjusted for the required mayLoad/hasSideEffects
changes to the BUFFER_STORE_DWORD* instructions. See also
http://reviews.llvm.org/D18291.

Reviewers: arsenm, tstellarAMD, mareko

Subscribers: arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D18292

llvm-svn: 266126
2016-04-12 21:18:10 +00:00
Tom Stellard
ab1d3a9d50 AMDGPU/SI: Insert wait states required after v_readfirstlane on SI
Summary:
We will be able to handle this case much better once the hazard recognizer
is finished, but this conservative implementation  fixes a hang with the piglit
test:

spec/arb_arrays_of_arrays/execution/sampler/fs-nested-struct-arrays-nonconst-nested-arra

Reviewers: arsenm, nhaehnle

Subscribers: arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D18988

llvm-svn: 266105
2016-04-12 18:40:43 +00:00
Matt Arsenault
3b08238f78 AMDGPU: Eliminate half of i64 or if one operand is zero_extend from i32
This helps clean up some of the mess when expanding unaligned 64-bit
loads when changed to be promote to v2i32, and fixes situations
where or x, 0 was emitted after splitting 64-bit ors during moveToVALU.

I think this could be a generic combine but I'm not sure.

llvm-svn: 266104
2016-04-12 18:24:38 +00:00
Nicolai Haehnle
279970c0dc AMDGPU/SI: Fix a mis-compilation of multi-level breaks
Summary:
Under certain circumstances, multi-level breaks (or what is understood by
the control flow passes as such) could be miscompiled in a way that causes
infinite loops, by emitting incorrect control flow intrinsics.

This fixes a hang in
dEQP-GLES3.functional.shaders.loops.while_dynamic_iterations.conditional_continue_vertex

Reviewers: arsenm, tstellarAMD

Subscribers: arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D18967

llvm-svn: 266088
2016-04-12 16:10:38 +00:00