We were trying to avoid using a FrameIndex operand in non-pointer
operands in a convoluted way, and would break because of
using TargetFrameIndex. The TargetFrameIndex should only be used
in the case where it makes sense to fold it as part of the addressing
mode, otherwise it requires materialization like a normal constant.
This wasn't working reliably and failed in the added testcase, hitting
the assert when processing the frame index.
The TargetFrameIndex was coming from trying to produce an AssertZext
limiting the maximum stack size. I'm not sure this was correct to begin
with, because it is apparently possible to have a single workitem
dispatch that requires all 4G of private memory.
llvm-svn: 281824
This reduces the number of copies and reg_sequences
when using fp constant vectors. This significantly
reduces the code size in local-stack-alloc-bug.ll
llvm-svn: 281822
Summary: i8, i16, and f16 values are not extended to 32-bit in the HSA kernel ABI.
Reviewers: arsenm
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, llvm-commits, yaxunl
Differential Revision: https://reviews.llvm.org/D24621
llvm-svn: 281789
These clean up some unnecessary or instructions in
cases with complex loops.
In the original testcase I noticed this, the same
or with exec was repeated 5 or 6 times in a row. With
this only one is emitted or sometimes a copy.
llvm-svn: 281786
Summary:
The main challenge in lowering kernel arguments for AMDGPU is determing the
memory type of the argument. The generic calling convention code assumes
that only legal register types can be stored in memory, but this is not the
case for AMDGPU.
This consolidates all the logic AMDGPU uses for deducing memory types into a single
function. This will make it much easier to support different ABIs in the future.
Reviewers: arsenm
Subscribers: arsenm, wdng, nhaehnle, llvm-commits, yaxunl
Differential Revision: https://reviews.llvm.org/D24614
llvm-svn: 281781
Summary:
mesa3d will use the same kernel calling convention as amdhsa, but it will
handle everything else like the default 'unknown' OS type.
Reviewers: arsenm
Subscribers: arsenm, llvm-commits, kzhuravl
Differential Revision: https://reviews.llvm.org/D22783
llvm-svn: 281779
This addresses a TODO to handle operations besides and. This
also starts eliminating no-op operations with a constant that
can emerge later.
llvm-svn: 281488
If the literal is being folded into src0, it doesn't matter
if it's an SGPR because it's being replaced with the literal.
Also fixes initially selecting 32-bit versions of some instructions
which also confused commuting.
llvm-svn: 281117
We want each register to have a canonical type, which means the best place to
store this is in MachineRegisterInfo rather than on every MachineInstr that
happens to use or define that register.
Most changes following from this are pretty simple (you need an MRI anyway if
you're going to be doing any transformations, so just check the type there).
But legalization doesn't really want to check redundant operands (when, for
example, a G_ADD only ever has one type) so I've made use of MCInstrDesc's
operand type field to encode these constraints and limit legalization's work.
As an added bonus, more validation is possible, both in MachineVerifier and
MachineIRBuilder (coming soon).
llvm-svn: 281035
Summary:
Also removed duplicate code from AMDGPUTargetAsmStreamer.
This change only change how amd_kernel_code_t is parsed and printed. No variable names are changed.
Reviewers: vpykhtin, tstellarAMD
Subscribers: arsenm, wdng, nhaehnle
Differential Revision: https://reviews.llvm.org/D24296
llvm-svn: 281028
Add the ability to computeKnownBits and SimplifyDemandedBits to extract the known zero/one bits from BUILD_VECTOR, returning the known bits that are shared by every vector element.
This is an initial step towards determining the sign bits of a vector (PR29079).
Differential Revision: https://reviews.llvm.org/D24253
llvm-svn: 280927
OpenCL kernels have hidden kernel arguments for global offset and printf buffer. For consistency, these hidden argument should be included in the runtime metadata. Also updated kernel argument kind metadata.
Differential Revision: https://reviews.llvm.org/D23424
llvm-svn: 280829
- Implemented amdgpu-flat-work-group-size attribute
- Implemented amdgpu-num-active-waves-per-eu attribute
- Implemented amdgpu-num-sgpr attribute
- Implemented amdgpu-num-vgpr attribute
- Dynamic LDS constraints are in a separate patch
Patch by Tom Stellard and Konstantin Zhuravlyov
Differential Revision: https://reviews.llvm.org/D21562
llvm-svn: 280747
Summary:
This contains two changes that reduce the time spent in WQM, with the
intention of reducing bandwidth required by VMEM loads:
1. Sampling instructions by themselves don't need to run in WQM, only their
coordinate inputs need it (unless of course there is a dependent sampling
instruction). The initial scanInstructions step is modified accordingly.
2. When switching back from WQM to Exact, switch back as soon as possible.
This affects the logic in processBlock.
This should always be a win or at best neutral.
There are also some cleanups (e.g. remove unused ExecExports) and some new
debugging output.
Reviewers: arsenm, tstellarAMD, mareko
Subscribers: arsenm, llvm-commits, kzhuravl
Differential Revision: http://reviews.llvm.org/D22092
llvm-svn: 280590
Summary:
This fixes a rare bug in polygon stippling with non-monolithic pixel shaders.
The underlying problem is as follows: the prolog part contains the polygon
stippling sequence, i.e. a kill. The main part then enables WQM based on the
_reduced_ exec mask, effectively undoing most of the polygon stippling.
Since we cannot know whether polygon stippling will be used, the main part
of a non-monolithic shader must always return to exact mode to fix this
problem.
Reviewers: arsenm, tstellarAMD, mareko
Subscribers: arsenm, llvm-commits, kzhuravl
Differential Revision: https://reviews.llvm.org/D23131
llvm-svn: 280589
readlane/writelane do not support using m0 as the output/input.
Constrain the register class of spill vregs to try to avoid this,
but also handle spilling of the physreg when necessary by inserting
an additional copy to a normal SGPR.
llvm-svn: 280584
Subregister definitions are considered uses for the purpose of tracking
liveness of the whole register. At the same time, when calculating live
interval subranges, subregister defs should not be treated as uses.
Differential Revision: https://reviews.llvm.org/D24190
llvm-svn: 280532
Add runtime metdata for pointee alignment of pointer type kernel argument. The key is KeyArgPointeeAlign and the value is a 32 bit unsigned integer.
Differential Revision: https://reviews.llvm.org/D24145
llvm-svn: 280399
Summary: This fixes some OpenCV tests that were broken by libclc commit r276443.
Reviewers: arsenm, jvesely
Subscribers: arsenm, wdng, llvm-commits
Differential Revision: https://reviews.llvm.org/D24051
llvm-svn: 280274
Summary:
Simply replace usage of aliases to functions with aliasee.
This came up when bitcode linking to builtin library and
calls to aliases not being resolved.
Also made minor improvements to existing test.
Reviewers: tstellarAMD, alex-t, vpykhtin
Subscribers: arsenm, wdng, rampitec
Differential Revision: https://reviews.llvm.org/D24023
llvm-svn: 280221
Summary:
The SILoadStoreOptimizer can now look ahead more then one instruction when
looking for instructions to merge, which greatly improves the number of
loads/stores that we are able to merge.
Moving the pass before scheduling avoids increasing register pressure after
the scheduler, so that the scheduler's register pressure estimates will be
more accurate. It also gives more consistent results, since it is no longer
affected by minor scheduling changes.
Reviewers: arsenm
Subscribers: arsenm, kzhuravl, llvm-commits
Differential Revision: https://reviews.llvm.org/D23814
llvm-svn: 279991
Summary:
For shrinking SOPK instructions, we were creating a hint to tell the
register allocator to use the register allocated for src0 for the dst
operand as well. However, this seems to not work sometimes depending
on the order virtual registers are assigned physical registers.
To fix this, I've added a second allocation hint which does the reverse,
asks that the register allocated for dst is used for src0.
Reviewers: arsenm
Subscribers: arsenm, llvm-commits, kzhuravl
Differential Revision: https://reviews.llvm.org/D23862
llvm-svn: 279968
There's only one use of this for the convenience
of a pattern. I think v_mov_b64_pseudo should also be
moved, but SIFoldOperands does currently make use of it.
llvm-svn: 279901
Summary:
If the scheduler clusters the loads, then the offsets will be sorted,
but it is possible for the scheduler to scheduler loads together
without out explicitly clustering them, which would give us non-sorted
offsets.
Also, we will want to do this if we move the load/store optimizer before
the scheduler.
Reviewers: arsenm
Subscribers: arsenm, llvm-commits, kzhuravl
Differential Revision: https://reviews.llvm.org/D23776
llvm-svn: 279870