As described on D111049, we're trying to remove the <string> dependency from error handling and replace uses of report_fatal_error(const std::string&) with the Twine() variant which can be forward declared.
Without this change _term instructions can be removed during
critical edge splitting.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D111126
The current way to detect hostcalls by looking for "ockl_hostcall_internal()" function in the module seems to be not reliable enough. The LTO may rename the "ockl_hostcall_internal()" function when an application is compiled with "-fgpu-rdc", and MetadataStreamer pass to fail to detect hostcalls, therefore it does not set the "hidden_hostcall_buffer" kernel argument.
This change adds a new module flag: hostcall that can be used to detect whether GPU functions use host calls for printf.
Differential revision: https://reviews.llvm.org/D110337
Scalarize before narrowing because the narrowing implementation does not
work on vectors. This matches what we do for regular G_MUL.
Differential Revision: https://reviews.llvm.org/D111129
The delayed stack protector feature which is currently used for SDAG (and thus
allows for more commonly generating tail calls) depends on being able to extract
the tail call into a separate return block. To do this it also has to extract
the vreg->physreg copies that set up the call's arguments, since if it doesn't
then the call inst ends up using undefined physregs in it's new spliced block.
SelectionDAG implementations can do this because they delay emitting register
copies until *after* the stack arguments are set up. GISel however just
processes and emits the arguments in IR order, so stack arguments always end up
last, and thus this breaks the code that looks for any register arg copies that
precede the call instruction.
This patch adds a thunk argument to the assignValueToReg() and custom assignment
hooks. For outgoing arguments, register assignments use this return param to
return a thunk that does the actual generating of the copies. We collect these
until all the outgoing stack assignments have been done and then execute them,
so that the copies (and perhaps some artifacts like G_SEXTs) are placed after
any stores.
Differential Revision: https://reviews.llvm.org/D110610
1. Splitted out some parts of R600 target to separate modules/headers.
2. Reduced some include lists in headers.
3. Minor forward declarations, redundant includes and flags in GCNSubtarget
cleanup.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D109351
It is now very simple and can go right into the header
allowing optimizer to combine callers, such as isVGPRClass
and similar.
It does not need anything from the TRI itself anymore, so
make it static class member along with the callers.
Differential Revision: https://reviews.llvm.org/D110762
This is used to fix wrong code generation of s_add_co_select_user in
test/CodeGen/AMDGPU/expand-scalar-carry-out-select-user.ll
s_addc_u32 s4, s6, 0
s_cselect_b64 vcc, 1, 0 <-- vcc set as 0x1 if SCC==1
v_mov_b32_e32 v1, s4
s_cmp_gt_u32 s6, 31
v_cndmask_b32_e32 v1, 0, v1, vcc
If the s_addc_u32 set SCC, then we will get value 0x1 in VCC.
The v_cndmask will do per thread selection with VCC as condition
register. As VCC only gets the first bit being set, only the first
thread/lane in destination register can get correct result if the
very first lane is active. In fact, we should broadcast the value to all
active lanes of the final register.
The idea here is doing this broadcast to vector boolean explicitly
instead of lowering it into a COPY from SCC which would be interpreted as
selecting between 0/1.
This is used to replace D109754.
Reviewed-by: foad, alex-t
Differential Revision: https://reviews.llvm.org/D109889
This was introduced in D32628 but it does not seem to be required any
more. At least it does not show any problems in check-llvm in an
LLVM_ENABLE_EXPENSIVE_CHECKS build.
Differential Revision: https://reviews.llvm.org/D110692
ASan device library functions (those starts with the prefix __asan_)
are at the moment undergoing through undesired optimizations due to
internalization. Hence, in order to avoid such undesired optimizations
on ASan device library functions, do not internalize them in the first
place.
Reviewed By: yaxunl
Differential Revision: https://reviews.llvm.org/D110468
HSA runtime fails to find the symbols for Init and Fini kernels as
they mark with internal linkage, changing the linkage to external
to fix those errors.
Differential Revision: https://reviews.llvm.org/D110054
KILL instructions are sometimes present and prevented hard
clauses from being formed.
Fix this by ignoring all meta instructions in clauses.
Differential Revision: https://reviews.llvm.org/D106042
Non-entry functions have 32 caller saved VGPRs available. If we
promote alloca to consume more registers we will have to spill
CSRs. There is no reason to eliminate scratch access to get
another scratch access instead.
Differential Revision: https://reviews.llvm.org/D110372
With architected flat scratch it becomes readonly. We must always
reserve SGPR pair for it even if we do not use scratch at all since
an attempt to write to SGPRs mapped to FLAT_SCRATCH results in
memory violation.
This is not needed since GFX10 with architected flat scratch though
since special SGPRs are not carving space from normal SGPRs.
Differential Revision: https://reviews.llvm.org/D110376
We don't allow an initializer for LDS variables
and there is an early abort during instruction
selection. This patch legalizes them by ignoring
the init values. During assembly emission, proper
error reporting already exists for such instances.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D109901
On targets that do not support AGPR to AGPR copying directly, try to find the
defining accvgpr_write and propagate its source vgpr register to the copies
before register allocation so the source vgpr register does not get clobbered.
The postrapseudos pass also attempt to propagate the defining accvgpr_write but
if the register to propagate is clobbered, it will give up and create new
temporary vgpr registers instead.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D108830
The pass amdgpu-propagate-attributes ("Early/Late propagate attributes
from kernels to functions") is currently run also for shaders, where
it does nothing. Modify the check so the pass only processes functions
for kernels.
Differential Revision: https://reviews.llvm.org/D109961
This simplifies the API and addresses a FIXME in
TwoAddressInstructionPass::convertInstTo3Addr.
Differential Revision: https://reviews.llvm.org/D110229
Based off a discussion on D110100, we should be avoiding default CostKinds whenever possible.
This initial patch removes them from the 'inner' target implementation callbacks - these should only be used by the main TTI calls, so this should guarantee that we don't cause changes in CostKind by missing it in an inner call. This exposed a few missing arguments in getGEPCost and reduction cost calls that I've cleaned up.
Differential Revision: https://reviews.llvm.org/D110242
Use of output modifiers forces VOP3 encoding for a VOP2 mac/fmac
instruction, so we might as well convert it to the more flexible VOP3-
only mad/fma form.
With this change, the only way we should emit VOP3-encoded mac/fmac is
if regalloc chooses registers that require the VOP3 encoding, e.g. sgprs
for both src0 and src1. In all other cases the mac/fmac should either be
converted to mad/fma or shrunk to VOP2 encoding.
Differential Revision: https://reviews.llvm.org/D110156
Add an overload to pass the flat workgroup range in separately. This
will allow the attributor to use the assumed value for
amdgpu-flat-workgroup-sizes when inferring amdgpu-waves-per-eu.
And always print it.
This makes some LLVM diagnostics match up better with Clang's diagnostics.
Updated some AMDGPU uses of DiagnosticInfoResourceLimit and now we print
better diagnostics for those.
Reviewed By: dblaikie
Differential Revision: https://reviews.llvm.org/D110204
Normally, given that the DA results are kept consistent over the selection DAG, uniform comparisons get selected to S_CMP_* but divergent to V_CMP_*. Sometimes, for the sake of efficiency, SSA subgraphs may be converted to VALU to avoid repeatedly copying data back and forth. Hence we have to be able to sustain the correctness passing the i1 from VALU to SALU context and vice versa.
VALU operations only process the active lanes of the VGPR and ignore inactive ones.
Active lanes correspond to 1 bit in the EXEC mask register.
SALU represents i1 as just one bit but VALU as 64bits: 0/1 and 0/(0xffffffffffffffff & EXEC) respectively.
SALU uses one-bit conditional flag SCC but VALU - VCC that is a pair of 32-bit SGPRs
To expose SCC to the VALU context we need to convert the one-bit boolean value to the appropriate 64bit.
To return back to the SALU context we need to do the opposite.
To correctly convert 64bit VALU boolean to either 0 or 1 we need to filter out the bits corresponding to the inactive lanes.
Reviewed By: piotr
Differential Revision: https://reviews.llvm.org/D109900
When adding alias.scope and noalias metadata to a memcpy function,
the alias.scope and noalias metadata from the operands are merged.
The rule for merging alias.scope is to take the intersection of
the domains and the union of the scopes within those domains.
The rule for merging noalias is to take the intersection.
The bug is that AMDGPULowerModuleLDS was using concatenation for
both alias.scope and noalias. For example, when f1 and f2 are added
to the LDS structure and there is a memcpy(f2, f1, sizeof(f1)).
Then, concatenation creates noalias metadata for the memcpy that
includes both {f1, f2}. That means that the memcpy is assumed
not to alias a prior load of f2, which enables the optimizer to
remove a load of f2 that occurs after mempcy.
The function MDNode::getmostGenericAliasScope defines the semantics
for alias.scope. There is a function, combineMetadata in Local.cpp,
that uses intersect for noalias.
Differential Revision: https://reviews.llvm.org/D110049
FMA_W_CHAIN is used when lowering fdiv f32. Prefer to select it to fmac
if there are no source modifiers, just like we do for other mad/mac and
fma/fmac cases.
Differential Revision: https://reviews.llvm.org/D110074
v_fmac with source modifiers forces VOP3 encoding, but it is strictly
better to use the VOP3-only v_fma instead, because $dst and $src2 are
not tied so it gives the register allocator more freedom and avoids a
copy in some cases.
This is the same strategy we already use for v_mad vs v_mac and
v_fma_legacy vs v_fmac_legacy.
Differential Revision: https://reviews.llvm.org/D110070
Add eraseInstr(s) utility functions. Before deleting an instruction
collects its use instructions. After deletion deletes use instructions
that became trivially dead.
This patch clears all dead instructions in existing legalizer mir tests.
Differential Revision: https://reviews.llvm.org/D109154
Rework getConstantstVRegValWithLookThrough in order to make it clear if we
are matching integer/float constant only or any constant(default).
Add helper functions that get DefVReg and APInt/APFloat from constant instr
getIConstantVRegValWithLookThrough: integer constant, only G_CONSTANT
getFConstantVRegValWithLookThrough: float constant, only G_FCONSTANT
getAnyConstantVRegValWithLookThrough: either G_CONSTANT or G_FCONSTANT
Rename getConstantVRegVal and getConstantVRegSExtVal to getIConstantVRegVal
and getIConstantVRegSExtVal. These now only match G_CONSTANT as described
in comment.
Relevant matchers now return both DefVReg and APInt/APFloat.
Replace existing uses of getConstantstVRegValWithLookThrough and
getConstantVRegVal with new helper functions. Any constant match is
only required in:
ConstantFoldBinOp: for constant argument that was bit-cast of float to int
getAArch64VectorSplat: AArch64::G_DUP operands can be any constant
amdgpu select for G_BUILD_VECTOR_TRUNC: operands can be any constant
In other places use integer only constant match.
Differential Revision: https://reviews.llvm.org/D104409
Nonfunctional commit fixing several minor spelling errors in llvm/lib/Target/AMDGPU header files.
Testing workflow as a new contributor.
Differential Revision: https://reviews.llvm.org/D109733
In https://reviews.llvm.org/D100481, forceful inline of all non-kernel
functions using lds was disabled since AMDGPULowerModuleLDS pass now handles
static lds. However that pass does not handle extern lds so non-kernel
functions using extern lds must sill be inline.
Reviewed By: hsmhsm, arsenm
Differential Revision: https://reviews.llvm.org/D109773
Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D109536
Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde
The attributor can determine that some indirect calls do not require
special inputs. The special inputs will still be present in the ABI,
so we need to allocate the registers and pass undefs.
This reapplies commit 7dbba3376f, or, put
differently, this reverts commit d9a8d20827.
The test now requires the amdgpu and nvptx backend explicitly as it
won't work without properly.
Not all address spaces support initializers for globals and we can
therefore not set them without checking if they are allowed. This
patch adds a hook into TTI to check if an AS allows non-undef
initializers. We disable it for all but address space 0 by default,
NVPTX and AMDGPU targets allow all but address space 3.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D109337
This reverts commit 98f4713122.
Without any (theoretical/practical) guarantee that all the allocas within
*entry* basic block are clustered together at the beginning of the block,
this patch is doomed to fail. Hence reverting it.
Drop the legacy version in AMDGPUAnnotateKernelFeatures. This has the
side effect of now respecting the linkage, and not changing externally
visible functions.