AMDGPU has native instructions and target intrinsics for this, but
these really should be subject to legalization and generic
optimizations. This will enable legalization of f16->f32 on targets
without f16 support.
Implement a somewhat horrible inline expansion for targets without
libcall support. This could be better if we could introduce control
flow (GlobalISel version not yet implemented). Support for strictfp
legalization is less complete but works for the simple cases.
Removing them seems to slightly increase code quality as well as
simplifying both the tablegen and C++ parts of the code.
Differential Revision: https://reviews.llvm.org/D149853
In order to enable the LLVM frontend to better analyze buffer
operations (and to potentially enable more precise analyses on the
backend), define versions of the raw and structured buffer intrinsics
that use `ptr addrspace(8)` instead of `<4 x i32>` to represent their
rsrc arguments.
The new intrinsics are named by replacing `buffer.` with `buffer.ptr`.
One advantage to these intrinsic definitions is that, instead of
specifying that a buffer load/store will read/write some memory, we
can indicate that the memory read or written will be based on the
pointer argument. This means that, for example, a read from a
`noalias` buffer can be pulled out of a loop that is modifying a
distinct buffer.
In the future, we will define custom PseudoSourceValues that will
allow us to package up the (buffer, index, offset) triples that buffer
intrinsics contain and allow for more precise backend analysis.
This work also enables creating address space 7, which represents
manipulation of raw buffers using native LLVM load and store
instructions.
Where tests simply used a buffer intrinsic while testing some other
code path (such as the tests for VGPR spills), they have been updated
to use the new intrinsic form. Tests that are "about" buffer
intrinsics (for instance, those that ensure that they codegen as
expected) have been duplicated, either within existing files or into
new ones.
Depends on D145441
Reviewed By: arsenm, #amdgpu
Differential Revision: https://reviews.llvm.org/D147547
Remove the unnecessary `"llvm/IR/PatternMatch.h"` include directive from
`ComplexDeinterleavingPass.h` and move it to the corresponding source
file.
Add missing includes that were transitively included by this header to 3
other source files.
This reduces the total number of preprocessing tokens across the LLVM
source files in `lib` from (roughly) 1,964,876,961 to 1,935,091,611 - a
reduction of ~1.52%. This should result in a small improvement in
compilation time.
If a call to the llvm.fabs intrinsic has users in another reachable
BB, SelectionDAG will not apply the abs modifier to these users and
instead generate a v_and ..., 0x7fffffff instruction.
For fneg instructions, the issue is similar.
This patch implements `AMDGPUIselLowering::shouldSinkOperands`,
which allows CodegenPrepare to call `tryToSinkFreeOperands`.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D150347
Summary:
Registers for tail call return should not be clobbered by callee.
So we need a sub-class of SGPR_64 (excluding callee saved registers (CSR)) to hold
the tail call return address.
Because GFX and C calling conventions have different CSR, we need to define
the sub-class separately. This work is an extension of D147096 with the
consideration of GFX calling convention.
Based on the calling conventions, different instructions will be selected with
different sub-class of SGPR_64 as the input.
Reviewers: arsenm, cdevadas and sebastian-ne
Differential Revision: https://reviews.llvm.org/D148824
A bitcast needs to be inserted back to the original type. Just
skip the multiple use case for a safer quick fix. Handling
the multiple use case seems to be beneficial in some but not
all cases.
The math libraries have a lot of code that performs
manual sign bit operations by bitcasting doubles to int2
and doing bithacking on them. This is a bad canonical form
we should rewrite to use high level sign operations directly
on double. To avoid codegen regressions, we need to do a better
job moving fnegs to operate only on the high 32-bits.
This is only halfway to fixing the real case.
Similar to the existing SelectionDAG::SplitVector helper, this helper creates the EXTRACT_ELEMENT nodes for the LO/HI halves of the scalar source.
Differential Revision: https://reviews.llvm.org/D147264
Post ISel, LDS variables are absolute values. Representing them as
such is simpler than the frame recalculation currently used to build assembler
tables from their addresses.
This is a precursor to lowering dynamic/external LDS accesses from non-kernel
functions.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D144221
We do match source modifiers for f32 typed selects already, but the
combiner code was never informed of this.
A long time ago the documentation lied and stated that source
modifiers don't work for v_cndmask_b32 when they in fact do. We had a
bunch fo code operating under the assumption that they don't support
source modifiers, so we tried to move fnegs around to work around
this.
Gets a few small improvements here and there. The main hazard to watch
out for is infinite loops in the combiner since we try to move fnegs
up and down the DAG. For now, don't fold fneg directly into select.
The generic combiner does this for a restricted set of cases
when getNegatedExpression obviously shows an improvement for both
operands. It turns out to be trickier to avoid infinite looping the
combiner in conjunction with pulling out source modifiers, so
leave this for a later commit.
Ignore the multiple use heuristics of the default
implementation, and report cost based on inline immediates. This
is mostly interesting for -0 vs. 0. Gets a few small improvements.
fneg_fadd_0_f16 is a small regression. We could probably avoid this
if we handled folding fneg into div_fixup.
This is NFC as it stands, since other combines will effectively
prevent this from being reachable. This will avoid regressions in a
future change which tries to make better use of select source
modifiers.
Didn't bother with the GlobalISel part for now, since the baseline
combine doesn't seem to work on the existing test.
C++17 allows us to call constructors pair and tuple instead of helper
functions make_pair and make_tuple.
Differential Revision: https://reviews.llvm.org/D139828
- [Clang] Declare AMDGPU target as supporting BF16 for storage-only purposes on amdgcn
- Add Sema & CodeGen tests cases.
- Also add cases that D138651 would have covered as this patch replaces it.
- [AMDGPU] Add BF16 storage-only support
- Support legalization/dealing with bf16 operations in DAGIsel.
- bf16 as a type remains illegal and is represented as i16 for storage purposes.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D139398
Renames the current lowering scheme to "module" and introduces two new
ones, "kernel" and "table", plus a "hybrid" that chooses between those three
on a per-variable basis.
Unit tests are set up to pass with the default lowering of "module" or "hybrid"
with this patch defaulting to "module", which will be a less dramatic codegen
change relative to the current. This reflects the sparsity of test coverage for
the table lowering method. Hybrid is better than module in every respect and
will be default in a subsequent patch.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D139433
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated. The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
class support and introduce GlobalISel implementation for AMDGPU
Uses existing SelectionDAG lowering of the llvm.amdgcn.class intrinsic
for llvm.is.fpclass
A target can return if a misaligned access is 'fast' as defined
by the target or not. In reality there can be different levels
of 'fast' and 'slow'. This patch changes the boolean 'Fast'
argument of the allowsMisalignedMemoryAccesses family of functions
to an unsigned representing its speed.
A target can still define it as it wants and the direct translation
of the current code uses 0 and 1 for current false and true. This
makes the change an NFC.
Subsequent patch will start using an actual value of speed in
the load/store vectorizer to compare if a vectorized access going
to be not just fast, but not slower than before.
Differential Revision: https://reviews.llvm.org/D124217
This was disabled to prevent regressions, which appear to be just occurring on AMDGPU (at least in our current lit tests), which I've addressed by adding AMDGPUTargetLowering::isDesirableToCommuteWithShift overrides.
Fixes#57872
Differential Revision: https://reviews.llvm.org/D136042
The condition needs to be different for right-shifts, else we may lose information in some cases.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D136059