Conv3D has an existing linalg operation for floating point. Adding a quantized
variant and corresponding lowering from TOSA. Numerical correctness was validated
using the TOSA conformance tests.
Reviewed By: jpienaar
Differential Revision: https://reviews.llvm.org/D140919
When converting to nvvm lowering gpu.printf to vprintf allows us to
support printing when running on cuda.
Differential Revision: https://reviews.llvm.org/D141049
When lowering tosa.resize it is possible there is an unary input dimension.
Lowering to a new tosa.resize and explicit broadcast simplifies the
tosa.resize operation to avoid recomputing the identical broadcasted values.
This change reworks the broadcast optimization reuse the tosa.resize generic
implementation.
Reviewed By: jpienaar
Differential Revision: https://reviews.llvm.org/D139963
There's currently no way to get accurate cube roots in the math dialect.
powf(x, 1/3.0) is too inaccurate in some cases.
Reviewed By: akuegel
Differential Revision: https://reviews.llvm.org/D140842
1. When converting from the GPU dialect to the ROCDL dialect, if the
function that contains a gpu.thread_id or gpu.block_id op is annotated
with gpu.known_{block,grid}_size, use that size to set a "range"
attribute on the corresponding rocdl intrinsic so that the LLVM
frontend can optimize based on that range information.
1b. When translating from the rocdl dialect to LLVM IR, use the
"range" attribute, if present, to set !range metadata on the relevant
function call.
2. Deprecate the old rocdl.max_flat_work_group_size attribute, which
was used in a tensorflow backend. Instead, use
rocdl.flat_work_group_size going forward to allow kernel generators to
specify the minimum and maximum work group sizes a kernel may be
launched with in one attribute, thus more closely matching the backend.
3. When translating from gpu.func to llvm.func within gpu-to-rocdl,
copy the known_block_size attribute as rocdl.reqd_work_group_size to
enable further translations to set the corresponding metadata on the
LLVM IR function. Also, set the rocdl.flat_work_group_size attribute
to ensure that the reqd_work_group_size metadata and the
amdgpu-flat-work-group-size metadata are consistent.
3b. Extend the ROCDL to LLVM IR translation to set the
!reqd_work_group_size metadata on LLVM functions
Also update tests and add functions to the ROCDL dialect to ensure
attribute names are used consistently.
Depends on D139865
Reviewed By: antiagainst
Differential Revision: https://reviews.llvm.org/D139866
Depending on the target environment, we may need to emulate certain
types, which can cause issue with bitcast.
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D140437
The requirement that divisor>0 is not enforced here outside of the
constant case, but how to enforce it? If I understand correctly, it is
UB and while it is nice to be able to deterministically intercept UB,
that isn't always feasible. Hopefully, keeping the existing
enforcement in the constant case is enough.
Differential Revision: https://reviews.llvm.org/D140079
When using a tosa resize for ?x1x1x? to ?x1x?x? we should avoid doing a 2D
interpolation as only two unique values are loaded. As the extract operation
performance numerical computation on its values the superfluous extracts may
fail to be coalesced. Instead we only interpolate between the values if there
are multiple values to interpolate between.
For the integer case we also perform scaling by the scaling-factor to apply
the same integer scaling behavior as interpolation.
Reviewed By: jpienaar, NatashaKnk
Differential Revision: https://reviews.llvm.org/D139979
This is loading from 2-D memref, in addition to D139655 where we
load from 1-D memref cases.
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D140136
This is now possible with transpose semantics on subgroup MMA
load ops.
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D139655
Conversion from gpu.subgroup_mma_constant_matrix to spirv.MatrixTimesScalar didn't check that the op type was a multiplication and thus would incorrectly convert other elementwise scalar operations.
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D140081
Power functions are implemented as linkonce_odr scalar functions
for FPowI operations met in a module.
Vector form of FPowI is linearized into a sequence of calls
of the scalar functions.
Option {min-width-of-fpowi-exponent} controls which FPowI operations
are converted by MathToFuncs: if the width of the exponent's integer
type is less than the specified value, then the operation is not converted.
Flang will specify {min-width-of-fpowi-exponent=33} to make sure that
math::FPowI operations with exponent wider than 32 bits will be converted
by MathToFuncs, and operations with more narrow exponent will be left
for MathToLLVM to convert them to LLVM::PowIOp.
Reviewed By: Mogball
Differential Revision: https://reviews.llvm.org/D139804
Moved to using helper lambdas to avoid code repetition. IR needed to be reordered to
accommodate which should be the only changes to the existing tests.
This changes the quantized test to target `i48` types to guarantee types are extended
correctly when necessary.
Reviewed By: jpienaar
Differential Revision: https://reviews.llvm.org/D136500
To not introduce 64-bit types that may be difficult to handle for some
targets.
Reviewed By: rsuderman, antiagainst
Differential Revision: https://reviews.llvm.org/D139777
Conversion of CopySignOp to SPIRV is supported for scalar and vectors but not 1D vectors with 1 element (aka vector<1xf32>). This revisions adds supports this by treating them as scalars.
An alternative solution would be to allow 0D vectors for SPIRV, but the spec [0] strictly defines the vector type as non-0D.
"Vector: An ordered homogeneous collection of two or more scalars. Vector sizes are quite restrictive and dependent on the execution model."
[0] https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#_types
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D139518
The goal is to make the naming of the future `_extended` ops more
consistent. With unsigned addition, the carry value/flag and overflow
bit are the same, but this is not true when it comes to signed addition.
Also rename the second result from `carry` to `overflow`.
Reviewed By: antiagainst
Differential Revision: https://reviews.llvm.org/D139569
Implementation assumed a i32 accumulator. Fixed the implementation to
work with an i32 accumulator.
Reviewed By: NatashaKnk
Differential Revision: https://reviews.llvm.org/D139365
Since tosa.pad is lowered strictly to artih and tensor ops, move
ConvertPad from TosaToLinalg to TosaToTensor, benefitting non-Linalg
Tosa targets. TensorToLinalg exists, and is trivial, so nothing is lost.
Signed-off-by: Ramkumar Ramachandra <r@artagnon.com>
Differential Revision: https://reviews.llvm.org/D139091
Along the way, make the default pattern fail instead of crashing
when an elementwise op is not supported yet.
Reviewed By: kuhar
Differential Revision: https://reviews.llvm.org/D139280
Rounding of tosa.resize did not handle rounding to the nearest pixel correctly.
Rather than dividing the scale by 2 we should double the partial pixel to
guarantee we include a check on the lowest bit.
Reviewed By: NatashaKnk
Differential Revision: https://reviews.llvm.org/D139162
Add support for loading, computing, and storing `gpu.subgroup` WMMA ops
in transpose mode as well. Update the GPU to NVVM lowerings to support
`transpose` mode and update integration tests as well.
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D139021
This reverts commit d0650d1089.
Original commit message:
Subviews are supposed to be expanded before we hit the lowering
code.
The expansion is done with the pass called
expand-strided-metadata.
Add a test that demonstrate how these passes can be linked up to achieve
the desired lowering.
This patch is NFC in spirit but not in practice because `subview` gets
lowered into `reinterpret_cast(extract_strided_metadata, <some math>)`
which lowers in two memref descriptors (one for `reinterpert_cast` and
one for `extract_strided_metadata`), which creates some noise of the
form: `extractvalue(unrealized_cast(extractvalue[0]))[0]` that is
currently not simplified within MLIR but that is really just noop in
that case.
Differential Revision: https://reviews.llvm.org/D136377
This reverts commit c8e15afa4c.
This breaks some integration tests, see
https://lab.llvm.org/buildbot/#/builders/220/builds/10446
I have to update a bunch of RUN lines in the tests to use the new
lowering scheme. Nothing complicated but let's keep the build clean
while I'm fixing that.
Subviews are supposed to be expanded before we hit the lowering
code.
The expansion is done with the pass called
expand-strided-metadata.
Add a test that demonstrate how these passes can be linked up to achieve
the desired lowering.
This patch is NFC in spirit but not in practice because `subview` gets
lowered into `reinterpret_cast(extract_strided_metadata, <some math>)`
which lowers in two memref descriptors (one for `reinterpert_cast` and
one for `extract_strided_metadata`), which creates some noise of the
form: `extractvalue(unrealized_cast(extractvalue[0]))[0]` that is
currently not simplified within MLIR but that is really just noop in
that case.
Differential Revision: https://reviews.llvm.org/D136377
This patch fixes and simplifies the ldmatrix affine map arithmetic by
abstracting the affine expressions in terms of pitch-linear layout
(strided and contiguous dimensions). Then it applies the maps for
strided and contiguous dimensions in row-major and col-major.
LdMatrixOp collaboratively (32 threads in a warp) load tiles
(8 row x 128b col) of data. It can load either x1, x2, x4 tiles.
Additionally, it can transpose at 16-bit granularity when moving
data from the Shared Memory to registers.
This patch fixes affine map:
(laneid -> coordinate index a thread points in a tile).
- Loading x4 tiles needs all 32 lanes T0-31 point to a contiguous
chunk of 128b. The issue was exposed when running this case.
- Loading x2 tiles and x1 needs T0-15 threads and T0-7 threads points
to contiguous chunk of 128b. The patch is NFC for these cases.
Differential Revision: https://reviews.llvm.org/D138978
* Fix type conversions around positions--we need to use the
converted value from the adaptor.
* Convert constant position cases to composite extract/insert.
Reviewed By: kuhar
Differential Revision: https://reviews.llvm.org/D139057
This commit extends the `ResourceLimitsAttr` to support specifying
a minimal and maximal subgroup size, and extends `EntryPointABIAttr`
to support specifying the requested subgroup size. This is possible
now in Vulkan with the VK_EXT_subgroup_size_control extension.
For OpenCL it's possible to use the `SubgroupSize` execution mode
directly.
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D138962
Enables transposed gpu.subgroup_mma_load_matrix and updates the lowerings in Vector to GPU and GPU to SPIRV. Needed to enable B transpose matmuls lowering to wmma ops.
Taken over from author: stanley-nod <stanley@nod-labs.com>
Reviewed By: ThomasRaoux, antiagainst
Differential Revision: https://reviews.llvm.org/D138770
This patch is part of a larger simplification effort of vector transfer
operations. It removes the flag `lower-permutation-maps` from
VectorToSCF conversion and enables the lowering of permutation maps
by default. This means that VectorToSCF will always lower permutation
maps to independent broadcast/transpose operations before lowering
vector operations to SCF.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D138742
This is generated by running
```
sed --in-place 's/[[:space:]]\+$//' mlir/**/*.td
sed --in-place 's/[[:space:]]\+$//' mlir/**/*.mlir
```
Reviewed By: rriddle, dcaballe
Differential Revision: https://reviews.llvm.org/D138866
This patch adds the and, or, and xor bitwise operations to
the index dialects with folders and LLVM lowerings.
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D138590