Currently, the lowering for vector.step lives
under a folder. This is not ideal if we want
to do transformation on it and defer the
materizaliztion of the constants much later.
This commits adds a rewrite pattern that
could be used by using
`transform.structured.vectorize_children_and_apply_patterns`
transform dialect operation.
Moreover, the rewriter of vector.step is also
now used in -convert-vector-to-llvm pass where
it handles scalable and non-scalable types as
LLVM expects it.
As a consequence of removing the vector.step
lowering as its folder, linalg vectorization
will keep vector.step intact.
Recently, we added an intrinsic for the elect.sync PTX instruction (PR
104780). This patch updates the corresponding Op in NVVM Dialect
to lower to the intrinsic instead of inline-ptx.
The existing test under Conversion/ is migrated to check for the new
pattern. A separate test is added to verify the lowered intrinsic under
the Target/ directory.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
Operation memref.reinterpret_cast was accept input like:
%out = memref.reinterpret_cast %in to offset: [%offset], sizes: [10],
strides: [1]
: memref<?xf32> to memref<10xf32>
A problem arises: while lowering, the true offset of %out is %offset,
but its data type indicates an offset of 0. Permitting this
inconsistency can result in incorrect outcomes, as certain pass might
erroneously extract the offset from the data type of %out.
This patch fixes this by enforcing that the return value's data type
aligns
with the input parameter.
This PR fixes an issue related to integer overflow when computing
`(intmax+1)` for `i64` during `tosa-to-linalg` pass for `tosa.cast`.
Found this issue while debugging a numerical mismatch for `deeplabv3`
model from `torchvision` represented in `tosa` dialect using the
`TorchToTosa` pipeline in `torch-mlir` repository. `torch.aten.to.dtype`
is converted to `tosa.cast` that casts `f32` to `i64` type. Technically
by the specification, `tosa.cast` doesn't handle casting `f32` to `i64`.
So it's possible to add a verifier to error out for such tosa ops
instead of producing incorrect code. However, I chose to fix the
overflow issue to still be able to represent the `deeplabv3` model with
`tosa` ops in the above-mentioned pipeline. Open to suggestions if
adding the verifier is more appropriate instead.
This PR fixes a crash in `VectorToGPU` when the operand of `extOp` is a
function argument, which cannot be retrieved using `getDefiningOp`.
Fixes#107967.
This PR updates the cast to bool from IntN to treat any non-zero value
as TRUE. This makes the cast more resilient to non-generic (i.e. "non
1") TRUE values.
Signed-off-by: Dmitriy Smirnov <dmitriy.smirnov@arm.com>
Since
ddf2d62c7d
, 0-d vectors are supported in VectorType. This patch removes 0-d vector
handling with scalars for the TransferOpReduceRank pattern. This pattern
specifically introduces tensor.extract_slice during vectorization,
causing vectorization to not fold transfer_read/transfer_write slices
properly. The changes in vectorization test files reflect this.
There are other places where lowering patterns are still side-stepping
from handling 0-d vectors properly, by turning them into scalars, but
this patch only focuses on the vector.transfer_x patterns.
This patch adds operand bundle support for `llvm.intr.assume`.
This patch actually contains two parts:
- `llvm.intr.assume` now accepts operand bundle related attributes and
operands. `llvm.intr.assume` does not take constraint on the operand
bundles, but obviously only a few set of operand bundles are meaningful.
I plan to add some of those (e.g. `aligned` and `separate_storage` are
what interest me but other people may be interested in other operand
bundles as well) in future patches.
- The definitions of `llvm.call`, `llvm.invoke`, and
`llvm.call_intrinsic` actually define `op_bundle_tags` as an operation
property. It turns out this approach would introduce some unnecessary
burden if applied equally to the intrinsic operations because properties
are not available through `Operation *` but we have to operate on
`Operation *` during the import/export of intrinsics, so this PR changes
it from a property to an array attribute.
This patch relands commit d8fadad07c.
I am removing the recently added integration test for various Arith Ops.
These operations and their lowerings are effectively already verified by
the Arith-to-LLVM conversion tests in:
* "mlir/test/Conversion/ArithToLLVM/arith-to-llvm.mlir"
I've noticed that a few variants of `arith.cmpi` were missing in that
file - those are added here as well.
This is a follow-up for this discussion:
* https://github.com/llvm/llvm-project/pull/92272
See also the recent update to our guidelines on e2e tests in MLIR:
* https://github.com/llvm/mlir-www/pull/203
This patch adds operand bundle support for `llvm.intr.assume`.
This patch actually contains two parts:
- `llvm.intr.assume` now accepts operand bundle related attributes and
operands. `llvm.intr.assume` does not take constraint on the operand
bundles, but obviously only a few set of operand bundles are meaningful.
I plan to add some of those (e.g. `aligned` and `separate_storage` are
what interest me but other people may be interested in other operand
bundles as well) in future patches.
- The definitions of `llvm.call`, `llvm.invoke`, and
`llvm.call_intrinsic` actually define `op_bundle_tags` as an operation
property. It turns out this approach would introduce some unnecessary
burden if applied equally to the intrinsic operations because properties
are not available through `Operation *` but we have to operate on
`Operation *` during the import/export of intrinsics, so this PR changes
it from a property to an array attribute.
Adds tests with scalable vectors for the Vector-To-LLVM conversion pass.
Covers the following Ops:
* `vector.transfer_read`,
* `vector.transfer_write`.
In addition:
* Duplicate tests from "vector-mask-to-llvm.mlir" are removed.
* Tests for xfer_read/xfer_write are moved to a newly created test file,
"vector-xfer-to-llvm.mlir". This follows an existing pattern among
VectorToLLVM conversion tests.
* Tests that test both xfer_read and xfer_write have their names updated
to capture that (e.g. @transfer_read_1d_mask ->
@transfer_read_write_1d_mask)
* @transfer_write_1d_scalable_mask and @transfer_read_1d_scalable_mask
are re-written as @transfer_read_write_1d_mask_scalable. This is to
make it clear that this case is meant to complement
@transfer_read_write_1d_mask.
* @transfer_write_tensor is updated to also test xfer_read.
This patch simplifies the representation of OpenMP loop wrapper
operations by introducing the `NoTerminator` trait and updating
accordingly the verifier for the `LoopWrapperInterface`.
Since loop wrappers are already limited to having exactly one region
containing exactly one block, and this block can only hold a single
`omp.loop_nest` or loop wrapper and an `omp.terminator` that does not
return any values, it makes sense to simplify the representation of loop
wrappers by removing the terminator.
There is an extensive list of Lit tests that needed updating to remove
the `omp.terminator`s adding some noise to this patch, but actual
changes are limited to the definition of the `omp.wsloop`, `omp.simd`,
`omp.distribute` and `omp.taskloop` loop wrapper ops, Flang lowering for
those, `LoopWrapperInterface::verifyImpl()`, SCF to OpenMP conversion
and OpenMP dialect documentation.
Default to Global address space for memrefs that do not have an explicit address space set in the IR.
---------
Co-authored-by: Victor Perez <victor.perez@intel.com>
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
Co-authored-by: Victor Perez <victor.perez@codeplay.com>
Relaxes vector.transfer_write lowering to allow out-of-bound writes.
This aligns lowering with the current hardware specification which does
not update bytes in out-of-bound locations during block stores.
The PR updates math.powf lowering to produce NaN result for a negative
base with a fractional exponent which matches the actual behaviour of
the C/C++ implementation.
Found by inspecting AMDGPU assembly - so the arithmetic ops created
there were definitely making their way into the target ISA. A
`LLVM::BitcastOp` seems equivalent, and evaporates as expected in the
target asm.
Along the way, I thought that this helper function `mfmaConcatIfNeeded`
could be renamed to `convertMFMAVectorOperand` to better convey its
contract; so I don't need to think about whether a bitcast is a
legitimate "concat" :-)
---------
Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
Adds tests with scalable vectors for the Vector-To-LLVM conversion pass.
Covers the following Ops:
* `vector.insert_strided_slice`
With this change, for every test with fixed-width vectors, there should
be a corresponding example with scalable vectors (for
`vector.insert_strided_slice`). In addition:
* Test function names are updated to more accurately reflect the case
being exercised (e.g. `@insert_strided_index_slice1` ->
`@insert_strided_index_slice_index_2d_into_3d`)
* For consistency, took the liberty of updating some of the function
names for `vector.extract_strided_slice`
* `@insert_strided_slice_scalable` is effectively replaced with
`@insert_strided_slice_f32_2d_into_3d_scalable`
The `memref.alloca` lowering computed the allocation size incorrectly
when there were 0 dimensions.
Previously:
```
memref.alloca() : memref<10x0x2xf32>
--> llvm.alloca 20xf32
```
Now:
```
memref.alloca() : memref<10x0x2xf32>
--> llvm.alloca 0xf32
```
From the `llvm.alloca` documentation:
```
Allocating zero bytes is legal, but the returned pointer may not be unique.
```
NOTE: This is a follow-up for #97049 in which the `in_bounds` attribute
was made mandatory.
This PR updates the semantics of the `in_bounds` attribute so that
broadcast dimensions are no longer required to be "in bounds".
Specifically, these xfer_read/xfer_write Ops become valid after this
change:
```mlir
%read = vector.transfer_read %A[%base1, %base2], %pad
{in_bounds = [false], permutation_map = affine_map<(d0, d1) -> (0)>}
{permutation_map = affine_map<(d0, d1) -> (0)>}
: memref<?x?xf32>, vector<9xf32>
vector.transfer_write %vec, %A[%base1, %base2],
{in_bounds = [false], permutation_map = affine_map<(d0, d1) -> (0)>}
{permutation_map = affine_map<(d0, d1) -> (0)>}
: vector<9xf32>, memref<?x?xf32>
```
Note that the value `false` merely means "may run out-of-bounds", i.e.,
the corresponding access can still be "in bounds". In fact, the folder
for xfer Ops is also updated (*) and will update the attribute value
corresponding to broadcast dims to `true` if all non-broadcast dims
are marked as "in bounds".
Note that this PR doesn't change any of the lowerings. The changes in
"SuperVectorize.cpp", "Vectorization.cpp" and "AffineMap.cpp" are simple
reverts of recent changes in #97049. Those were only meant to facilitate
making `in_bounds` mandatory and to work around the extra requirements
for broadcast dims (those requirements ere removed in this PR). All
changes in tests are also reverts of changes from #97049.
For context, here's a PR in which "broadcast" dims where forced to
always be "in-bounds":
* https://reviews.llvm.org/D102566
(*) See `foldTransferInBoundsAttribute`.
Support broadcasting of depthwise conv2d bias in tosa->linalg named
lowering in the case that bias is a rank-1 tensor with exactly 1
element. In this case TOSA specifies the value should first be broadcast
across the bias dimension and then across the result tensor.
Add `lit` tests for depthwise conv2d with scalar bias and for conv3d
which was already supported but missing coverage.
Signed-off-by: Jack Frankland <jack.frankland@arm.com>
This patch updates printing and parsing of operations including clauses
that define entry block arguments to the operation's region. This
impacts `in_reduction`, `map`, `private`, `reduction` and
`task_reduction`.
The proposed representation to be used by all such clauses is the
following:
```
<clause_name>([byref] [@<sym>] %value -> %block_arg [, ...] : <type>[, ...]) {
...
}
```
The `byref` tag is only allowed for reduction-like clauses and the
`@<sym>` is required and only allowed for the `private` and
reduction-like clauses. The `map` clause does not accept any of these
two.
This change fixes some currently broken op representations, like
`omp.teams` or `omp.sections` reduction:
```
omp.teams reduction([byref] @<sym> -> %value : <type>) {
^bb0(%block_arg : <type>):
...
}
```
Additionally, it addresses some redundancy in the representation of the
previously mentioned cases, as well as e.g. `map` in `omp.target`. The
problem is that the block argument name after the arrow is not checked
in any way, which makes some misleading representations legal:
```mlir
omp.target map_entries(%x -> %arg1, %y -> %arg0, %z -> %doesnt_exist : !llvm.ptr, !llvm.ptr, !llvm.ptr) {
^bb0(%arg0 : !llvm.ptr, %arg1 : !llvm.ptr, %arg2 : !llvm.ptr):
...
}
```
In that case, `%x` maps to `%arg0`, contrary to what the representation
states, and `%z` maps to `%arg2`. `%doesnt_exist` is not resolved, so it
would likely cause issues if used anywhere inside of the operation's
region.
The solution implemented in this patch makes it so that values
introduced after the arrow on the representation of these clauses
implicitly define the corresponding entry block arguments, removing the
potential for these problematic representations. This is what is already
implemented for the `private` and `reduction` clauses of `omp.parallel`.
There are a couple of consequences of this change:
- Entry block argument-defining clauses must come at the end of the
operation's representation and in alphabetical order. This is because
they are printed/parsed as part of the region and a standardized
ordering is needed to reliably match op arguments with their
corresponding entry block arguments via the `BlockArgOpenMPOpInterface`.
- We can no longer define per-clause assembly formats to be reused by
all operations that take these clauses, since they must be passed to a
custom printer including the region and arguments of all other entry
block argument-defining clauses. Code duplication and potential for
introducing issues is minimized by providing the generic
`{print,parse}BlockArgRegion` helpers and associated structures.
MLIR and Flang lowering unit tests are updated due to changes in the
order and formatting of impacted operations.
This change contains following:
- adds lowering of printf op to spirv.CL.printf op in GPUToSPIRV pass.
- Fixes Constant decoration parsing for spirv GlobalVariable.
- minor modification to spirv.CL.printf op assembly format.
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
Update the GPU to NVVM lowerings to correctly propagate range
information on IDs and dimension queries, etiher from
known_{block,grid}_size attributes or from `upperBound` annotations on
the operations themselves.
This patch modifies the representation of `OpenMP_Clause` to allow
definitions to incorporate both required and optional arguments while
still allowing operations including them and overriding the
`assemblyFormat` to take advantage of automatically-populated format
strings.
The proposed approach is to split the `assemblyFormat` clause property
into `reqAssemblyFormat` and `optAssemblyFormat`, and remove the
`isRequired` template and associated `required` property. The
`OpenMP_Op` class, in turn, populates the new `clausesReqAssemblyFormat`
and `clausesOptAssemblyFormat` properties in addition to
`clausesAssemblyFormat`. These properties can be used by clause-based
OpenMP operation definitions to reconstruct parts of the
clause-inherited format string in a more flexible way when overriding
it.
Clause definitions are updated to follow this new approach and some
operation definitions overriding the `assemblyFormat` are simplified by
taking advantage of the improved flexibility, reducing code duplication.
The `verify-openmp-ops` tablegen pass is updated for the new
`OpenMP_Clause` representation.
Some MLIR and Flang unit tests had to be updated due to changes to the
default printing order of clauses on updated operations.
The AMDGPU backend now implements LLVM's `bfloat` type. Therefore, we no
longer need to type convert MLIR's `bf16` to `i16` during lowerings to
ROCDL.
As a result of this change, we discovered that, whel the code for MFMA
and WMMA intrinsics was mainly prepared for this change, we were failing
to bitcast the bf16 results of WMMA operations out from the i16 they're
natively represented as. This commit also fixes that issue.
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
* Fix a bug introduced by the Chipset refactoring in #107720 where
atomics emulation for adds was mistakenly applied to gfx11+
* Add the case needed for gfx11+ atomic emulation, namely that gfx11
doesn't support atomically adding a v2f16 or v2bf16, thus requiring
MLIR-level legalization for buffer intrinsics that attempt to do such an
addition
* Add tests, including tests for gfx11 atomic emulation
Co-authored-by: Manupa Karunaratne <manupa.karunaratne@amd.com>
This commit introduces a ConstantRange attribute to match the
ConstantRange attribute type present in LLVM IR.
It then refactors the LLVM_IntrOpBase so that the basic part of the
intrinsic builder code can be re-used without needing to copy it or
get rid of important context. This, along with adding code for
handling an optional `range` attribute to that same base, allows us to
make the support for range() annotations generic without adding
another bit to IntrOpBase.
This commit then updates the lowering of index intrinsic operations to
use the new ConstantRange attribute and fixes a bug (where we'd be
subtracting 1 from upper bounds instead of adding it on operations
like gpu.block_dim) along the way.
The point of these changes is to enable these range annotations to be
used for the corresponding NVVM operations in a future commit.
Extend the lowering of atomic.fadd to support the v2f16 variant
avaliable on some AMDGPU chips.
Re-lands #108238 (and addresses review comments from there)
Co-authored-by: Giuseppe Rossini <giuseppe.rossini@amd.com>