This patch updates the syntax for nvgpu_arrive Op
in matmulBuilder.py. This fixes the compilation
error for this test.
For the warp-specialized matmul_kernel implementation,
removing the WaitGroupSyncOp (after the mma-main-loop)
fixes the hang observed.
With these two fixes, the test compiles and
executes successfully on an sm90a machine.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
I am removing the recently added integration test for various Arith Ops.
These operations and their lowerings are effectively already verified by
the Arith-to-LLVM conversion tests in:
* "mlir/test/Conversion/ArithToLLVM/arith-to-llvm.mlir"
I've noticed that a few variants of `arith.cmpi` were missing in that
file - those are added here as well.
This is a follow-up for this discussion:
* https://github.com/llvm/llvm-project/pull/92272
See also the recent update to our guidelines on e2e tests in MLIR:
* https://github.com/llvm/mlir-www/pull/203
This patch fixes the sm90 cluster test by:
* Fixing a typo in LowerGpuOpsToNVVMOps where one of the ClusterDim Op
conversion pattern should actually be for the
ClusterDimBlocks Op. This addresses the compilation error for this test.
* The grid-size should be (4,4,1) instead of (2,2,1). This passes the
scf-if check against the threshold of 3 below and actually
generates the required prints from the GPU.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
The memref.expand_shape explicitly takes an output_shape now.
This patch adds it to the Op and fixes the failing test.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
NOTE: This is a follow-up for #97049 in which the `in_bounds` attribute
was made mandatory.
This PR updates the semantics of the `in_bounds` attribute so that
broadcast dimensions are no longer required to be "in bounds".
Specifically, these xfer_read/xfer_write Ops become valid after this
change:
```mlir
%read = vector.transfer_read %A[%base1, %base2], %pad
{in_bounds = [false], permutation_map = affine_map<(d0, d1) -> (0)>}
{permutation_map = affine_map<(d0, d1) -> (0)>}
: memref<?x?xf32>, vector<9xf32>
vector.transfer_write %vec, %A[%base1, %base2],
{in_bounds = [false], permutation_map = affine_map<(d0, d1) -> (0)>}
{permutation_map = affine_map<(d0, d1) -> (0)>}
: vector<9xf32>, memref<?x?xf32>
```
Note that the value `false` merely means "may run out-of-bounds", i.e.,
the corresponding access can still be "in bounds". In fact, the folder
for xfer Ops is also updated (*) and will update the attribute value
corresponding to broadcast dims to `true` if all non-broadcast dims
are marked as "in bounds".
Note that this PR doesn't change any of the lowerings. The changes in
"SuperVectorize.cpp", "Vectorization.cpp" and "AffineMap.cpp" are simple
reverts of recent changes in #97049. Those were only meant to facilitate
making `in_bounds` mandatory and to work around the extra requirements
for broadcast dims (those requirements ere removed in this PR). All
changes in tests are also reverts of changes from #97049.
For context, here's a PR in which "broadcast" dims where forced to
always be "in-bounds":
* https://reviews.llvm.org/D102566
(*) See `foldTransferInBoundsAttribute`.
When only all-dense "sparse" tensors occur in a function prototype, the
assembler would skip the method conversion purely based on input/output
counts. It should rewrite based on the presence of any annotation,
however.
This patch is moving out stepvector intrinsic from the experimental
namespace.
This intrinsic exists in LLVM for several years now, and is widely used.
To run integration tests using qemu-aarch64 on x64 host, below flags are
added to the cmake command when building mlir/llvm:
-DMLIR_INCLUDE_INTEGRATION_TESTS=ON \
-DMLIR_RUN_ARM_SVE_TESTS=ON \
-DMLIR_RUN_ARM_SME_TESTS=ON \
-DARM_EMULATOR_EXECUTABLE="<...>/qemu-aarch64" \
-DARM_EMULATOR_OPTIONS="-L /usr/aarch64-linux-gnu" \
-DARM_EMULATOR_MLIR_CPU_RUNNER_EXECUTABLE="<llvm_arm64_build_top>/bin/mlir-cpu-runner-arm64"
\
-DARM_EMULATOR_LLI_EXECUTABLE="<llvm_arm64_build_top>/bin/lli" \
-DARM_EMULATOR_UTILS_LIB_DIR="<llvm_arm64_build_top>/lib"
The last three above are prebuilt on, or cross-built for, an aarch64
host.
This patch introduced substittutions of "%native_mlir_runner_utils" etc. and use
them in SVE/SME integration tests. When configured to run using qemu-aarch64,
mlir runtime util libs will be loaded from ARM_EMULATOR_UTILS_LIB_DIR, if set.
Some tests marked with 'UNSUPPORTED: target=aarch64{{.*}}' are still run
when configured with ARM_EMULATOR_EXECUTABLE and the default target is
not aarch64.
A lit config feature 'mlir_arm_emulator' is added in
mlir/test/lit.site.cfg.py.in and to UNSUPPORTED list of such tests.
This PR renames the `MemRef` integration test directory for and the
`DecomposeMemref.s.cpp` so that they can be found when doing a
case-sensitive search on file paths.
- Use vector.interleave rather than the LLVM intrinsic
- Remove dependency on LLVM dialect
- Remove manual outerproduct erases (these are now trivially dead)
- Remove comment explaining issues with previous tile allocator
- Update pipeline in `multi-tile-matmul-mixed-types.mlir`
Recent changes: #90448, #80965
Allow scalable vectorization of linalg::reduce and linalg::generic that has
reduction iterator(s) with two restrictions:
1. The reduction dim is the last (innermost) dim of the op; and
2. Only the reduction dim is requested for scalable vectorization.
One exception is that scalable vectorization of the reduction dim in
Matmul-like ops are not supported even above restrictions are met.
Allowed combinations of scalable flags and iterator types:
Matmul:
Iterators: ["parallel", "parallel", "reduction"]
Scalable Flags: ["true", "true", "false"]
["false", "true", "false"]
Matvec:
Iterators: ["parallel", "reduction"]
Scalable Flags: ["false", "true"]
["true", "false"]
At the moment, the in_bounds attribute has two confusing/contradicting
properties:
1. It is both optional _and_ has an effective default-value.
2. The default value is "out-of-bounds" for non-broadcast dims, and
"in-bounds" for broadcast dims.
(see the `isDimInBounds` vector interface method for an example of this
"default" behaviour [1]).
This PR aims to clarify the logic surrounding the `in_bounds` attribute
by:
* making the attribute mandatory (i.e. it is always present),
* always setting the default value to "out of bounds" (that's
consistent with the current behaviour for the most common cases).
#### Broadcast dimensions in tests
As per [2], the broadcast dimensions requires the corresponding
`in_bounds` attribute to be `true`:
```
vector.transfer_read op requires broadcast dimensions to be in-bounds
```
The changes in this PR mean that we can no longer rely on the
default value in cases like the following (dim 0 is a broadcast dim):
```mlir
%read = vector.transfer_read %A[%base1, %base2], %f, %mask
{permutation_map = affine_map<(d0, d1) -> (0, d1)>} :
memref<?x?xf32>, vector<4x9xf32>
```
Instead, the broadcast dimension has to explicitly be marked as "in
bounds:
```mlir
%read = vector.transfer_read %A[%base1, %base2], %f, %mask
{in_bounds = [true, false], permutation_map = affine_map<(d0, d1) -> (0, d1)>} :
memref<?x?xf32>, vector<4x9xf32>
```
All tests with broadcast dims are updated accordingly.
#### Changes in "SuperVectorize.cpp" and "Vectorization.cpp"
The following patterns in "Vectorization.cpp" are updated to explicitly
set the `in_bounds` attribute to `false`:
* `LinalgCopyVTRForwardingPattern` and `LinalgCopyVTWForwardingPattern`
Also, `vectorizeAffineLoad` (from "SuperVectorize.cpp") and
`vectorizeAsLinalgGeneric` (from "Vectorization.cpp") are updated to
make sure that xfer Ops created by these hooks set the dimension
corresponding to broadcast dims as "in bounds". Otherwise, the Op
verifier would complain
Note that there is no mechanism to verify whether the corresponding
memory access are indeed in bounds. Still, this is consistent with the
current behaviour where the broadcast dim would be implicitly assumed
to be "in bounds".
[1]
4145ad2bac/mlir/include/mlir/Interfaces/VectorInterfaces.td (L243-L246)
[2]
https://mlir.llvm.org/docs/Dialects/Vector/#vectortransfer_read-vectortransferreadop
This adds a new pattern that can legalize a multi-tile transfer_write as
a single store loop. This is done as part of type decomposition as at
this level we know each tile write is disjoint, but that information is
lost after decomposition (without analysis to reconstruct it).
Example (pseudo-MLIR):
```
vector.transfer_write %vector, %dest[%y, %x], %mask
: vector<[16]x[8]xi16>, memref<?x?xi16>
```
Is rewritten to:
```
scf.for %slice_idx = %c0 to %c8_vscale step %c1 {
%upper_slice_mask = vector.extract %mask[%slice_idx] ─┐
: vector<[8]xi1> from vector<[16]x[8]xi1> |
%upper_slice = vector.extract %upper_tile[%slice_idx] |- Store upper tile
: vector<[8]xi16> from vector<[8]x[8]xi16> |
vector.transfer_write %upper_slice, |
%dest[%slice_idx + %y, %x], %upper_slice_mask |
: vector<[8]xi16>, memref<?x?xi16> ┘
%lower_slice_idx = %slice_idx + %c8_vscale ─┐
%lower_slice_mask = vector.extract %mask[%lower_slice_idx] |
: vector<[8]xi1> from vector<[16]x[8]xi1> |
%lower_slice = vector.extract %lower_tile[%slice_idx] |- Store lower
: vector<[8]xi16> from vector<[8]x[8]xi16> | tile
vector.transfer_write %lower_slice, |
%dest[%lower_slice_idx + %y, %x], %lower_slice_mask |
: vector<[8]xi16>, memref<?x?xi16> ┘
}
```
This commit adds support for `gpu.cluster_dim_blocks` and
`gpu.cluster_block_id` Ops to represent number of blocks per cluster and
block id inside a cluster respectively. Also, fixed the description of
`gpu.cluster_dim` Op and updated the `cga_cluster.mlir` test file to use
`gpu.cluster_dim_blocks`
Co-authored-by: pradeepku <pradeepku@nvidia.com>
Co-authored-by: Guray Ozen <guray.ozen@gmail.com>
This adds a new option
`-enable-arm-streaming=if-contains-scalable-vectors`, which only applies
the selected streaming/ZA modes if the function contains scalable vector
types.
As a NFC this patch also removes the `only-` prefix from the
`if-required-by-ops` mode.
The "Emulated" sub-directories under "ArmSVE" and
"ArmSME" have been removed. Associated tests
have been moved up a directory and now include
the "REQUIRES" constraint for the arm-emulator.
To keep the test filenames consistent, this patch:
* removes "test-" from file names (there used to be a mix of
"test-feature-1.mlir" and "feature-2.mlir"),
* replaces "_" with "-" (there used to be a mix of "feature-3.mlir"
and "feature_4.mlir").
Only files under test/Integration/Dialect/Vector/CPU are updated.
This patch implements the lowering of vector.deinterleave
for 1D vectors.
For fixed vector types, the operation is lowered to two
llvm shufflevector operations. One for even indexed
elements and the other for odd indexed elements. A poison
operation is used to satisfy the parameters of the
shufflevector parameters.
For scalable vectors, the llvm vector.deinterleave2
intrinsic is used for lowering. As such the results
found by extraction and used to form the result
struct for the intrinsic.
These passes have been depreciated for a long time and replaced by
one-shot bufferization. These passes are also unsafe because they do not
check for read-after-write conflicts.
Relands https://github.com/llvm/llvm-project/pull/93488 which failed on
buildbot. Fixes the failure by updating integration tests to use
one-shot-bufferize instead.
This is to make it more obvious for what the result type is, especially
with some less trivial cases like 0-d inputs resulting in 1-d inputs or
interaction with scalable vector types. Note that `vector.deinterleave`
uses the same format with explicit result type.
Also improve examples and clean up surrounding code.
It did not make sense that this said "all tile operations will go
through memory". Only the operations where the warning is emitted will
go through memory. The message has been updated to reflect that.
This patch rewrites the ArmSME tile allocator to use liveness
information to make better tile allocation decisions and improve the
correctness of the ArmSME dialect. This algorithm used here is a linear
scan over live ranges, where live ranges are assigned to tiles as they
appear in the program (chronologically). Live ranges release their
assigned tile ID when the current program point is passed their end.
This is a greedy algorithm (which is mainly to keep the implementation
relatively straightforward), and because it seems to be sufficient for
most kernels (e.g. matmuls) that use ArmSME. The general steps of this
are roughly from
https://link.springer.com/content/pdf/10.1007/3-540-45937-5_17.pdf,
though there have been a few simplifications and assumptions made for
our use case.
Hopefully, the only changes needed for a user of the ArmSME dialect is
that:
- `-allocate-arm-sme-tiles` will no longer be a standalone pass
- `-test-arm-sme-tile-allocation` is only for unit tests
- `-convert-arm-sme-to-llvm` must happen after `-convert-scf-to-cf`
- SME tile allocation is now part of the LLVM conversion
By integrating this into the `ArmSME -> LLVM` conversion we can allow
high-level (value-based) ArmSME operations to be side-effect-free, as we
can guarantee nothing will rearrange ArmSME operations before we emit
intrinsics (which could invalidate the tile allocation).
The hope is for ArmSME operations to have no hidden state/side effects
and allow easily lowering dialects such as `vector` and `arith` to SME,
without making assumptions about how the input IR looks, as the
semantics of the operations will be the same. That is no (new) side
effects and the IR follows the rules of SSA (a value will never change).
The aim is correctness, so we have a base for working on optimizations.
This patch is a first pass at making consistent syntax across the
`LinalgTransformOp`s that use dynamic index lists for size parameters.
Previously, there were two different forms: inline types in the list, or
place them in the functional style tuple. This patch goes for the
latter.
In order to do this, the `printPackedOrDynamicIndexList`,
`printDynamicIndexList` and their `parse` counterparts were modified so
that the types can be optionally provided to the corresponding custom
directives.
All affected ops now use tablegen `assemblyFormat`, so custom
`parse`/`print` functions have been removed. There are a couple ops that
will likely add dynamic size support, and once that happens it should be
made sure that the assembly remains consistent with the changes in this
patch.
The affected ops are as follows: `pack`, `pack_greedily`,
`tile_using_forall`. The `tile_using_for` and `vectorize` ops already
used this syntax, but their custom assembly was removed.
---------
Co-authored-by: Oleksandr "Alex" Zinenko <ftynse@gmail.com>
Codegen "vectors" for pos/crd/val use the capacity as memref size, not
the actual used size. Although the sparsifier itself always uses just
the defined pos/crd/val parts, printing these and passing them back to a
runtime environment could benefit from wrapping the basic pos/crd/val
getters into a proper memref view that sets the right size.