This change adds (runtime) bounds checks for `memref` ops using the
existing `RuntimeVerifiableOpInterface`. For `memref.load` and
`memref.store`, we check that the indices are in-bounds of the memref's
index space. For `memref.reinterpret_cast` and `memref.subview` we check
that the resulting address space is in-bounds of the input memref's
address space.
This is to avoid confusion when dealing with reduction/combining kinds.
For example, see a recent PR comment:
https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175.
Previously, they were picked to mostly mirror the names of the llvm
vector reduction intrinsics:
https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In
isolation, it was not clear if `<maxf>` has `arith.maxnumf` or
`arith.maximumf` semantics. The new reduction kind names map 1:1 to
arith ops, which makes it easier to tell/look up their semantics.
Because both the vector and the gpu dialect depend on the arith dialect,
it's more natural to align names with those in arith than with the
lowering to llvm intrinsics.
Issue: https://github.com/llvm/llvm-project/issues/72354
The `test-lower-to-nvvm` pipeline serves as the common and proper
pipeline for nvvm+host compilation, and it's used across our CUDA
integration tests.
This PR updates the `test-lower-to-nvvm` pipeline to `gpu-lower-to-nvvm`
and moves it within `InitAllPasses.h`. The aim is to call it from
Python, also having a standardize compilation process for nvvm.
This moves the fix out of the IR and into the pass description, which
seems nicer. It also works as an integration test for the
`only-if-required-by-ops` flag :)
Note, tensor.empty may feed into SPARSE output (meaning it truly has no
values yet), but for a DENSE output, it should always have an initial
value. We ran a verifier over all our tests and this is the only
remaining omission.
At the moment the logic to tile and vectorize `linalg.matmul` is
duplicated in multiple test files:
* matmul.mlir
* matmul_mixed_ty.mlir
Instead, this patch uses `transform.foreach` to apply the same sequence
to multiple functions within the same test file (e.g. `matmul_f32` and
`matmul_mixed_ty` as defined in the original files). This allows us to
merge relevant test files.
The current lowering of tosa.fully_connected produces a linalg.matmul
followed by a linalg.generic to add the bias. The IR looks like the
following:
%init = tensor.empty()
%zero = linalg.fill ins(0 : f32) outs(%init)
%prod = linalg.matmul ins(%A, %B) outs(%zero)
// Add the bias
%initB = tensor.empty()
%result = linalg.generic ins(%prod, %bias) outs(%initB) {
// add bias and product
}
This has two down sides:
1. The tensor.empty operations typically result in additional
allocations after bufferization
2. There is a redundant traversal of the data to add the bias to the
matrix product.
This extra work can be avoided by leveraging the out-param of
linalg.matmul. The new IR sequence is:
%init = tensor.empty()
%broadcast = linalg.broadcast ins(%bias) outs(%init)
%prod = linalg.matmul ins(%A, %B) outs(%broadcast)
In my experiments, this eliminates one loop and one allocation (post
bufferization) from the generated code.
This reworks the ArmSME dialect to use attributes for tile allocation.
This has a number of advantages and corrects some issues with the
previous approach:
* Tile allocation can now be done ASAP (i.e. immediately after
`-convert-vector-to-arm-sme`)
* SSA form for control flow is now supported (e.g.`scf.for` loops that
yield tiles)
* ArmSME ops can be converted to intrinsics very late (i.e. after
lowering to control flow)
* Tests are simplified by removing constants and casts
* Avoids correctness issues with representing LLVM `immargs` as MLIR
values
- The tile ID on the SME intrinsics is an `immarg` (so is required to be
a compile-time constant), `immargs` should be mapped to MLIR attributes
(this is already the case for intrinsics in the LLVM dialect)
- Using MLIR values for `immargs` can lead to invalid LLVM IR being
generated (and passes such as -cse making incorrect optimizations)
As part of this patch we bid farewell to the following operations:
```mlir
arm_sme.get_tile_id : i32
arm_sme.cast_tile_to_vector : i32 to vector<[4]x[4]xi32>
arm_sme.cast_vector_to_tile : vector<[4]x[4]xi32> to i32
```
These are now replaced with:
```mlir
// Allocates a new tile with (indeterminate) state:
arm_sme.get_tile : vector<[4]x[4]xi32>
// A placeholder operation for lowering ArmSME ops to intrinsics:
arm_sme.materialize_ssa_tile : vector<[4]x[4]xi32>
```
The new tile allocation works by operations implementing the
`ArmSMETileOpInterface`. This interface says that an operation needs to
be assigned a tile ID, and may conditionally allocate a new SME tile.
Operations allocate a new tile by implementing...
```c++
std::optional<arm_sme::ArmSMETileType> getAllocatedTileType()
```
...and returning what type of tile the op allocates (ZAB, ZAH, etc).
Operations that don't allocate a tile return `std::nullopt` (which is
the default behaviour).
Currently the following ops are defined as allocating:
```mlir
arm_sme.get_tile
arm_sme.zero
arm_sme.tile_load
arm_sme.outerproduct // (if no accumulator is specified)
```
Allocating operations become the roots for the tile allocation pass,
which currently just (naively) assigns all transitive uses of a root
operation the same tile ID. However, this is enough to handle current
use cases.
Once tile IDs have been allocated subsequent rewrites can forward the
tile IDs to any newly created operations.
Apart from the test itself, this patch also updates a few patterns to
fix how new VectorType(s) are created. Namely, it makes sure that
"scalability" is correctly propagated.
Regression tests will be updated seperately while auditing Vector
dialect tests in the context of scalable vectors:
* https://github.com/orgs/llvm/projects/23
The "Dim" prefix is a legacy left-over that no longer makes sense, since
we have a very strict "Dimension" vs. "Level" definition for sparse
tensor types and their storage.
This patch adds an integration test lowering a linalg.matmul to SME via
vector.outerproduct.
It's similar to the linalg.matmul_transpose_a e2e test added recently in
as well as vector transpose canonicalizations, to lower the following
sequence (taken from the inner loop):
```
%subview = memref.subview %arg0[%arg3, %arg5] [%2, 1] [1, 1] :
memref<?x?xf32, strided<[?, ?], offset: ?>> to memref<?x1xf32, strided<[?, ?], offset: ?>>
%mask = vector.create_mask %2, %c1 : vector<[4]x1xi1>
%0 = vector.transfer_read %subview[%c0, %c0], %pad, %mask {in_bounds = [true, true]} :
memref<?x1xf32, strided<[?, ?], offset: ?>>, vector<[4]x1xf32>
%1 = vector.transpose %0, [1, 0] : vector<[4]x1xf32> to vector<1x[4]xf32>
%2 = vector.extract %1[0] : vector<[4]xf32> from vector<1x[4]xf32>
```
Rank-2 vectors with leading scalable dim can't be type converted to an
array. TransferReadDropUnitDimsPattern drops the unit dim on the
vector.transfer_read so it can be lowered via the generic path (to SVE).
The transpose canonicalizations lower the transpose to a shape_cast
which folds away.
This gives more flexibility with when these lowerings are performed,
without also lowering unrelated vector ops.
This is a NFC (other than adding a new `-convert-arm-sme-to-llvm` pass)
Note, this is a redo of https://github.com/llvm/llvm-project/pull/72712
which was reverted due to time outs in the bot. I have timed the tests
on various settings, and it does not even hit the top 20 of integration
tests. To be safe, I removed the SIMD version of the tests, just keeping
libgen/direcIR paths (which are the most important to test for us).
I will also keep an eye on
https://lab.llvm.org/buildbot/#/builders/264/builds after submitting to
make sure there is no repeat.
I always enjoy a good stress test. This end-to-end integration test
ensures the major ordering of both the block and within the block are
correctly handled (giving row-row, row-col, col-row and col-row as
options).
This change implements the correct *level* sizes set up for the direct
IR codegen fields in the sparse storage scheme. This brings libgen and
codegen together again.
This is step 3 out of 3 to make sparse_tensor.new work for BSR
Previous change no longer properly used the GPU libgen pass (even though
most tests still passed falling back to CPU). This revision puts the
proper pass order into place. Also bit of a cleanup of CPU codegen vs.
libgen setup.
Previously, we were inserting za.enable/disable intrinsics for functions
with the "arm_za" attribute (at the MLIR level), rather than using the
backend attributes. This was done to avoid a dependency on the SME ABI
functions from compiler-rt (which have only recently been implemented).
Doing things this way did have correctness issues, for example, calling
a streaming-mode function from another streaming-mode function (both
with ZA enabled) would lead to ZA being disabled after returning to the
caller (where it should still be enabled). Fixing issues like this would
require re-doing the ABI work already done in the backend within MLIR.
Instead, this patch switches to use the "arm_new_za" (backend) attribute
for enabling ZA for an MLIR function. For the integration tests, this
requires some way of linking the SME ABI functions. This is done via the
`%arm_sme_abi_shlib` lit substitution. By default, this expands to a
stub implementation of the SME ABI functions, but this can be overridden
by providing the `ARM_SME_ABI_ROUTINES_SHLIB` CMake cache variable
(pointing it at an alternative implementation). For now, the ArmSME
integration tests pass with just stubs, as we don't make use of nested
ZA-enabled calls.
A future patch may add an option to compiler-rt to build the SME
builtins into a standalone shared library to allow easily
building/testing with the actual implementation.
Note that the (dis)assemble operations still make some simplfying
assumptions (e.g. trailing 2-D COO in AoS format) but now at least both
the direct IR and support library path behave exactly the same.
Generalizing the ops is still TBD.
This patch adds an integration test demonstrating the first e2e example
lowering a linalg.matmul to SME via vector.outerproduct.
The test uses a 'linalg.matmul_transpose_a' rather than 'linalg.matmul'
since the latter emits a 'vector.transfer_read' with a vector type of
'vector<[4]x1xf32>' that can't be currently lowered via generic (SVE)
path, since it has leading scalable dim.
The flag seems to be doing practically the same thing for zero cost and
pinned dma. In addition, the register host is not truly the right zero
cost mechanism according to Thomas. So we are simplifying the setup for
now, until we have a better definition for what to implement and test.
https://github.com/llvm/llvm-project/issues/64316
This patch extends ArmSMEToSCF to support lowering of masked tile_load
ops. Only masks created by 'vector.create_mask' are currently supported.
There are two lowerings depending on the pad.
For pad of constant zero, the tile is first zeroed, then only active
rows are loaded.
For non-zero pad, the scalar pad is broadcast to a 1-D vector and a
regular 'vector.masked_load' (will be lowered to SVE, not SME) loads
each slice, with padding specified as a passthru and the 2-D mask
combined into a 1-D mask. The resulting slice is then inserted into the
tile with 'arm_sme.move_vector_to_tile_slice'.
When the Powers That Be decided that the name "sparse compiler" should
be changed to "sparsifier", we negected to change some of the comments
in the code; this pull request completes the name change.
The string symbols were replaced with 'vector.print str' calls in
061d978043 (#68973) but the addressof ops weren't removed. This was
missed as the test is currently XFAIL'ed.