Transform interfaces are implemented, direction or via extensions, in
libraries belonging to multiple other dialects. Those dialects don't
need to depend on the non-interface part of the transform dialect, which
includes the growing number of ops and transitive dependency footprint.
Split out the interfaces into a separate library. This in turn requires
flipping the dependency from the interface on the dialect that has crept
in because both co-existed in one library. The interface shouldn't
depend on the transform dialect either.
As a consequence of splitting, the capability of the interpreter to
automatically walk the payload IR to identify payload ops of a certain
kind based on the type used for the entry point symbol argument is
disabled. This is a good move by itself as it simplifies the interpreter
logic. This functionality can be trivially replaced by a
`transform.structured.match` operation.
Since the vector.print str provides no punctuation control, it is
slightly more flexible to let the client of this operation decide
whether there should be a trailing newline. This allows for printing
like
vector.print str "nse = "
vector.print %nse : index
as
nse = 42
The ArmSME compilation pipeline has evolved significantly and is now
sufficiently complex enough that it warrants a proper lowering pipeline
that encapsulates the various passes and orderings. Currently the
pipeline is loosely defined in our integration tests, but these have
diverged and are not using the same passes or ordering everywhere.
This patch introduces a test-lower-to-arm-sme pipeline mirroring
test-lower-to-llvm that provides some sanity when running e2e examples
and can be used a reference for targeting ArmSME in MLIR.
All the integration tests are updated to use this pipeline. The
intention is to productize the pipeline once it becomes more mature.
This patch introduces support for 4-way widening outer products. This
enables the fusion of 4 'arm_sme.outerproduct' operations that are
chained via the accumulator into single widened operations.
Changes:
- Adds the following operations:
- smopa_4way, smops_4way
- umopa_4way, umops_4way
- sumopa_4way, sumops_4way
- sumopa_4way, sumops_4way
- Implements conversions for the above ops to intrinsics in ArmSMEToLLVM.
- Extends 'arm-sme-outer-product' pass.
For a detailed description of these operations see the
'arm_sme.smopa_4way' description.
This adds a new pass (`-arm-sme-vector-legalization`) which legalizes
vector operations so that they can be lowered to ArmSME. This initial
patch adds decomposition for `vector.outerproduct`,
`vector.transfer_read`, and `vector.transfer_write` when they operate on
vector types larger than a single SME tile. For example, a [8]x[8]xf32
outer product would be decomposed into four [4]x[4]xf32 outer products,
which could then be lowered to ArmSME. These three ops have been picked
as supporting them alone allows lowering matmuls that use all ZA
accumulators to ArmSME.
For it to be possible to legalize a vector type it has to be a multiple
of an SME tile size, but other than that any shape can be used. E.g.
`vector<[8]x[8]xf32>`, `vector<[4]x[16]xf32>`, `vector<[16]x[4]xf32>`
can all be lowered to four `vector<[4]x[4]xf32>` operations.
In future, this pass will be extended with more SME-specific rewrites to
legalize unrolling the reduction dimension of matmuls (which is not
type-decomposition), which is why the pass has quite a general name.
This patch introduces support for 2-way widening outer products. This
enables the fusion of 2 'arm_sme.outerproduct' operations that are
chained via the accumulator into a 2-way widening outer product
operation.
Changes:
- Add 'llvm.aarch64.sme.[us]mop[as].za32' intrinsics for 2-way variants.
These map to instruction variants added in SME2 and use different
intrinsics. Intrinsics are already implemented for widening variants
from SME1.
- Adds the following operations:
- fmopa_2way, fmops_2way
- smopa_2way, smops_2way
- umopa_2way, umops_2way
- Implements conversions for the above ops to intrinsics in
ArmSMEToLLVM.
- Adds a pass 'arm-sme-outer-product-fusion' that fuses
'arm_sme.outerproduct' operations.
For a detailed description of these operations see the
'arm_sme.fmopa_2way' description.
The reason for introducing many operations rather than one is the
signed/unsigned variants can't be distinguished with types (e.g., ui16,
si16) since 'arith.extui' and 'arith.extsi' only support signless
integers. A single operation would require this information and an
attribute (for example) for the sign doesn't feel right if
floating-point types are also supported where this wouldn't apply.
Furthermore, the SME FP8 extensions (FEAT_SME_F8F16, FEAT_SME_F8F32)
introduce FMOPA 2-way (FP8 to FP16) and 4-way (FP8 to FP32) variants but
no subtract variant. Whilst these are not supported in this patch, it
felt simpler to have separate ops for add/subtract given this.
This adds a new `mlir_arm_runner_utils` library that contains utils
specific to Arm/AArch64. This is for use in MLIR integration tests.
This initial patch adds `setArmVLBits()` and `setArmSVLBits()`. This
allows changing vector length or streaming vector length at runtime (or
setting it to a known minimum, i.e. 128-bits).
Existing 'arith::ConstantOp' conversion and tests are moved from
VectorToArmSME. There's currently only a single op that's converted at
the moment, but this will grow in the future as things like in-tile add
are implemented. Also, 'createLoopOverTileSlices' is moved to ArmSME
utils since it's relevant for both conversions.
This is to avoid confusion when dealing with reduction/combining kinds.
For example, see a recent PR comment:
https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175.
Previously, they were picked to mostly mirror the names of the llvm
vector reduction intrinsics:
https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In
isolation, it was not clear if `<maxf>` has `arith.maxnumf` or
`arith.maximumf` semantics. The new reduction kind names map 1:1 to
arith ops, which makes it easier to tell/look up their semantics.
Because both the vector and the gpu dialect depend on the arith dialect,
it's more natural to align names with those in arith than with the
lowering to llvm intrinsics.
Issue: https://github.com/llvm/llvm-project/issues/72354
This moves the fix out of the IR and into the pass description, which
seems nicer. It also works as an integration test for the
`only-if-required-by-ops` flag :)
This reworks the ArmSME dialect to use attributes for tile allocation.
This has a number of advantages and corrects some issues with the
previous approach:
* Tile allocation can now be done ASAP (i.e. immediately after
`-convert-vector-to-arm-sme`)
* SSA form for control flow is now supported (e.g.`scf.for` loops that
yield tiles)
* ArmSME ops can be converted to intrinsics very late (i.e. after
lowering to control flow)
* Tests are simplified by removing constants and casts
* Avoids correctness issues with representing LLVM `immargs` as MLIR
values
- The tile ID on the SME intrinsics is an `immarg` (so is required to be
a compile-time constant), `immargs` should be mapped to MLIR attributes
(this is already the case for intrinsics in the LLVM dialect)
- Using MLIR values for `immargs` can lead to invalid LLVM IR being
generated (and passes such as -cse making incorrect optimizations)
As part of this patch we bid farewell to the following operations:
```mlir
arm_sme.get_tile_id : i32
arm_sme.cast_tile_to_vector : i32 to vector<[4]x[4]xi32>
arm_sme.cast_vector_to_tile : vector<[4]x[4]xi32> to i32
```
These are now replaced with:
```mlir
// Allocates a new tile with (indeterminate) state:
arm_sme.get_tile : vector<[4]x[4]xi32>
// A placeholder operation for lowering ArmSME ops to intrinsics:
arm_sme.materialize_ssa_tile : vector<[4]x[4]xi32>
```
The new tile allocation works by operations implementing the
`ArmSMETileOpInterface`. This interface says that an operation needs to
be assigned a tile ID, and may conditionally allocate a new SME tile.
Operations allocate a new tile by implementing...
```c++
std::optional<arm_sme::ArmSMETileType> getAllocatedTileType()
```
...and returning what type of tile the op allocates (ZAB, ZAH, etc).
Operations that don't allocate a tile return `std::nullopt` (which is
the default behaviour).
Currently the following ops are defined as allocating:
```mlir
arm_sme.get_tile
arm_sme.zero
arm_sme.tile_load
arm_sme.outerproduct // (if no accumulator is specified)
```
Allocating operations become the roots for the tile allocation pass,
which currently just (naively) assigns all transitive uses of a root
operation the same tile ID. However, this is enough to handle current
use cases.
Once tile IDs have been allocated subsequent rewrites can forward the
tile IDs to any newly created operations.
This gives more flexibility with when these lowerings are performed,
without also lowering unrelated vector ops.
This is a NFC (other than adding a new `-convert-arm-sme-to-llvm` pass)
Previously, we were inserting za.enable/disable intrinsics for functions
with the "arm_za" attribute (at the MLIR level), rather than using the
backend attributes. This was done to avoid a dependency on the SME ABI
functions from compiler-rt (which have only recently been implemented).
Doing things this way did have correctness issues, for example, calling
a streaming-mode function from another streaming-mode function (both
with ZA enabled) would lead to ZA being disabled after returning to the
caller (where it should still be enabled). Fixing issues like this would
require re-doing the ABI work already done in the backend within MLIR.
Instead, this patch switches to use the "arm_new_za" (backend) attribute
for enabling ZA for an MLIR function. For the integration tests, this
requires some way of linking the SME ABI functions. This is done via the
`%arm_sme_abi_shlib` lit substitution. By default, this expands to a
stub implementation of the SME ABI functions, but this can be overridden
by providing the `ARM_SME_ABI_ROUTINES_SHLIB` CMake cache variable
(pointing it at an alternative implementation). For now, the ArmSME
integration tests pass with just stubs, as we don't make use of nested
ZA-enabled calls.
A future patch may add an option to compiler-rt to build the SME
builtins into a standalone shared library to allow easily
building/testing with the actual implementation.
This patch extends ArmSMEToSCF to support lowering of masked tile_load
ops. Only masks created by 'vector.create_mask' are currently supported.
There are two lowerings depending on the pad.
For pad of constant zero, the tile is first zeroed, then only active
rows are loaded.
For non-zero pad, the scalar pad is broadcast to a 1-D vector and a
regular 'vector.masked_load' (will be lowered to SVE, not SME) loads
each slice, with padding specified as a passthru and the 2-D mask
combined into a 1-D mask. The resulting slice is then inserted into the
tile with 'arm_sme.move_vector_to_tile_slice'.
The string symbols were replaced with 'vector.print str' calls in
061d978043 (#68973) but the addressof ops weren't removed. This was
missed as the test is currently XFAIL'ed.
This patch adds support for lowering masked outer products to SME. This
is done in two stages. First, vector.outerproducts (both masked and
non-masked) are rewritten to arm_sme.outerproducts. The
arm_sme.outerproduct op is close to vector.outerproduct, but supports
masking on the operands rather than the result. It also limits the cases
it handles to things that could be (directly) lowered to SME.
This currently requires that the source of the mask is a
vector.create_mask op. E.g.:
```mlir
%mask = vector.create_mask %dimA, %dimB : vector<[4]x[4]xi1>
%result = vector.mask %mask {
vector.outerproduct %vecA, %vecB
: vector<[4]xf32>, vector<[4]xf32>
} : vector<[4]x[4]xi1> -> vector<[4]x[4]xf32>
```
Is rewritten to:
```
%maskA = vector.create_mask %dimA : vector<[4]xi1>
%maskB = vector.create_mask %dimB : vector<[4]xi1>
%result = arm_sme.outerproduct %vecA, %vecB masks(%maskA, %maskB)
: vector<[4]xf32>, vector<[4]xf32>
```
(The same rewrite works for non-masked vector.outerproducts too)
The arm_sme.outerproduct can then be directly lowered to SME intrinsics.
Adds basic integration tests for `vector.contract` for the dot product
and matvec operations. These tests exercise scalable vectors.
Depends on https://github.com/llvm/llvm-project/pull/69845
… emulation
Currently the expected CHECK values are not correct for
`fcst_maskedload` from
mlir/test/Integration/Dialect/Vector/CPU/test-rewrite-narrow-types.mlir
Adds an end-to-end test for `vector.contract` that targets SVE (i.e.
scalable vectors). Note that this requires lifting the restriction on
`vector.outerproduct` (to which `vector.contract` is lowered to) that
would deem the following as invalid by the Op verifier (*):
```
vector.outerproduct %27, %28, %26 {kind = #vector.kind<add>} : vector<3xf32>, vector<[2]xf32>
```
This is indeed valid as the end-to-end test demonstrates (at least when
compiling for SVE).
This patch adds a pass that ensures that loads, stores, and allocations
of SVE vector types will be legal in the LLVM backend. It does this at
the memref level, so this pass must be applied before lowering all the
way to LLVM.
This pass currently fixes two issues.
## Loading and storing predicate types
It is only legal to load/store predicate types equal to (or greater
than) a full predicate register, which in MLIR is `vector<[16]xi1>`.
Smaller predicate types (`vector<[1|2|4|8]xi1>`) must be converted
to/from a full predicate type (referred to as a `svbool`) before and
after storing and loading respectively. This pass does this by widening
allocations and inserting conversion intrinsics.
For example:
```mlir
%alloca = memref.alloca() : memref<vector<[4]xi1>>
%mask = vector.constant_mask [4] : vector<[4]xi1>
memref.store %mask, %alloca[] : memref<vector<[4]xi1>>
%reload = memref.load %alloca[] : memref<vector<[4]xi1>>
```
Becomes:
```mlir
%alloca = memref.alloca() {alignment = 1 : i64} : memref<vector<[16]xi1>>
%mask = vector.constant_mask [4] : vector<[4]xi1>
%svbool = arm_sve.convert_to_svbool %mask : vector<[4]xi1>
memref.store %svbool, %alloca[] : memref<vector<[16]xi1>>
%reload_svbool = memref.load %alloca[] : memref<vector<[16]xi1>>
%reload = arm_sve.convert_from_svbool %reload_svbool : vector<[4]xi1>
```
## Relax alignments for SVE vector allocas
The storage for SVE vector types only needs to have an alignment that
matches the element type (for example 4 byte alignment for `f32`s).
However, the LLVM backend currently defaults to aligning to `base size x
element size` bytes. For non-legal vector types like `vector<[8]xf32>`
this results in 8 x 4 = 32-byte alignment, but the backend only supports
up to 16-byte alignment for SVE vectors on the stack. Explicitly setting
a smaller alignment prevents this issue.
Depends on: #68586 and #68695 (for testing)
Update most test passes to use the transform-interpreter pass instead of
the test-transform-dialect-interpreter-pass. The new "main" interpreter
pass has a named entry point instead of looking up the top-level op with
`PossibleTopLevelOpTrait`, which is arguably a more understandable
interface. The change is mechanical, rewriting an unnamed sequence into
a named one and wrapping the transform IR in to a module when necessary.
Add an option to the transform-interpreter pass to target a tagged
payload op instead of the root anchor op, which is also useful for repro
generation.
Only the test in the transform dialect proper and the examples have not
been updated yet. These will be updated separately after a more careful
consideration of testing coverage of the transform interpreter logic.
Printing strings within integration tests is currently quite annoyingly
verbose, and can't be tucked into shared helpers as the types depend on
the length of the string:
```
llvm.mlir.global internal constant @hello_world("Hello, World!\0")
func.func @entry() {
%0 = llvm.mlir.addressof @hello_world : !llvm.ptr<array<14 x i8>>
%1 = llvm.mlir.constant(0 : index) : i64
%2 = llvm.getelementptr %0[%1, %1]
: (!llvm.ptr<array<14 x i8>>, i64, i64) -> !llvm.ptr<i8>
llvm.call @printCString(%2) : (!llvm.ptr<i8>) -> ()
return
}
```
So this patch adds a simple extension to `vector.print` to simplify
this:
```
func.func @entry() {
// Print a vector of characters ;)
vector.print str "Hello, World!"
return
}
```
Most of the logic for this is now shared with `cf.assert` which already
does something similar.
Depends on #68694
This patch prefixes tile slice layout with `layout` in the
assemblyFormat:
- `<vertical>` -> `layout<vertical>`
- `<horizontal>` -> `layout<horizontal>`
The reason for this change is the current format doesn't play nicely
with additional optional operands, required to support padding and
masking (#69148), as it becomes ambiguous.
This affects the the following ops:
- arm_sme.tile_load
- arm_sme.tile_store
- arm_sme.load_tile_slice
- arm_sme.store_tile_slice
The vector.extract assembly format currently only contains the source
type, for example:
%1 = vector.extract %0[1] : vector<3x7x8xf32>
it's not immediately obvious if this is the source or result type. This
patch improves the assembly format to make this clearer, so the above
becomes:
%1 = vector.extract %0[1] : vector<7x8xf32> from vector<3x7x8xf32>
This adds a custom lowering for SME that loops over each row of the
tile, extracting it via an SME MOVA, then printing with a normal 1D
vector.print.
This makes writing SME integration tests easier and less verbose.
Depends on: #66910, #66911
This patch adds support for lowering vector.transpose to ArmSME. It's
implemented by storing the input tile of the tranpose to memory and
reloading vertically, building on top of the tile slice layout support.
Tranposing via memory is obviously expensive, the current intention is
to avoid the transpose if possible, this is therefore intended as a
fallback and to provide base support for Vector ops. If it turns out
transposes can't be avoided then this should be replaced with a more
optimal implementation, perhaps with tile <-> vector (MOVA) ops.
Depends on https://github.com/llvm/llvm-project/pull/66758.
In SME a ZA tile slice is a one-dimensional set of horizontally or
vertically contiguous elements within a ZA tile. Currently the load and
store ops only support horizontal tile slices. This patch adds a tile
slice layout attribute to the load and store ops to support both
horizontal and vertical tile slices.
When lowering from Vector dialect horizontal layout is the default.
…cast) expansion
This revision adds a rewrite for sequences of vector `ext(bitcast)` to
use a more efficient sequence of vector operations comprising `shuffle`
and `bitwise` ops.
Such patterns appear naturally when writing quantization /
dequantization functionality with the vector dialect.
The rewrite performs a simple enumeration of each of the bits in the
result vector and determines its provenance in the source vector. The
enumeration is used to generate the proper sequence of `shuffle`,
`andi`, `ori` with shifts`.
The rewrite currently only applies to 1-D non-scalable vectors and bails
out if the final vector element type is not a multiple of 8. This is a
failsafe heuristic determined empirically: if the resulting type is not
an even number of bytes, further complexities arise that are not
improved by this pattern: the heavy lifting still needs to be done by
LLVM.
…(trunci) expansion
This revision adds a rewrite for sequences of vector `bitcast(trunci)`
to use a more efficient sequence of vector operations comprising
`shuffle` and `bitwise` ops.
Such patterns appear naturally when writing quantization /
dequantization functionality with the vector dialect.
The rewrite performs a simple enumeration of each of the bits in the
result vector and determines its provenance in the pre-trunci vector.
The enumeration is used to generate the proper sequence of `shuffle`,
`andi`, `ori` followed by an optional final `trunci`/`extui`.
The rewrite currently only applies to 1-D non-scalable vectors and bails
out if the final vector element type is not a multiple of 8. This is a
failsafe heuristic determined empirically: if the resulting type is not
an even number of bytes, further complexities arise that are not
improved by this pattern: the heavy lifting still needs to be done by
LLVM.
This patch adds support for lowering vector.outerproduct to the ArmSME
MOPA intrinsic for the following types:
vector<[8]xf16>, vector<[8]xf16> -> vector<[8]x[8]xf16>
vector<[8]xbf16>, vector<[8]xbf16> -> vector<[8]x[8]xbf16>
vector<[4]xf32>, vector<[4]xf32> -> vector<[4]x[4]xf32>
vector<[2]xf64>, vector<[2]xf64> -> vector<[2]x[2]xf64>
The FP variants are lowered to FMOPA (non-widening) [1] and BFloat to
BFMOPA
(non-widening) [2].
Note at the ISA level these variants are implemented by different
architecture features, these are listed below:
FMOPA (non-widening)
* half-precision - +sme2p1,+sme-f16f16
* single-precision - +sme
* double-precision - +sme-f64f64
BFMOPA (non-widening)
* half-precision - +sme2p1,+b16b16
There's currently no way to target different features when lowering to
ArmSME. Integration tests are added for F32 and F64. We use QEMU to run
the integration tests but SME2 support isn't available yet, it's
targeted for 9.0, so integration tests for these variants excluded.
Masking is currently unsupported.
Depends on #65450.
[1] https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-
[2] https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/BFMOPA--non-widening---BFloat16-floating-point-outer-product-and-accumulate-
There are two motivations for this change:
1. It considerably simplifies adding support for the realloc operation to the
new buffer deallocation pass by lowering the realloc such that no
deallocation operation is inserted and the deallocation pass itself can
insert that dealloc
2. The lowering is expressed on a higher level and thus easier to understand,
and the lowerings of the memref operations it is composed of don't have to
be duplicated in the MemRefToLLVM lowering (also see discussion in
https://reviews.llvm.org/D133424)
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D159430
This adds a 'move_vector_to_tile_slice' op to the ArmSME dialect that
moves a 1-D scalable vector to a slice of a 2-D tile at a given index.
This is lowered to the 'llvm.aarch64.sme.write.horiz' intrinsic that
maps to the MOVA (vector to tile, single) SME instruction [1] when
lowering to LLVM. Like the SME load and store instructions this operates
on ZA tile slices, which are 1D vectors of horizontally or vertically
contiguous elements within a ZA tile.
This patch extends the lowering of 'arith.constant' to SME to support
non-zero constants using this new op. This requires materializing a
loop that broadcasts the constant to each tile slice with the
'vector_to_tile_slice' op. Unlike load and store, this is done during
conversion from Vector to ArmSME, rather than ArmSME to SCF. The latter
would require a higher-level custom op in the ArmSME dialect like
'tile_load' and 'tile_store' and this isn't necessary. We may also
remove the load and store ops in the future in favour of lowering
straight from Vector, at which point this would converge.
Currently only horizontal tile slices are supported. A future patch will
extend this mechanism to support 'vector.broadcast'.
Depends on D156980 D157004
[1] https://developer.arm.com/documentation/ddi0602
Reviewed By: awarzynski, dcaballe
Differential Revision: https://reviews.llvm.org/D157005
For consistency with other tests and to simplify the `RUN` lines, switch
to using `mlir-cpu-runner` instead of `lli` in integrations tests
targeting SSVE and SME.
Differential Revision: https://reviews.llvm.org/D158719
This follows from D155306.
Loads and stores of 128-bit tiles have been confirmed to work in the
`load-store-128-bit-tile.mlir` integration test. However, there is
currently a bug in QEMU (see: https://gitlab.com/qemu-project/qemu/-/issues/1833)
which means this test produces incorrect results (a patch for this issue
is available but not yet in any released version of QEMU). Until a
fixed version of QEMU is available the integration test is expected to fail.
Reviewed By: c-rhodes, awarzynski
Differential Revision: https://reviews.llvm.org/D158418
This replaces the manual print loops with `vector.print` which now supports
scalable vectors.
Reviewed By: awarzynski
Differential Revision: https://reviews.llvm.org/D157978
Reland of the original patch after updating the Python binding tests,
a few CUDA/GPU MLIR tests, and ensuring the assembly format is
round-trippable.
This patch splits the lowering of vector.print into first converting
an n-D print into a loop of scalar prints of the elements, then a second
pass that converts those scalar prints into the runtime calls. The
former is done in VectorToSCF and the latter in VectorToLLVM.
The main reason for this is to allow printing scalable vector types,
which are not possible to fully unroll at compile time, though this
also avoids fully unrolling very large vectors.
To allow VectorToSCF to add the necessary punctuation between vectors
and elements, a "punctuation" attribute has been added to vector.print.
This abstracts calling the runtime functions such as printNewline(),
without leaking the LLVM details into the higher abstraction levels.
For example:
vector.print punctuation <comma>
lowers to
llvm.call @printComma() : () -> ()
The output format and runtime functions remain the same, which avoids
the need to alter a large number of tests (aside from the pipelines).
Reviewed By: awarzynski, c-rhodes, aartbik
Differential Revision: https://reviews.llvm.org/D156519