Commit Graph

202 Commits

Author SHA1 Message Date
Andrzej Warzyński
56d6b56739 [mlir][vector] Relax the requirements on broadcast dims (#99341)
NOTE: This is a follow-up for #97049 in which the `in_bounds` attribute
was made mandatory.

This PR updates the semantics of the `in_bounds` attribute so that
broadcast dimensions are no longer required to be "in bounds".
Specifically, these xfer_read/xfer_write Ops become valid after this
change:

```mlir
  %read = vector.transfer_read %A[%base1, %base2], %pad
      {in_bounds = [false], permutation_map = affine_map<(d0, d1) -> (0)>}
      {permutation_map = affine_map<(d0, d1) -> (0)>}
      : memref<?x?xf32>, vector<9xf32>

  vector.transfer_write %vec, %A[%base1, %base2],
      {in_bounds = [false], permutation_map = affine_map<(d0, d1) -> (0)>}
      {permutation_map = affine_map<(d0, d1) -> (0)>}
      : vector<9xf32>, memref<?x?xf32>
```

Note that the value `false` merely means "may run out-of-bounds", i.e.,
the corresponding access can still be "in bounds". In fact, the folder
for xfer Ops is also updated (*) and will update the attribute value
corresponding to broadcast dims to `true` if all non-broadcast dims
are marked as "in bounds". 

Note that this PR doesn't change any of the lowerings. The changes in
"SuperVectorize.cpp", "Vectorization.cpp" and "AffineMap.cpp" are simple
reverts of recent changes in #97049. Those were only meant to facilitate
making `in_bounds` mandatory and to work around the extra requirements
for broadcast dims (those requirements ere removed in this PR). All
changes in tests are also reverts of changes from #97049.

For context, here's a PR in which "broadcast" dims where forced to
always be "in-bounds":
  * https://reviews.llvm.org/D102566

(*) See `foldTransferInBoundsAttribute`.
2024-10-04 07:41:20 +01:00
Benjamin Maxwell
833ce5d27b [mlir][ArmSME] Fix test after #98043 (NFC) 2024-08-30 10:16:44 +00:00
Maciej Gabka
95d2d1cba0 Move stepvector intrinsic out of experimental namespace (#98043)
This patch is moving out stepvector intrinsic from the experimental
namespace.

This intrinsic exists in LLVM for several years now, and is widely used.
2024-08-28 12:48:20 +01:00
Benjamin Maxwell
ea28668f8c [mlir][ArmSME] Remove XFAILs (#104758)
Buildbots have been updated to QEMU 9.0.2: 
https://github.com/llvm/llvm-project/pull/104758#issuecomment-2296592845
2024-08-19 14:55:09 +01:00
Zhaoshi Zheng
fe55c34d19 [MLIR][test] Run SVE and SME Integration tests using qemu-aarch64 (#101568)
To run integration tests using qemu-aarch64 on x64 host, below flags are
added to the cmake command when building mlir/llvm:

      -DMLIR_INCLUDE_INTEGRATION_TESTS=ON \
      -DMLIR_RUN_ARM_SVE_TESTS=ON \
      -DMLIR_RUN_ARM_SME_TESTS=ON \
      -DARM_EMULATOR_EXECUTABLE="<...>/qemu-aarch64" \
      -DARM_EMULATOR_OPTIONS="-L /usr/aarch64-linux-gnu" \

-DARM_EMULATOR_MLIR_CPU_RUNNER_EXECUTABLE="<llvm_arm64_build_top>/bin/mlir-cpu-runner-arm64"
\
      -DARM_EMULATOR_LLI_EXECUTABLE="<llvm_arm64_build_top>/bin/lli" \
      -DARM_EMULATOR_UTILS_LIB_DIR="<llvm_arm64_build_top>/lib"

The last three above are prebuilt on, or cross-built for, an aarch64
host.

This patch introduced substittutions of "%native_mlir_runner_utils" etc. and use
them in SVE/SME integration tests. When configured to run using qemu-aarch64,
mlir runtime util libs will be loaded from ARM_EMULATOR_UTILS_LIB_DIR, if set.

Some tests marked with 'UNSUPPORTED: target=aarch64{{.*}}' are still run
when configured with ARM_EMULATOR_EXECUTABLE and the default target is
not aarch64.
A lit config feature 'mlir_arm_emulator' is added in
mlir/test/lit.site.cfg.py.in and to UNSUPPORTED list of such tests.
2024-08-15 21:37:51 -07:00
Andrzej Warzyński
2ee5586ac7 [mlir][vector] Make the in_bounds attribute mandatory (#97049)
At the moment, the in_bounds attribute has two confusing/contradicting
properties:
  1. It is both optional _and_ has an effective default-value.
  2. The default value is "out-of-bounds" for non-broadcast dims, and
     "in-bounds" for broadcast dims.

(see the `isDimInBounds` vector interface method for an example of this
"default" behaviour [1]).

This PR aims to clarify the logic surrounding the `in_bounds` attribute
by:
  * making the attribute mandatory (i.e. it is always present),
  * always setting the default value to "out of bounds" (that's
    consistent with the current behaviour for the most common cases).

#### Broadcast dimensions in tests

As per [2], the broadcast dimensions requires the corresponding
`in_bounds` attribute to be `true`:
```
  vector.transfer_read op requires broadcast dimensions to be in-bounds
```

The changes in this PR mean that we can no longer rely on the
default value in cases like the following (dim 0 is a broadcast dim):
```mlir
  %read = vector.transfer_read %A[%base1, %base2], %f, %mask
      {permutation_map = affine_map<(d0, d1) -> (0, d1)>} :
    memref<?x?xf32>, vector<4x9xf32>
```

Instead, the broadcast dimension has to explicitly be marked as "in
bounds:

```mlir
  %read = vector.transfer_read %A[%base1, %base2], %f, %mask
      {in_bounds = [true, false], permutation_map = affine_map<(d0, d1) -> (0, d1)>} :
    memref<?x?xf32>, vector<4x9xf32>
```

All tests with broadcast dims are updated accordingly.

#### Changes in "SuperVectorize.cpp" and "Vectorization.cpp"

The following patterns in "Vectorization.cpp" are updated to explicitly
set the `in_bounds` attribute to `false`:
* `LinalgCopyVTRForwardingPattern` and `LinalgCopyVTWForwardingPattern`

Also, `vectorizeAffineLoad` (from "SuperVectorize.cpp") and
`vectorizeAsLinalgGeneric` (from "Vectorization.cpp") are updated to
make sure that xfer Ops created by these hooks set the dimension
corresponding to broadcast dims as "in bounds". Otherwise, the Op
verifier would complain

Note that there is no mechanism to verify whether the corresponding
memory access are indeed in bounds. Still, this is consistent with the
current behaviour where the broadcast dim would be implicitly assumed
to be "in bounds".

[1]
4145ad2bac/mlir/include/mlir/Interfaces/VectorInterfaces.td (L243-L246)
[2]
https://mlir.llvm.org/docs/Dialects/Vector/#vectortransfer_read-vectortransferreadop
2024-07-16 16:49:52 +01:00
Mubashar Ahmad
7d69095fd5 [mlir][vector] Remove Emulated Sub-directory (#94742)
The "Emulated" sub-directories under "ArmSVE" and
"ArmSME" have been removed. Associated tests
have been moved up a directory and now include
the "REQUIRES" constraint for the arm-emulator.
2024-06-07 15:38:50 +01:00
Andrzej Warzyński
435114f9fe [mlir][test] Rename Vector integration tests for CPU (nfc) (#93521)
To keep the test filenames consistent, this patch:
  * removes "test-" from  file names (there used to be a mix of
    "test-feature-1.mlir" and "feature-2.mlir"),
  * replaces "_" with "-" (there used to be a mix of "feature-3.mlir"
    and "feature_4.mlir").

Only files under test/Integration/Dialect/Vector/CPU are updated.
2024-05-30 18:06:43 +01:00
Mubashar Ahmad
bc946f5287 [mlir][vector] Add 1D vector.deinterleave lowering (#93042)
This patch implements the lowering of vector.deinterleave 
for 1D vectors.

For fixed vector types, the operation is lowered to two
llvm shufflevector operations. One for even indexed
elements and the other for odd indexed elements. A poison
operation is used to satisfy the parameters of the
shufflevector parameters.
    
For scalable vectors, the llvm vector.deinterleave2
intrinsic is used for lowering. As such the results
found by extraction and used to form the result
struct for the intrinsic.
2024-05-30 09:42:35 +01:00
Kunwar Grover
debdbeda15 [mlir] Remove dialect specific bufferization passes (Reland) (#93535)
These passes have been depreciated for a long time and replaced by
one-shot bufferization. These passes are also unsafe because they do not
check for read-after-write conflicts.

Relands https://github.com/llvm/llvm-project/pull/93488 which failed on
buildbot. Fixes the failure by updating integration tests to use
one-shot-bufferize instead.
2024-05-28 20:04:27 +01:00
Cullen Rhodes
ea20647023 [mlir][ArmSME] NFC: -force-streaming-compatible-sve rename fixup (#93177)
-force-streaming-compatible-sve was renamed in #92774 but this test was
missed, no longer required so removing.
2024-05-28 09:07:07 +01:00
Jakub Kuderski
714aee31e1 [mlir][vector] Add result type to interleave assembly format (#93392)
This is to make it more obvious for what the result type is, especially
with some less trivial cases like 0-d inputs resulting in 1-d inputs or
interaction with scalable vector types. Note that `vector.deinterleave`
uses the same format with explicit result type.

Also improve examples and clean up surrounding code.
2024-05-27 11:03:36 -04:00
Benjamin Maxwell
041baf2f60 [mlir][ArmSME] Use liveness information in the tile allocator (#90448)
This patch rewrites the ArmSME tile allocator to use liveness
information to make better tile allocation decisions and improve the
correctness of the ArmSME dialect. This algorithm used here is a linear
scan over live ranges, where live ranges are assigned to tiles as they
appear in the program (chronologically). Live ranges release their
assigned tile ID when the current program point is passed their end.
This is a greedy algorithm (which is mainly to keep the implementation
relatively straightforward), and because it seems to be sufficient for
most kernels (e.g. matmuls) that use ArmSME. The general steps of this
are roughly from
https://link.springer.com/content/pdf/10.1007/3-540-45937-5_17.pdf,
though there have been a few simplifications and assumptions made for
our use case.

Hopefully, the only changes needed for a user of the ArmSME dialect is
that:

- `-allocate-arm-sme-tiles` will no longer be a standalone pass 
  - `-test-arm-sme-tile-allocation` is only for unit tests 
- `-convert-arm-sme-to-llvm` must happen after `-convert-scf-to-cf` 
   - SME tile allocation is now part of the LLVM conversion

By integrating this into the `ArmSME -> LLVM` conversion we can allow
high-level (value-based) ArmSME operations to be side-effect-free, as we
can guarantee nothing will rearrange ArmSME operations before we emit
intrinsics (which could invalidate the tile allocation).

The hope is for ArmSME operations to have no hidden state/side effects
and allow easily lowering dialects such as `vector` and `arith` to SME,
without making assumptions about how the input IR looks, as the
semantics of the operations will be the same. That is no (new) side
effects and the IR follows the rules of SSA (a value will never change).

The aim is correctness, so we have a base for working on optimizations.
2024-05-14 14:59:01 +01:00
Oleksandr "Alex" Zinenko
5a9bdd85ee [mlir] split transform interfaces into a separate library (#85221)
Transform interfaces are implemented, direction or via extensions, in
libraries belonging to multiple other dialects. Those dialects don't
need to depend on the non-interface part of the transform dialect, which
includes the growing number of ops and transitive dependency footprint.

Split out the interfaces into a separate library. This in turn requires
flipping the dependency from the interface on the dialect that has crept
in because both co-existed in one library. The interface shouldn't
depend on the transform dialect either.

As a consequence of splitting, the capability of the interpreter to
automatically walk the payload IR to identify payload ops of a certain
kind based on the type used for the entry point symbol argument is
disabled. This is a good move by itself as it simplifies the interpreter
logic. This functionality can be trivially replaced by a
`transform.structured.match` operation.
2024-03-20 22:15:17 +01:00
Benjamin Maxwell
780d55690e [mlir][ArmSVE] Fix test-setArmVLBits.mlir after #83213 2024-02-29 13:02:50 +00:00
Aart Bik
c1b8c6cf41 [mlir][vector][print] do not append newline to printing pure strings (#83213)
Since the vector.print str provides no punctuation control, it is
slightly more flexible to let the client of this operation decide
whether there should be a trailing newline. This allows for printing
like

vector.print str "nse = "
vector.print %nse : index

as

nse = 42
2024-02-28 10:18:21 -08:00
Oleksandr "Alex" Zinenko
5468f88413 [mlir] update remaining transform tests to main pass (#81279)
Use the main transform interpreter pass instead of the test pass. The
only tests that are not updated are specific to the operation of the
test pass.
2024-02-28 11:06:53 +01:00
David Spickett
575bf33a27 [mlir][ArmSME] Remove XFAIL for load-store-128-bit-tile
https://gitlab.com/qemu-project/qemu/-/issues/1833 was fixed by
4b3520fd93
which was included in v8.1.1.

The buildbot is now using v8.1.3 so the test is passing.

https://lab.llvm.org/buildbot/#/builders/179/builds/9468
2024-02-27 08:45:01 +00:00
Cullen Rhodes
b39f5660a4 [mlir][ArmSME] Add test-lower-to-arm-sme pipeline (#81732)
The ArmSME compilation pipeline has evolved significantly and is now
sufficiently complex enough that it warrants a proper lowering pipeline
that encapsulates the various passes and orderings. Currently the
pipeline is loosely defined in our integration tests, but these have
diverged and are not using the same passes or ordering everywhere.

This patch introduces a test-lower-to-arm-sme pipeline mirroring
test-lower-to-llvm that provides some sanity when running e2e examples
and can be used a reference for targeting ArmSME in MLIR.

All the integration tests are updated to use this pipeline. The
intention is to productize the pipeline once it becomes more mature.
2024-02-23 09:42:08 +00:00
Benjamin Maxwell
08eced5fcc [mlir][test] Add -march=aarch64 -mattr=+sve to test-scalable-interleave
Fix for https://lab.llvm.org/buildbot/#/builders/179/builds/9438
2024-02-22 15:30:53 +00:00
Benjamin Maxwell
f01719afaa [mlir][test] Add integration tests for vector.interleave (#80969) 2024-02-22 10:21:12 +00:00
Cullen Rhodes
fff86c6111 [mlir][ArmSME] Support 4-way widening outer products (#79288)
This patch introduces support for 4-way widening outer products. This
enables the fusion of 4 'arm_sme.outerproduct' operations that are
chained via the accumulator into single widened operations.

Changes:

- Adds the following operations:
  - smopa_4way, smops_4way
  - umopa_4way, umops_4way
  - sumopa_4way, sumops_4way
  - sumopa_4way, sumops_4way
- Implements conversions for the above ops to intrinsics in ArmSMEToLLVM.
- Extends 'arm-sme-outer-product' pass.

For a detailed description of these operations see the
'arm_sme.smopa_4way' description.
2024-02-07 08:17:47 +00:00
Benjamin Maxwell
042800a4dd [mlir][ArmSME] Add initial SME vector legalization pass (#79152)
This adds a new pass (`-arm-sme-vector-legalization`) which legalizes
vector operations so that they can be lowered to ArmSME. This initial
patch adds decomposition for `vector.outerproduct`,
`vector.transfer_read`, and `vector.transfer_write` when they operate on
vector types larger than a single SME tile. For example, a [8]x[8]xf32
outer product would be decomposed into four [4]x[4]xf32 outer products,
which could then be lowered to ArmSME. These three ops have been picked
as supporting them alone allows lowering matmuls that use all ZA
accumulators to ArmSME.

For it to be possible to legalize a vector type it has to be a multiple
of an SME tile size, but other than that any shape can be used. E.g.
`vector<[8]x[8]xf32>`, `vector<[4]x[16]xf32>`, `vector<[16]x[4]xf32>`
can all be lowered to four `vector<[4]x[4]xf32>` operations.

In future, this pass will be extended with more SME-specific rewrites to
legalize unrolling the reduction dimension of matmuls (which is not
type-decomposition), which is why the pass has quite a general name.
2024-01-31 11:55:22 +00:00
Cullen Rhodes
95ef8e3868 [mlir][ArmSME] Support 2-way widening outer products (#78975)
This patch introduces support for 2-way widening outer products. This
enables the fusion of 2 'arm_sme.outerproduct' operations that are
chained via the accumulator into a 2-way widening outer product
operation.

Changes:

- Add 'llvm.aarch64.sme.[us]mop[as].za32' intrinsics for 2-way variants.
  These map to instruction variants added in SME2 and use different
  intrinsics. Intrinsics are already implemented for widening variants
  from SME1.
- Adds the following operations:
  - fmopa_2way, fmops_2way
  - smopa_2way, smops_2way
  - umopa_2way, umops_2way
- Implements conversions for the above ops to intrinsics in
ArmSMEToLLVM.
- Adds a pass 'arm-sme-outer-product-fusion'  that fuses
  'arm_sme.outerproduct' operations.

For a detailed description of these operations see the
'arm_sme.fmopa_2way' description.

The reason for introducing many operations rather than one is the
signed/unsigned variants can't be distinguished with types (e.g., ui16,
si16) since 'arith.extui' and 'arith.extsi' only support signless
integers. A single operation would require this information and an
attribute (for example) for the sign doesn't feel right if
floating-point types are also supported where this wouldn't apply.
Furthermore, the SME FP8 extensions (FEAT_SME_F8F16, FEAT_SME_F8F32)
introduce FMOPA 2-way (FP8 to FP16) and 4-way (FP8 to FP32) variants but
no subtract variant. Whilst these are not supported in this patch, it
felt simpler to have separate ops for add/subtract given this.
2024-01-31 09:13:18 +00:00
Benjamin Maxwell
e280c287e4 [mlir] Add mlir_arm_runner_utils library for use in integration tests (#78583)
This adds a new `mlir_arm_runner_utils` library that contains utils
specific to Arm/AArch64. This is for use in MLIR integration tests.

This initial patch adds `setArmVLBits()` and `setArmSVLBits()`. This
allows changing vector length or streaming vector length at runtime (or
setting it to a known minimum, i.e. 128-bits).
2024-01-22 09:28:13 +00:00
Cullen Rhodes
9f7fff7f13 [mlir][ArmSME] Add arith-to-arm-sme conversion pass (#78197)
Existing 'arith::ConstantOp' conversion and tests are moved from
VectorToArmSME. There's currently only a single op that's converted at
the moment, but this will grow in the future as things like in-tile add
are implemented. Also, 'createLoopOverTileSlices' is moved to ArmSME
utils since it's relevant for both conversions.
2024-01-22 09:23:11 +00:00
Benjamin Maxwell
dc974573a8 [mlir][ArmSME][test] Make use of arm_sme.streaming_vl (NFC) (#77322) 2024-01-11 10:24:55 +00:00
Guray Ozen
ace69e6b94 [mlir][gpu] Improve gpu-lower-to-nvvm-pipeline Documentation (#77062)
This PR improves the documentation for the `gpu-lower-to-nvvm-pipeline`
(as it was remaning item for #75775)

- Changes pipeline `gpu-lower-to-nvvm` -> `gpu-lower-to-nvvm-pipeline`
- Adds a section in GPU Dialect in website. It clarifies the pipeline's
functionality in lowering primary dialects to NVVM targets.
2024-01-05 12:51:25 +01:00
Jakub Kuderski
560564f51c [mlir][vector][gpu] Align minf/maxf reduction kind names with arith (#75901)
This is to avoid confusion when dealing with reduction/combining kinds.
For example, see a recent PR comment:
https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175.

Previously, they were picked to mostly mirror the names of the llvm
vector reduction intrinsics:
https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In
isolation, it was not clear if `<maxf>` has `arith.maxnumf` or
`arith.maximumf` semantics. The new reduction kind names map 1:1 to
arith ops, which makes it easier to tell/look up their semantics.

Because both the vector and the gpu dialect depend on the arith dialect,
it's more natural to align names with those in arith than with the
lowering to llvm intrinsics.

Issue: https://github.com/llvm/llvm-project/issues/72354
2023-12-20 00:14:43 -05:00
Guray Ozen
5caae72d1a [mlir][gpu] Productize test-lower-to-nvvm as gpu-lower-to-nvvm (#75775)
The `test-lower-to-nvvm` pipeline serves as the common and proper
pipeline for nvvm+host compilation, and it's used across our CUDA
integration tests.

This PR updates the `test-lower-to-nvvm` pipeline to `gpu-lower-to-nvvm`
and moves it within `InitAllPasses.h`. The aim is to call it from
Python, also having a standardize compilation process for nvvm.
2023-12-19 08:40:46 +01:00
Benjamin Maxwell
9505cf457f [mlir][ArmSME][test] Use only-if-required-by-ops rather than enable_arm_streaming_ignore (NFC) (#75209)
This moves the fix out of the IR and into the pass description, which
seems nicer. It also works as an integration test for the
`only-if-required-by-ops` flag :)
2023-12-13 10:29:28 +00:00
Benjamin Maxwell
eaff02f28e [mlir][ArmSME] Switch to an attribute-based tile allocation scheme (#73253)
This reworks the ArmSME dialect to use attributes for tile allocation.
This has a number of advantages and corrects some issues with the
previous approach:

* Tile allocation can now be done ASAP (i.e. immediately after
`-convert-vector-to-arm-sme`)
* SSA form for control flow is now supported (e.g.`scf.for` loops that
yield tiles)
* ArmSME ops can be converted to intrinsics very late (i.e. after
lowering to control flow)
 * Tests are simplified by removing constants and casts
* Avoids correctness issues with representing LLVM `immargs` as MLIR
values
- The tile ID on the SME intrinsics is an `immarg` (so is required to be
a compile-time constant), `immargs` should be mapped to MLIR attributes
(this is already the case for intrinsics in the LLVM dialect)
- Using MLIR values for `immargs` can lead to invalid LLVM IR being
generated (and passes such as -cse making incorrect optimizations)

As part of this patch we bid farewell to the following operations:

```mlir
arm_sme.get_tile_id : i32
arm_sme.cast_tile_to_vector : i32 to vector<[4]x[4]xi32>
arm_sme.cast_vector_to_tile : vector<[4]x[4]xi32> to i32
```

These are now replaced with:
```mlir
// Allocates a new tile with (indeterminate) state:
arm_sme.get_tile : vector<[4]x[4]xi32>
// A placeholder operation for lowering ArmSME ops to intrinsics:
arm_sme.materialize_ssa_tile : vector<[4]x[4]xi32>
```

The new tile allocation works by operations implementing the
`ArmSMETileOpInterface`. This interface says that an operation needs to
be assigned a tile ID, and may conditionally allocate a new SME tile.

Operations allocate a new tile by implementing...
```c++
std::optional<arm_sme::ArmSMETileType> getAllocatedTileType()
```
...and returning what type of tile the op allocates (ZAB, ZAH, etc).

Operations that don't allocate a tile return `std::nullopt` (which is
the default behaviour).

Currently the following ops are defined as allocating:
```mlir
arm_sme.get_tile
arm_sme.zero
arm_sme.tile_load
arm_sme.outerproduct // (if no accumulator is specified)
```

Allocating operations become the roots for the tile allocation pass,
which currently just (naively) assigns all transitive uses of a root
operation the same tile ID. However, this is enough to handle current
use cases.

Once tile IDs have been allocated subsequent rewrites can forward the
tile IDs to any newly created operations.
2023-11-30 10:22:22 +00:00
Benjamin Maxwell
dff97c1e4c [mlir][ArmSME] Move ArmSME -> intrinsics lowerings to convert-arm-sme-to-llvm pass (#72890)
This gives more flexibility with when these lowerings are performed,
without also lowering unrelated vector ops.

This is a NFC (other than adding a new `-convert-arm-sme-to-llvm` pass)
2023-11-22 13:36:36 +00:00
Benjamin Maxwell
783ac3b6fb [mlir][ArmSME] Make use of backend function attributes for enabling ZA storage (#71044)
Previously, we were inserting za.enable/disable intrinsics for functions
with the "arm_za" attribute (at the MLIR level), rather than using the
backend attributes. This was done to avoid a dependency on the SME ABI
functions from compiler-rt (which have only recently been implemented).

Doing things this way did have correctness issues, for example, calling
a streaming-mode function from another streaming-mode function (both
with ZA enabled) would lead to ZA being disabled after returning to the
caller (where it should still be enabled). Fixing issues like this would
require re-doing the ABI work already done in the backend within MLIR.

Instead, this patch switches to use the "arm_new_za" (backend) attribute
for enabling ZA for an MLIR function. For the integration tests, this
requires some way of linking the SME ABI functions. This is done via the
`%arm_sme_abi_shlib` lit substitution. By default, this expands to a
stub implementation of the SME ABI functions, but this can be overridden
by providing the `ARM_SME_ABI_ROUTINES_SHLIB` CMake cache variable
(pointing it at an alternative implementation). For now, the ArmSME
integration tests pass with just stubs, as we don't make use of nested
ZA-enabled calls.

A future patch may add an option to compiler-rt to build the SME
builtins into a standalone shared library to allow easily
building/testing with the actual implementation.
2023-11-14 12:50:38 +00:00
Cullen Rhodes
4240b1790f [mlir][ArmSME] Lower transfer_write + transpose to vertical store (#71181)
This patch extends the lowering of vector.transfer_write in
VectorToArmSME to support in-flight transpose via SME vertical store.
2023-11-10 07:51:06 +00:00
Cullen Rhodes
9783cf448a [mlir][ArmSME] Add support for lowering masked tile_load ops (#70915)
This patch extends ArmSMEToSCF to support lowering of masked tile_load
ops. Only masks created by 'vector.create_mask' are currently supported.
There are two lowerings depending on the pad.

For pad of constant zero, the tile is first zeroed, then only active
rows are loaded.

For non-zero pad, the scalar pad is broadcast to a 1-D vector and a
regular 'vector.masked_load' (will be lowered to SVE, not SME) loads
each slice, with padding specified as a passthru and the 2-D mask
combined into a 1-D mask. The resulting slice is then inserted into the
tile with 'arm_sme.move_vector_to_tile_slice'.
2023-11-08 09:02:09 +00:00
Cullen Rhodes
fbc70c5a9e [mlir][ArmSME] remove addressof ops to undefined symbols (NFC)
The string symbols were replaced with 'vector.print str' calls in
061d978043 (#68973) but the addressof ops weren't removed. This was
missed as the test is currently XFAIL'ed.
2023-11-06 11:43:43 +00:00
Cullen Rhodes
ed350bb3d8 [mlir][ArmSME] Add support for lowering masked tile_store ops (#71180)
This patch extends ArmSMEToSCF to support lowering of masked tile_store
ops. Only masks created by 'vector.create_mask' are currently supported.

Example:

  %mask = vector.create_mask %c3, %c2 : vector<[4]x[4]xi1>
  arm_sme.tile_store %tile, %dest[%c0, %c0], %mask : memref<?x?xi32>,
vector<[4]x[4]xi32>

Produces:

  %num_rows = arith.constant 3 : index
  %num_cols = vector.create_mask %c2 : vector<[4]xi1>
  scf.for %slice_idx = %c0 to %num_rows step %c1
    arm_sme.store_tile_slice %tile, %slice_idx, %num_cols, %dest[%slice_idx, %c0]
      : memref<?x?xi32>, vector<[4]xi1>, vector<[4]x[4]xi32>
2023-11-06 11:18:57 +00:00
Christian Ulmann
52491c99fa [MLIR][LLVM] Remove typed pointer remnants from integration tests (#71208)
This commit removes all LLVM dialect typed pointers from the integration
tests. Typed pointers have been deprecated for a while now and it's
planned to soon remove them from the LLVM dialect.

Related PSA:
https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502
2023-11-03 21:21:25 +01:00
Benjamin Maxwell
e666295011 [mlir][ArmSME] Support lowering masked vector.outerproduct ops to SME (#69604)
This patch adds support for lowering masked outer products to SME. This
is done in two stages. First, vector.outerproducts (both masked and
non-masked) are rewritten to arm_sme.outerproducts. The
arm_sme.outerproduct op is close to vector.outerproduct, but supports
masking on the operands rather than the result. It also limits the cases
it handles to things that could be (directly) lowered to SME.

This currently requires that the source of the mask is a
vector.create_mask op. E.g.:

```mlir
%mask = vector.create_mask %dimA, %dimB : vector<[4]x[4]xi1>
%result = vector.mask %mask {
             vector.outerproduct %vecA, %vecB
              : vector<[4]xf32>, vector<[4]xf32>
          } : vector<[4]x[4]xi1> -> vector<[4]x[4]xf32>
```
Is rewritten to:
```
%maskA = vector.create_mask %dimA : vector<[4]xi1>
%maskB = vector.create_mask %dimB : vector<[4]xi1>
%result = arm_sme.outerproduct %vecA, %vecB masks(%maskA, %maskB)
              : vector<[4]xf32>, vector<[4]xf32>
```
(The same rewrite works for non-masked vector.outerproducts too)

The arm_sme.outerproduct can then be directly lowered to SME intrinsics.
2023-10-31 09:06:21 +00:00
Andrzej Warzyński
e9478b167f [mlir][SVE] Add more e2e test for vector.contract (#70367)
Adds basic integration tests for `vector.contract` for the dot product
and matvec operations. These tests exercise scalable vectors.

Depends on https://github.com/llvm/llvm-project/pull/69845
2023-10-27 15:00:01 +01:00
tyb0807
4d4f603793 [mlir][Vector] Fix integration test for vector.maskedload narrow type… (#70431)
… emulation

Currently the expected CHECK values are not correct for
`fcst_maskedload` from
mlir/test/Integration/Dialect/Vector/CPU/test-rewrite-narrow-types.mlir
2023-10-27 11:26:20 +02:00
tyb0807
674261b203 [mlir][Vector] Add narrow type emulation pattern for vector.maskedload (#68443) 2023-10-27 10:49:58 +02:00
Andrzej Warzyński
f24c443e82 [mlir][SVE] Add an e2e test for vector.contract (#69845)
Adds an end-to-end test for `vector.contract` that targets SVE (i.e.
scalable vectors). Note that this requires lifting the restriction on
`vector.outerproduct` (to which `vector.contract` is lowered to) that
would deem the following as invalid by the Op verifier (*):

```
vector.outerproduct %27, %28, %26 {kind = #vector.kind<add>} : vector<3xf32>, vector<[2]xf32>
```

This is indeed valid as the end-to-end test demonstrates (at least when
compiling for SVE).
2023-10-26 20:57:49 +01:00
Benjamin Maxwell
96e040acee [mlir][ArmSVE] Add -arm-sve-legalize-vector-storage pass (#68794)
This patch adds a pass that ensures that loads, stores, and allocations
of SVE vector types will be legal in the LLVM backend. It does this at
the memref level, so this pass must be applied before lowering all the
way to LLVM.

This pass currently fixes two issues.

## Loading and storing predicate types

It is only legal to load/store predicate types equal to (or greater
than) a full predicate register, which in MLIR is `vector<[16]xi1>`.
Smaller predicate types (`vector<[1|2|4|8]xi1>`) must be converted
to/from a full predicate type (referred to as a `svbool`) before and
after storing and loading respectively. This pass does this by widening
allocations and inserting conversion intrinsics.

For example:


```mlir
%alloca = memref.alloca() : memref<vector<[4]xi1>>
%mask = vector.constant_mask [4] : vector<[4]xi1>
memref.store %mask, %alloca[] : memref<vector<[4]xi1>>
%reload = memref.load %alloca[] : memref<vector<[4]xi1>>
```
Becomes:
```mlir
%alloca = memref.alloca() {alignment = 1 : i64} : memref<vector<[16]xi1>>
%mask = vector.constant_mask [4] : vector<[4]xi1>
%svbool = arm_sve.convert_to_svbool %mask : vector<[4]xi1>
memref.store %svbool, %alloca[] : memref<vector<[16]xi1>>
%reload_svbool = memref.load %alloca[] : memref<vector<[16]xi1>>
%reload = arm_sve.convert_from_svbool %reload_svbool : vector<[4]xi1>
```

## Relax alignments for SVE vector allocas

The storage for SVE vector types only needs to have an alignment that
matches the element type (for example 4 byte alignment for `f32`s).
However, the LLVM backend currently defaults to aligning to `base size x
element size` bytes. For non-legal vector types like `vector<[8]xf32>`
this results in 8 x 4 = 32-byte alignment, but the backend only supports
up to 16-byte alignment for SVE vectors on the stack. Explicitly setting
a smaller alignment prevents this issue.

Depends on: #68586 and #68695 (for testing)
2023-10-26 12:18:58 +01:00
Benjamin Maxwell
061d978043 [mlir][test] Update tests to use vector.print str (NFC) (#68973)
This cuts down on a fair amount of boilerplate.

Depends on: #68695
2023-10-25 10:14:34 +01:00
Oleksandr "Alex" Zinenko
e4384149b5 [mlir] use transform-interpreter in test passes (#70040)
Update most test passes to use the transform-interpreter pass instead of
the test-transform-dialect-interpreter-pass. The new "main" interpreter
pass has a named entry point instead of looking up the top-level op with
`PossibleTopLevelOpTrait`, which is arguably a more understandable
interface. The change is mechanical, rewriting an unnamed sequence into
a named one and wrapping the transform IR in to a module when necessary.

Add an option to the transform-interpreter pass to target a tagged
payload op instead of the root anchor op, which is also useful for repro
generation.

Only the test in the transform dialect proper and the examples have not
been updated yet. These will be updated separately after a more careful
consideration of testing coverage of the transform interpreter logic.
2023-10-24 16:12:34 +02:00
Benjamin Maxwell
3be3883e6d [mlir][VectorOps] Support string literals in vector.print (#68695)
Printing strings within integration tests is currently quite annoyingly
verbose, and can't be tucked into shared helpers as the types depend on
the length of the string:

```
llvm.mlir.global internal constant @hello_world("Hello, World!\0")

func.func @entry() {
  %0 = llvm.mlir.addressof @hello_world : !llvm.ptr<array<14 x i8>>
  %1 = llvm.mlir.constant(0 : index) : i64
  %2 = llvm.getelementptr %0[%1, %1]
    : (!llvm.ptr<array<14 x i8>>, i64, i64) -> !llvm.ptr<i8>
  llvm.call @printCString(%2) : (!llvm.ptr<i8>) -> ()
  return
}
```

So this patch adds a simple extension to `vector.print` to simplify
this:
```
func.func @entry() {
   // Print a vector of characters ;)
   vector.print str "Hello, World!"
   return
}
```

Most of the logic for this is now shared with `cf.assert` which already
does something similar.

Depends on #68694
2023-10-24 09:34:14 +01:00
Cullen Rhodes
d86047cb66 [mlir][ArmSME] Update tile slice layout syntax (#69151)
This patch prefixes tile slice layout with `layout` in the
assemblyFormat:

  - `<vertical>`   -> `layout<vertical>`
  - `<horizontal>` -> `layout<horizontal>`

The reason for this change is the current format doesn't play nicely
with additional optional operands, required to support padding and
masking (#69148), as it becomes ambiguous.

This affects the the following ops:

  - arm_sme.tile_load
  - arm_sme.tile_store
  - arm_sme.load_tile_slice
  - arm_sme.store_tile_slice
2023-10-16 10:55:30 +01:00
Cullen Rhodes
9816edc9f3 [mlir][vector] add result type to vector.extract assembly format (#66499)
The vector.extract assembly format currently only contains the source
type, for example:

  %1 = vector.extract %0[1] : vector<3x7x8xf32>

it's not immediately obvious if this is the source or result type. This
patch improves the assembly format to make this clearer, so the above
becomes:

  %1 = vector.extract %0[1] : vector<7x8xf32> from vector<3x7x8xf32>
2023-09-28 11:11:16 +01:00