Commit Graph

764 Commits

Author SHA1 Message Date
Ryan Holt
847a6f8f0a [mlir][MemRef] Add runtime bounds checking (#75817)
This change adds (runtime) bounds checks for `memref` ops using the
existing `RuntimeVerifiableOpInterface`. For `memref.load` and
`memref.store`, we check that the indices are in-bounds of the memref's
index space. For `memref.reinterpret_cast` and `memref.subview` we check
that the resulting address space is in-bounds of the input memref's
address space.
2023-12-22 11:49:15 +09:00
Jakub Kuderski
560564f51c [mlir][vector][gpu] Align minf/maxf reduction kind names with arith (#75901)
This is to avoid confusion when dealing with reduction/combining kinds.
For example, see a recent PR comment:
https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175.

Previously, they were picked to mostly mirror the names of the llvm
vector reduction intrinsics:
https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In
isolation, it was not clear if `<maxf>` has `arith.maxnumf` or
`arith.maximumf` semantics. The new reduction kind names map 1:1 to
arith ops, which makes it easier to tell/look up their semantics.

Because both the vector and the gpu dialect depend on the arith dialect,
it's more natural to align names with those in arith than with the
lowering to llvm intrinsics.

Issue: https://github.com/llvm/llvm-project/issues/72354
2023-12-20 00:14:43 -05:00
Guray Ozen
5caae72d1a [mlir][gpu] Productize test-lower-to-nvvm as gpu-lower-to-nvvm (#75775)
The `test-lower-to-nvvm` pipeline serves as the common and proper
pipeline for nvvm+host compilation, and it's used across our CUDA
integration tests.

This PR updates the `test-lower-to-nvvm` pipeline to `gpu-lower-to-nvvm`
and moves it within `InitAllPasses.h`. The aim is to call it from
Python, also having a standardize compilation process for nvvm.
2023-12-19 08:40:46 +01:00
Yinying Li
7bc6c4abe8 [mlir][print]Add functions for printing memref f16/bf16/i16 (#75094)
1. Added functions for printMemrefI16/f16/bf16.
2. Added a new integration test for all the printMemref functions.
2023-12-14 13:06:25 -05:00
Benjamin Maxwell
9505cf457f [mlir][ArmSME][test] Use only-if-required-by-ops rather than enable_arm_streaming_ignore (NFC) (#75209)
This moves the fix out of the IR and into the pass description, which
seems nicer. It also works as an integration test for the
`only-if-required-by-ops` flag :)
2023-12-13 10:29:28 +00:00
Matthias Springer
95d6aa21fb [mlir][SparseTensor][NFC] Use tensor.empty for dense tensors (#74804)
Use `tensor.empty` + initialization for dense tensors instead of
`bufferization.alloc_tensor`.
2023-12-12 08:56:47 +09:00
Aart Bik
21213f39e2 [mlir][sparse] fix uninitialized dense tensor out in conv2d test (#74884)
Note, tensor.empty may feed into SPARSE output (meaning it truly has no
values yet), but for a DENSE output, it should always have an initial
value. We ran a verifier over all our tests and this is the only
remaining omission.
2023-12-08 12:44:57 -08:00
Aart Bik
ec9e49796d [mlir][sparse] add sparse convolution with 5x5 kernel (#74793)
Also unifies some of the test set up parts in other conv tests
2023-12-07 18:11:04 -08:00
Aart Bik
7003e255d3 [mlir][sparse] code formatting (NFC) (#74779) 2023-12-07 15:46:24 -08:00
Peiming Liu
78e2b74f96 [mlir][sparse] fix bugs when generate sparse conv_3d kernels. (#74561) 2023-12-06 15:59:10 -08:00
Sang Ik Lee
7fc792cba7 [MLIR] Enable GPU Dialect to SYCL runtime integration (#71430)
GPU Dialect lowering to SYCL runtime is driven by spirv.target_env
attached to gpu.module. As a result of this, spirv.target_env remains as
an input to LLVMIR Translation.
A SPIRVToLLVMIRTranslation without any actual translation is added to
avoid an unregistered error in mlir-cpu-runner.
SelectObjectAttr.cpp is updated to
1) Pass binary size argument to getModuleLoadFn
2) Pass parameter count to getKernelLaunchFn
This change does not impact CUDA and ROCM usage since both
mlir_cuda_runtime and mlir_rocm_runtime are already updated to accept
and ignore the extra arguments.
2023-12-05 16:55:24 -05:00
Peiming Liu
8206b75a1e [mlir][sparse] fix crash when generate rotated convolution kernels. (#74146) 2023-12-01 14:13:57 -08:00
Andrzej Warzyński
bc802407d1 [mlir][sve][nfc] Merge the integration tests for linalg.matmul (#74059)
At the moment the logic to tile and vectorize `linalg.matmul` is
duplicated in multiple test files:
  * matmul.mlir
  * matmul_mixed_ty.mlir

Instead, this patch uses `transform.foreach` to apply the same sequence
to multiple functions within the same test file (e.g. `matmul_f32` and
`matmul_mixed_ty` as defined in the original files). This allows us to
merge relevant test files.
2023-12-01 17:39:48 +00:00
Spenser Bauman
0d87e25779 [mlir][tosa] Improve lowering to tosa.fully_connected (#73049)
The current lowering of tosa.fully_connected produces a linalg.matmul
followed by a linalg.generic to add the bias. The IR looks like the
following:

    %init = tensor.empty()
    %zero = linalg.fill ins(0 : f32) outs(%init)
    %prod = linalg.matmul ins(%A, %B) outs(%zero)

    // Add the bias
    %initB = tensor.empty()
    %result = linalg.generic ins(%prod, %bias) outs(%initB) {
       // add bias and product
    }

This has two down sides:

1. The tensor.empty operations typically result in additional
allocations after bufferization
2. There is a redundant traversal of the data to add the bias to the
matrix product.

This extra work can be avoided by leveraging the out-param of
linalg.matmul. The new IR sequence is:

    %init = tensor.empty()
    %broadcast = linalg.broadcast ins(%bias) outs(%init)
    %prod = linalg.matmul ins(%A, %B) outs(%broadcast)

In my experiments, this eliminates one loop and one allocation (post
bufferization) from the generated code.
2023-12-01 15:16:51 +00:00
Andrzej Warzyński
f42ce1621f [mlir][sve][nfc] Update a test to use transform-interpreter (#73771)
This is a follow-up of #70040 in which the test updated here was missed.

Includes a few additional NFC changes in preparation for extending this
test.
2023-12-01 10:08:00 +00:00
Benjamin Maxwell
eaff02f28e [mlir][ArmSME] Switch to an attribute-based tile allocation scheme (#73253)
This reworks the ArmSME dialect to use attributes for tile allocation.
This has a number of advantages and corrects some issues with the
previous approach:

* Tile allocation can now be done ASAP (i.e. immediately after
`-convert-vector-to-arm-sme`)
* SSA form for control flow is now supported (e.g.`scf.for` loops that
yield tiles)
* ArmSME ops can be converted to intrinsics very late (i.e. after
lowering to control flow)
 * Tests are simplified by removing constants and casts
* Avoids correctness issues with representing LLVM `immargs` as MLIR
values
- The tile ID on the SME intrinsics is an `immarg` (so is required to be
a compile-time constant), `immargs` should be mapped to MLIR attributes
(this is already the case for intrinsics in the LLVM dialect)
- Using MLIR values for `immargs` can lead to invalid LLVM IR being
generated (and passes such as -cse making incorrect optimizations)

As part of this patch we bid farewell to the following operations:

```mlir
arm_sme.get_tile_id : i32
arm_sme.cast_tile_to_vector : i32 to vector<[4]x[4]xi32>
arm_sme.cast_vector_to_tile : vector<[4]x[4]xi32> to i32
```

These are now replaced with:
```mlir
// Allocates a new tile with (indeterminate) state:
arm_sme.get_tile : vector<[4]x[4]xi32>
// A placeholder operation for lowering ArmSME ops to intrinsics:
arm_sme.materialize_ssa_tile : vector<[4]x[4]xi32>
```

The new tile allocation works by operations implementing the
`ArmSMETileOpInterface`. This interface says that an operation needs to
be assigned a tile ID, and may conditionally allocate a new SME tile.

Operations allocate a new tile by implementing...
```c++
std::optional<arm_sme::ArmSMETileType> getAllocatedTileType()
```
...and returning what type of tile the op allocates (ZAB, ZAH, etc).

Operations that don't allocate a tile return `std::nullopt` (which is
the default behaviour).

Currently the following ops are defined as allocating:
```mlir
arm_sme.get_tile
arm_sme.zero
arm_sme.tile_load
arm_sme.outerproduct // (if no accumulator is specified)
```

Allocating operations become the roots for the tile allocation pass,
which currently just (naively) assigns all transitive uses of a root
operation the same tile ID. However, this is enough to handle current
use cases.

Once tile IDs have been allocated subsequent rewrites can forward the
tile IDs to any newly created operations.
2023-11-30 10:22:22 +00:00
Andrzej Warzyński
4b2ba5a61a [mlir][sve] Add an e2e for linalg.matmul with mixed types (#73773)
Apart from the test itself, this patch also updates a few patterns to
fix how new VectorType(s) are created. Namely, it makes sure that
"scalability" is correctly propagated.

Regression tests will be updated seperately while auditing Vector
dialect tests in the context of scalable vectors:
  * https://github.com/orgs/llvm/projects/23
2023-11-29 21:21:10 +00:00
Aart Bik
1944c4f76b [mlir][sparse] rename DimLevelType to LevelType (#73561)
The "Dim" prefix is a legacy left-over that no longer makes sense, since
we have a very strict "Dimension" vs. "Level" definition for sparse
tensor types and their storage.
2023-11-27 14:27:52 -08:00
Jakub Kuderski
7eccd52842 Reland "[mlir][gpu] Align reduction operations with vector combining kinds (#73423)"
This reverts commit dd09221a29 and relands
https://github.com/llvm/llvm-project/pull/73423.

* Updated `gpu.all_reduce` `min`/`max` in CUDA integration tests.
2023-11-27 11:38:18 -05:00
Guray Ozen
edf5cae739 [mlir][gpu] Support Cluster of Thread Blocks in gpu.launch_func (#72871)
NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA).
It is a new level of parallelism, allowing clustering of Cooperative
Thread Arrays (CTA) to synchronize and communicate through shared memory
while running concurrently.

This PR enables support for CGA within the `gpu.launch_func` in the GPU
dialect. It extends `gpu.launch_func` to accommodate this functionality.

The GPU dialect remains architecture-agnostic, so we've added CGA
functionality as optional parameters. We want to leverage mechanisms
that we have in the GPU dialects such as outlining and kernel launching,
making it a practical and convenient choice.

An example of this implementation can be seen below:

```
gpu.launch_func @kernel_module::@kernel
                clusters in (%1, %0, %0) // <-- Optional
                blocks in (%0, %0, %0)
                threads in (%0, %0, %0)
```

The PR also introduces index and dimensions Ops specific to clusters,
binding them to NVVM Ops:

```
%cidX = gpu.cluster_id  x
%cidY = gpu.cluster_id  y
%cidZ = gpu.cluster_id  z

%cdimX = gpu.cluster_dim  x
%cdimY = gpu.cluster_dim  y
%cdimZ = gpu.cluster_dim  z
```

We will introduce cluster support in `gpu.launch` Op in an upcoming PR. 

See [the
documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays)
provided by NVIDIA for details.
2023-11-27 11:05:07 +01:00
Cullen Rhodes
fae3964cbc [mlir][linalg] Add an e2e test for linalg.matmul to ArmSME (#72144)
This patch adds an integration test lowering a linalg.matmul to SME via
vector.outerproduct.

It's similar to the linalg.matmul_transpose_a e2e test added recently in
as well as vector transpose canonicalizations, to lower the following
sequence (taken from the inner loop):
```
  %subview = memref.subview %arg0[%arg3, %arg5] [%2, 1] [1, 1] :
    memref<?x?xf32, strided<[?, ?], offset: ?>> to memref<?x1xf32, strided<[?, ?], offset: ?>>
  %mask = vector.create_mask %2, %c1 : vector<[4]x1xi1>
  %0 = vector.transfer_read %subview[%c0, %c0], %pad, %mask {in_bounds = [true, true]} :
    memref<?x1xf32, strided<[?, ?], offset: ?>>, vector<[4]x1xf32>
  %1 = vector.transpose %0, [1, 0] : vector<[4]x1xf32> to vector<1x[4]xf32>
  %2 = vector.extract %1[0] : vector<[4]xf32> from vector<1x[4]xf32>
```
Rank-2 vectors with leading scalable dim can't be type converted to an
array. TransferReadDropUnitDimsPattern drops the unit dim on the
vector.transfer_read so it can be lowered via the generic path (to SVE).
The transpose canonicalizations lower the transpose to a shape_cast
which folds away.
2023-11-23 08:53:43 +00:00
Benjamin Maxwell
dff97c1e4c [mlir][ArmSME] Move ArmSME -> intrinsics lowerings to convert-arm-sme-to-llvm pass (#72890)
This gives more flexibility with when these lowerings are performed,
without also lowering unrelated vector ops.

This is a NFC (other than adding a new `-convert-arm-sme-to-llvm` pass)
2023-11-22 13:36:36 +00:00
Aart Bik
c97e4273e2 [mlir][sparse] test on read/convert permuted 3d sparse tensors (#72925)
3! = 6
2023-11-21 09:26:04 -08:00
Peiming Liu
b52eb7c2fe [mlir][sparse] add a csr x bsr matmul test case (#73012) 2023-11-21 09:14:45 -08:00
Aart Bik
6352a07ba6 [mlir][sparse] test four row/col major versions of BSR (#72898)
Note, this is a redo of https://github.com/llvm/llvm-project/pull/72712
which was reverted due to time outs in the bot. I have timed the tests
on various settings, and it does not even hit the top 20 of integration
tests. To be safe, I removed the SIMD version of the tests, just keeping
libgen/direcIR paths (which are the most important to test for us).

I will also keep an eye on
https://lab.llvm.org/buildbot/#/builders/264/builds after submitting to
make sure there is no repeat.
2023-11-20 12:28:16 -08:00
Mehdi Amini
2b71f91b06 Revert "[mlir][sparse] stress test BSR" (#72735)
Reverts llvm/llvm-project#72712

This causes timeouts on the bots.
2023-11-17 19:06:49 -08:00
Aart Bik
813aaf39f9 [mlir][sparse] stress test BSR (#72712)
I always enjoy a good stress test. This end-to-end integration test
ensures the major ordering of both the block and within the block are
correctly handled (giving row-row, row-col, col-row and col-row as
options).
2023-11-17 15:47:38 -08:00
Aart Bik
6b56dd6a93 [mlir][sparse] enable 2:4 test for both directIR/libgen path (#72593) 2023-11-17 09:40:32 -08:00
Aart Bik
83cf0dc982 [mlir][sparse] implement direct IR alloc/empty/new for non-permutations (#72585)
This change implements the correct *level* sizes set up for the direct
IR codegen fields in the sparse storage scheme. This brings libgen and
codegen together again.

This is step 3 out of 3 to make sparse_tensor.new work for BSR
2023-11-16 17:17:41 -08:00
Aart Bik
5535e48be2 [mlir][sparse] Capitalize class comment (#72436) 2023-11-15 13:04:27 -08:00
Aart Bik
58090617c6 [mlir][sparse] fix broken test (merge conflict marker was left) (#72438) 2023-11-15 13:01:43 -08:00
Tim Harvey
dce7a7cf69 Changed all code and comments that used the phrase "sparse compiler" to instead use "sparsifier" (#71875)
The changes in this p.r. mostly center around the tests that use the
flag sparse_compiler (also: sparse-compiler).
2023-11-15 20:12:35 +00:00
Aart Bik
a89c15aa2e [mlir][sparse] enable Python BSR test (#72325) 2023-11-14 15:35:03 -08:00
Aart Bik
a40900211a [mlir][sparse] set rwx permissions to consistent values (#72311)
some files had "x" permission set, others were missing "r"
2023-11-14 13:32:55 -08:00
Aart Bik
5f32bcfbae [mlir][sparse][gpu] re-enable all GPU libgen tests (#72185)
Previous change no longer properly used the GPU libgen pass (even though
most tests still passed falling back to CPU). This revision puts the
proper pass order into place. Also bit of a cleanup of CPU codegen vs.
libgen setup.
2023-11-14 09:06:15 -08:00
Benjamin Maxwell
783ac3b6fb [mlir][ArmSME] Make use of backend function attributes for enabling ZA storage (#71044)
Previously, we were inserting za.enable/disable intrinsics for functions
with the "arm_za" attribute (at the MLIR level), rather than using the
backend attributes. This was done to avoid a dependency on the SME ABI
functions from compiler-rt (which have only recently been implemented).

Doing things this way did have correctness issues, for example, calling
a streaming-mode function from another streaming-mode function (both
with ZA enabled) would lead to ZA being disabled after returning to the
caller (where it should still be enabled). Fixing issues like this would
require re-doing the ABI work already done in the backend within MLIR.

Instead, this patch switches to use the "arm_new_za" (backend) attribute
for enabling ZA for an MLIR function. For the integration tests, this
requires some way of linking the SME ABI functions. This is done via the
`%arm_sme_abi_shlib` lit substitution. By default, this expands to a
stub implementation of the SME ABI functions, but this can be overridden
by providing the `ARM_SME_ABI_ROUTINES_SHLIB` CMake cache variable
(pointing it at an alternative implementation). For now, the ArmSME
integration tests pass with just stubs, as we don't make use of nested
ZA-enabled calls.

A future patch may add an option to compiler-rt to build the SME
builtins into a standalone shared library to allow easily
building/testing with the actual implementation.
2023-11-14 12:50:38 +00:00
Peiming Liu
269685545e [mlir][sparse] remove filter-loop based algorithm support to handle a… (#71840)
…ffine subscript expressions.
2023-11-13 11:36:49 -08:00
Aart Bik
af8428c0d9 [mlir][sparse] unify support of (dis)assemble between direct IR/lib path (#71880)
Note that the (dis)assemble operations still make some simplfying
assumptions (e.g. trailing 2-D COO in AoS format) but now at least both
the direct IR and support library path behave exactly the same.

Generalizing the ops is still TBD.
2023-11-13 10:05:00 -08:00
Peiming Liu
bfe08c094d [mlir][sparse] support sparsifying 2:4 block sparsity (#71749) 2023-11-10 12:25:53 -08:00
Guray Ozen
51916f0c92 [mlir] Add sm_90a GEMM test 128x128x128 (F32 += F16 * F16) (#69913)
This PR adds a test that performs GEMM 128x128x128 (F32 += F16 * F16).
It uses `sm_90a` features in NVGPU dialect.

Simplified algorithm is as follows:

**Prologue** 
```
mgroup = mbarriers.init x 2
tma.load ... shmem_buffer_lhs<0 x 128 x 64>
tma.load ... shmem_buffer_rhs<0 x 64 x 64>
tma.load ... shmem_buffer_rhs<0 x 64 x 64>
mbarrier.expect_tx 32768
tma.load ... shmem_buffer_lhs<1 x 128 x 64>
tma.load ... shmem_buffer_rhs<1 x 64 x 64>
tma.load ... shmem_buffer_rhs<1 x 64 x 64>
mbarrier.expect_tx 32768
```
**Mainloop**
```
matrixD = 
 for(i = 0;...2) {   
   mbarrier.try_wait [i]
   lhs = shmem_buffer_lhs<pipe x 128 x 64>
   rhs = shmem_buffer_rhs<pipe x 64 x 128>
   yield nvgpu.warpgroup.mma (lhs, rhs)

//   Expanded : nvgpu.warpgroup.mma [128][128]+=[128][64]*[64][128]
//                  wgmma.m64n128k16(A[0:64][0:16]  *  B[0:16][0:128])
//                  wgmma.m64n128k16(A[0:64][16:32] *  B[16:32][0:128])
//                  wgmma.m64n128k16(A[0:64][32:48] *  B[32:48][0:128])
//                  wgmma.m64n128k16(A[0:64][48:64] *  B[48:64][0:128])
//                  wgmma.m64n128k16(A[64:128][0:16]  *  B[0:16][0:128])
//                  wgmma.m64n128k16(A[64:128][16:32] *  B[16:32][0:128])
//                  wgmma.m64n128k16(A[64:128][32:48] *  B[32:48][0:128])
//                  wgmma.m64n128k16(A[64:128][48:64] *  B[48:64][0:128])
```

**Epilogue** 
```
//reg->shmem
warpgroup.mma.store matrixD, shmem
//shmem->glbmem
parallel-for(i=0;...128)
 parallel-for(j=0;...128)
   store shmem, globalmem
```
2023-11-10 16:53:43 +01:00
Guray Ozen
a00caad6bf [mlir] Add sm_90a GEMM test 128x128x128 (F32 =F16*F16) with predicate (#70028)
PR #69913 added a GEMM test (128x128x128 F32 += F16 * F16) with
if-statement. This PR adds the same test using predicates in PTX.
Predicate support is enabled using _BasicPtxBuilderInterface_
`(nvgpu.opcode ..., predicate = %pred)`.

The predicate condition is computed in `Step 2. [GPU] Elect fastest
thread in CTA` inspired by cutlass. It is as follows:
```
lane_predicate = nvvm.elect.sync
warp_idx = __shfl_sync(0xffffffff, threadIdx.x / 32, 0)
warp_idx_in_warp_group = warp_idx % 4
predicate = (lane_predicate & warp_idx_in_warp_group)
```

Depends on #70027 #69934 #69935 #69584
2023-11-10 16:52:00 +01:00
Guray Ozen
f4d59522cf [mlir] Fix sm90 test for new verifier
#70923 improved verifier. The verifier caught that the tensor map type in the tma descriptor in this test isn't correct. The program was working correctly anway since the offset is calculated correctly.

This work fixes the test.
2023-11-10 16:50:01 +01:00
Cullen Rhodes
fe8c649d01 [mlir][linalg] Add an e2e test for linalg.matmul_transpose_a to ArmSME (#71644)
This patch adds an integration test demonstrating the first e2e example
lowering a linalg.matmul to SME via vector.outerproduct.

The test uses a 'linalg.matmul_transpose_a' rather than 'linalg.matmul'
since the latter emits a 'vector.transfer_read' with a vector type of
'vector<[4]x1xf32>' that can't be currently lowered via generic (SVE)
path, since it has leading scalable dim.
2023-11-10 07:52:39 +00:00
Cullen Rhodes
4240b1790f [mlir][ArmSME] Lower transfer_write + transpose to vertical store (#71181)
This patch extends the lowering of vector.transfer_write in
VectorToArmSME to support in-flight transpose via SME vertical store.
2023-11-10 07:51:06 +00:00
Peiming Liu
5a6ffc5503 [mlir][sparse] temporarily disable BSR GPU libgen tests. (#71870) 2023-11-09 13:54:02 -08:00
Peiming Liu
a2d9d2e1d9 [mlir][sparse] re-enable aarch64 test. (#71855)
Should have been fixed by initializing output tensor to zeros in
https://github.com/llvm/llvm-project/pull/71845
2023-11-09 11:46:52 -08:00
Peiming Liu
30e4b09d49 [mlir][sparse] try fix flanky test. (#71845) 2023-11-09 11:10:59 -08:00
Peiming Liu
4eb01f7d5e [mlir][sparse] disable aarch64 test to fix buildbot error. (#71818)
To fix https://github.com/llvm/llvm-project/pull/71448
2023-11-09 10:50:58 -08:00
Peiming Liu
c99951d491 [mlir][sparse] end-to-end matmul between Dense and BSR tensors (#71448) 2023-11-08 11:28:00 -08:00
Aart Bik
5ef446790f [mlir][sparse][gpu] cleanup GPUDataTransferStrategy (#71615)
The flag seems to be doing practically the same thing for zero cost and
pinned dma. In addition, the register host is not truly the right zero
cost mechanism according to Thomas. So we are simplifying the setup for
now, until we have a better definition for what to implement and test.
    
https://github.com/llvm/llvm-project/issues/64316
2023-11-08 09:45:11 -08:00