Commit Graph

238 Commits

Author SHA1 Message Date
Alexander Belyaev
bfcd3fa825 [mlir] Add result name for gpu.block_id and gpu.thread_id ops. (#83393)
expand-arith-ops.mlir fails on windows, but this is unrelated to this PR
2024-02-29 10:57:09 +01:00
Guray Ozen
d7f59c8fb8 [mlir] Lower math dialect later in gpu-lower-to-nvvm-pipeline (#81489)
This PR moves lowering of math dialect later in the pipeline. Because
math dialect is lowered correctly by createConvertGpuOpsToNVVMOps for
GPU target, and it needs to run it first.

Reland #78556
2024-02-13 08:31:42 +01:00
Benjamin Kramer
98dbc688de Revert "[mlir] Lower math dialect later in gpu-lower-to-nvvm-pipeline (#78556)"
This reverts commit 74bf0b1cd9. The test
always fails.

 | mlir/test/Dialect/GPU/test-nvvm-pipeline.mlir:23:16: error: CHECK-PTX: expected string not found in input
 |  // CHECK-PTX: __nv_expf

https://lab.llvm.org/buildbot/#/builders/61/builds/53789
2024-01-31 17:41:21 +01:00
Guray Ozen
74bf0b1cd9 [mlir] Lower math dialect later in gpu-lower-to-nvvm-pipeline (#78556)
This PR moves lowering of math dialect later in the pipeline. Because
math dialect is lowered correctly by `createConvertGpuOpsToNVVMOps` for
GPU target, and it needs to run it first.
2024-01-31 15:24:32 +01:00
Matthias Springer
ce7cc723b9 [mlir][memref] memref.subview: Verify result strides
The `memref.subview` verifier currently checks result shape, element type, memory space and offset of the result type. However, the strides of the result type are currently not verified. This commit adds verification of result strides for non-rank reducing ops and fixes invalid IR in test cases.

Verification of result strides for ops with rank reductions is more complex (and there could be multiple possible result types). That is left for a separate commit.

Also refactor the implementation a bit:
* If `computeMemRefRankReductionMask` could not compute the dropped dimensions, there must be something wrong with the op. Return `FailureOr` instead of `std::optional`.
* `isRankReducedMemRefType` did much more than just checking whether the op has rank reductions or not. Inline the implementation into the verifier and add better comments.
* `produceSubViewErrorMsg` does not have to be templatized.
* Fix comment and add additional assert to `ExpandStridedMetadata.cpp`, to make sure that the memref.subview verifier is in sync with the memref.subview -> memref.reinterpret_cast lowering.

Note: This change is identical to #79865, but with a fixed comment and an additional assert in `ExpandStridedMetadata.cpp`. (I reverted #79865 in #80116, but the implementation was actually correct, just the comment in `ExpandStridedMetadata.cpp` was confusing.)
2024-01-31 09:28:53 +00:00
Matthias Springer
96c907dbce Revert "[mlir][memref] memref.subview: Verify result strides" (#80116)
Reverts llvm/llvm-project#79865

I think there is a bug in the stride computation in
`SubViewOp::inferResultType`. (Was already there before this change.)

Reverting this commit for now and updating the original pull request
with a fix and more test cases.
2024-01-31 09:35:13 +01:00
Matthias Springer
db49319264 [mlir][memref] memref.subview: Verify result strides (#79865)
The `memref.subview` verifier currently checks result shape, element
type, memory space and offset of the result type. However, the strides
of the result type are currently not verified. This commit adds
verification of result strides for non-rank reducing ops and fixes
invalid IR in test cases.

Verification of result strides for ops with rank reductions is more
complex (and there could be multiple possible result types). That is
left for a separate commit.

Also refactor the implementation a bit:
* If `computeMemRefRankReductionMask` could not compute the dropped
dimensions, there must be something wrong with the op. Return
`FailureOr` instead of `std::optional`.
* `isRankReducedMemRefType` did much more than just checking whether the
op has rank reductions or not. Inline the implementation into the
verifier and add better comments.
* `produceSubViewErrorMsg` does not have to be templatized.
2024-01-31 09:14:48 +01:00
Matthias Springer
fbb62d449c [mlir][bufferization] Buffer deallocation: Make op preconditions stricter (#75127)
The buffer deallocation pass checks the IR ("operation preconditions")
to make sure that there is no IR that is unsupported. In such a case,
the pass signals a failure.

The pass now rejects all ops with unknown memory effects. We do not know
whether such an op allocates memory or not. Therefore, the buffer
deallocation pass does not know whether a deallocation op should be
inserted or not.

Memory effects are queried from the `MemoryEffectOpInterface` interface.
Ops that do not implement this interface but have the
`RecursiveMemoryEffects` trait do not have any side effects (apart from
the ones that their nested ops may have).

Unregistered ops are now rejected by the pass because they do not
implement the `MemoryEffectOpInterface` and neither do we know if they
have `RecursiveMemoryEffects` or not. All test cases that currently have
unregistered ops are updated to use registered ops.
2024-01-21 11:10:09 +01:00
Fabian Mora
5b4f2b906b [mlir][gpu] Add an offloading handler attribute to gpu.module (#78047)
This patch adds an optional offloading handler attribute to
the`gpu.module` op. This attribute will be used during
`gpu-module-to-binary` pass to override the offloading handler used in
the `gpu.binary` op.
2024-01-15 16:58:10 -05:00
Fabian Mora
a1eaed7a21 [mlir][gpu] Fix GPU YieldOP format and traits (#78006)
This patch adds assembly format to `gpu::YieldOp`. It also adds the
return like trait, to make it compatible with `RegionBranchOpInterface`.
2024-01-14 21:19:20 -05:00
Guray Ozen
5b33cff397 [mlir][gpu] Add Support for Cluster of Thread Blocks in gpu.launch (#76924) 2024-01-06 11:17:01 +01:00
Andrzej Warzyński
ca5d34ec71 [mlir][TD] Fix the order of return handles (#76929)
Replace (in tests and docs):

    %forall, %tiled = transform.structured.tile_using_forall

with (updated order of return handles):

    %tiled, %forall = transform.structured.tile_using_forall

Similar change is applied to (in the TD tutorial):

    transform.structured.fuse_into_containing_op

This update makes sure that the tests/documentation are consistent with
the Op specifications. Follow-up for #67320 which updated the order of
the return handles for `tile_using_forall`.
2024-01-04 12:54:16 +00:00
Jakub Kuderski
c0345b4648 [mlir][gpu] Add subgroup_reduce to shuffle lowering (#76530)
This supports both the scalar and the vector multi-reduction cases.
2024-01-02 16:14:22 -05:00
Jakub Kuderski
2af186f9bd [mlir][gpu] Add patterns to break down subgroup reduce (#76271)
The new patterns break down subgroup reduce ops with vector values into
a sequence of subgroup reductions that fit the native shuffle size. The
maximum/native shuffle size is parametrized.

The overall goal is to be able to perform multi-element reductions with
a sequence of `gpu.shuffle` ops.
2023-12-28 14:39:46 -05:00
Jakub Kuderski
72003adf6b [mlir][gpu] Allow subgroup reductions over 1-d vector types (#76015)
Each vector element is reduced independently, which is a form of
multi-reduction.

The plan is to allow for gradual lowering of multi-reduction that
results in fewer `gpu.shuffle` ops at the end:
1d `vector.multi_reduction` --> 1d `gpu.subgroup_reduce` --> smaller 1d
`gpu.subgroup_reduce` --> packed `gpu.shuffle` over i32

For example we can perform 2 independent f16 reductions with a series of
`gpu.shuffles` over i32, reducing the final number of `gpu.shuffles` by 2x.
2023-12-21 11:55:43 -05:00
Matthias Springer
f7096428b4 [mlir][GPU] Add RecursiveMemoryEffects to gpu.launch (#75315)
Infer the side effects of `gpu.launch` from its body.
2023-12-20 15:25:25 +09:00
Jakub Kuderski
560564f51c [mlir][vector][gpu] Align minf/maxf reduction kind names with arith (#75901)
This is to avoid confusion when dealing with reduction/combining kinds.
For example, see a recent PR comment:
https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175.

Previously, they were picked to mostly mirror the names of the llvm
vector reduction intrinsics:
https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In
isolation, it was not clear if `<maxf>` has `arith.maxnumf` or
`arith.maximumf` semantics. The new reduction kind names map 1:1 to
arith ops, which makes it easier to tell/look up their semantics.

Because both the vector and the gpu dialect depend on the arith dialect,
it's more natural to align names with those in arith than with the
lowering to llvm intrinsics.

Issue: https://github.com/llvm/llvm-project/issues/72354
2023-12-20 00:14:43 -05:00
Fabian Mora
419c45a325 [mlir][gpu] Fix crash in gpu-module-to-binary (#75477)
This patch fixes the error in issue #75434. The crash was being caused
by not checking for a lack of target attributes in a GPU module. It's
now considered an error to invoke the pass with a GPU module with no
target attributes.
2023-12-14 14:03:10 -05:00
Jakub Kuderski
7eccd52842 Reland "[mlir][gpu] Align reduction operations with vector combining kinds (#73423)"
This reverts commit dd09221a29 and relands
https://github.com/llvm/llvm-project/pull/73423.

* Updated `gpu.all_reduce` `min`/`max` in CUDA integration tests.
2023-11-27 11:38:18 -05:00
Jakub Kuderski
dd09221a29 Revert "[mlir][gpu] Align reduction operations with vector combining kinds (#73423)"
This reverts commit e0aac8c88d.

I'm seeing some nvidia integration test failures:
https://lab.llvm.org/buildbot/#/builders/61/builds/52334.
2023-11-27 11:29:23 -05:00
Jakub Kuderski
e0aac8c88d [mlir][gpu] Align reduction operations with vector combining kinds (#73423)
The motivation for this change is explained in
https://github.com/llvm/llvm-project/issues/72354.

Before this change, we could not tell between signed/unsigned
minimum/maximum and NaN treatment for floating point values.

The mapping of old reduction operations to the new ones is as follows:
*  `min` --> `minsi` for ints, `minf` for floats
*  `max` --> `maxsi` for ints, `maxf` for floats

New reduction kinds not represented in the old enum: `minui`, `maxui`,
`minimumf`, `maximumf`.

As a next step, I would like to have a common definition of combining
kinds used by the `vector` and `gpu` dialects. Separately, the GPU to
SPIR-V lowering does not yet properly handle zero and NaN values -- the
behavior of floating point min/max group reductions is not specified by
the SPIR-V spec, see https://github.com/llvm/llvm-project/issues/73459. 

Issue: https://github.com/llvm/llvm-project/issues/72354
2023-11-27 11:19:20 -05:00
Guray Ozen
edf5cae739 [mlir][gpu] Support Cluster of Thread Blocks in gpu.launch_func (#72871)
NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA).
It is a new level of parallelism, allowing clustering of Cooperative
Thread Arrays (CTA) to synchronize and communicate through shared memory
while running concurrently.

This PR enables support for CGA within the `gpu.launch_func` in the GPU
dialect. It extends `gpu.launch_func` to accommodate this functionality.

The GPU dialect remains architecture-agnostic, so we've added CGA
functionality as optional parameters. We want to leverage mechanisms
that we have in the GPU dialects such as outlining and kernel launching,
making it a practical and convenient choice.

An example of this implementation can be seen below:

```
gpu.launch_func @kernel_module::@kernel
                clusters in (%1, %0, %0) // <-- Optional
                blocks in (%0, %0, %0)
                threads in (%0, %0, %0)
```

The PR also introduces index and dimensions Ops specific to clusters,
binding them to NVVM Ops:

```
%cidX = gpu.cluster_id  x
%cidY = gpu.cluster_id  y
%cidZ = gpu.cluster_id  z

%cdimX = gpu.cluster_dim  x
%cdimY = gpu.cluster_dim  y
%cdimZ = gpu.cluster_dim  z
```

We will introduce cluster support in `gpu.launch` Op in an upcoming PR. 

See [the
documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays)
provided by NVIDIA for details.
2023-11-27 11:05:07 +01:00
Guray Ozen
ea84897ba3 [mlir][gpu] Introduce gpu.dynamic_shared_memory Op (#71546)
While the `gpu.launch` Op allows setting the size via the
`dynamic_shared_memory_size` argument, accessing the dynamic shared
memory is very convoluted. This PR implements the proposed Op,
`gpu.dynamic_shared_memory` that aims to simplify the utilization of
dynamic shared memory.

RFC:
https://discourse.llvm.org/t/rfc-simplifying-dynamic-shared-memory-access-in-gpu/

**Proposal from RFC**
This PR `gpu.dynamic.shared.memory` Op to use dynamic shared memory
feature efficiently. It is is a powerful feature that enables the
allocation of shared memory at runtime with the kernel launch on the
host. Afterwards, the memory can be accessed directly from the device. I
believe similar story exists for AMDGPU.

**Current way Using Dynamic Shared Memory with MLIR**

Let me illustrate the challenges of using dynamic shared memory in MLIR
with an example below. The process involves several steps:
- memref.global 0-sized array LLVM's NVPTX backend expects
- dynamic_shared_memory_size Set the size of dynamic shared memory
- memref.get_global Access the global symbol
- reinterpret_cast and subview Many OPs for pointer arithmetic

```
// Step 1. Create 0-sized global symbol. Manually set the alignment
memref.global "private" @dynamicShmem  : memref<0xf16, 3> { alignment = 16 }
func.func @main() {
  // Step 2. Allocate shared memory
  gpu.launch blocks(...) threads(...)
    dynamic_shared_memory_size %c10000 {
    // Step 3. Access the global object
    %shmem = memref.get_global @dynamicShmem : memref<0xf16, 3>
    // Step 4. A sequence of `memref.reinterpret_cast` and `memref.subview` operations.
    %4 = memref.reinterpret_cast %shmem to offset: [0], sizes: [14, 64, 128],  strides: [8192,128,1] : memref<0xf16, 3> to memref<14x64x128xf16,3>
    %5 = memref.subview %4[7, 0, 0][7, 64, 128][1,1,1] : memref<14x64x128xf16,3> to memref<7x64x128xf16, strided<[8192, 128, 1], offset: 57344>, 3>
    %6 = memref.subview %5[2, 0, 0][1, 64, 128][1,1,1] : memref<7x64x128xf16, strided<[8192, 128, 1], offset: 57344>, 3> to memref<64x128xf16, strided<[128, 1], offset: 73728>, 3>
    %7 = memref.subview %6[0, 0][64, 64][1,1]  : memref<64x128xf16, strided<[128, 1], offset: 73728>, 3> to memref<64x64xf16, strided<[128, 1], offset: 73728>, 3>
    %8 = memref.subview %6[32, 0][64, 64][1,1] : memref<64x128xf16, strided<[128, 1], offset: 73728>, 3> to memref<64x64xf16, strided<[128, 1], offset: 77824>, 3>
    // Step.5 Use
    "test.use.shared.memory"(%7) : (memref<64x64xf16, strided<[128, 1], offset: 73728>, 3>) -> (index)
    "test.use.shared.memory"(%8) : (memref<64x64xf16, strided<[128, 1], offset: 77824>, 3>) -> (index)
    gpu.terminator
  }
```

Let’s write the program above with that:

```
func.func @main() {
    gpu.launch blocks(...) threads(...) dynamic_shared_memory_size %c10000 {
    	%i = arith.constant 18 : index
        // Step 1: Obtain shared memory directly
        %shmem = gpu.dynamic_shared_memory : memref<?xi8, 3>
        %c147456 = arith.constant 147456 : index
        %c155648 = arith.constant 155648 : index
        %7 = memref.view %shmem[%c147456][] : memref<?xi8, 3> to memref<64x64xf16, 3>
        %8 = memref.view %shmem[%c155648][] : memref<?xi8, 3> to memref<64x64xf16, 3>

        // Step 2: Utilize the shared memory
        "test.use.shared.memory"(%7) : (memref<64x64xf16, 3>) -> (index)
        "test.use.shared.memory"(%8) : (memref<64x64xf16, 3>) -> (index)
    }
}
```

This PR resolves #72513
2023-11-16 14:42:17 +01:00
drazi
9a3d3c7093 generalize pass gpu-kernel-outlining for symbol op (#72074)
This PR generalize gpu-out-lining pass to take care of ops
`SymbolOpInterface` instead of just `func::FuncOp`.

Before this change, gpu-out-lining pass will skip `llvm.func`.
```mlir
module {
  llvm.func @main() {
    %c1 = arith.constant 1 : index
    gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %c1, %arg7 = %c1, %arg8 = %c1) threads(%arg3, %arg4, %arg5) in (%arg9 = %c1, %arg10 = %c1, %arg11 = %c1) {
      gpu.terminator
    }
    llvm.return
  }
}
```

After this change, gpu-out-lining pass can handle llvm.func as well.
2023-11-12 21:48:49 -08:00
spaceotter
00c3c73189 [mlir][gpu] Separate the barrier elimination code from transform ops (#71762)
Allows the barrier elimination code to be run from C++ as well. The code
from transforms dialect is copied as-is, the pass and populate functions
have beed added at the end.

Co-authored-by: Eric Eaton <eric@nod-labs.com>
2023-11-10 17:59:09 -08:00
spaceotter
51af040b22 [mlir][gpu] Eliminate redundant gpu.barrier ops (#71575)
Adds a canonicalizer for gpu.barrier that gets rid of duplicates.

Co-authored-by: Eric Eaton <eric@nod-labs.com>
2023-11-09 18:06:20 -05:00
Fabian Mora
42630689e2 [mlir][gpu] Clean GPU Passes.h from external SPIRV includes (#71331)
Removes the `SPIRVAttributes.h` header from `GPU/Transforms/Passes.h`
2023-11-05 17:06:04 -08:00
Sang Ik Lee
2dace04521 [mlir][spirv] Implement gpu::TargetAttrInterface (#69949)
This commit implements gpu::TargetAttrInterface for SPIR-V target
attribute. The plan is to use this to enable GPU compilation pipeline
for OpenCL kernels later.

The changes do not impact Vulkan shaders using milr-vulkan-runner.
New GPU Dialect transform pass spirv-attach-target is implemented for
attaching attribute from CLI.

gpu-module-to-binary pass now works with GPU module that has SPIR-V
module with OpenCL kernel functions inside.
2023-11-05 08:11:53 -08:00
Christian Ulmann
7ed96b1c0d [MLIR][LLVM] Remove last typed pointer remnants from tests (#71232)
This commit removes all LLVM dialect typed pointers from the lit tests.
Typed pointers have been deprecated for a while now and it's planned to
soon remove them from the LLVM dialect.

Related PSA:
https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502
2023-11-04 14:13:31 +01:00
Oleksandr "Alex" Zinenko
e4384149b5 [mlir] use transform-interpreter in test passes (#70040)
Update most test passes to use the transform-interpreter pass instead of
the test-transform-dialect-interpreter-pass. The new "main" interpreter
pass has a named entry point instead of looking up the top-level op with
`PossibleTopLevelOpTrait`, which is arguably a more understandable
interface. The change is mechanical, rewriting an unnamed sequence into
a named one and wrapping the transform IR in to a module when necessary.

Add an option to the transform-interpreter pass to target a tagged
payload op instead of the root anchor op, which is also useful for repro
generation.

Only the test in the transform dialect proper and the examples have not
been updated yet. These will be updated separately after a more careful
consideration of testing coverage of the transform interpreter logic.
2023-10-24 16:12:34 +02:00
Aviad Cohen
7060422265 [mlir][Linalg]: Optimize linalg generic in transform::PromoteOp to avoid unnecessary copies (#68555)
If the operands are not used in the payload of linalg generic operations, there is no need to copy them before the operation.
2023-10-14 10:40:45 +03:00
Aart Bik
39038177ee [mlir][sparse][gpu] add CSC and BSR format to cuSparse GPU ops (#67509)
This adds two cuSparse formats to the GPU dialect support. Together with
proper lowering and runtime cuda support. Also fixes a few minor
omissions.
2023-09-27 09:32:25 -07:00
Oleksandr "Alex" Zinenko
96ff0255f2 [mlir] cleanup of structured.tile* transform ops (#67320)
Rename and restructure tiling-related transform ops from the structured
extension to be more homogeneous. In particular, all ops now follow a
consistent naming scheme:

 - `transform.structured.tile_using_for`;
 - `transform.structured.tile_using_forall`;
 - `transform.structured.tile_reduction_using_for`;
 - `transform.structured.tile_reduction_using_forall`.

This drops the "_op" naming artifact from `tile_to_forall_op` that
shouldn't have been included in the first place, consistently specifies
the name of the control flow op to be produced for loops (instead of
`tile_reduction_using_scf` since `scf.forall` also belongs to `scf`),
and opts for the `using` connector to avoid ambiguity.

The loops produced by tiling are now systematically placed as *trailing*
results of the transform op. While this required changing 3 out of 4 ops
(except for `tile_using_for`), this is the only choice that makes sense
when producing multiple `scf.for` ops that can be associated with a
variadic number of handles. This choice is also most consistent with
*other* transform ops from the structured extension, in particular with
fusion ops, that produce the structured op as the leading result and the
loop as the trailing result.
2023-09-26 09:14:29 +02:00
Tobias Gysi
85175edd4e [mlir][llvm] Replace NullOp by ZeroOp (#67183)
This revision replaces the LLVM dialect NullOp by the recently
introduced ZeroOp. The ZeroOp is more generic in the sense that it
represents zero values of any LLVM type rather than null pointers only.

This is a follow to https://github.com/llvm/llvm-project/pull/65508
2023-09-25 11:11:52 +02:00
Martin Erhart
522c1d0eea [mlir][gpu][bufferization] Implement BufferDeallocationOpInterface for gpu.terminator (#66880)
This is necessary to support deallocation of IR with gpu.launch
operations because it does not implement the RegionBranchOpInterface.
Implementing the interface would require it to support regions with
unstructured control flow and produced arguments/results.
2023-09-20 12:28:28 +02:00
Fabian Mora
5093413a50 [mlir][gpu][NVPTX] Enable NVIDIA GPU JIT compilation path (#66220)
This patch adds an NVPTX compilation path that enables JIT compilation
on NVIDIA targets. The following modifications were performed:
1. Adding a format field to the GPU object attribute, allowing the
translation attribute to use the correct runtime function to load the
module. Likewise, a dictionary attribute was added to add any possible
extra options.

2. Adding the `createObject` method to `GPUTargetAttrInterface`; this
method returns a GPU object from a binary string.

3. Adding the function `mgpuModuleLoadJIT`, which is only available for
NVIDIA GPUs, as there is no equivalent for AMD.

4. Adding the CMake flag `MLIR_GPU_COMPILATION_TEST_FORMAT` to specify
the format to use during testing.
2023-09-14 18:00:27 -04:00
Nicolas Vasilache
92f088d335 [mlir][gpu][transform] Provide better error messages and avoid crashing in MapForallToBlocks.
This revision addresses issues surfaced in https://reviews.llvm.org/D159093
2023-09-04 14:11:38 +00:00
Aart Bik
289f7231f9 [mlir][sparse][gpu] minor code cleanup for sparse gpu ops
Consistent order of ops and related methods.
Also, renamed SpGEMMGetSizeOp to SpMatGetSizeOp
since this is a general utility for sparse matrices,
not specific to GEMM ops only.

Reviewed By: Peiming

Differential Revision: https://reviews.llvm.org/D157922
2023-08-14 15:08:57 -07:00
Fabian Mora
43752a2aa3 [mlir][gpu] Add the gpu-module-to-binary pass.
**For an explanation of these patches see D154153.**

Commit message:
This pass converts GPU modules into GPU binaries, serializing all targets present
in a GPU module by invoking the `serializeToObject` target attribute method.

Depends on D154147

Reviewed By: mehdi_amini

Differential Revision: https://reviews.llvm.org/D154149
2023-08-12 00:24:53 +00:00
Fabian Mora
8ae074b195 [mlir][gpu] Add the Select Object compilation attribute.
**For an explanation of these patches see D154153.**

Commit message:
This patch adds the default offloading handler for GPU binary ops: `#gpu.select_object`,
it selects the object to embed based on an index or a target attribute, embedding
the object as a global string and launches the kernel using the scheme used in the
GPU to LLVM pass.

Depends on D154137

Reviewed By: mehdi_amini

Differential Revision: https://reviews.llvm.org/D154147
2023-08-11 22:00:35 +00:00
Fabian Mora
a63db3f5f5 [mlir][gpu] Modifies gpu.launch_func to allow lowering it after gpu-to-llvm.
**For an explanation of these patches see D154153.**

Commit message:
In order to lower `gpu.launch_func` after running `gpu-to-llvm` it must be
able to handle lowered types -eg. index -> i64. This patch also allows the op
to refer to GPU binaries and not only GPU modules.

Depends on D154132.

Reviewed By: mehdi_amini

Differential Revision: https://reviews.llvm.org/D154137
2023-08-11 21:56:37 +00:00
Fabian Mora
068213130d [mlir][ROCDL] Adds the ROCDL target attribute.
**For an explanation of these patches see D154153.**

Commit message:
This patch adds the ROCDL target attribute for serializing GPU modules into
strings containing HSAco.

Depends on D154117

Differential Revision: https://reviews.llvm.org/D154129
2023-08-11 21:44:05 +00:00
Fabian Mora
1e77536e1d Revert "[mlir][ROCDL] Adds the ROCDL target attribute."
This reverts commit 6a0feb1503.
2023-08-11 19:50:05 +00:00
Fabian Mora
bf24fb81ac [mlir][gpu] Add gpu.binary op and #gpu.object attribute.
**For an explanation of these patches see D154153.**

Commit message:
Adds the `#gpu.object` attribute for holding a binary object and the target
attribute used to create it. Also adds the `gpu.binary` operation used to
store GPU objects.

Depends on D154108

Reviewed By: mehdi_amini

Differential Revision: https://reviews.llvm.org/D154132
2023-08-11 19:48:18 +00:00
Fabian Mora
6a0feb1503 [mlir][ROCDL] Adds the ROCDL target attribute.
**For an explanation of these patches see D154153.**

Commit message:
This patch adds the ROCDL target attribute for serializing GPU modules into
strings containing HSAco.

Depends on D154117

Reviewed By: mehdi_amini, krzysz00

Differential Revision: https://reviews.llvm.org/D154129
2023-08-11 19:43:59 +00:00
Aart Bik
6c4cd7a13e [mlir][sparse][gpu] refine sparse gpu round-trip and lowering test
Tests had become inconsistent, and contained a few slip ups
(e.g. non-async versions did not lower)

Reviewed By: K-Wu

Differential Revision: https://reviews.llvm.org/D157666
2023-08-10 17:18:59 -07:00
Aart Bik
95a6c509c9 [mlir][sparse][gpu] add set csr pointers, remove estimate op, fix bugs
Rationale:
Since we only support default algorithm for SpGEMM, we can remove the
estimate op (for now at least). This also introduces the set csr pointers
op, and fixes a few bugs in the existing lowering for the SpGEMM breakdown.
This revision paves the way for actual recognition of SpGEMM in the sparsifier.

Reviewed By: K-Wu

Differential Revision: https://reviews.llvm.org/D157645
2023-08-10 13:52:47 -07:00
Ivan Butygin
793ee2bf08 [mlir][gpu] Add DecomposeMemrefsPass
Some GPU backends (SPIR-V) lower memrefs to bare pointers, so for dynamically sized/strided memrefs it will fail.
This pass extracts sizes and strides via `memref.extract_strrided_metadata` outside `gpu.launch` body and do index/offset calculation explicitly and then reconstructs memrefs via `memref.reinterpret_cast`.

`memref.reinterpret_cast` then lowered via https://reviews.llvm.org/D155011

Differential Revision: https://reviews.llvm.org/D155247
2023-08-10 22:28:05 +02:00
Mehdi Amini
363b655920 Finish renaming getOperandSegmentSizeAttr() from operand_segment_sizes to operandSegmentSizes
This renaming started with the native ODS support for properties, this is completing it.

A mass automated textual rename seems safe for most codebases.
Drop also the ods prefix to keep the accessors the same as they were before
this change:
 properties.odsOperandSegmentSizes
reverts back to:
 properties.operandSegementSizes

The ODS prefix was creating divergence between all the places and make it harder to
be consistent.

Reviewed By: jpienaar

Differential Revision: https://reviews.llvm.org/D157173
2023-08-09 19:37:01 -07:00
Ivan Butygin
b13248f997 Revert "[mlir][gpu] Add DecomposeMemrefsPass"
Broke some bots

This reverts commit 2b5b2bfef1.
2023-08-10 03:07:28 +02:00