Commit Graph

222 Commits

Author SHA1 Message Date
Jakub Kuderski
971b852546 [mlir][NFC] Simplify type checks with isa predicates (#87183)
For more context on isa predicates, see:
https://github.com/llvm/llvm-project/pull/83753.
2024-04-01 11:40:09 -04:00
Andrei Golubev
89cd345667 [mlir][LLVM] Use int32_t to indirectly construct GEPArg (#79562)
GEPArg can only be constructed from int32_t and mlir::Value. Explicitly
cast other types (e.g. unsigned, size_t) to int32_t to avoid narrowing
conversion warnings on MSVC. Some recent examples of such are:

```
mlir\lib\Dialect\LLVMIR\Transforms\TypeConsistency.cpp: error C2398:
Element '1': conversion from 'size_t' to 'T' requires a narrowing
conversion
    with
    [
        T=mlir::LLVM::GEPArg
    ]

mlir\lib\Dialect\LLVMIR\Transforms\TypeConsistency.cpp: error C2398:
Element '1': conversion from 'unsigned int' to 'T' requires a narrowing
conversion
    with
    [
        T=mlir::LLVM::GEPArg
    ]
```

Co-authored-by: Nikita Kudriavtsev <nikita.kudriavtsev@intel.com>
2024-01-26 14:27:51 +01:00
Matthias Springer
5fcf907b34 [mlir][IR] Rename "update root" to "modify op" in rewriter API (#78260)
This commit renames 4 pattern rewriter API functions:
* `updateRootInPlace` -> `modifyOpInPlace`
* `startRootUpdate` -> `startOpModification`
* `finalizeRootUpdate` -> `finalizeOpModification`
* `cancelRootUpdate` -> `cancelOpModification`

The term "root" is a misnomer. The root is the op that a rewrite pattern
matches against
(https://mlir.llvm.org/docs/PatternRewriter/#root-operation-name-optional).
A rewriter must be notified of all in-place op modifications, not just
in-place modifications of the root
(https://mlir.llvm.org/docs/PatternRewriter/#pattern-rewriter). The old
function names were confusing and have contributed to various broken
rewrite patterns.

Note: The new function names use the term "modify" instead of "update"
for consistency with the `RewriterBase::Listener` terminology
(`notifyOperationModified`).
2024-01-17 11:08:59 +01:00
Guray Ozen
2aec7083ad [mlir][gpu] Use DenseI32Array for NVVM's maxntid and reqntid (NFC) (#77466) 2024-01-09 16:44:25 +01:00
Guray Ozen
763109e346 [mlir][gpu] Use known_block_size to set maxntid for NVVM target (#77301)
Setting thread block size with `maxntid` on the kernel has great
performance benefits. In this way, downstream PTX compiler can do better
register allocation.

MLIR's `gpu.launch` and `gpu.launch_func` already has an attribute
(`known_block_size`) that keeps the thread block size when it is known.
This PR simply uses this attribute to set `maxntid`.
2024-01-08 14:49:19 +01:00
Krzysztof Drewniak
ddd6acd7a8 [mlir][GPU] Expand LLVM function attribute copies (#76755)
Expand the copying of attributes on GPU kernel arguments during LLVM
lowering.

Support copying attributes from values that are already LLVM pointers.

Support copying attributes, like `noundef`, that aren't specific to (the
pointer parts of) arguments.
2024-01-03 14:28:15 -06:00
Paul C Fuqua
11141bc68a Fix what seems to be a silly bug in gpu.set_default_device rewriting. Smoke test included. (#75756) 2023-12-20 09:35:42 -06:00
Mehdi Amini
6ac80a7677 Apply clang-tidy fixes for readability-identifier-naming in GPUToLLVMConversion.cpp (NFC) 2023-12-07 21:39:25 -08:00
Mehdi Amini
9415fca848 [mlir] Fix build with shared libs (missing cmake link dependency) (NFC) 2023-11-29 12:17:52 -08:00
Mehdi Amini
9e7b6f46ba [mlir] Adopt ConvertToLLVMPatternInterface GpuToLLVMConversionPass to align with convert-to-llvm (#73761)
This is a follow-up to the introduction of `convert-to-llvm`: it is
supposed to be a unifying pass through the
`ConvertToLLVMPatternInterface`, but some specific conversion (like the
GPU target) aren't vanilla LLVM target. Instead they need extra
customizations that are specific to LLVM-on-GPUs and our custom runtime
wrappers.
This change make the GpuToLLVMConversionPass just as pluggable as the
`convert-to-llvm` by using the same mechanism.
2023-11-29 11:37:53 -08:00
Guray Ozen
edf5cae739 [mlir][gpu] Support Cluster of Thread Blocks in gpu.launch_func (#72871)
NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA).
It is a new level of parallelism, allowing clustering of Cooperative
Thread Arrays (CTA) to synchronize and communicate through shared memory
while running concurrently.

This PR enables support for CGA within the `gpu.launch_func` in the GPU
dialect. It extends `gpu.launch_func` to accommodate this functionality.

The GPU dialect remains architecture-agnostic, so we've added CGA
functionality as optional parameters. We want to leverage mechanisms
that we have in the GPU dialects such as outlining and kernel launching,
making it a practical and convenient choice.

An example of this implementation can be seen below:

```
gpu.launch_func @kernel_module::@kernel
                clusters in (%1, %0, %0) // <-- Optional
                blocks in (%0, %0, %0)
                threads in (%0, %0, %0)
```

The PR also introduces index and dimensions Ops specific to clusters,
binding them to NVVM Ops:

```
%cidX = gpu.cluster_id  x
%cidY = gpu.cluster_id  y
%cidZ = gpu.cluster_id  z

%cdimX = gpu.cluster_dim  x
%cdimY = gpu.cluster_dim  y
%cdimZ = gpu.cluster_dim  z
```

We will introduce cluster support in `gpu.launch` Op in an upcoming PR. 

See [the
documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays)
provided by NVIDIA for details.
2023-11-27 11:05:07 +01:00
Mehdi Amini
e204b9198a Apply clang-tidy fixes for llvm-else-after-return in GPUToLLVMConversion.cpp (NFC) 2023-11-20 01:40:49 -08:00
Guray Ozen
ea84897ba3 [mlir][gpu] Introduce gpu.dynamic_shared_memory Op (#71546)
While the `gpu.launch` Op allows setting the size via the
`dynamic_shared_memory_size` argument, accessing the dynamic shared
memory is very convoluted. This PR implements the proposed Op,
`gpu.dynamic_shared_memory` that aims to simplify the utilization of
dynamic shared memory.

RFC:
https://discourse.llvm.org/t/rfc-simplifying-dynamic-shared-memory-access-in-gpu/

**Proposal from RFC**
This PR `gpu.dynamic.shared.memory` Op to use dynamic shared memory
feature efficiently. It is is a powerful feature that enables the
allocation of shared memory at runtime with the kernel launch on the
host. Afterwards, the memory can be accessed directly from the device. I
believe similar story exists for AMDGPU.

**Current way Using Dynamic Shared Memory with MLIR**

Let me illustrate the challenges of using dynamic shared memory in MLIR
with an example below. The process involves several steps:
- memref.global 0-sized array LLVM's NVPTX backend expects
- dynamic_shared_memory_size Set the size of dynamic shared memory
- memref.get_global Access the global symbol
- reinterpret_cast and subview Many OPs for pointer arithmetic

```
// Step 1. Create 0-sized global symbol. Manually set the alignment
memref.global "private" @dynamicShmem  : memref<0xf16, 3> { alignment = 16 }
func.func @main() {
  // Step 2. Allocate shared memory
  gpu.launch blocks(...) threads(...)
    dynamic_shared_memory_size %c10000 {
    // Step 3. Access the global object
    %shmem = memref.get_global @dynamicShmem : memref<0xf16, 3>
    // Step 4. A sequence of `memref.reinterpret_cast` and `memref.subview` operations.
    %4 = memref.reinterpret_cast %shmem to offset: [0], sizes: [14, 64, 128],  strides: [8192,128,1] : memref<0xf16, 3> to memref<14x64x128xf16,3>
    %5 = memref.subview %4[7, 0, 0][7, 64, 128][1,1,1] : memref<14x64x128xf16,3> to memref<7x64x128xf16, strided<[8192, 128, 1], offset: 57344>, 3>
    %6 = memref.subview %5[2, 0, 0][1, 64, 128][1,1,1] : memref<7x64x128xf16, strided<[8192, 128, 1], offset: 57344>, 3> to memref<64x128xf16, strided<[128, 1], offset: 73728>, 3>
    %7 = memref.subview %6[0, 0][64, 64][1,1]  : memref<64x128xf16, strided<[128, 1], offset: 73728>, 3> to memref<64x64xf16, strided<[128, 1], offset: 73728>, 3>
    %8 = memref.subview %6[32, 0][64, 64][1,1] : memref<64x128xf16, strided<[128, 1], offset: 73728>, 3> to memref<64x64xf16, strided<[128, 1], offset: 77824>, 3>
    // Step.5 Use
    "test.use.shared.memory"(%7) : (memref<64x64xf16, strided<[128, 1], offset: 73728>, 3>) -> (index)
    "test.use.shared.memory"(%8) : (memref<64x64xf16, strided<[128, 1], offset: 77824>, 3>) -> (index)
    gpu.terminator
  }
```

Let’s write the program above with that:

```
func.func @main() {
    gpu.launch blocks(...) threads(...) dynamic_shared_memory_size %c10000 {
    	%i = arith.constant 18 : index
        // Step 1: Obtain shared memory directly
        %shmem = gpu.dynamic_shared_memory : memref<?xi8, 3>
        %c147456 = arith.constant 147456 : index
        %c155648 = arith.constant 155648 : index
        %7 = memref.view %shmem[%c147456][] : memref<?xi8, 3> to memref<64x64xf16, 3>
        %8 = memref.view %shmem[%c155648][] : memref<?xi8, 3> to memref<64x64xf16, 3>

        // Step 2: Utilize the shared memory
        "test.use.shared.memory"(%7) : (memref<64x64xf16, 3>) -> (index)
        "test.use.shared.memory"(%8) : (memref<64x64xf16, 3>) -> (index)
    }
}
```

This PR resolves #72513
2023-11-16 14:42:17 +01:00
Mehdi Amini
d9dadfda85 Refactor ModuleToObject to offer more flexibility to subclass (NFC)
Some specific implementation of the offload may want more customization, and
even avoid using LLVM in-tree to dispatch the ISA translation to a custom
solution. This refactoring makes it possible for such implementation to work
without even configuring the target backend in LLVM.

Reviewers: fabianmcg

Reviewed By: fabianmcg

Pull Request: https://github.com/llvm/llvm-project/pull/71165
2023-11-03 13:41:45 -07:00
Christian Ulmann
97a238e863 [MLIR][LLVM] Remove typed pointer conversion utils (#71169)
This commit removes the no longer required type pointer helpers from the
LLVM dialect conversion utils. Typed pointers have been deprecated for a
while now and it's planned to soon remove them from the LLVM dialect.

Related PSA:
https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502
2023-11-03 13:02:35 +01:00
Christian Ulmann
dbd4a0dd38 [MLIR][GPUCommon] Remove typed pointer support (#70735)
This commit removes the GPUCommon's lowering support for typed pointers.
Typed pointers have been deprecated for a while now and it's planned to
soon remove them from the LLVM dialect.

Related PSA:
https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502
2023-10-31 09:22:44 +01:00
Nishant Patel
ced9f4f0e8 [MLIR] Modify lowering of gpu.alloc op to llvm (#69969)
If gpu.alloc has no asyn deependency ( in case if gpu.alloc has
hostShared allocation), create a new stream & synchronize. This PR is
follow up to #66401
2023-10-25 22:00:47 +03:00
Christian Ulmann
484668c759 Reland "[MLIR][LLVM] Change addressof builders to use opaque pointers" (#69292)
This relands fbde19a664, which was broken due to incorrect GEP element type creation.

This commit changes the builders of the `llvm.mlir.addressof` operations
to no longer produce typed pointers.

As a consequence, a GPU to NVVM pattern had to be updated, that still
relied on typed pointers.
2023-10-17 11:33:45 +02:00
Christian Ulmann
9397e5f581 Revert "[MLIR][LLVM] Change addressof builders to use opaque pointers (#69215)"
This reverts commit fbde19a664 due to
breaking integration tests.
2023-10-17 06:31:48 +00:00
Christian Ulmann
fbde19a664 [MLIR][LLVM] Change addressof builders to use opaque pointers (#69215)
This commit changes the builders of the `llvm.mlir.addressof` operations
to no longer produce typed pointers.

As a consequence, a GPU to NVVM pattern and the toy example LLVM lowerings had to be updated, as they still relied on typed pointers.
2023-10-17 07:55:00 +02:00
Aart Bik
39038177ee [mlir][sparse][gpu] add CSC and BSR format to cuSparse GPU ops (#67509)
This adds two cuSparse formats to the GPU dialect support. Together with
proper lowering and runtime cuda support. Also fixes a few minor
omissions.
2023-09-27 09:32:25 -07:00
Nishant Patel
1002a1d058 [MLIR] Pass hostShared flag in gpu.alloc op to runtime wrappers (#66401)
This PR is a breakdown of the big PR
https://github.com/llvm/llvm-project/pull/65539 which enables intel gpu
integration. In this PR we pass hostShared flag to runtime wrappers
(required by SyclRuntimeWrappers which will come in subsequent PR) to
indicate if the allocation is done on host shared gpu memory or device
only memory.
2023-09-26 15:32:11 -07:00
Nishant Patel
ebfea261e6 [MLIR] Pass count of parameters & gpu binary size to runtime wrappers (#66154)
This PR is a breakdown of the big PR #65539 which enables intel gpu
integration. In this PR we pass count of parameters and size of gpu
binary to runtime wrappers since the SyclRuntimeWrappers (which will
come in subsequent PR) requires the spirv size for compilation and also
the number of parameters to iterate over the params.
2023-09-26 11:27:07 -07:00
Tobias Gysi
85175edd4e [mlir][llvm] Replace NullOp by ZeroOp (#67183)
This revision replaces the LLVM dialect NullOp by the recently
introduced ZeroOp. The ZeroOp is more generic in the sense that it
represents zero values of any LLVM type rather than null pointers only.

This is a follow to https://github.com/llvm/llvm-project/pull/65508
2023-09-25 11:11:52 +02:00
stefankoncarevic
fbf67bfaf0 [mlir][GPU] Handle LLVM pointer attributes on memref arguments.
Handle pointer attributes (noalias, nonnull, readonly, writeonly,
dereferencable, dereferencable_or_null). "noalias" attribute is
ignore for non-bare pointer.

Reviewed By: krzysz00

Differential Revision: https://reviews.llvm.org/D157082
2023-09-11 15:10:55 +00:00
Adrian Kuegel
583e78b372 [mlir] Apply ClangTidy fixes (NFC)
Prefer to use .empty() instead of checking size().
2023-08-23 17:51:11 +02:00
Matthias Springer
7f4dbd83dc [mlir][GPU][NFC] Remove type converter hack
Remove `dangerousSetOptions` and call `promoteOperands` with the correct arguments directly.

Differential Revision: https://reviews.llvm.org/D158175
2023-08-18 15:28:47 +02:00
Aart Bik
289f7231f9 [mlir][sparse][gpu] minor code cleanup for sparse gpu ops
Consistent order of ops and related methods.
Also, renamed SpGEMMGetSizeOp to SpMatGetSizeOp
since this is a general utility for sparse matrices,
not specific to GEMM ops only.

Reviewed By: Peiming

Differential Revision: https://reviews.llvm.org/D157922
2023-08-14 15:08:57 -07:00
Matthias Springer
ce254598b7 [mlir][Conversion] Store const type converter in ConversionPattern
ConversionPatterns do not (and should not) modify the type converter that they are using.

* Make `ConversionPattern::typeConverter` const.
* Make member functions of the `LLVMTypeConverter` const.
* Conversion patterns take a const type converter.
* Various helper functions (that are called from patterns) now also take a const type converter.

Differential Revision: https://reviews.llvm.org/D157601
2023-08-14 09:03:11 +02:00
Fabian Mora
fcfeb1e5b3 [mlir][gpu] Add GPU target support to gpu-to-llvm.
**For an explanation of these patches see D154153.**

This patch modifies the lowering of `gpu.module` & `gpu.launch_func` in the `gpu-to-llvm` pass,
allowing the usage of the new GPU compilation mechanism in the patch series ending in D154153.

Instead of removing Modules, this patch preserves the module if it has target attributes so that the
`gpu-module-to-binary` pass can later serialize them.

Instead of lowering the kernel calls to the LLVM dialect, this patch primarily updates the operation's
arguments, leaving the job of converting the operation into LLVM instructions to the translation stage.
The reason for not lowering the operation to LLVM at this stage is that kernel launches do not have a
single one-to-one representation in LLVM. For example, a kernel launch can be represented by a call
to a kernel stub, like in CUDA or HIP.
Kernel launches are also intrinsically linked to the binary associated with the call, and the binaries are
converted during translation.

Depends on D154149

Reviewed By: mehdi_amini

Differential Revision: https://reviews.llvm.org/D154152
2023-08-12 00:27:28 +00:00
Aart Bik
95a6c509c9 [mlir][sparse][gpu] add set csr pointers, remove estimate op, fix bugs
Rationale:
Since we only support default algorithm for SpGEMM, we can remove the
estimate op (for now at least). This also introduces the set csr pointers
op, and fixes a few bugs in the existing lowering for the SpGEMM breakdown.
This revision paves the way for actual recognition of SpGEMM in the sparsifier.

Reviewed By: K-Wu

Differential Revision: https://reviews.llvm.org/D157645
2023-08-10 13:52:47 -07:00
Nicolas Vasilache
888717e853 [mlir][transform] Enable gpu-to-nvvm via conversion patterns driven by TD
This revision untangles a few more conversion pieces and allows rewriting
the relatively intricate (and somewhat inconsistent) LowerGpuOpsToNVVMOpsPass
in a declarative fashion that provides a much better understanding and control.

Differential Revision: https://reviews.llvm.org/D157617
2023-08-10 15:30:48 +00:00
Aart Bik
e7e4ed0d7a [mlir][sparse][gpu] only support default algorithm for SpGEMM
Rationale:
This is the approach taken for all the others too (SpMV, SpMM, SDDMM),
so it is more consistent to follow the same path (until we have a need
for more algorithms). Also, in a follow up revision, this will allow
us to remove some unused GEMM ops.

Reviewed By: K-Wu

Differential Revision: https://reviews.llvm.org/D157542
2023-08-09 12:49:47 -07:00
Aart Bik
9dfd3c3247 [mlir][sparse][gpu] reduce boilerplate class declarations
Macro is used to avoid repeating same pattern many times.
Also fixed the ordering of ops to be consistent.

Reviewed By: K-Wu

Differential Revision: https://reviews.llvm.org/D157419
2023-08-08 10:42:57 -07:00
Kun Wu
dfe2942909 [mlir][sparse][gpu] add spgemm operator
Differential Revision: https://reviews.llvm.org/D152981
2023-08-08 00:29:23 +00:00
Alex Zinenko
e98e59955e Revert "Foo"
This reverts commit 3c9aa10c57.

No proper description of the commit.
2023-08-04 13:30:12 +00:00
Nicolas Vasilache
3c9aa10c57 Foo 2023-08-04 11:06:17 +00:00
Nicolas Vasilache
620e2bb20c [mlir][LLVM] NFC - Remove createIndexConstant method
This revision removes the createIndexConstant method, which implicitly creates constants of the
getIndexType type and updates all uses to the more explicit createIndexAttrConstant which requires
an explicit Type parameter.

This is an NFC step towards entangling index type conversion in LLVM lowering.

The selection of which index type to use requires finer granularity than the existing
implementations which all rely on pass level flags and end up in mismatches, especially on GPUs
with multiple address spaces of different capacities.

This revision also includes an NFC fix to MemRefToLLVM.cpp that prevents a crash in cases where
an integer memory space cannot be derived for a MemRef.

Differential Revision: https://reviews.llvm.org/D156854
2023-08-02 07:24:29 +00:00
Kun Wu
1e491c425b [mlir][sparse][gpu] add 2:4 spmm prune_and_check flag
Differential Revision: https://reviews.llvm.org/D155909
2023-08-01 18:24:18 +00:00
Nicolas Vasilache
67754a9dc4 [mlir][gpu] NFC - Fail gracefully when type conversion fails instead of crashing 2023-07-28 21:03:52 +00:00
Guray Ozen
e56d6745f7 [mlir][nvgpu] Add tma.create.descriptor to create tensor map descriptor
The Op creates a tensor map descriptor object representing tiled memory region. The descriptor is used by Tensor Memory Access (TMA). The `tensor` is the source tensor to be tiled. The `boxDimensions` is the size of the tiled memory region in each dimension.

The pattern here lowers `tma.create.descriptor` to a runtime function call that eventually calls calls CUDA Driver's `cuTensorMapEncodeTiled`. For more information see below:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html

Depends on D155453

Reviewed By: nicolasvasilache

Differential Revision: https://reviews.llvm.org/D155680
2023-07-21 11:33:04 +02:00
Aart Bik
86eff489e7 [mlir][sparse][gpu] force 16-byte alignment on data structs for cuSparseLt
Also makes some minor consistency edits in the cuSparseLt wrapper lib.

Reviewed By: Peiming, K-Wu

Differential Revision: https://reviews.llvm.org/D155139
2023-07-13 10:45:15 -07:00
Christopher Bate
14858cf05d [mlir][Conversion/GPUCommon] Fix bug in conversion of math ops
The common GPU operation transformation that lowers `math` operations
to function calls in the `gpu-to-nvvm` and `gpu-to-rocdl` passes handles
`vector` types by applying the function to each scalar and returning a
new vector. However, there was a typo that results in incorrectly
accumulating the result vector, and the rewrite returns an `llvm.mlir.undef`
result instead of the correct vector. A patch is added and tests are
strengthened.

Reviewed By: ThomasRaoux

Differential Revision: https://reviews.llvm.org/D154269
2023-07-03 13:26:51 -06:00
Kun Wu
be2dd22b8f [mlir][sparse][gpu] reuse CUDA environment handle throughout instance lifetime
Differential Revision: https://reviews.llvm.org/D153173
2023-06-30 21:52:34 +00:00
Tobias Gysi
b126ee65fc [mlir][llvm] Add comdat attribute to functions
This revision adds comdat support to functions. Additionally,
it ensures only comdats that have uses are imported/exported and
only non-empty global comdat operations are created.

Reviewed By: Dinistro

Differential Revision: https://reviews.llvm.org/D153739
2023-06-27 07:26:59 +00:00
Kun Wu
632ccc538c [mlir][sparse][gpu] remove tuple as one of the spmm_buffer_size output type
Reviewed By: aartbik

Differential Revision: https://reviews.llvm.org/D153188
2023-06-19 15:57:50 +00:00
Uday Bondhugula
597f04fe97 [MLIR] Add support for bare pointer calling convention in gpu-to-llvm
Add support for the bare pointer calling convention in the gpu-to-llvm
pass. This wasn't being exposed and is needed when GPU-compiled MLIR is
to be called with this convention.

Reviewed By: krzysz00

Differential Revision: https://reviews.llvm.org/D152477
2023-06-17 23:27:13 +05:30
Kun Wu
ac30f48e37 [mlir][sparse][gpu]fix various cusparseLt bugs
Reviewed By: aartbik

Differential Revision: https://reviews.llvm.org/D152489
2023-06-12 23:48:49 +00:00
Kun Wu
97f4c22b3a [mlir][sparse][gpu] unify dnmat and dnvec handle and ops
Reviewed By: aartbik

Differential Revision: https://reviews.llvm.org/D152465
2023-06-09 17:16:48 +00:00
Navdeep Katel
18cc07aa07 [MLIR][GPU] Add 16-bit version of cudaMemset in cudaRuntimeWrappers
Add 16-bit version of cudaMemset in cudaRuntimeWrappers and update the GPU to LLVM lowering.

Reviewed By: bondhugula

Differential Revision: https://reviews.llvm.org/D151642
2023-06-08 17:33:26 +05:30