Commit Graph

490 Commits

Author SHA1 Message Date
Jakub Kuderski
72003adf6b [mlir][gpu] Allow subgroup reductions over 1-d vector types (#76015)
Each vector element is reduced independently, which is a form of
multi-reduction.

The plan is to allow for gradual lowering of multi-reduction that
results in fewer `gpu.shuffle` ops at the end:
1d `vector.multi_reduction` --> 1d `gpu.subgroup_reduce` --> smaller 1d
`gpu.subgroup_reduce` --> packed `gpu.shuffle` over i32

For example we can perform 2 independent f16 reductions with a series of
`gpu.shuffles` over i32, reducing the final number of `gpu.shuffles` by 2x.
2023-12-21 11:55:43 -05:00
Krzysztof Parzyszek
8b231d73bd [mlir] Fix build break with shared libraries
When project components are built as separate shared libraries, a lot
of errors appear about undefined symbols, e.g.

```
/usr/bin/ld: CMakeFiles/obj.MLIRGPUPipelines.dir/GPUToNVVMPipeline.cpp.o
: in function `(anonymous namespace)::buildCommonPassPipeline(mlir::OpPa
ssManager&, (anonymous namespace)::GPUToNVVMPipelineOptions const&)':
GPUToNVVMPipeline.cpp:(.text._ZN12_GLOBAL__N_123buildCommonPassPipelineE
RN4mlir13OpPassManagerERKNS_24GPUToNVVMPipelineOptionsE+0xa5): undefined
 reference to `mlir::createConvertLinalgToLoopsPass()'
```

Add the necessary dependencies to Dialect/GPU/Pipelines/CMakeLists.txt
2023-12-20 12:49:25 -06:00
Jakub Kuderski
560564f51c [mlir][vector][gpu] Align minf/maxf reduction kind names with arith (#75901)
This is to avoid confusion when dealing with reduction/combining kinds.
For example, see a recent PR comment:
https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175.

Previously, they were picked to mostly mirror the names of the llvm
vector reduction intrinsics:
https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In
isolation, it was not clear if `<maxf>` has `arith.maxnumf` or
`arith.maximumf` semantics. The new reduction kind names map 1:1 to
arith ops, which makes it easier to tell/look up their semantics.

Because both the vector and the gpu dialect depend on the arith dialect,
it's more natural to align names with those in arith than with the
lowering to llvm intrinsics.

Issue: https://github.com/llvm/llvm-project/issues/72354
2023-12-20 00:14:43 -05:00
Jakub Kuderski
9f74e6e615 [mlir][vector][gpu] Use makeArithReduction in lowering patterns. NFC. (#75952)
Use the `vector::makeArithReduction` helper as the source-of-truth of
reduction to arith ops lowering.
2023-12-19 19:04:27 -05:00
Guray Ozen
5caae72d1a [mlir][gpu] Productize test-lower-to-nvvm as gpu-lower-to-nvvm (#75775)
The `test-lower-to-nvvm` pipeline serves as the common and proper
pipeline for nvvm+host compilation, and it's used across our CUDA
integration tests.

This PR updates the `test-lower-to-nvvm` pipeline to `gpu-lower-to-nvvm`
and moves it within `InitAllPasses.h`. The aim is to call it from
Python, also having a standardize compilation process for nvvm.
2023-12-19 08:40:46 +01:00
Fabian Mora
419c45a325 [mlir][gpu] Fix crash in gpu-module-to-binary (#75477)
This patch fixes the error in issue #75434. The crash was being caused
by not checking for a lack of target attributes in a GPU module. It's
now considered an error to invoke the pass with a GPU module with no
target attributes.
2023-12-14 14:03:10 -05:00
Kazu Hirata
88d319a29f [mlir] Use StringRef::{starts,ends}_with (NFC)
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.

I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
2023-12-13 22:58:30 -08:00
Adrian Kuegel
8a5b448fa0 [mlir][GPU] Apply ClangTidy fixes
Use const reference in loops if possible.
2023-12-12 07:34:03 +00:00
Mehdi Amini
6402706a00 [mlir] Fix the link of libcuda.so in MLIRGPUTransforms to not use fully qualified path (#74018)
At the moment we find libcuda.so in a path like:

  /usr/local/cuda/targets/x86_64-linux/lib/stubs/libcuda.so

and directly add this to `target_link_libraries`. The problem is that
our installed MLIR package will include the full path to the library,
and a user downstream when including our cmake installed package will
inherit this full path.

We're changing this to instead

 -L /usr/local/cuda/targets/x86_64-linux/lib/stubs/ -lcuda
2023-11-30 19:30:05 -08:00
Jakub Kuderski
7eccd52842 Reland "[mlir][gpu] Align reduction operations with vector combining kinds (#73423)"
This reverts commit dd09221a29 and relands
https://github.com/llvm/llvm-project/pull/73423.

* Updated `gpu.all_reduce` `min`/`max` in CUDA integration tests.
2023-11-27 11:38:18 -05:00
Jakub Kuderski
dd09221a29 Revert "[mlir][gpu] Align reduction operations with vector combining kinds (#73423)"
This reverts commit e0aac8c88d.

I'm seeing some nvidia integration test failures:
https://lab.llvm.org/buildbot/#/builders/61/builds/52334.
2023-11-27 11:29:23 -05:00
Jakub Kuderski
e0aac8c88d [mlir][gpu] Align reduction operations with vector combining kinds (#73423)
The motivation for this change is explained in
https://github.com/llvm/llvm-project/issues/72354.

Before this change, we could not tell between signed/unsigned
minimum/maximum and NaN treatment for floating point values.

The mapping of old reduction operations to the new ones is as follows:
*  `min` --> `minsi` for ints, `minf` for floats
*  `max` --> `maxsi` for ints, `maxf` for floats

New reduction kinds not represented in the old enum: `minui`, `maxui`,
`minimumf`, `maximumf`.

As a next step, I would like to have a common definition of combining
kinds used by the `vector` and `gpu` dialects. Separately, the GPU to
SPIR-V lowering does not yet properly handle zero and NaN values -- the
behavior of floating point min/max group reductions is not specified by
the SPIR-V spec, see https://github.com/llvm/llvm-project/issues/73459. 

Issue: https://github.com/llvm/llvm-project/issues/72354
2023-11-27 11:19:20 -05:00
Guray Ozen
edf5cae739 [mlir][gpu] Support Cluster of Thread Blocks in gpu.launch_func (#72871)
NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA).
It is a new level of parallelism, allowing clustering of Cooperative
Thread Arrays (CTA) to synchronize and communicate through shared memory
while running concurrently.

This PR enables support for CGA within the `gpu.launch_func` in the GPU
dialect. It extends `gpu.launch_func` to accommodate this functionality.

The GPU dialect remains architecture-agnostic, so we've added CGA
functionality as optional parameters. We want to leverage mechanisms
that we have in the GPU dialects such as outlining and kernel launching,
making it a practical and convenient choice.

An example of this implementation can be seen below:

```
gpu.launch_func @kernel_module::@kernel
                clusters in (%1, %0, %0) // <-- Optional
                blocks in (%0, %0, %0)
                threads in (%0, %0, %0)
```

The PR also introduces index and dimensions Ops specific to clusters,
binding them to NVVM Ops:

```
%cidX = gpu.cluster_id  x
%cidY = gpu.cluster_id  y
%cidZ = gpu.cluster_id  z

%cdimX = gpu.cluster_dim  x
%cdimY = gpu.cluster_dim  y
%cdimZ = gpu.cluster_dim  z
```

We will introduce cluster support in `gpu.launch` Op in an upcoming PR. 

See [the
documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays)
provided by NVIDIA for details.
2023-11-27 11:05:07 +01:00
Guray Ozen
ea84897ba3 [mlir][gpu] Introduce gpu.dynamic_shared_memory Op (#71546)
While the `gpu.launch` Op allows setting the size via the
`dynamic_shared_memory_size` argument, accessing the dynamic shared
memory is very convoluted. This PR implements the proposed Op,
`gpu.dynamic_shared_memory` that aims to simplify the utilization of
dynamic shared memory.

RFC:
https://discourse.llvm.org/t/rfc-simplifying-dynamic-shared-memory-access-in-gpu/

**Proposal from RFC**
This PR `gpu.dynamic.shared.memory` Op to use dynamic shared memory
feature efficiently. It is is a powerful feature that enables the
allocation of shared memory at runtime with the kernel launch on the
host. Afterwards, the memory can be accessed directly from the device. I
believe similar story exists for AMDGPU.

**Current way Using Dynamic Shared Memory with MLIR**

Let me illustrate the challenges of using dynamic shared memory in MLIR
with an example below. The process involves several steps:
- memref.global 0-sized array LLVM's NVPTX backend expects
- dynamic_shared_memory_size Set the size of dynamic shared memory
- memref.get_global Access the global symbol
- reinterpret_cast and subview Many OPs for pointer arithmetic

```
// Step 1. Create 0-sized global symbol. Manually set the alignment
memref.global "private" @dynamicShmem  : memref<0xf16, 3> { alignment = 16 }
func.func @main() {
  // Step 2. Allocate shared memory
  gpu.launch blocks(...) threads(...)
    dynamic_shared_memory_size %c10000 {
    // Step 3. Access the global object
    %shmem = memref.get_global @dynamicShmem : memref<0xf16, 3>
    // Step 4. A sequence of `memref.reinterpret_cast` and `memref.subview` operations.
    %4 = memref.reinterpret_cast %shmem to offset: [0], sizes: [14, 64, 128],  strides: [8192,128,1] : memref<0xf16, 3> to memref<14x64x128xf16,3>
    %5 = memref.subview %4[7, 0, 0][7, 64, 128][1,1,1] : memref<14x64x128xf16,3> to memref<7x64x128xf16, strided<[8192, 128, 1], offset: 57344>, 3>
    %6 = memref.subview %5[2, 0, 0][1, 64, 128][1,1,1] : memref<7x64x128xf16, strided<[8192, 128, 1], offset: 57344>, 3> to memref<64x128xf16, strided<[128, 1], offset: 73728>, 3>
    %7 = memref.subview %6[0, 0][64, 64][1,1]  : memref<64x128xf16, strided<[128, 1], offset: 73728>, 3> to memref<64x64xf16, strided<[128, 1], offset: 73728>, 3>
    %8 = memref.subview %6[32, 0][64, 64][1,1] : memref<64x128xf16, strided<[128, 1], offset: 73728>, 3> to memref<64x64xf16, strided<[128, 1], offset: 77824>, 3>
    // Step.5 Use
    "test.use.shared.memory"(%7) : (memref<64x64xf16, strided<[128, 1], offset: 73728>, 3>) -> (index)
    "test.use.shared.memory"(%8) : (memref<64x64xf16, strided<[128, 1], offset: 77824>, 3>) -> (index)
    gpu.terminator
  }
```

Let’s write the program above with that:

```
func.func @main() {
    gpu.launch blocks(...) threads(...) dynamic_shared_memory_size %c10000 {
    	%i = arith.constant 18 : index
        // Step 1: Obtain shared memory directly
        %shmem = gpu.dynamic_shared_memory : memref<?xi8, 3>
        %c147456 = arith.constant 147456 : index
        %c155648 = arith.constant 155648 : index
        %7 = memref.view %shmem[%c147456][] : memref<?xi8, 3> to memref<64x64xf16, 3>
        %8 = memref.view %shmem[%c155648][] : memref<?xi8, 3> to memref<64x64xf16, 3>

        // Step 2: Utilize the shared memory
        "test.use.shared.memory"(%7) : (memref<64x64xf16, 3>) -> (index)
        "test.use.shared.memory"(%8) : (memref<64x64xf16, 3>) -> (index)
    }
}
```

This PR resolves #72513
2023-11-16 14:42:17 +01:00
long.chen
1609f1c2a5 [mlir][affine][nfc] cleanup deprecated T.cast style functions (#71269)
detail see the docment: https://mlir.llvm.org/deprecation/

Not all changes are made manually, most of them are made through a clang
tool I wrote https://github.com/lipracer/cpp-refactor.
2023-11-14 13:01:19 +08:00
drazi
9a3d3c7093 generalize pass gpu-kernel-outlining for symbol op (#72074)
This PR generalize gpu-out-lining pass to take care of ops
`SymbolOpInterface` instead of just `func::FuncOp`.

Before this change, gpu-out-lining pass will skip `llvm.func`.
```mlir
module {
  llvm.func @main() {
    %c1 = arith.constant 1 : index
    gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %c1, %arg7 = %c1, %arg8 = %c1) threads(%arg3, %arg4, %arg5) in (%arg9 = %c1, %arg10 = %c1, %arg11 = %c1) {
      gpu.terminator
    }
    llvm.return
  }
}
```

After this change, gpu-out-lining pass can handle llvm.func as well.
2023-11-12 21:48:49 -08:00
spaceotter
45f669252e [mlir][gpu] Fix build error after barrier elimination code moved (#72019)
Should fix
https://lab.llvm.org/buildbot/#/builders/61/builds/51692/steps/5/logs/stdio
2023-11-11 00:57:30 -08:00
spaceotter
00c3c73189 [mlir][gpu] Separate the barrier elimination code from transform ops (#71762)
Allows the barrier elimination code to be run from C++ as well. The code
from transforms dialect is copied as-is, the pass and populate functions
have beed added at the end.

Co-authored-by: Eric Eaton <eric@nod-labs.com>
2023-11-10 17:59:09 -08:00
spaceotter
51af040b22 [mlir][gpu] Eliminate redundant gpu.barrier ops (#71575)
Adds a canonicalizer for gpu.barrier that gets rid of duplicates.

Co-authored-by: Eric Eaton <eric@nod-labs.com>
2023-11-09 18:06:20 -05:00
Krzysztof Drewniak
05fa923a9b Fix SmallVector usage in SerailzeToHsaco (#71702)
Enable merging #71439 by removing a definitely-wrong usage of
std::unique_ptr<SmallVectorImpl<char>> as a return value with passing in
a SmallVectorImpl<char>&

Also change the following function to take ArrayRef<char> instead of
const SmalVectorImpl<char>& .
2023-11-08 13:57:41 -06:00
Fabian Mora
42630689e2 [mlir][gpu] Clean GPU Passes.h from external SPIRV includes (#71331)
Removes the `SPIRVAttributes.h` header from `GPU/Transforms/Passes.h`
2023-11-05 17:06:04 -08:00
Sang Ik Lee
2dace04521 [mlir][spirv] Implement gpu::TargetAttrInterface (#69949)
This commit implements gpu::TargetAttrInterface for SPIR-V target
attribute. The plan is to use this to enable GPU compilation pipeline
for OpenCL kernels later.

The changes do not impact Vulkan shaders using milr-vulkan-runner.
New GPU Dialect transform pass spirv-attach-target is implemented for
attaching attribute from CLI.

gpu-module-to-binary pass now works with GPU module that has SPIR-V
module with OpenCL kernel functions inside.
2023-11-05 08:11:53 -08:00
Mehdi Amini
6883343843 [mlir] Guard NVPTX backend initialization on it being configured (NFC)
This is just helping with some build failure in some new configurations.
2023-11-03 22:23:01 -07:00
Mehdi Amini
d9dadfda85 Refactor ModuleToObject to offer more flexibility to subclass (NFC)
Some specific implementation of the offload may want more customization, and
even avoid using LLVM in-tree to dispatch the ISA translation to a custom
solution. This refactoring makes it possible for such implementation to work
without even configuring the target backend in LLVM.

Reviewers: fabianmcg

Reviewed By: fabianmcg

Pull Request: https://github.com/llvm/llvm-project/pull/71165
2023-11-03 13:41:45 -07:00
Rohan Yadav
71bdd2c238 mlir/lib/Dialect/GPU/Transforms: improve context management in SerializeToCubin (#65779)
This commit adjusts the CUDA context management in the SerializeToCubin
pass. In particular, it uses the device 0 primary context instead of
creating a new CUDA context on each invocation of SerializeToCubin. This
yields very large improvements in compile time, especially if an
application (like a JIT compiler) is calling SerializeToCubin
repeatedly.

Differential Revision: https://reviews.llvm.org/D159487

Co-authored-by: Rohan Yadav <rohany@cs.stanford.edu>
2023-10-20 23:05:10 +05:30
Krzysztof Drewniak
0463e00ac6 [mlir][ROCDL] Fix file leak in SeralizeToHsaco and its newer form (#67711)
SerializetToHsaco, as currently implemented, leaks the file descriptor
of the .hsaco temporary file, which causes issues in long-running
parallel compilation setups.

See also https://github.com/ROCmSoftwarePlatform/rocMLIR/pull/1257
2023-09-29 17:24:40 -05:00
Martin Erhart
522c1d0eea [mlir][gpu][bufferization] Implement BufferDeallocationOpInterface for gpu.terminator (#66880)
This is necessary to support deallocation of IR with gpu.launch
operations because it does not implement the RegionBranchOpInterface.
Implementing the interface would require it to support regions with
unstructured control flow and produced arguments/results.
2023-09-20 12:28:28 +02:00
Fabian Mora
5093413a50 [mlir][gpu][NVPTX] Enable NVIDIA GPU JIT compilation path (#66220)
This patch adds an NVPTX compilation path that enables JIT compilation
on NVIDIA targets. The following modifications were performed:
1. Adding a format field to the GPU object attribute, allowing the
translation attribute to use the correct runtime function to load the
module. Likewise, a dictionary attribute was added to add any possible
extra options.

2. Adding the `createObject` method to `GPUTargetAttrInterface`; this
method returns a GPU object from a binary string.

3. Adding the function `mgpuModuleLoadJIT`, which is only available for
NVIDIA GPUs, as there is no equivalent for AMD.

4. Adding the CMake flag `MLIR_GPU_COMPILATION_TEST_FORMAT` to specify
the format to use during testing.
2023-09-14 18:00:27 -04:00
Arthur Eubanks
0a1aa6cda2 [NFC][CodeGen] Change CodeGenOpt::Level/CodeGenFileType into enum classes (#66295)
This will make it easy for callers to see issues with and fix up calls
to createTargetMachine after a future change to the params of
TargetMachine.

This matches other nearby enums.

For downstream users, this should be a fairly straightforward
replacement,
e.g. s/CodeGenOpt::Aggressive/CodeGenOptLevel::Aggressive
or s/CGFT_/CodeGenFileType::
2023-09-14 14:10:14 -07:00
Fabian Mora
444abb396c [mlir][gpu] Add a symbol table field to TargetOptions and adjust GpuModuleToBinary (#65797)
This patch adds the option of building an optional symbol table for the
top operation in the `gpu-module-to-binary` pass. The table is not
created by default as most targets don't need it; instead, it is lazily
built. The table is passed through a callback in `TargetOptions`.

This patch is required to integrate #65539 .
2023-09-09 19:59:20 -04:00
Fabian Mora
ec9f218173 [mlir][gpu][target] Use promises to verify TargetAttrs IR correctness. (#65787)
This patch employs the updated promise mechanism to enforce Target
Attribute IR constraints. Due to this patch, TargetAttributes
implementations no longer have to be registered before executing
translation to LLVM IR in cases where they are not needed, like when
translating `gpu.binary` operations.
2023-09-08 17:21:45 -04:00
Fabian Mora
c16adb0dcb [mlir][Target][NVPTX] Add fatbin support to NVPTX compilation. (#65398)
Currently, the NVPTX tool compilation path only calls `ptxas`; thus, the
GPU running the binary must be an exact match of the arch of the target,
or else the runtime throws an error due to the arch mismatch.

This patch adds a call to `fatbinary`, creating a fat binary with the
cubin object and the PTX code, allowing the driver to JIT the PTX at
runtime if there's an arch mismatch.
2023-09-07 07:44:41 -04:00
Lukas Sommer
06918a969c [MLIR][NFC] Mark barrier elimination helper static (#65303)
Make local helper functions static to avoid symbol name collision.
2023-09-05 09:59:22 +02:00
Fabian Mora
e22f04b597 [mlir][gpu] Fix option parsing in TargetOptions
`TargetOptions` includes a field for passing additional command line options to
the GPU compilation process. This field is typically used during the 'gpu-module-to-binary`
pass:
```
--gpu-module-to-binary=opts="-v -c"
```

The problem is that `tokenizeCmdOptions` receives the quoted string, which produces an
incorrect tokenization for `"-v -c"`. This patch removes quotes, fixing this issue.

Reviewed By: mehdi_amini

Differential Revision: https://reviews.llvm.org/D159434
2023-09-04 20:54:29 -04:00
Nicolas Vasilache
92f088d335 [mlir][gpu][transform] Provide better error messages and avoid crashing in MapForallToBlocks.
This revision addresses issues surfaced in https://reviews.llvm.org/D159093
2023-09-04 14:11:38 +00:00
Martin Erhart
34a35a8b24 [mlir] Move FunctionInterfaces to Interfaces directory and inherit from CallableOpInterface
Functions are always callable operations and thus every operation
implementing the `FunctionOpInterface` also implements the
`CallableOpInterface`. The only exception was the FuncOp in the toy
example. To make implementation of the `FunctionOpInterface` easier,
this commit lets `FunctionOpInterface` inherit from
`CallableOpInterface` and merges some of their methods. More precisely,
the `CallableOpInterface` has methods to get the argument and result
attributes and a method to get the result types of the callable region.
These methods are always implemented the same way as their analogues in
`FunctionOpInterface` and thus this commit moves all the argument and
result attribute handling methods to the callable interface as well as
the methods to get the argument and result types. The
`FuntionOpInterface` then does not have to declare them as well, but
just inherits them from the `CallableOpInterface`.
Adding the inheritance relation also required to move the
`FunctionOpInterface` from the IR directory to the Interfaces directory
since IR should not depend on Interfaces.

Reviewed By: jpienaar, springerm

Differential Revision: https://reviews.llvm.org/D157988
2023-08-31 11:28:23 +00:00
Adrian Kuegel
b454ecc84f [mlir] Apply ClangTidy fix (NFC)
Prefer to use .empty() instead of checking for size() > 0.
2023-08-28 10:53:15 +02:00
Adrian Kuegel
bf92a7655c [mlir] Apply ClangTidy fixes (NFC)
Prefer to use .empty() instead of checking size().
2023-08-23 17:18:59 +02:00
Adrian Kuegel
93228cff8f [mlir] Apply ClangTidy fix (NFC)
Use .empty() instead of checking for size().
2023-08-22 13:55:09 +02:00
Nicolas Vasilache
7c4e8c6a27 [mlir] Disentangle dialect and extension registrations.
This revision avoids the registration of dialect extensions in Pass::getDependentDialects.

Such registration of extensions can be dangerous because `DialectRegistry::isSubsetOf` is
always guaranteed to return false for extensions (i.e. there is no mechanism to track
whether a lambda is already in the list of already registered extensions).
When the context is already in a multi-threaded mode, this is guaranteed to assert.

Arguably a more structured registration mechanism for extensions with a unique ExtensionID
could be envisioned in the future.

In the process of cleaning this up, multiple usage inconsistencies surfaced around the
registration of translation extensions that this revision also cleans up.

Reviewed By: springerm

Differential Revision: https://reviews.llvm.org/D157703
2023-08-22 00:40:09 +00:00
Fabian Mora
fbbb8adef1 [mlir][gpu] Add passes to attach (NVVM|ROCDL) target attributes to GPU Modules
Adds the passes `nvvm-attach-target` & `rocdl-attach-target for attaching `nvvm.target` & `rocdl.target` attributes to GPU Modules.

These passes search GPU Modules in the immediate region of the Op being acted on, attaching the target attribute to the module.
Modules can be selected using a regex string, allowing fine grain attachment of targets, see the test `attach-target.mlir` for an example.

Depends on D154153

Reviewed By: mehdi_amini

Differential Revision: https://reviews.llvm.org/D157351
2023-08-12 00:45:26 +00:00
Fabian Mora
43752a2aa3 [mlir][gpu] Add the gpu-module-to-binary pass.
**For an explanation of these patches see D154153.**

Commit message:
This pass converts GPU modules into GPU binaries, serializing all targets present
in a GPU module by invoking the `serializeToObject` target attribute method.

Depends on D154147

Reviewed By: mehdi_amini

Differential Revision: https://reviews.llvm.org/D154149
2023-08-12 00:24:53 +00:00
Fabian Mora
8ae074b195 [mlir][gpu] Add the Select Object compilation attribute.
**For an explanation of these patches see D154153.**

Commit message:
This patch adds the default offloading handler for GPU binary ops: `#gpu.select_object`,
it selects the object to embed based on an index or a target attribute, embedding
the object as a global string and launches the kernel using the scheme used in the
GPU to LLVM pass.

Depends on D154137

Reviewed By: mehdi_amini

Differential Revision: https://reviews.llvm.org/D154147
2023-08-11 22:00:35 +00:00
Fabian Mora
a63db3f5f5 [mlir][gpu] Modifies gpu.launch_func to allow lowering it after gpu-to-llvm.
**For an explanation of these patches see D154153.**

Commit message:
In order to lower `gpu.launch_func` after running `gpu-to-llvm` it must be
able to handle lowered types -eg. index -> i64. This patch also allows the op
to refer to GPU binaries and not only GPU modules.

Depends on D154132.

Reviewed By: mehdi_amini

Differential Revision: https://reviews.llvm.org/D154137
2023-08-11 21:56:37 +00:00
Fabian Mora
bf24fb81ac [mlir][gpu] Add gpu.binary op and #gpu.object attribute.
**For an explanation of these patches see D154153.**

Commit message:
Adds the `#gpu.object` attribute for holding a binary object and the target
attribute used to create it. Also adds the `gpu.binary` operation used to
store GPU objects.

Depends on D154108

Reviewed By: mehdi_amini

Differential Revision: https://reviews.llvm.org/D154132
2023-08-11 19:48:18 +00:00
Ingo Müller
616eb0b2c4 [mlir][gpu] Fix error message on unknown CUDA error code.
This patch fixes the output of the error message that is printed when
the CUDA library cannot identity the error code. In that case, no error
message is provided by the library, and the previous implementation just
printed the content of a randomly initialized pointer. This patch
initializes the pointer to nullptr and only prints the content if that
has changed.

Reviewed By: Mogball

Differential Revision: https://reviews.llvm.org/D156791
2023-08-11 08:04:58 +00:00
Ivan Butygin
793ee2bf08 [mlir][gpu] Add DecomposeMemrefsPass
Some GPU backends (SPIR-V) lower memrefs to bare pointers, so for dynamically sized/strided memrefs it will fail.
This pass extracts sizes and strides via `memref.extract_strrided_metadata` outside `gpu.launch` body and do index/offset calculation explicitly and then reconstructs memrefs via `memref.reinterpret_cast`.

`memref.reinterpret_cast` then lowered via https://reviews.llvm.org/D155011

Differential Revision: https://reviews.llvm.org/D155247
2023-08-10 22:28:05 +02:00
Nicolas Vasilache
888717e853 [mlir][transform] Enable gpu-to-nvvm via conversion patterns driven by TD
This revision untangles a few more conversion pieces and allows rewriting
the relatively intricate (and somewhat inconsistent) LowerGpuOpsToNVVMOpsPass
in a declarative fashion that provides a much better understanding and control.

Differential Revision: https://reviews.llvm.org/D157617
2023-08-10 15:30:48 +00:00
Ivan Butygin
b13248f997 Revert "[mlir][gpu] Add DecomposeMemrefsPass"
Broke some bots

This reverts commit 2b5b2bfef1.
2023-08-10 03:07:28 +02:00
Ivan Butygin
2b5b2bfef1 [mlir][gpu] Add DecomposeMemrefsPass
Some GPU backends (SPIR-V) lower memrefs to bare pointers, so for dynamically sized/strided memrefs it will fail.
This pass extracts sizes and strides via `memref.extract_strrided_metadata` outside `gpu.launch` body and do index/offset calculation explicitly and then reconstructs memrefs via `memref.reinterpret_cast`.

`memref.reinterpret_cast` then lowered via https://reviews.llvm.org/D155011

Differential Revision: https://reviews.llvm.org/D155247
2023-08-10 02:28:03 +02:00