The `test-lower-to-nvvm` pipeline serves as the common and proper
pipeline for nvvm+host compilation, and it's used across our CUDA
integration tests.
This PR updates the `test-lower-to-nvvm` pipeline to `gpu-lower-to-nvvm`
and moves it within `InitAllPasses.h`. The aim is to call it from
Python, also having a standardize compilation process for nvvm.
GPU Dialect lowering to SYCL runtime is driven by spirv.target_env
attached to gpu.module. As a result of this, spirv.target_env remains as
an input to LLVMIR Translation.
A SPIRVToLLVMIRTranslation without any actual translation is added to
avoid an unregistered error in mlir-cpu-runner.
SelectObjectAttr.cpp is updated to
1) Pass binary size argument to getModuleLoadFn
2) Pass parameter count to getKernelLaunchFn
This change does not impact CUDA and ROCM usage since both
mlir_cuda_runtime and mlir_rocm_runtime are already updated to accept
and ignore the extra arguments.
NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA).
It is a new level of parallelism, allowing clustering of Cooperative
Thread Arrays (CTA) to synchronize and communicate through shared memory
while running concurrently.
This PR enables support for CGA within the `gpu.launch_func` in the GPU
dialect. It extends `gpu.launch_func` to accommodate this functionality.
The GPU dialect remains architecture-agnostic, so we've added CGA
functionality as optional parameters. We want to leverage mechanisms
that we have in the GPU dialects such as outlining and kernel launching,
making it a practical and convenient choice.
An example of this implementation can be seen below:
```
gpu.launch_func @kernel_module::@kernel
clusters in (%1, %0, %0) // <-- Optional
blocks in (%0, %0, %0)
threads in (%0, %0, %0)
```
The PR also introduces index and dimensions Ops specific to clusters,
binding them to NVVM Ops:
```
%cidX = gpu.cluster_id x
%cidY = gpu.cluster_id y
%cidZ = gpu.cluster_id z
%cdimX = gpu.cluster_dim x
%cdimY = gpu.cluster_dim y
%cdimZ = gpu.cluster_dim z
```
We will introduce cluster support in `gpu.launch` Op in an upcoming PR.
See [the
documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays)
provided by NVIDIA for details.
PR #69913 added a GEMM test (128x128x128 F32 += F16 * F16) with
if-statement. This PR adds the same test using predicates in PTX.
Predicate support is enabled using _BasicPtxBuilderInterface_
`(nvgpu.opcode ..., predicate = %pred)`.
The predicate condition is computed in `Step 2. [GPU] Elect fastest
thread in CTA` inspired by cutlass. It is as follows:
```
lane_predicate = nvvm.elect.sync
warp_idx = __shfl_sync(0xffffffff, threadIdx.x / 32, 0)
warp_idx_in_warp_group = warp_idx % 4
predicate = (lane_predicate & warp_idx_in_warp_group)
```
Depends on #70027#69934#69935#69584
#70923 improved verifier. The verifier caught that the tensor map type in the tma descriptor in this test isn't correct. The program was working correctly anway since the offset is calculated correctly.
This work fixes the test.
This commit removes the last remnants of `use-opaque-pointers` from the
mlir tests. Two of the tests seem to be disabled, while the CUDA one is
an integration test that didn't trigger a buildbot failure.
The test was meant to check `64x128xf16` as the contiguous dimension
exceeds the cache line (128b). TMA requires cache line-aligned loads, so
loading 64x128 can be done with two 64x64 loads, as documented in the
test.
However, there was a typo in the type, which was `memref<128x64xf16>`
instead of the correct `memref<64x128xf16>`. This PR corrects the issue
and updates the verification.
#69934 broke integration tests that rely on the
kernel-bare-ptr-calling-convention and host-bare-ptr-calling-convention
flags. This PR brings these flags.
Also the kernel-index-bitwidth flag is removed, as kernel pointer size
depends on the host. Separating host (64-bit) and kernel (32-bit) is not
viable.
Update most test passes to use the transform-interpreter pass instead of
the test-transform-dialect-interpreter-pass. The new "main" interpreter
pass has a named entry point instead of looking up the top-level op with
`PossibleTopLevelOpTrait`, which is arguably a more understandable
interface. The change is mechanical, rewriting an unnamed sequence into
a named one and wrapping the transform IR in to a module when necessary.
Add an option to the transform-interpreter pass to target a tagged
payload op instead of the root anchor op, which is also useful for repro
generation.
Only the test in the transform dialect proper and the examples have not
been updated yet. These will be updated separately after a more careful
consideration of testing coverage of the transform interpreter logic.
This PR enables `test-lower-to-nvvm` pass pipeline for the integration
tests for NVIDIA sm_90 architecture.
This PR adjusts `test-lower-to-nvvm` pass in two ways:
1) Calls `createConvertNVGPUToNVVMPass` before the outlining process.
This particular pass is responsible for generating both device and host
code. On the host, it calls the CUDA driver to build the TMA descriptor
(`cuTensorMap`).
2) Integrates the `createConvertNVVMToLLVMPass` to generate PTXs for
NVVM Ops.
MLIR has begun supporting many features of Nvidia's sm_90 architecture,
and new tests have been added for it. Although the tests worked well,
there were redundancies in the pipeline. This PR cleans up unnecessary
passes.
The #65953 added a test `128x64xf16` that does a single TMA load. This
PR adds more complex test that does 2 additional TMA loads with 128B
Swizzling:
```
TMA Load: Matrix-A[0:128][0:64]
TMA Load: Matrix-B[0:64][0:64]
TMA Load: Matrix-B[64:128][0:64]
```
The program tests the loaded data for Matrix-B.
This patch adds an NVPTX compilation path that enables JIT compilation
on NVIDIA targets. The following modifications were performed:
1. Adding a format field to the GPU object attribute, allowing the
translation attribute to use the correct runtime function to load the
module. Likewise, a dictionary attribute was added to add any possible
extra options.
2. Adding the `createObject` method to `GPUTargetAttrInterface`; this
method returns a GPU object from a binary string.
3. Adding the function `mgpuModuleLoadJIT`, which is only available for
NVIDIA GPUs, as there is no equivalent for AMD.
4. Adding the CMake flag `MLIR_GPU_COMPILATION_TEST_FORMAT` to specify
the format to use during testing.
The 'TargetAttr' workflow was recently introduced to serialization for
'MLIR->LLVM->PTX'. #65857 removes previous passes (gpu::Serialization*
passes) because they are duplicates.
This PR removes the use of gpu::Serialization* passes in SM_90
integration tests, and enables the 'TargetAttr' workflow.
It also moves the transform dialect specific test to a new folder.
The revert happened due to a build bot failure that threw 'CUDA_ERROR_UNSUPPORTED_PTX_VERSION'.
The failure's root cause was a pass using "+ptx76" for compilation and an old CUDA driver
on the bot. This commit relands the patch with "+ptx60".
Original Gh PR: #65768
Original commit message:
Migrate tests referencing `gpu-to-cubin` to the new compilation workflow
using `TargetAttrs`. The `test-lower-to-nvvm` pass pipeline was modified
to use the new compilation workflow to simplify the introduction of
future tests.
The `createLowerGpuOpsToNVVMOpsPass` function was removed, as it didn't
allow for passing all options available in the `ConvertGpuOpsToNVVMOp`
pass.
Migrate tests referencing `gpu-to-cubin` to the new compilation workflow
using `TargetAttrs`. The `test-lower-to-nvvm` pass pipeline was modified
to use the new compilation workflow to simplify the introduction of
future tests.
The `createLowerGpuOpsToNVVMOpsPass` function was removed, as it didn't
allow for passing all options available in the `ConvertGpuOpsToNVVMOp`
pass.
TMA was introduced to MLIR, however, it needed `ptxas` compiler. Recent work D154117 introduced that!
This work runs the existing integration test.
Reviewed By: fmorac
Differential Revision: https://reviews.llvm.org/D159347
Reland of the original patch after updating the Python binding tests,
a few CUDA/GPU MLIR tests, and ensuring the assembly format is
round-trippable.
This patch splits the lowering of vector.print into first converting
an n-D print into a loop of scalar prints of the elements, then a second
pass that converts those scalar prints into the runtime calls. The
former is done in VectorToSCF and the latter in VectorToLLVM.
The main reason for this is to allow printing scalable vector types,
which are not possible to fully unroll at compile time, though this
also avoids fully unrolling very large vectors.
To allow VectorToSCF to add the necessary punctuation between vectors
and elements, a "punctuation" attribute has been added to vector.print.
This abstracts calling the runtime functions such as printNewline(),
without leaking the LLVM details into the higher abstraction levels.
For example:
vector.print punctuation <comma>
lowers to
llvm.call @printComma() : () -> ()
The output format and runtime functions remain the same, which avoids
the need to alter a large number of tests (aside from the pipelines).
Reviewed By: awarzynski, c-rhodes, aartbik
Differential Revision: https://reviews.llvm.org/D156519
Reland of the original patch after updating the Python binding tests and
a few CUDA/GPU MLIR tests.
This patch splits the lowering of vector.print into first converting
an n-D print into a loop of scalar prints of the elements, then a second
pass that converts those scalar prints into the runtime calls. The
former is done in VectorToSCF and the latter in VectorToLLVM.
The main reason for this is to allow printing scalable vector types,
which are not possible to fully unroll at compile time, though this
also avoids fully unrolling very large vectors.
To allow VectorToSCF to add the necessary punctuation between vectors
and elements, a "punctuation" attribute has been added to vector.print.
This abstracts calling the runtime functions such as printNewline(),
without leaking the LLVM details into the higher abstraction levels.
For example:
vector.print <comma>
lowers to
llvm.call @printComma() : () -> ()
The output format and runtime functions remain the same, which avoids
the need to alter a large number of tests (aside from the pipelines).
Reviewed By: awarzynski, c-rhodes, aartbik
Differential Revision: https://reviews.llvm.org/D156519
This revision adds support for direct lowering of a linalg.copy on buffers between global and shared memory to a tma async load + synchronization operations.
This uses the recently introduced Hopper NVVM and NVGPU abstraction to connect things end to end.
Differential Revision: https://reviews.llvm.org/D157087
This work introduces sm90 integration testing and adds a single test.
Depends on : D155825 D155680 D155563 D155453
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155838
This mirrors the test-lower-to-llvm pass pipeline that provides some sanity when running e2e examples.
One peculiarity of the GPU pipeline is that we want to allow 32b indexing in kernels.
This is currently not straightforward as there are dependencies between passes.
This new test pass orders passes in a way that connects end-to-end.
Differential Revision: https://reviews.llvm.org/D155463
This PR adds support for the m16n8k16 f16 case.
At this point, the support is mostly mechanical and could be Tablegen'd to all cases.
Until then, this can be populated as needed on a case-by-case basis.
Depends on: D153420
Differential Revision: https://reviews.llvm.org/D153428
Mapping to NVGPU operations such as mma.sync with mixed precision and ldmatrix with transposes and
various data types involves complex matchings from low-level IR.
This is akin to raising complex patterns after unnecessarily having lost structural information.
To avoid such unnecessary complexity, introduce a direct mapping step from a matmul on memrefs
to distributed NVGPU vector abstractions.
In this context, mapping to specific mma.sync operations is trivial and consists in simply
translating the documentation into indexing expressions.
Correctness is demonstrated with an end-to-end integration test.
Differential Revision: https://reviews.llvm.org/D153420
Provide the bare pointer memref lowering option on gpu-to-nvvm pass.
This is needed whenever we lower memrefs on the host function side and
the kernel calls on the host-side (gpu-to-llvm) with the bare ptr
convention. The GPU module side of the lowering should also "align" and
use the bare pointer convention.
Reviewed By: krzysz00
Differential Revision: https://reviews.llvm.org/D152480
This is an ongoing series of commits that are reformatting our
Python code.
Reformatting is done with `black`.
If you end up having problems merging this commit because you
have made changes to a python file, the best way to handle that
is to run git checkout --ours <yourfile> and then reformat it
with black.
If you run into any problems, post to discourse about it and
we will try to help.
RFC Thread below:
https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style
Differential Revision: https://reviews.llvm.org/D150782
This reverts commit 5561e17411
The logic was moved from cmake into lit fixing the issue that lead to the revert and potentially others with multi-config cmake generators
Differential Revision: https://reviews.llvm.org/D143925
This patch contains the changes required to make the vast majority of integration and runner tests run on Windows.
Historically speaking, the JIT support for Windows has been lacking behind, but recent versions of ORC JIT have now caught up and works for basically all examples in repo.
Sadly due to these tests previously not working on Windows, basically all of them are making unix-like assumptions about things like filenames, paths, shell syntax etc.
This patch fixes all these issues in one big swoop and enables Windows support for the vast majority of integration tests.
More specifically, following changes had to be done:
* The various JIT runners used paths to the runtime libraries that assumed a Unix toolchain layout and filenames. I abstracted the specific path and filename of these runtime libraries away by making the paths to the runtime libraries be passed from cmake into lit. This now also allows a much more convenient syntax: `--shared-libs=%mlir_c_runner_utils` instead of `--shared-libs=%mlir_lib_dir/lib/libmlir_c_runner_utils%shlibext`
* Some tests using python set environment variables using the `ENV=VALUE cmd` format. This works on Unix, but on Windows it has to prefixed using `env ENV=VALUE cmd`
* Some tests used C functions that are simply not available or exported on Windows (`fabsf`, `aligned_alloc`). These tests have either been adjusted or explicitly marked as `UNSUPPORTED`
Some tests remain disabled on Windows as before:
* In SparseTensor some tests have non-trivial logic for finding the runtime libraries which seems to be required for the use of emulators. I do not have the time to port these so I simply kept them disabled
* Some tests requiring special hardware which I simply cannot test remain disabled on Windows. These include usage of AVX512 or AMX
The tests for `mlir-vulkan-runner` and `mlir-spirv-runner` all work now as well and so do the vast majority of `mlir-cpu-runner`.
Differential Revision: https://reviews.llvm.org/D143925
When converting to nvvm lowering gpu.printf to vprintf allows us to
support printing when running on cuda.
Differential Revision: https://reviews.llvm.org/D141049
Add support for loading, computing, and storing `gpu.subgroup` WMMA ops
in transpose mode as well. Update the GPU to NVVM lowerings to support
`transpose` mode and update integration tests as well.
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D139021
In D134622 the printed form of a pass manager is changed to include the
name of the op that the pass manager is anchored on. This updates the
`-pass-pipeline` argument format to include the anchor op as well, so
that the printed form of a pipeline can be directly passed to
`-pass-pipeline`. In most cases this requires updating
`-pass-pipeline='pipeline'` to
`-pass-pipeline='builtin.module(pipeline)'`.
This also fixes an outdated assert that prevented running a
`PassManager` anchored on `'any'`.
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D134900