This patch updates the syntax for nvgpu_arrive Op
in matmulBuilder.py. This fixes the compilation
error for this test.
For the warp-specialized matmul_kernel implementation,
removing the WaitGroupSyncOp (after the mma-main-loop)
fixes the hang observed.
With these two fixes, the test compiles and
executes successfully on an sm90a machine.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
This patch fixes the sm90 cluster test by:
* Fixing a typo in LowerGpuOpsToNVVMOps where one of the ClusterDim Op
conversion pattern should actually be for the
ClusterDimBlocks Op. This addresses the compilation error for this test.
* The grid-size should be (4,4,1) instead of (2,2,1). This passes the
scf-if check against the threshold of 3 below and actually
generates the required prints from the GPU.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
The memref.expand_shape explicitly takes an output_shape now.
This patch adds it to the Op and fixes the failing test.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
This commit adds support for `gpu.cluster_dim_blocks` and
`gpu.cluster_block_id` Ops to represent number of blocks per cluster and
block id inside a cluster respectively. Also, fixed the description of
`gpu.cluster_dim` Op and updated the `cga_cluster.mlir` test file to use
`gpu.cluster_dim_blocks`
Co-authored-by: pradeepku <pradeepku@nvidia.com>
Co-authored-by: Guray Ozen <guray.ozen@gmail.com>
This PR add `TmaDescriptorBuilder`
- class simplifies TMA generation.
- Makes the code ready to support various Tma configurations
- removes strings and use the enums from `mlir.nvgpu.ENUMs`.
- Example "swizzle = swizzle_128b, l2promo=none, oob=zero,
interleave=none" to enums in `mlir.nvgpu` dialects.
- Enums have string equivalent that are used during the IR writing and
generation (see `TmaDescriptorBuilder::tensormap_descriptor_ty`).
- Improves readability and abstracts out TMA descriptor builders in
reusable component.
---------
Co-authored-by: Manish Gupta <manigupta@google.com>
Currently, `phaseParity` argument of `nvgpu.mbarrier.try_wait.parity` is
index. This can cause a problem if it's passed any value different than
0 or 1. Because the PTX instruction only accepts even or odd phase. This
PR makes phaseParity argument i1 to avoid misuse.
Here is the information from PTX doc:
```
The .parity variant of the instructions test for the completion of the phase indicated
by the operand phaseParity, which is the integer parity of either the current phase or
the immediately preceding phase of the mbarrier object. An even phase has integer
parity 0 and an odd phase has integer parity of 1. So the valid values of phaseParity
operand are 0 and 1.
```
See for more information:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-test-wait-mbarrier-try-wait
This PR improves the documentation for the `gpu-lower-to-nvvm-pipeline`
(as it was remaning item for #75775)
- Changes pipeline `gpu-lower-to-nvvm` -> `gpu-lower-to-nvvm-pipeline`
- Adds a section in GPU Dialect in website. It clarifies the pipeline's
functionality in lowering primary dialects to NVVM targets.
The `test-lower-to-nvvm` pipeline serves as the common and proper
pipeline for nvvm+host compilation, and it's used across our CUDA
integration tests.
This PR updates the `test-lower-to-nvvm` pipeline to `gpu-lower-to-nvvm`
and moves it within `InitAllPasses.h`. The aim is to call it from
Python, also having a standardize compilation process for nvvm.
NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA).
It is a new level of parallelism, allowing clustering of Cooperative
Thread Arrays (CTA) to synchronize and communicate through shared memory
while running concurrently.
This PR enables support for CGA within the `gpu.launch_func` in the GPU
dialect. It extends `gpu.launch_func` to accommodate this functionality.
The GPU dialect remains architecture-agnostic, so we've added CGA
functionality as optional parameters. We want to leverage mechanisms
that we have in the GPU dialects such as outlining and kernel launching,
making it a practical and convenient choice.
An example of this implementation can be seen below:
```
gpu.launch_func @kernel_module::@kernel
clusters in (%1, %0, %0) // <-- Optional
blocks in (%0, %0, %0)
threads in (%0, %0, %0)
```
The PR also introduces index and dimensions Ops specific to clusters,
binding them to NVVM Ops:
```
%cidX = gpu.cluster_id x
%cidY = gpu.cluster_id y
%cidZ = gpu.cluster_id z
%cdimX = gpu.cluster_dim x
%cdimY = gpu.cluster_dim y
%cdimZ = gpu.cluster_dim z
```
We will introduce cluster support in `gpu.launch` Op in an upcoming PR.
See [the
documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays)
provided by NVIDIA for details.
PR #69913 added a GEMM test (128x128x128 F32 += F16 * F16) with
if-statement. This PR adds the same test using predicates in PTX.
Predicate support is enabled using _BasicPtxBuilderInterface_
`(nvgpu.opcode ..., predicate = %pred)`.
The predicate condition is computed in `Step 2. [GPU] Elect fastest
thread in CTA` inspired by cutlass. It is as follows:
```
lane_predicate = nvvm.elect.sync
warp_idx = __shfl_sync(0xffffffff, threadIdx.x / 32, 0)
warp_idx_in_warp_group = warp_idx % 4
predicate = (lane_predicate & warp_idx_in_warp_group)
```
Depends on #70027#69934#69935#69584
#70923 improved verifier. The verifier caught that the tensor map type in the tma descriptor in this test isn't correct. The program was working correctly anway since the offset is calculated correctly.
This work fixes the test.
This commit removes the last remnants of `use-opaque-pointers` from the
mlir tests. Two of the tests seem to be disabled, while the CUDA one is
an integration test that didn't trigger a buildbot failure.
The test was meant to check `64x128xf16` as the contiguous dimension
exceeds the cache line (128b). TMA requires cache line-aligned loads, so
loading 64x128 can be done with two 64x64 loads, as documented in the
test.
However, there was a typo in the type, which was `memref<128x64xf16>`
instead of the correct `memref<64x128xf16>`. This PR corrects the issue
and updates the verification.
#69934 broke integration tests that rely on the
kernel-bare-ptr-calling-convention and host-bare-ptr-calling-convention
flags. This PR brings these flags.
Also the kernel-index-bitwidth flag is removed, as kernel pointer size
depends on the host. Separating host (64-bit) and kernel (32-bit) is not
viable.
Update most test passes to use the transform-interpreter pass instead of
the test-transform-dialect-interpreter-pass. The new "main" interpreter
pass has a named entry point instead of looking up the top-level op with
`PossibleTopLevelOpTrait`, which is arguably a more understandable
interface. The change is mechanical, rewriting an unnamed sequence into
a named one and wrapping the transform IR in to a module when necessary.
Add an option to the transform-interpreter pass to target a tagged
payload op instead of the root anchor op, which is also useful for repro
generation.
Only the test in the transform dialect proper and the examples have not
been updated yet. These will be updated separately after a more careful
consideration of testing coverage of the transform interpreter logic.
This PR enables `test-lower-to-nvvm` pass pipeline for the integration
tests for NVIDIA sm_90 architecture.
This PR adjusts `test-lower-to-nvvm` pass in two ways:
1) Calls `createConvertNVGPUToNVVMPass` before the outlining process.
This particular pass is responsible for generating both device and host
code. On the host, it calls the CUDA driver to build the TMA descriptor
(`cuTensorMap`).
2) Integrates the `createConvertNVVMToLLVMPass` to generate PTXs for
NVVM Ops.
MLIR has begun supporting many features of Nvidia's sm_90 architecture,
and new tests have been added for it. Although the tests worked well,
there were redundancies in the pipeline. This PR cleans up unnecessary
passes.
The #65953 added a test `128x64xf16` that does a single TMA load. This
PR adds more complex test that does 2 additional TMA loads with 128B
Swizzling:
```
TMA Load: Matrix-A[0:128][0:64]
TMA Load: Matrix-B[0:64][0:64]
TMA Load: Matrix-B[64:128][0:64]
```
The program tests the loaded data for Matrix-B.
This patch adds an NVPTX compilation path that enables JIT compilation
on NVIDIA targets. The following modifications were performed:
1. Adding a format field to the GPU object attribute, allowing the
translation attribute to use the correct runtime function to load the
module. Likewise, a dictionary attribute was added to add any possible
extra options.
2. Adding the `createObject` method to `GPUTargetAttrInterface`; this
method returns a GPU object from a binary string.
3. Adding the function `mgpuModuleLoadJIT`, which is only available for
NVIDIA GPUs, as there is no equivalent for AMD.
4. Adding the CMake flag `MLIR_GPU_COMPILATION_TEST_FORMAT` to specify
the format to use during testing.
The 'TargetAttr' workflow was recently introduced to serialization for
'MLIR->LLVM->PTX'. #65857 removes previous passes (gpu::Serialization*
passes) because they are duplicates.
This PR removes the use of gpu::Serialization* passes in SM_90
integration tests, and enables the 'TargetAttr' workflow.
It also moves the transform dialect specific test to a new folder.
The revert happened due to a build bot failure that threw 'CUDA_ERROR_UNSUPPORTED_PTX_VERSION'.
The failure's root cause was a pass using "+ptx76" for compilation and an old CUDA driver
on the bot. This commit relands the patch with "+ptx60".
Original Gh PR: #65768
Original commit message:
Migrate tests referencing `gpu-to-cubin` to the new compilation workflow
using `TargetAttrs`. The `test-lower-to-nvvm` pass pipeline was modified
to use the new compilation workflow to simplify the introduction of
future tests.
The `createLowerGpuOpsToNVVMOpsPass` function was removed, as it didn't
allow for passing all options available in the `ConvertGpuOpsToNVVMOp`
pass.
Migrate tests referencing `gpu-to-cubin` to the new compilation workflow
using `TargetAttrs`. The `test-lower-to-nvvm` pass pipeline was modified
to use the new compilation workflow to simplify the introduction of
future tests.
The `createLowerGpuOpsToNVVMOpsPass` function was removed, as it didn't
allow for passing all options available in the `ConvertGpuOpsToNVVMOp`
pass.
TMA was introduced to MLIR, however, it needed `ptxas` compiler. Recent work D154117 introduced that!
This work runs the existing integration test.
Reviewed By: fmorac
Differential Revision: https://reviews.llvm.org/D159347
Reland of the original patch after updating the Python binding tests,
a few CUDA/GPU MLIR tests, and ensuring the assembly format is
round-trippable.
This patch splits the lowering of vector.print into first converting
an n-D print into a loop of scalar prints of the elements, then a second
pass that converts those scalar prints into the runtime calls. The
former is done in VectorToSCF and the latter in VectorToLLVM.
The main reason for this is to allow printing scalable vector types,
which are not possible to fully unroll at compile time, though this
also avoids fully unrolling very large vectors.
To allow VectorToSCF to add the necessary punctuation between vectors
and elements, a "punctuation" attribute has been added to vector.print.
This abstracts calling the runtime functions such as printNewline(),
without leaking the LLVM details into the higher abstraction levels.
For example:
vector.print punctuation <comma>
lowers to
llvm.call @printComma() : () -> ()
The output format and runtime functions remain the same, which avoids
the need to alter a large number of tests (aside from the pipelines).
Reviewed By: awarzynski, c-rhodes, aartbik
Differential Revision: https://reviews.llvm.org/D156519
Reland of the original patch after updating the Python binding tests and
a few CUDA/GPU MLIR tests.
This patch splits the lowering of vector.print into first converting
an n-D print into a loop of scalar prints of the elements, then a second
pass that converts those scalar prints into the runtime calls. The
former is done in VectorToSCF and the latter in VectorToLLVM.
The main reason for this is to allow printing scalable vector types,
which are not possible to fully unroll at compile time, though this
also avoids fully unrolling very large vectors.
To allow VectorToSCF to add the necessary punctuation between vectors
and elements, a "punctuation" attribute has been added to vector.print.
This abstracts calling the runtime functions such as printNewline(),
without leaking the LLVM details into the higher abstraction levels.
For example:
vector.print <comma>
lowers to
llvm.call @printComma() : () -> ()
The output format and runtime functions remain the same, which avoids
the need to alter a large number of tests (aside from the pipelines).
Reviewed By: awarzynski, c-rhodes, aartbik
Differential Revision: https://reviews.llvm.org/D156519
This revision adds support for direct lowering of a linalg.copy on buffers between global and shared memory to a tma async load + synchronization operations.
This uses the recently introduced Hopper NVVM and NVGPU abstraction to connect things end to end.
Differential Revision: https://reviews.llvm.org/D157087