I have a tutorial at EuroLLVM 2024 ([Zero to Hero: Programming Nvidia Hopper Tensor Core with MLIR's NVGPU Dialect](https://llvm.swoogo.com/2024eurollvm/session/2086997/zero-to-hero-programming-nvidia-hopper-tensor-core-with-mlir's-nvgpu-dialect)). For that, I implemented tutorial codes in Python. The focus is the nvgpu dialect and how to use its advanced features. I thought it might be useful to upstream this. The tutorial codes are as follows: - **Ch0.py:** Hello World - **Ch1.py:** 2D Saxpy - **Ch2.py:** 2D Saxpy using TMA - **Ch3.py:** GEMM 128x128x64 using Tensor Core and TMA - **Ch4.py:** Multistage performant GEMM using Tensor Core and TMA - **Ch5.py:** Warp Specialized GEMM using Tensor Core and TMA I might implement one more chapter: - **Ch6.py:** Warp Specialized Persistent ping-pong GEMM This PR also introduces the nvdsl class, making IR building in the tutorial easier.
51 lines
1.6 KiB
Python
51 lines
1.6 KiB
Python
# RUN: env SUPPORT_LIB=%mlir_cuda_runtime \
|
|
# RUN: %PYTHON %s | FileCheck %s
|
|
|
|
# ===----------------------------------------------------------------------===//
|
|
# Chapter 0 : Hello World
|
|
# ===----------------------------------------------------------------------===//
|
|
#
|
|
# This program demonstrates Hello World:
|
|
# 1. Build MLIR function with arguments
|
|
# 2. Build MLIR GPU kernel
|
|
# 3. Print from a GPU thread
|
|
# 4. Pass arguments, JIT compile and run the MLIR function
|
|
#
|
|
# ===----------------------------------------------------------------------===//
|
|
|
|
|
|
from mlir.dialects import gpu
|
|
from tools.nvdsl import *
|
|
|
|
|
|
# 1. The decorator generates a MLIR func.func.
|
|
# Everything inside the Python function becomes the body of the func.
|
|
# The decorator also translates `alpha` to an `index` type.
|
|
@NVDSL.mlir_func
|
|
def main(alpha):
|
|
# 2. The decorator generates a MLIR gpu.launch.
|
|
# Everything inside the Python function becomes the body of the gpu.launch.
|
|
# This allows for late outlining of the GPU kernel, enabling optimizations
|
|
# like constant folding from host to device.
|
|
@NVDSL.mlir_gpu_launch(grid=(1, 1, 1), block=(4, 1, 1))
|
|
def kernel():
|
|
tidx = gpu.thread_id(gpu.Dimension.x)
|
|
# + operator generates arith.addi
|
|
myValue = alpha + tidx
|
|
# Print from a GPU thread
|
|
gpu.printf("GPU thread %llu has %llu\n", [tidx, myValue])
|
|
|
|
# 3. Call the GPU kernel
|
|
kernel()
|
|
|
|
|
|
alpha = 100
|
|
# 4. The `mlir_func` decorator JIT compiles the IR and executes the MLIR function.
|
|
main(alpha)
|
|
|
|
|
|
# CHECK: GPU thread 0 has 100
|
|
# CHECK: GPU thread 1 has 101
|
|
# CHECK: GPU thread 2 has 102
|
|
# CHECK: GPU thread 3 has 103
|