This PR that introduces the `nvvm.barrier` OP to the NVVM dialect.
Currently, NVVM only supports the `nvvm.barrier0`, which synchronizes
all threads using barrier resource 0.
The new `nvvm.barrier` has two essential arguments: the barrier resource
and the number of threads. This added flexibility allows for selective
synchronization of threads within a CTA, aligning with the capabilities
provided by LLVM intrinsics or the PTX model.
I think we can deprecate `nvvm.barrier0` in favor of the more generic
`nvvm.barrier`.
```
// Equivalent to nvvm.barrier0 (or __syncthreads() in CUDA)
nvvm.barrier
// Synchronize all threads using the 3rd barrier resource.
nvvm.barrier id = 3
// Synchronize %numberOfThreads threads using the 3rd barrier resource.
nvvm.barrier id = 3 number_of_threads = %numberOfThreads
```
Add support for attribute nvvm.grid_constant on LLVM function arguments.
The attribute can be attached only to arguments of type llvm.ptr that
have llvm.byval attribute.
Generate LLVM metadata for functions with nvvm.grid_constant arguments.
The metadata node is a list of integers, where each integer n denotes
that the nth parameter has the
grid_constant annotation (numbering from 1). The generated metadata node
will be handled by NVVM compiler. See
https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#supported-properties
for documentation on grid_constant property.
This patch also adds convertParameterAttr to
LLVMTranslationDialectInterface for supporting the translation of
derived dialect attributes on function parameters
The current implementation of `nvvm.wgmma.mma_async` Op deduces the data
type of the output matrix from the data type of struct member, which can be
non-intuitive, especially in cases where types like `2xf16` are packed
into `i32`.
This PR addresses this issue by improving the Op to include an explicit
data type for the output matrix.
The modified Op now includes an explicit data type for Matrix-D (<f16>),
and looks as follows:
```
%result = llvm.mlir.undef : !llvm.struct<(struct<(i32, i32, ...
nvvm.wgmma.mma_async
%descA, %descB, %result,
#nvvm.shape<m = 64, n = 32, k = 16>,
D [<f16>, #nvvm.wgmma_scale_out<zero>],
A [<f16>, #nvvm.wgmma_scale_in<neg>, <col>],
B [<f16>, #nvvm.wgmma_scale_in<neg>, <col>]
```
PR adds support of `im2col` and `l2cache` to
`cp.async.bulk.tensor.shared.cluster.global`. The Op is now supports all
the traits of the corresponding PTX instruction.
The current structure of this operation looks somewhat like below. The
PR also simplifies types so we don't need to write obvious types after
`:` anymore.
```
nvvm.cp.async.bulk.tensor.shared.cluster.global
%dest, %tmaDescriptor, %barrier,
box[%crd0,%crd1,%crd2,%crd3,%crd4]
im2col[%off0,%off1,%off2] <-- PR introduces
multicast_mask = %ctamask
l2_cache_hint = %cacheHint <-- PR introduces
: !llvm.ptr<3>, !llvm.ptr
```
This PR adds `nvvm.stmatrix` Op to NVVM dialect. The Op collectively
store one or more matrices across all threads in a warp to the given
address location in shared memory.
While looking into reducing needless interdependencies between upstream
MLIR dialects and passes, I discovered that the ROCDL Dialect
redundantely uses links in `VectorToLLVM` conversion pass when it
actually requires just the LLVM Dialect. Furthermore, after a build
failure, I ran `ninja -t missingdeps` which revealed that the NVVM
Dialect depends on headers of the GPU dialect
(211c9752c8/mlir/include/mlir/Dialect/LLVMIR/NVVMDialect.h (L18))
without stating so in CMake.
This causes flaky builds as it is not guaranteed that the header exists
prior to the dialect being compiled.
This patch pairs a promised interface with the object (Op/Attr/Type/Dialect) requesting the promise, ie:
```
declarePromisedInterface<MyAttr, MyInterface>();
```
Allowing to make fine grained promises. It also adds a mechanism to query if `Op/Attr/Type` has an specific promise returning true if the promise is there or if an implementation has been added. Finally it adds a couple of `Attr|TypeConstraints` that can be used in ODS to query if the promise or an implementation is there.
This patch tries to solve 2 issues:
1. Different entities cannot use the same promise.
```
declarePromisedInterface<MyInterface>();
// Resolves a promise.
MyAttr1::attachInterface<MyInterface>(ctx);
// Doesn't resolves a promise, as the previous attachment removed the promise.
MyAttr2::attachInterface<MyInterface>(ctx);
```
2. Is not possible to query if a promise has been declared.
Reviewed By: mehdi_amini
Differential Revision: https://reviews.llvm.org/D158464
WgmmaMmaAsync Op generates `wgmma.mma_async` PTX instruction that uses the same registers as read and write with mapping. Therefore, the registers count needs to be increased 2 times for the following registers.
This works changes this:
```
llvm.inline_asm has_side_effects asm_dialect = att "{wgmma.mma_async... {$0, $1, $2, $3, $4}, $5, $6, p", "=f,=f,=f,=f,0,1,2,3,l,l"
```
Into this one below. The only different is the number of registers ($8 and $9) that comes after read/write.
```
llvm.inline_asm has_side_effects asm_dialect = att "{wgmma.mma_async... {$0, $1, $2, $3, $4}, $8, $9, p", "=f,=f,=f,=f,0,1,2,3,l,l"
```
Reviewed By: qcolombet
Differential Revision: https://reviews.llvm.org/D157843
This work introduces `WGMMATypes` attributes for the `WgmmaMmaSyncOp`. This op, having been recently added to MLIR, previously used `MMATypes`. However, there arises a disparity in supported types between `MmaOp` and `WgmmaMmaSyncOp`. To address this discrepancy more effectively, a new set of attributes is introduced.
Furthermore, this patch refines and optimizing the verification mechanisms of `WgmmaMmaSyncOp` Op.
It also adds support for f8 types, including `e4m3` and `e5m2`, within the `WgmmaMmaSyncOp`.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D157695
WgmmaMmaSyncOp is asynchronous operation. There was a typo named op. This work fixes that.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D157697
This work introduces the `wgmma.mma_async` Op along PTX generation using `BasicPtxBuilderOpInterface`. The Op is designed to execute the matrix multiply-and-accumulate operation across a warpgroup (128 threads). It's important to note that this operation works for devices with the sm_90a capability.
The matrix multiply-and-accumulate operation can take one of the following forms. In both cases, matrix D is referred to as the accumulator:
D = A * B + D : Result is added to the accumulator matrix D.
D = A * B : The input from the accumulator matrix D is not utilized.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D157370
**For an explanation of these patches see D154153.**
Commit message:
This patch adds the NVVM target attribute for serializing GPU modules into
strings containing cubin.
Depends on D154113 and D154100 and D154097
Reviewed By: mehdi_amini
Differential Revision: https://reviews.llvm.org/D154117
The multiple -convert-XXX-to-llvm passes are really nice testing tools for
individual dialects, but the expectation is that a proper conversion should
assemble the conversion patterns using `populateXXXToLLVMConversionPatterns()
APIs. However most customers just chain the conversion passes by convenience.
This pass makes it composable more transparently to assemble the required
patterns for conversion to LLVM dialect by using an interface.
The Pass will scan the input and collect all the dialect present, and for
those who implement the `ConvertToLLVMPatternInterface` it will use it to
populate the conversion pattern, and possible the conversion target.
Since these conversions can involve intermediate dialects, or target other
dialects than LLVM (for example AVX or NVVM), this pass can't statically
declare the required `getDependentDialects()` before the pass pipeline
begins. This is worked around by using an extension in the dialectRegistry
that will be invoked for every new loaded dialects in the context. This
allows to lookup the interface ahead of time and use it to query the
dependent dialects.
Differential Revision: https://reviews.llvm.org/D157183
This work introduce `cp.async.bulk.tensor.shared.cluster.global` in NVVM dialect that executes load using TMA.
Depends on D155056
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155060
`nvgpu.device_async_copy` is lowered into `cp.async` PTX instruction. However, NVPTX backend does not support its all mode especially when zero padding is needed. Therefore, current MLIR implementation genereates inline assembly for that.
This work simplifies PTX generation for `nvgpu.device_async_copy`, and implements it by `NVVMToLLVM` Pass.
Depends on D154060
Reviewed By: nicolasvasilache, manishucsd
Differential Revision: https://reviews.llvm.org/D154345
The MLIR classes Type/Attribute/Operation/Op/Value support
cast/dyn_cast/isa/dyn_cast_or_null functionality through llvm's doCast
functionality in addition to defining methods with the same name.
This change begins the migration of uses of the method to the
corresponding function call as has been decided as more consistent.
Note that there still exist classes that only define methods directly,
such as AffineExpr, and this does not include work currently to support
a functional cast/isa call.
Context:
* https://mlir.llvm.org/deprecation/ at "Use the free function variants for dyn_cast/cast/isa/…"
* Original discussion at https://discourse.llvm.org/t/preferred-casting-style-going-forward/68443
Implementation:
This follows a previous patch that updated calls
`op.cast<T>()-> cast<T>(op)`. However some cases could not handle an
unprefixed `cast` call due to occurrences of variables named cast, or
occurring inside of class definitions which would resolve to the method.
All C++ files that did not work automatically with `cast<T>()` are
updated here to `llvm::cast` and similar with the intention that they
can be easily updated after the methods are removed through a
find-replace.
See https://github.com/llvm/llvm-project/compare/main...tpopp:llvm-project:tidy-cast-check
for the clang-tidy check that is used and then update printed
occurrences of the function to include `llvm::` before.
One can then run the following:
```
ninja -C $BUILD_DIR clang-tidy
run-clang-tidy -clang-tidy-binary=$BUILD_DIR/bin/clang-tidy -checks='-*,misc-cast-functions'\
-export-fixes /tmp/cast/casts.yaml mlir/*\
-header-filter=mlir/ -fix
rm -rf $BUILD_DIR/tools/mlir/**/*.inc
```
Differential Revision: https://reviews.llvm.org/D150348
This is part of an effort to migrate from llvm::Optional to
std::optional. This patch changes the way mlir-tblgen generates .inc
files, and modifies tests and documentation appropriately. It is a "no
compromises" patch, and doesn't leave the user with an unpleasant mix of
llvm::Optional and std::optional.
A non-trivial change has been made to ControlFlowInterfaces to split one
constructor into two, relating to a build failure on Windows.
See also: https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
Signed-off-by: Ramkumar Ramachandra <r@artagnon.com>
Differential Revision: https://reviews.llvm.org/D138934
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated. The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
PTX programming models provides some performance tuning directives; see https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#performance-tuning-directives
The downstream compiler namely `ptxas` leverages these information for better register allocation or to handle other resource management that improves the performance.
This revision introduce all the kernel based directives to MLIR's NVVM dialect. The list is below
```
maxnreg -> max register per thread in CTA
maxntid -> max threads per CTA
reqntid -> exact number of threads per CTA
minnctapersm -> min CTA per SM
```
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D136931
This reland includes changes to the Python bindings.
Switch variadic operand and result segment size attributes to use the
dense i32 array. Dense integer arrays were introduced primarily to
represent index lists. They are a better fit for segment sizes than
dense elements attrs.
Depends on D131801
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D131803
Switch variadic operand and result segment size attributes to use the
dense i32 array. Dense integer arrays were introduced primarily to
represent index lists. They are a better fit for segment sizes than
dense elements attrs.
Depends on D131738
Reviewed By: mehdi_amini
Differential Revision: https://reviews.llvm.org/D131702
Follow up from flipping dialects to both, flip accessor used to prefixed
variant ahead to flipping from _Both to _Prefixed. This just flips to
the accessors introduced in the preceding change which are just prefixed
forms of the existing accessor changed from.
Mechanical change using helper script
https://github.com/jpienaar/llvm-project/blob/main/clang-tools-extra/clang-tidy/misc/AddGetterCheck.cpp and clang-format.
There are a lot of cases where we accidentally ignored the result of some
parsing hook. Mark ParseResult as LLVM_NODISCARD just like ParseResult is.
This exposed some stuff to clean up, so do.
Differential Revision: https://reviews.llvm.org/D125549
Add attribute to be able to generate the intrinsic version of async copy
generating a copy with l1 bypass. This correspond to
cp.async.cg.shared.global in ptx.
Differential Revision: https://reviews.llvm.org/D125241
The NVVM dialect test coverage for all possible type/shape combinations
in the `nvvm.mma.sync` op is mostly complete. However, there were tests
missing for TF32 datatype support. This change adds tests for the one
relevant shape/type combination. This uncovered a small bug in the op
verifier, which this change also fixes.
Differential Revision: https://reviews.llvm.org/D124975
This patch adds MLIR NVVM support for the various NVPTX `mma.sync`
operations. There are a number of possible data type, shape,
and other attribute combinations supported by the operation, so a
custom assebmly format is added and attributes are inferred where
possible.
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D122410