This is to avoid confusion when dealing with reduction/combining kinds.
For example, see a recent PR comment:
https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175.
Previously, they were picked to mostly mirror the names of the llvm
vector reduction intrinsics:
https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In
isolation, it was not clear if `<maxf>` has `arith.maxnumf` or
`arith.maximumf` semantics. The new reduction kind names map 1:1 to
arith ops, which makes it easier to tell/look up their semantics.
Because both the vector and the gpu dialect depend on the arith dialect,
it's more natural to align names with those in arith than with the
lowering to llvm intrinsics.
Issue: https://github.com/llvm/llvm-project/issues/72354
This patch fixes the error in issue #75434. The crash was being caused
by not checking for a lack of target attributes in a GPU module. It's
now considered an error to invoke the pass with a GPU module with no
target attributes.
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
The motivation for this change is explained in
https://github.com/llvm/llvm-project/issues/72354.
Before this change, we could not tell between signed/unsigned
minimum/maximum and NaN treatment for floating point values.
The mapping of old reduction operations to the new ones is as follows:
* `min` --> `minsi` for ints, `minf` for floats
* `max` --> `maxsi` for ints, `maxf` for floats
New reduction kinds not represented in the old enum: `minui`, `maxui`,
`minimumf`, `maximumf`.
As a next step, I would like to have a common definition of combining
kinds used by the `vector` and `gpu` dialects. Separately, the GPU to
SPIR-V lowering does not yet properly handle zero and NaN values -- the
behavior of floating point min/max group reductions is not specified by
the SPIR-V spec, see https://github.com/llvm/llvm-project/issues/73459.
Issue: https://github.com/llvm/llvm-project/issues/72354
This PR generalize gpu-out-lining pass to take care of ops
`SymbolOpInterface` instead of just `func::FuncOp`.
Before this change, gpu-out-lining pass will skip `llvm.func`.
```mlir
module {
llvm.func @main() {
%c1 = arith.constant 1 : index
gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %c1, %arg7 = %c1, %arg8 = %c1) threads(%arg3, %arg4, %arg5) in (%arg9 = %c1, %arg10 = %c1, %arg11 = %c1) {
gpu.terminator
}
llvm.return
}
}
```
After this change, gpu-out-lining pass can handle llvm.func as well.
Allows the barrier elimination code to be run from C++ as well. The code
from transforms dialect is copied as-is, the pass and populate functions
have beed added at the end.
Co-authored-by: Eric Eaton <eric@nod-labs.com>
Enable merging #71439 by removing a definitely-wrong usage of
std::unique_ptr<SmallVectorImpl<char>> as a return value with passing in
a SmallVectorImpl<char>&
Also change the following function to take ArrayRef<char> instead of
const SmalVectorImpl<char>& .
This commit implements gpu::TargetAttrInterface for SPIR-V target
attribute. The plan is to use this to enable GPU compilation pipeline
for OpenCL kernels later.
The changes do not impact Vulkan shaders using milr-vulkan-runner.
New GPU Dialect transform pass spirv-attach-target is implemented for
attaching attribute from CLI.
gpu-module-to-binary pass now works with GPU module that has SPIR-V
module with OpenCL kernel functions inside.
This commit adjusts the CUDA context management in the SerializeToCubin
pass. In particular, it uses the device 0 primary context instead of
creating a new CUDA context on each invocation of SerializeToCubin. This
yields very large improvements in compile time, especially if an
application (like a JIT compiler) is calling SerializeToCubin
repeatedly.
Differential Revision: https://reviews.llvm.org/D159487
Co-authored-by: Rohan Yadav <rohany@cs.stanford.edu>
SerializetToHsaco, as currently implemented, leaks the file descriptor
of the .hsaco temporary file, which causes issues in long-running
parallel compilation setups.
See also https://github.com/ROCmSoftwarePlatform/rocMLIR/pull/1257
This is necessary to support deallocation of IR with gpu.launch
operations because it does not implement the RegionBranchOpInterface.
Implementing the interface would require it to support regions with
unstructured control flow and produced arguments/results.
This patch adds an NVPTX compilation path that enables JIT compilation
on NVIDIA targets. The following modifications were performed:
1. Adding a format field to the GPU object attribute, allowing the
translation attribute to use the correct runtime function to load the
module. Likewise, a dictionary attribute was added to add any possible
extra options.
2. Adding the `createObject` method to `GPUTargetAttrInterface`; this
method returns a GPU object from a binary string.
3. Adding the function `mgpuModuleLoadJIT`, which is only available for
NVIDIA GPUs, as there is no equivalent for AMD.
4. Adding the CMake flag `MLIR_GPU_COMPILATION_TEST_FORMAT` to specify
the format to use during testing.
This will make it easy for callers to see issues with and fix up calls
to createTargetMachine after a future change to the params of
TargetMachine.
This matches other nearby enums.
For downstream users, this should be a fairly straightforward
replacement,
e.g. s/CodeGenOpt::Aggressive/CodeGenOptLevel::Aggressive
or s/CGFT_/CodeGenFileType::
This patch adds the option of building an optional symbol table for the
top operation in the `gpu-module-to-binary` pass. The table is not
created by default as most targets don't need it; instead, it is lazily
built. The table is passed through a callback in `TargetOptions`.
This patch is required to integrate #65539 .
Currently, the NVPTX tool compilation path only calls `ptxas`; thus, the
GPU running the binary must be an exact match of the arch of the target,
or else the runtime throws an error due to the arch mismatch.
This patch adds a call to `fatbinary`, creating a fat binary with the
cubin object and the PTX code, allowing the driver to JIT the PTX at
runtime if there's an arch mismatch.
This revision avoids the registration of dialect extensions in Pass::getDependentDialects.
Such registration of extensions can be dangerous because `DialectRegistry::isSubsetOf` is
always guaranteed to return false for extensions (i.e. there is no mechanism to track
whether a lambda is already in the list of already registered extensions).
When the context is already in a multi-threaded mode, this is guaranteed to assert.
Arguably a more structured registration mechanism for extensions with a unique ExtensionID
could be envisioned in the future.
In the process of cleaning this up, multiple usage inconsistencies surfaced around the
registration of translation extensions that this revision also cleans up.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D157703
Adds the passes `nvvm-attach-target` & `rocdl-attach-target for attaching `nvvm.target` & `rocdl.target` attributes to GPU Modules.
These passes search GPU Modules in the immediate region of the Op being acted on, attaching the target attribute to the module.
Modules can be selected using a regex string, allowing fine grain attachment of targets, see the test `attach-target.mlir` for an example.
Depends on D154153
Reviewed By: mehdi_amini
Differential Revision: https://reviews.llvm.org/D157351
**For an explanation of these patches see D154153.**
Commit message:
This pass converts GPU modules into GPU binaries, serializing all targets present
in a GPU module by invoking the `serializeToObject` target attribute method.
Depends on D154147
Reviewed By: mehdi_amini
Differential Revision: https://reviews.llvm.org/D154149
This patch fixes the output of the error message that is printed when
the CUDA library cannot identity the error code. In that case, no error
message is provided by the library, and the previous implementation just
printed the content of a randomly initialized pointer. This patch
initializes the pointer to nullptr and only prints the content if that
has changed.
Reviewed By: Mogball
Differential Revision: https://reviews.llvm.org/D156791
Some GPU backends (SPIR-V) lower memrefs to bare pointers, so for dynamically sized/strided memrefs it will fail.
This pass extracts sizes and strides via `memref.extract_strrided_metadata` outside `gpu.launch` body and do index/offset calculation explicitly and then reconstructs memrefs via `memref.reinterpret_cast`.
`memref.reinterpret_cast` then lowered via https://reviews.llvm.org/D155011
Differential Revision: https://reviews.llvm.org/D155247
This revision untangles a few more conversion pieces and allows rewriting
the relatively intricate (and somewhat inconsistent) LowerGpuOpsToNVVMOpsPass
in a declarative fashion that provides a much better understanding and control.
Differential Revision: https://reviews.llvm.org/D157617
Some GPU backends (SPIR-V) lower memrefs to bare pointers, so for dynamically sized/strided memrefs it will fail.
This pass extracts sizes and strides via `memref.extract_strrided_metadata` outside `gpu.launch` body and do index/offset calculation explicitly and then reconstructs memrefs via `memref.reinterpret_cast`.
`memref.reinterpret_cast` then lowered via https://reviews.llvm.org/D155011
Differential Revision: https://reviews.llvm.org/D155247
This reverts commit 2e0e00ed84
and reverts commit a6eb40692c
and reverts commit 585cbe3f63.
15 tests are broken on the mlir-nvidia buildbot:
'cuModuleLoadData(&module, data)' failed with 'CUDA_ERROR_INVALID_SOURCE'
'cuModuleGetFunction(&function, module, name)' failed with 'CUDA_ERROR_INVALID_HANDLE'
'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, smem, stream, params, extra)' failed with 'CUDA_ERROR_INVALID_HANDLE'
'cuModuleUnload(module)' failed with 'CUDA_ERROR_INVALID_HANDLE'
Current SM version is 35 but it is deprecated long time ago. D155563 introduced ptxas compilations, using sm_35 causes failures in builtbot. This change increase default SM version to 50.
Differential Revision: https://reviews.llvm.org/D156098
Recent change introduces compilation with ptxas compiler. The change is important to be able to different versions of ptxas compiler without changing the compiler.
It causes some failures in builtbot. This change adds fallback mechanism to JIt compilation that is original path.
Differential Revision: https://reviews.llvm.org/D156096
This work improves how we compile the generated PTX code using the `ptxas` compiler. Currently, we rely on the driver's jit API to compile the PTX code. However, this approach has some limitations. It doesn't always produce the same binary output as the ptxas compiler, leading to potential inconsistencies in the generated Cubin files.
This work introduces a significant improvement by directly utilizing the ptxas compiler for PTX compilation. By doing so, we can achieve more consistent and reliable results in generating cubin files. Key Benefits:
- Using the Ptxas compiler directly ensures that the cubin files generated during the build process remain consistent with CUDA compilation using `nvcc` or `clang`.
- Another advantage of this work is that it allows developers to experiment with different ptxas compilers without the need to change the compiler. Performance among ptxas compiler versions are vary, therefore, one can easily try different ptxas compilers.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155563
No matter how one constructs their SerializeTo* pass, we want to
ensure that the LLVM initialization code runs once and only once. This
commit adds a static once_flag to ensure that.
I've run into mysterious segfaults when calling MLIR GPU compiles from
multiple threads, and this commit is a potential fix for the issue.
Reviewed By: fmorac
Differential Revision: https://reviews.llvm.org/D155226
When targeting NVIDIA GPUs, seeing the generated PTX is important. Currently, we don't have simple way to do it.
This work adds dump-ptx to gpu-to-cubin pass. One can use it like `gpu-to-cubin{chip=sm_90 features=+ptx80 dump-ptx}`.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155166
* Use `create` instead of `createOrFold` for constant ops. Constants cannot be folded any further.
* Use `create` instead of `createOrFold` for ops that do not have a folder.
* Use C++ op builders that take an `int` instead of creating a `ConstantIndexOp`.
* Create `tensor::DimOp` instead of `linalg::createOrFoldDimOp` when it is certain that the operand is a tensor.
Differential Revision: https://reviews.llvm.org/D154196
Before serializing, optimizations on llvm were only called on path to
hsaco, and not cubin. Define opt-level for `gpu-to-cubin` pass as well,
and move call to optimize llvm to a common place.
Reviewed By: bondhugula
Differential Revision: https://reviews.llvm.org/D151554
This patch adds support for i64, f64 values in `gpu.shuffle`, rewriting 64bit shuffles into two 32bit shuffles.
The reason behind this change is that both CUDA & HIP support this kind of shuffling.
The implementation provided by this patch is based on the LLVM IR emitted by clang for 64bit shuffles when using `-O3`.
Reviewed By: makslevental
Differential Revision: https://reviews.llvm.org/D148974
This patch implements a rewrite pattern for transforming gpu.global_id x
to gpu.thread_id + gpu.block_id * gpu.block_dim.
Reviewed By: makslevental
Differential Revision: https://reviews.llvm.org/D148978
The MLIR classes Type/Attribute/Operation/Op/Value support
cast/dyn_cast/isa/dyn_cast_or_null functionality through llvm's doCast
functionality in addition to defining methods with the same name.
This change begins the migration of uses of the method to the
corresponding function call as has been decided as more consistent.
Note that there still exist classes that only define methods directly,
such as AffineExpr, and this does not include work currently to support
a functional cast/isa call.
Caveats include:
- This clang-tidy script probably has more problems.
- This only touches C++ code, so nothing that is being generated.
Context:
- https://mlir.llvm.org/deprecation/ at "Use the free function variants
for dyn_cast/cast/isa/…"
- Original discussion at https://discourse.llvm.org/t/preferred-casting-style-going-forward/68443
Implementation:
This first patch was created with the following steps. The intention is
to only do automated changes at first, so I waste less time if it's
reverted, and so the first mass change is more clear as an example to
other teams that will need to follow similar steps.
Steps are described per line, as comments are removed by git:
0. Retrieve the change from the following to build clang-tidy with an
additional check:
https://github.com/llvm/llvm-project/compare/main...tpopp:llvm-project:tidy-cast-check
1. Build clang-tidy
2. Run clang-tidy over your entire codebase while disabling all checks
and enabling the one relevant one. Run on all header files also.
3. Delete .inc files that were also modified, so the next build rebuilds
them to a pure state.
4. Some changes have been deleted for the following reasons:
- Some files had a variable also named cast
- Some files had not included a header file that defines the cast
functions
- Some files are definitions of the classes that have the casting
methods, so the code still refers to the method instead of the
function without adding a prefix or removing the method declaration
at the same time.
```
ninja -C $BUILD_DIR clang-tidy
run-clang-tidy -clang-tidy-binary=$BUILD_DIR/bin/clang-tidy -checks='-*,misc-cast-functions'\
-header-filter=mlir/ mlir/* -fix
rm -rf $BUILD_DIR/tools/mlir/**/*.inc
git restore mlir/lib/IR mlir/lib/Dialect/DLTI/DLTI.cpp\
mlir/lib/Dialect/Complex/IR/ComplexDialect.cpp\
mlir/lib/**/IR/\
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp\
mlir/lib/Dialect/Vector/Transforms/LowerVectorMultiReduction.cpp\
mlir/test/lib/Dialect/Test/TestTypes.cpp\
mlir/test/lib/Dialect/Transform/TestTransformDialectExtension.cpp\
mlir/test/lib/Dialect/Test/TestAttributes.cpp\
mlir/unittests/TableGen/EnumsGenTest.cpp\
mlir/test/python/lib/PythonTestCAPI.cpp\
mlir/include/mlir/IR/
```
Differential Revision: https://reviews.llvm.org/D150123
This new features enabled to dedicate custom storage inline within operations.
This storage can be used as an alternative to attributes to store data that is
specific to an operation. Attribute can also be stored inside the properties
storage if desired, but any kind of data can be present as well. This offers
a way to store and mutate data without uniquing in the Context like Attribute.
See the OpPropertiesTest.cpp for an example where a struct with a
std::vector<> is attached to an operation and mutated in-place:
struct TestProperties {
int a = -1;
float b = -1.;
std::vector<int64_t> array = {-33};
};
More complex scheme (including reference-counting) are also possible.
The only constraint to enable storing a C++ object as "properties" on an
operation is to implement three functions:
- convert from the candidate object to an Attribute
- convert from the Attribute to the candidate object
- hash the object
Optional the parsing and printing can also be customized with 2 extra
functions.
A new options is introduced to ODS to allow dialects to specify:
let usePropertiesForAttributes = 1;
When set to true, the inherent attributes for all the ops in this dialect
will be using properties instead of being stored alongside discardable
attributes.
The TestDialect showcases this feature.
Another change is that we introduce new APIs on the Operation class
to access separately the inherent attributes from the discardable ones.
We envision deprecating and removing the `getAttr()`, `getAttrsDictionary()`,
and other similar method which don't make the distinction explicit, leading
to an entirely separate namespace for discardable attributes.
Recommit d572cd1b06 after fixing python bindings build.
Differential Revision: https://reviews.llvm.org/D141742
This new features enabled to dedicate custom storage inline within operations.
This storage can be used as an alternative to attributes to store data that is
specific to an operation. Attribute can also be stored inside the properties
storage if desired, but any kind of data can be present as well. This offers
a way to store and mutate data without uniquing in the Context like Attribute.
See the OpPropertiesTest.cpp for an example where a struct with a
std::vector<> is attached to an operation and mutated in-place:
struct TestProperties {
int a = -1;
float b = -1.;
std::vector<int64_t> array = {-33};
};
More complex scheme (including reference-counting) are also possible.
The only constraint to enable storing a C++ object as "properties" on an
operation is to implement three functions:
- convert from the candidate object to an Attribute
- convert from the Attribute to the candidate object
- hash the object
Optional the parsing and printing can also be customized with 2 extra
functions.
A new options is introduced to ODS to allow dialects to specify:
let usePropertiesForAttributes = 1;
When set to true, the inherent attributes for all the ops in this dialect
will be using properties instead of being stored alongside discardable
attributes.
The TestDialect showcases this feature.
Another change is that we introduce new APIs on the Operation class
to access separately the inherent attributes from the discardable ones.
We envision deprecating and removing the `getAttr()`, `getAttrsDictionary()`,
and other similar method which don't make the distinction explicit, leading
to an entirely separate namespace for discardable attributes.
Differential Revision: https://reviews.llvm.org/D141742
Currently memory attributions are not supported for gpu::LaunchOp, this patch implements memory attributions for gpu::LaunchOp and modifies the KernelOutlining pass to make the attributions available in GPUFuncOp.
Reviewed By: makslevental
Differential Revision: https://reviews.llvm.org/D147809
This patch supports the processing of dialect attributes attached to top-level
module-type operations during MLIR-to-LLVMIR lowering.
This approach modifies the `mlir::translateModuleToLLVMIR()` function to call
`ModuleTranslation::convertOperation()` on the top-level operation, after its
body has been lowered. This, in turn, will get the
`LLVMTranslationDialectInterface` object associated to that operation's dialect
before trying to use it for lowering prior to processing dialect attributes
attached to the operation.
Since there are no `LLVMTranslationDialectInterface`s for the builtin and GPU
dialects, which define their own module-type operations, this patch also adds
and registers them. The requirement for always calling
`mlir::registerBuiltinDialectTranslation()` before any translation of MLIR to
LLVM IR where builtin module operations are present is introduced. The purpose
of these new translation interfaces is to succeed when processing module-type
operations, allowing the lowering process to continue and to prevent the
introduction of failures related to not finding such interfaces.
Differential Revision: https://reviews.llvm.org/D145932