The greedy rewriter is used in many different flows and it has a lot of
convenience (work list management, debugging actions, tracing, etc). But
it combines two kinds of greedy behavior 1) how ops are matched, 2)
folding wherever it can.
These are independent forms of greedy and leads to inefficiency. E.g.,
cases where one need to create different phases in lowering and is
required to applying patterns in specific order split across different
passes. Using the driver one ends up needlessly retrying folding/having
multiple rounds of folding attempts, where one final run would have
sufficed.
Of course folks can locally avoid this behavior by just building their
own, but this is also a common requested feature that folks keep on
working around locally in suboptimal ways.
For downstream users, there should be no behavioral change. Updating
from the deprecated should just be a find and replace (e.g., `find ./
-type f -exec sed -i
's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} \;` variety)
as the API arguments hasn't changed between the two.
This change allows to expose through an interface attributes wrapping
content as external resources, and the usage inside the ModuleToObject
show how we will be able to provide runtime libraries without relying on
the filesystem.
This is a follow-up of #117246.
I thought then it would be easy to edit a DictionaryAttr but it turns
out that these attributes are immutable and need to be passed during the
construction of the gpu.binary Op.
The first commit was using the NVVMTargetAttr to pass the information.
After feedback from @fabianmcg, this PR now passes the information
through a new option of the gpu-module-to-binary pass.
Please add reviewers, as you see fit.
Continue the move of `warp_execute_on_lane_0` op to the gpu dialect
(#116994). This patch creates a utils library in GPU and moves generic
helper functions there.
Existing implementation may trigger infinite cycles when collecting
effects above or below the current block after wrapping around a
loop-like construct. Limit this case to only looking at the immediate
block (loop body). This is correct because wrap around is intended to
consider effects of different iterations of the same loop and shouldn't
be existing the loop block.
Reported-by: Fabian Mora <fmora.dev@gmail.com>
Co-authored-by: Fabian Mora <fmora.dev@gmail.com>
Operation memref.reinterpret_cast was accept input like:
%out = memref.reinterpret_cast %in to offset: [%offset], sizes: [10],
strides: [1]
: memref<?xf32> to memref<10xf32>
A problem arises: while lowering, the true offset of %out is %offset,
but its data type indicates an offset of 0. Permitting this
inconsistency can result in incorrect outcomes, as certain pass might
erroneously extract the offset from the data type of %out.
This patch fixes this by enforcing that the return value's data type
aligns
with the input parameter.
Making the existing populateGpuLowerSubgroupReduceToShufflePatterns()
function also cover the new "clustered" subgroup reductions is proving
to be inconvenient, because certain backends may have more specific
lowerings that only cover the non-clustered type, and this creates pass
ordering constraints. This commit removes coverage of clustered
reductions from this function in favour of a new separate function,
which makes controlling the lowering much more straightforward.
This patch adds an argument to `gpu::TargetAttrInterface::createObject`
to pass the GPU module. This is useful as `gpu::ObjectAttr` contains a
property dict for metadata, hence the module can be used for extracting
things like the symbol table and adding it to the property dict.
---------
Co-authored-by: Oleksandr "Alex" Zinenko <ftynse@gmail.com>
This enables performing several reductions in parallel, each smaller
than the size of the subgroup.
One potential application is flash attention with subgroup-wide matrix
multiplication and reduction combined in one kernel. The multiplication
operation requires a 2D matrix to be distributed over the lanes of the
subgroup, which then constrains the shape the following reduction can
have if we want to keep data in registers.
This PR renames the `MemRef` integration test directory for and the
`DecomposeMemref.s.cpp` so that they can be found when doing a
case-sensitive search on file paths.
Most of the code here does not depend on the NVPTX target. In particular
the simple offload can just emit LLVM IR and we can use this without the
NVVM backend being built, which can be useful for a frontend that just
need to serialize the IR and leave it up to the runtime to JIT further.
This patch removes the last vestiges of the old gpu serialization
pipeline. To compile GPU code use target attributes instead.
See [Compilation overview | 'gpu' Dialect - MLIR
docs](https://mlir.llvm.org/docs/Dialects/GPU/#compilation-overview) for
additional information on the target attributes compilation pipeline
that replaced the old serialization pipeline.
This change reworks how range information for GPU dispatch IDs (block
IDs, thread IDs, and so on) is handled.
1. `known_block_size` and `known_grid_size` become inherent attributes
of GPU functions. This makes them less clunky to work with. As a
consequence, the `gpu.func` lowering patterns now only look at the
inherent attributes when setting target-specific attributes on the
`llvm.func` that they lower to.
2. At the same time, `gpu.known_block_size` and `gpu.known_grid_size`
are made official dialect-level discardable attributes which can be
placed on arbitrary functions. This allows for progressive lowerings
(without this, a lowering for `gpu.thread_id` couldn't know about the
bounds if it had already been moved from a `gpu.func` to an `llvm.func`)
and allows for range information to be provided even when
`gpu.*_{id,dim}` are being used outside of a `gpu.func` context.
3. All of these index operations have gained an optional `upper_bound`
attribute, allowing for an alternate mode of operation where the bounds
are specified locally and not inherited from the operation's context.
These also allow handling of cases where the precise launch sizes aren't
known, but can be bounded more precisely than the maximum of what any
platform's API allows. (I'd like to thank @benvanik for pointing out
that this could be useful.)
When inferring bounds (either for range inference or for setting `range`
during lowering) these sources of information are consulted in order of
specificity (`upper_bound` > inherent attribute > discardable attribute,
except that dimension sizes check for `known_*_bounds` to see if they
can be constant-folded before checking their `upper_bound`).
This patch also updates the documentation about the bounds and inference
behavior to clarify what these attributes do when set and the
consequences of setting them up incorrectly.
---------
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
LLVM_HAS_NVPTX_TARGET is automatically set depending on whether NVPTX
was enabled when building LLVM. Use this instead of manually defining
MLIR_ENABLE_CUDA_CONVERSIONS (whose name is a bit misleading btw).
This patch removes the `offloadingHandler` option from the
`ModuleToBinary` pass. The option is removed as it cannot be parsed from
textual form.
This fixes issue #90344.
This is another follow-up of #83004. The PR replaces the macro
`MLIR_GPU_TO_HSACO_PASS_ENABLE` with the more generic macro
`MLIR_ENABLE_ROCM_CONVERSIONS`. Until now, the former has been defined
if and only if the latter evaluated to true in CMake. However, the
former was not defined when the latter evaluated to false, in which case
a warning was raised if compiled with `-Wundef`. Using a single macro
relies on the `#cmakedefine01` mechanism that ensures the macro is
always set to either 0 or 1.
This is a follow up of #83004, which made the same change for
`MLIR_CUDA_CONVERSIONS_ENABLED`. As the previous PR, this PR commit
exposes mentioned CMake variable through `mlir-config.h` and uses the
macro that is introduced with the same name. This replaces the macro
`MLIR_ROCM_CONVERSIONS_ENABLED`, which the CMake files previously
defined manually.
That macro was not defined in some cases and thus yielded warnings if
compiled with `-Wundef`. In particular, they were not defined in the
BUILD files, so the GPU targets were broken when built with Bazel. This
commit exposes mentioned CMake variable through mlir-config.h and uses
the macro that is introduced with the same name. This replaces the macro
MLIR_CUDA_CONVERSIONS_ENABLED, which the CMake files previously defined
manually.
The `SerializeToCubin` pass was deprecated in September 2023 in favor of
GPU compilation attributes; see the [GPU
compilation](https://mlir.llvm.org/docs/Dialects/GPU/#gpu-compilation)
section in the `gpu` dialect MLIR docs.
This patch removes `SerializeToCubin` from the repo.
This patch adds an optional offloading handler attribute to
the`gpu.module` op. This attribute will be used during
`gpu-module-to-binary` pass to override the offloading handler used in
the `gpu.binary` op.
The new patterns break down subgroup reduce ops with vector values into
a sequence of subgroup reductions that fit the native shuffle size. The
maximum/native shuffle size is parametrized.
The overall goal is to be able to perform multi-element reductions with
a sequence of `gpu.shuffle` ops.
This is to avoid confusion when dealing with reduction/combining kinds.
For example, see a recent PR comment:
https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175.
Previously, they were picked to mostly mirror the names of the llvm
vector reduction intrinsics:
https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In
isolation, it was not clear if `<maxf>` has `arith.maxnumf` or
`arith.maximumf` semantics. The new reduction kind names map 1:1 to
arith ops, which makes it easier to tell/look up their semantics.
Because both the vector and the gpu dialect depend on the arith dialect,
it's more natural to align names with those in arith than with the
lowering to llvm intrinsics.
Issue: https://github.com/llvm/llvm-project/issues/72354
This patch fixes the error in issue #75434. The crash was being caused
by not checking for a lack of target attributes in a GPU module. It's
now considered an error to invoke the pass with a GPU module with no
target attributes.
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
The motivation for this change is explained in
https://github.com/llvm/llvm-project/issues/72354.
Before this change, we could not tell between signed/unsigned
minimum/maximum and NaN treatment for floating point values.
The mapping of old reduction operations to the new ones is as follows:
* `min` --> `minsi` for ints, `minf` for floats
* `max` --> `maxsi` for ints, `maxf` for floats
New reduction kinds not represented in the old enum: `minui`, `maxui`,
`minimumf`, `maximumf`.
As a next step, I would like to have a common definition of combining
kinds used by the `vector` and `gpu` dialects. Separately, the GPU to
SPIR-V lowering does not yet properly handle zero and NaN values -- the
behavior of floating point min/max group reductions is not specified by
the SPIR-V spec, see https://github.com/llvm/llvm-project/issues/73459.
Issue: https://github.com/llvm/llvm-project/issues/72354
This PR generalize gpu-out-lining pass to take care of ops
`SymbolOpInterface` instead of just `func::FuncOp`.
Before this change, gpu-out-lining pass will skip `llvm.func`.
```mlir
module {
llvm.func @main() {
%c1 = arith.constant 1 : index
gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %c1, %arg7 = %c1, %arg8 = %c1) threads(%arg3, %arg4, %arg5) in (%arg9 = %c1, %arg10 = %c1, %arg11 = %c1) {
gpu.terminator
}
llvm.return
}
}
```
After this change, gpu-out-lining pass can handle llvm.func as well.
Allows the barrier elimination code to be run from C++ as well. The code
from transforms dialect is copied as-is, the pass and populate functions
have beed added at the end.
Co-authored-by: Eric Eaton <eric@nod-labs.com>
Enable merging #71439 by removing a definitely-wrong usage of
std::unique_ptr<SmallVectorImpl<char>> as a return value with passing in
a SmallVectorImpl<char>&
Also change the following function to take ArrayRef<char> instead of
const SmalVectorImpl<char>& .
This commit implements gpu::TargetAttrInterface for SPIR-V target
attribute. The plan is to use this to enable GPU compilation pipeline
for OpenCL kernels later.
The changes do not impact Vulkan shaders using milr-vulkan-runner.
New GPU Dialect transform pass spirv-attach-target is implemented for
attaching attribute from CLI.
gpu-module-to-binary pass now works with GPU module that has SPIR-V
module with OpenCL kernel functions inside.