Add support for argument attributes on workgroup and private
attributions for GPU functions. These arguments are outside the range
of getNumArguments() and get printed separately, so the default
mechanism for function argument attributes can't be used on them.
Having done this, check for the `llvm.align` attribute on workgroup or
private attributions in a `gpu.func` and pass it through to the
relevant allocation op (creating a global or alloca). This allows
people creating kernels that use multiple workgroup buffers to set an
alignment.
(This could, in the future, be a GPU dialect `alignment` attribute,
but I've taken the simpler route of using the LLVM version instead for
simplicity and because I don't know how this might impact backends
like Vulkan)
Reviewed By: nirvedhmeshram
Differential Revision: https://reviews.llvm.org/D148965
Currently memory attributions are not supported for gpu::LaunchOp, this patch implements memory attributions for gpu::LaunchOp and modifies the KernelOutlining pass to make the attributions available in GPUFuncOp.
Reviewed By: makslevental
Differential Revision: https://reviews.llvm.org/D147809
This revisions refactors the implementation of mapping to threads to additionally allow warps and linear ids to be specified.
`warp_dims` is currently specified along with `block_dims` as a transform attribute.
Linear ids on th other hand use the flattened block_dims to predicate on the first (linearized) k threads.
An additional GPULinearIdMappingAttr is added to the GPU dialect to allow specifying loops mapped to this new scheme.
Various implementation and transform op semantics cleanups are also applied.
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D146130
In this patch we are adding the support of copying a a `memref.subview` to the shared or private memory in GPU. The global to shared memory copy is adopted from codes implemented in IREE (https://github.com/iree-org/iree), but the private memory copy part has not been implemented in IREE. This patch enables transferring a subview from `global->shared`, `global->private`, and `shared->private`.
Our final aim is to provide a copy layout as an affine map to the `transform.promote` op to support transpose memory copy. This map is a permutation of the original affine index map. Although this has been implemented and user can copy data to arbitrary layout , this attempt is not included in this patch since we have still problem with `linalg.generic` operations to change their index map to the transformed index map. You can find more in following links ([[ 4fd5f93355 | Initial attempt to support layout map in promote op in transform dialect ]]) ([[ 9062b5849f | Fix data transpose in shared memory ]])
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D144666
Change map_nested_foreach_to_threads to ignore foreach_thread not
mapping to threads, this will allow us to call
mapNestedForeachToThreadsImpl with different set of ids to lower
multiple levels. Also adds warpIds attributes.
Differential Revision: https://reviews.llvm.org/D143298
The parser of gpu.launch_func was incorrectly rejecting SSA values with
result numbers (`%0#0`) in the list of function arguments by using the
`parseArgument` function intended for region argument declarations, not
operands. Fix this by directly parsing comma-separated operands and
types.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D141851
This is a purely mechanical change that introduces an enum attribute in the GPU
dialect to represent the various memref memory spaces as opposed to the
hard-coded integer attributes that are currently used.
The following steps were taken to make the transition across the codebase:
1. Introduce a pass "gpu-lower-memory-space-attributes":
The pass updates all memref types that have a memory space attribute that is a
`gpu::AddressSpaceAttr`. These attributes are changed to `IntegerAttr`'s using a
mapping that is given by the caller. This pass is based on the
"map-memref-spirv-storage-class" pass and the common functions can probably
be refactored into a set of utilities under the MemRef dialect.
2. Update the verifiers of GPU/NVGPU dialect operations.
If a verifier currently checks the address space of an operand using
e.g.`getWorkspaceAddressSpace`, then it can continue to do so. However, the
checks are changed to only fail if the memory space is either missing or a wrong
value of type `gpu::AddressSpaceAttr`. Otherwise, it just assumes the address
space is correct because it was specifically lowered to something other than a
`gpu::AddressSpaceAttr`.
3. Update existing gpu-to-llvm conversion infrastructure.
In the existing gpu-to-X passes, we add a full conversion equivalent to
`gpu-lower-memory-space-attributes` just before doing the conversion to the
LLVMDialect. This is done because currently both the gpu-to-llvm passes
(rocdl,nvvm) run gpu-to-gpu rewrites within the pass, which introduce
`AddressSpaceAttr` memory space annotations. Therefore, I inserted the
memory space conversion between the gpu-to-gpu rewrites and the LLVM
conversion.
For more context see the below discourse discussion:
https://discourse.llvm.org/t/gpu-workgroup-shared-memory-address-space-is-hard-coded/
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D140644
The patch adds operations to `BlockAndValueMapping` and renames it to `IRMapping`. When operations are cloned, old operations are mapped to the cloned operations. This allows mapping from an operation to a cloned operation. Example:
```
Operation *opWithRegion = ...
Operation *opInsideRegion = &opWithRegion->front().front();
IRMapping map
Operation *newOpWithRegion = opWithRegion->clone(map);
Operation *newOpInsideRegion = map.lookupOrNull(opInsideRegion);
```
Migration instructions:
All includes to `mlir/IR/BlockAndValueMapping.h` should be replaced with `mlir/IR/IRMapping.h`. All uses of `BlockAndValueMapping` need to be renamed to `IRMapping`.
Reviewed By: rriddle, mehdi_amini
Differential Revision: https://reviews.llvm.org/D139665
This new option is set to `false` by default. It should be set only in Canonicalizer tests to detect faulty canonicalization patterns. I.e., patterns that prevent the canonicalizer from converging. The canonicalizer should always convergence on such small unit tests that we have in `canonicalize.mlir`.
Two faulty canonicalization patterns were detected and fixed with this change.
Differential Revision: https://reviews.llvm.org/D140873
In many cases, the the number of workgroups (the grid size) and the
number of workitems within each group (the block size) that a GPU
kernel will be launched with are known. For example, if gpu.launch is
called with constant block and grid sizes, we know that those are the
only possible sizes that will be used to launch that kernel. In other
cases, a custom code-generation pipeline that eventually produces GPU
kernels may know the launch dimensions of those kernels, or at least
may be able to provide an upper bound on them.
Other GPU programming systems, such as OpenCL, allow capturing such
information to enable compiler optimizations - see
reqd_work_group_size, but MLIR currently has no mechanism for doing so.
This set of attributes is the first step in enabling optimizations
based on the known launch dimensions of kernels. It extends the kernel
outline pass to set these bounds on kernels with constant launch
dimensions and extends integer range inference for GPU index
operations to account for the bounds when they are known.
Subsequent revisions will use this data when lowering GPU operations
to the ROCDL dialect.
Reviewed By: antiagainst
Differential Revision: https://reviews.llvm.org/D139865
This is part of an effort to migrate from llvm::Optional to
std::optional. This patch changes the way mlir-tblgen generates .inc
files, and modifies tests and documentation appropriately. It is a "no
compromises" patch, and doesn't leave the user with an unpleasant mix of
llvm::Optional and std::optional.
A non-trivial change has been made to ControlFlowInterfaces to split one
constructor into two, relating to a build failure on Windows.
See also: https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
Signed-off-by: Ramkumar Ramachandra <r@artagnon.com>
Differential Revision: https://reviews.llvm.org/D138934
Reland D139447, D139471 With flang actually working
- FunctionOpInterface: make get/setFunctionType interface methods
This patch removes the concept of a `function_type`-named type attribute
as a requirement for implementors of FunctionOpInterface. Instead, this
type should be provided through two interface methods, `getFunctionType`
and `setFunctionTypeAttr` (*Attr because functions may use different
concrete function types), which should be automatically implemented by
ODS for ops that define a `$function_type` attribute.
This also allows FunctionOpInterface to materialize function types if
they don't carry them in an attribute, for example.
Importantly, all the function "helper" still accept an attribute name to
use in parsing and printing functions, for example.
- FunctionOpInterface: arg and result attrs dispatch to interface
This patch removes the `arg_attrs` and `res_attrs` named attributes as a
requirement for FunctionOpInterface and replaces them with interface
methods for the getters, setters, and removers of the relevent
attributes. This allows operations to use their own storage for the
argument and result attributes.
Reviewed By: jpienaar
Differential Revision: https://reviews.llvm.org/D139736
This patch removes the `arg_attrs` and `res_attrs` named attributes as a
requirement for FunctionOpInterface and replaces them with interface
methods for the getters, setters, and removers of the relevent
attributes. This allows operations to use their own storage for the
argument and result attributes.
Depends on D139471
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D139472
This patch removes the concept of a `function_type`-named type attribute
as a requirement for implementors of FunctionOpInterface. Instead, this
type should be provided through two interface methods, `getFunctionType`
and `setFunctionTypeAttr` (*Attr because functions may use different
concrete function types), which should be automatically implemented by
ODS for ops that define a `$function_type` attribute.
This also allows FunctionOpInterface to materialize function types if
they don't carry them in an attribute, for example.
Importantly, all the function "helper" still accept an attribute name to
use in parsing and printing functions, for example.
Reviewed By: rriddle, lattner
Differential Revision: https://reviews.llvm.org/D139447
`DeviceMappingAttrInterface` is implemented as unifiying mechanism for thread mapping. A code generator could use any attribute that implements this interface to lower `scf.foreach_thread` to device specific code. It is allowed to choose its own mapping and interpretation.
Currently, GPU transform dialect supports only `GPUThreadMapping` and `GPUBlockMapping`; however, other mappings should to be supported as well. This change addresses this issue. It decouples gpu transform dialect from the `GPUThreadMapping` and `GPUBlockMapping`. Now, they can work any other mapping.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D138020
The foldMemRefCast method is defined in memref namespace; the
foldTensorCast method is defined in tensor namespace. This revision
deletes the dup code and use the unified methods.
Reviewed By: dcaballe
Differential Revision: https://reviews.llvm.org/D136379
Introduce `subgroup_reduce` operation, similar to `all_reduce`, but operating on subgroup scope instead of workgroup.
It is intended as low-level building block for more high level abstractions (e.g for workgroup-wide `all_reduce` ops).
Only introduce version taking reduce operation enum for simplicity sake.
Differential Revision: https://reviews.llvm.org/D135323
This allows for incrementally updating the old API usages without
needing to update everything at once. These will be left on Both
for a little bit and then flipped to prefixed when all APIs have been
updated.
Differential Revision: https://reviews.llvm.org/D134386
GPUFuncOpLowering moves the body out of gpu.func op and erases it. An empty gpu.func may fail verification but should not crash it. Verification of an erased op is triggered e.g. with debug printing on.
Reviewed By: akuegel
Differential Revision: https://reviews.llvm.org/D132446
This reland includes changes to the Python bindings.
Switch variadic operand and result segment size attributes to use the
dense i32 array. Dense integer arrays were introduced primarily to
represent index lists. They are a better fit for segment sizes than
dense elements attrs.
Depends on D131801
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D131803
Switch variadic operand and result segment size attributes to use the
dense i32 array. Dense integer arrays were introduced primarily to
represent index lists. They are a better fit for segment sizes than
dense elements attrs.
Depends on D131738
Reviewed By: mehdi_amini
Differential Revision: https://reviews.llvm.org/D131702
Erase gpu.memcpy op when only uses of dest are
the memcpy op in question, its allocation and deallocation
ops.
Reviewed By: bondhugula, csigg
Differential Revision: https://reviews.llvm.org/D124257
Move async copy operations to NVGPU as they only exist on NV target and are
designed to match ptx semantic. This allows us to also add more fine grain
caching hint attribute to the op.
Add hint to bypass L1 and hook it up to NVVM op.
Differential Revision: https://reviews.llvm.org/D125244
MLIR has a common pattern for "arguments" that uses syntax
like `%x : i32 {attrs} loc("sourceloc")` which is implemented
in adhoc ways throughout the codebase. The approach this uses
is verbose (because it is implemented with parallel arrays) and
inconsistent (e.g. lots of things drop source location info).
Solve this by introducing OpAsmParser::Argument and make addRegion
(which sets up BlockArguments for the region) take it. Convert the
world to propagating this down. This means that we correctly
capture and propagate source location information in a lot more
cases (e.g. see the affine.for testcase example), and it also
simplifies much code.
Differential Revision: https://reviews.llvm.org/D124649
The asm parser had a notional distinction between parsing an
operand (like "%foo" or "%4#3") and parsing a region argument
(which isn't supposed to allow a result number like #3).
Unfortunately the implementation has two problems:
1) It didn't actually check for the result number and reject
it. parseRegionArgument and parseOperand were identical.
2) It had a lot of machinery built up around it that paralleled
operand parsing. This also was functionally identical, but
also had some subtle differences (e.g. the parseOptional
stuff had a different result type).
I thought about just removing all of this, but decided that the
missing error checking was important, so I reimplemented it with
a `allowResultNumber` flag on parseOperand. This keeps the
codepaths unified and adds the missing error checks.
Differential Revision: https://reviews.llvm.org/D124470
When Location tracking support for block arguments was added, we
discussed various approaches to threading support for this through
function-like argument parsing. At the time, we added a parallel array
of locations that could hold this. It turns out that that approach was
verbose and error prone, roughly no one adopted it.
This patch takes a different approach, adding an optional source
locator to the UnresolvedOperand class. This fits much more naturally
into the standard structure we use for representing locators, and gives
all the function like dialects locator support for free (e.g. see the
test adding an example for the LLVM dialect).
Differential Revision: https://reviews.llvm.org/D124188
Add async dependencies support for gpu.launch op: this allows specifying
a list of async tokens ("streams") as dependencies for the launch.
Update the GPU kernel outlining pass lowering to propagate async
dependencies from gpu.launch to gpu.launch_func op. Previously, a new
stream was being created and destroyed for a kernel launch. The async
deps support allows the kernel launch to be serialized on an existing
stream.
Differential Revision: https://reviews.llvm.org/D123499
NFC. Drop trailing end of line white space in GPU async ops' printer
whenever the list of async deps is empty.
Reviewed By: mehdi_amini, rriddle
Differential Revision: https://reviews.llvm.org/D123754
Fold away gpu.memcpy op when only uses of dest are
the memcpy op in question, its allocation and deallocation
ops.
Reviewed By: bondhugula
Differential Revision: https://reviews.llvm.org/D121279