Expose loop results, which correspond to the region iter_arg values that
are returned from the loop when there are no more iterations. Exposing
loop results is optional because some loops (e.g., `scf.while`) do not
have a 1-to-1 mapping between region iter_args and op results.
Also add additional helper functions to query tied
results/iter_args/inits.
The `LoopLikeOpInterface` already has interface methods to query inits
and iter_args. This commit adds helper functions to query tied
init/iter_arg pairs and removes the corresponding functions for
`scf::ForOp`.
Add a new interface method that returns the yielded values.
Also add a verifier that checks the number of inits/iter_args/yielded
values. Most of the checked invariants (but not all of them) are already
covered by the `RegionBranchOpInterface`, but the `LoopLikeOpInterface`
now provides (additional) error messages that are easier to read.
The `BufferizableOpInterface` implementation of `scf.for` currently
assumes that an OpResult does not alias with any tensor apart from the
corresponding init OpOperand. Newly allocated buffers (inside of the
loop) are also allowed. The current implementation checks whether the
respective init_arg and yielded value are equivalent. This is overly
strict and causes extra buffer allocations/copies when yielding a new
buffer allocation from a loop.
This commit implements `LoopLikeOpInterface` on `scf.while`. This
enables LICM (and potentially other transforms) on `scf.while`.
`LoopLikeOpInterface::getLoopBody()` is renamed to `getLoopRegions` and
can now return multiple regions.
Also fix a bug in the default implementation of
`LoopLikeOpInterface::isDefinedOutsideOfLoop()`, which returned "false"
for some values that are defined outside of the loop (in a nested op, in
such a way that the value does not dominate the loop). This interface is
currently only used for LICM and there is no way to trigger this bug, so
no test is added.
In line with #66515, change `MutableArrayRange::begin`/`end` to
enumerate `OpOperand &` instead of `Value`. Also remove
`ForOp::getIterOpOperands`/`setIterArg`, which are now redundant.
Note: `MutableOperandRange` cannot be made a derived class of
`indexed_accessor_range_base` (like `OperandRange`), because
`MutableOperandRange::assign` can change the number of operands in the
range.
This commit removes the deallocation capabilities of
one-shot-bufferization. One-shot-bufferization should never deallocate
any memrefs as this should be entirely handled by the
ownership-based-buffer-deallocation pass going forward. This means the
`allow-return-allocs` pass option will default to true now,
`create-deallocs` defaults to false and they, as well as the escape
attribute indicating whether a memref escapes the current region, will
be removed. A new `allow-return-allocs-from-loops` option is added as a
temporary workaround for some bufferization limitations.
This is the first commit in a series with the goal to rework the
BufferDeallocation pass. Currently, this pass heavily relies on copies
to perform correct deallocations, which leads to very slow code and
potentially high memory usage. Additionally, there are unsupported cases
such as returning memrefs which this series of commits aims to add
support for as well.
This first commit removes the deallocation capabilities of
one-shot-bufferization.One-shot-bufferization should never deallocate any
memrefs as this should be entirely handled by the buffer-deallocation pass
going forward. This means the allow-return-allocs pass option will
default to true now, create-deallocs defaults to false and they, as well
as the escape attribute indicating whether a memref escapes the current region,
will be removed.
The documentation should w.r.t. these pass option changes should also be
updated in this commit.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D156662
One-Shot Bufferize correctly handles RaW conflicts around repetitive regions (loops). Specical handling is needed for parallel regions. These are a special kind of repetitive regions that can have additional RaW conflicts that would not be present if the regions would be executed sequentially.
Example:
```
%0 = bufferization.alloc_tensor()
scf.forall ... {
%1 = linalg.fill ins(...) outs(%0)
...
scf.forall.in_parallel {
tensor.parallel_insert_slice %1 into ...
}
}
```
A separate (private) buffer must be allocated for each iteration of the `scf.forall` loop.
This change adds a new interface method to `BufferizableOpInterface` to detect parallel regions. By default, regions are assumed to be sequential.
A buffer is privatized if an OpOperand bufferizes to a memory read inside a parallel region that is different from the parallel region where operand's value is defined.
Differential Revision: https://reviews.llvm.org/D159286
This revision adds support for unstructured control flow to the bufferization infrastructure. In particular: regions with multiple blocks, `cf.br`, `cf.cond_br`.
Two helper templates are added to `BufferizableOpInterface.h`, which can be implemented by ops that supported unstructured control flow in their regions (e.g., `func.func`) and ops that branch to another block (e.g., `cf.br`).
A block signature is always bufferized together with the op that owns the block.
Differential Revision: https://reviews.llvm.org/D158094
`getBufferType` computes the bufferized type of an SSA value without bufferizing any IR. This is useful for predicting the bufferized type of iter_args of a loop.
To avoid endless recursion (e.g., in the case of "scf.for", the type of the iter_arg depends on the type of init_arg and the type of the yielded value; the type of the yielded value depends on the type of the iter_arg again), `fixedTypes` was used to fall back to "fixed" type. A simpler way is to maintain an "invocation stack". `getBufferType` implementations can then inspect the invocation stack to detect repetitive computations (typically when computing the bufferized type of a block argument).
Also improve error messages in case of inconsistent memory spaces inside of a loop.
Differential Revision: https://reviews.llvm.org/D158060
This revision is needed to support bufferization of `cf.br`/`cf.cond_br`. It will also be useful for better analysis of loop ops.
This revision generalizes `getAliasingOpResults` to `getAliasingValues`. An OpOperand can now not only alias with OpResults but also with BlockArguments. In the case of `cf.br` (will be added in a later revision): a `cf.br` operand will alias with the corresponding argument of the destination block.
If an op does not implement the `BufferizableOpInterface`, the analysis in conservative. It previously assumed that an OpOperand may alias with each OpResult. It now assumes that an OpOperand may alias with each OpResult and each BlockArgument of the entry block.
Differential Revision: https://reviews.llvm.org/D157957
Add two new helper functions `getBeforeBody` and `getAfterBody` to be consistent with "scf.for" (`getBody`) and to show in the API that both regions have exactly one block. Also simplify some code that assumed that there can be more than one block in a region.
Differential Revision: https://reviews.llvm.org/D157860
The MLIR classes Type/Attribute/Operation/Op/Value support
cast/dyn_cast/isa/dyn_cast_or_null functionality through llvm's doCast
functionality in addition to defining methods with the same name.
This change begins the migration of uses of the method to the
corresponding function call as has been decided as more consistent.
Note that there still exist classes that only define methods directly,
such as AffineExpr, and this does not include work currently to support
a functional cast/isa call.
Context:
- https://mlir.llvm.org/deprecation/ at "Use the free function variants
for dyn_cast/cast/isa/…"
- Original discussion at https://discourse.llvm.org/t/preferred-casting-style-going-forward/68443
Implementation:
This patch updates all remaining uses of the deprecated functionality in
mlir/. This was done with clang-tidy as described below and further
modifications to GPUBase.td and OpenMPOpsInterfaces.td.
Steps are described per line, as comments are removed by git:
0. Retrieve the change from the following to build clang-tidy with an
additional check:
main...tpopp:llvm-project:tidy-cast-check
1. Build clang-tidy
2. Run clang-tidy over your entire codebase while disabling all checks
and enabling the one relevant one. Run on all header files also.
3. Delete .inc files that were also modified, so the next build rebuilds
them to a pure state.
```
ninja -C $BUILD_DIR clang-tidy
run-clang-tidy -clang-tidy-binary=$BUILD_DIR/bin/clang-tidy -checks='-*,misc-cast-functions'\
-header-filter=mlir/ mlir/* -fix
rm -rf $BUILD_DIR/tools/mlir/**/*.inc
```
Differential Revision: https://reviews.llvm.org/D151542
Block arguments and yielded values are not equivalent if there are not enough block arguments. This fixes#59442.
Differential Revision: https://reviews.llvm.org/D145575
The MLIR classes Type/Attribute/Operation/Op/Value support
cast/dyn_cast/isa/dyn_cast_or_null functionality through llvm's doCast
functionality in addition to defining methods with the same name.
This change begins the migration of uses of the method to the
corresponding function call as has been decided as more consistent.
Note that there still exist classes that only define methods directly,
such as AffineExpr, and this does not include work currently to support
a functional cast/isa call.
Caveats include:
- This clang-tidy script probably has more problems.
- This only touches C++ code, so nothing that is being generated.
Context:
- https://mlir.llvm.org/deprecation/ at "Use the free function variants
for dyn_cast/cast/isa/…"
- Original discussion at https://discourse.llvm.org/t/preferred-casting-style-going-forward/68443
Implementation:
This first patch was created with the following steps. The intention is
to only do automated changes at first, so I waste less time if it's
reverted, and so the first mass change is more clear as an example to
other teams that will need to follow similar steps.
Steps are described per line, as comments are removed by git:
0. Retrieve the change from the following to build clang-tidy with an
additional check:
https://github.com/llvm/llvm-project/compare/main...tpopp:llvm-project:tidy-cast-check
1. Build clang-tidy
2. Run clang-tidy over your entire codebase while disabling all checks
and enabling the one relevant one. Run on all header files also.
3. Delete .inc files that were also modified, so the next build rebuilds
them to a pure state.
4. Some changes have been deleted for the following reasons:
- Some files had a variable also named cast
- Some files had not included a header file that defines the cast
functions
- Some files are definitions of the classes that have the casting
methods, so the code still refers to the method instead of the
function without adding a prefix or removing the method declaration
at the same time.
```
ninja -C $BUILD_DIR clang-tidy
run-clang-tidy -clang-tidy-binary=$BUILD_DIR/bin/clang-tidy -checks='-*,misc-cast-functions'\
-header-filter=mlir/ mlir/* -fix
rm -rf $BUILD_DIR/tools/mlir/**/*.inc
git restore mlir/lib/IR mlir/lib/Dialect/DLTI/DLTI.cpp\
mlir/lib/Dialect/Complex/IR/ComplexDialect.cpp\
mlir/lib/**/IR/\
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp\
mlir/lib/Dialect/Vector/Transforms/LowerVectorMultiReduction.cpp\
mlir/test/lib/Dialect/Test/TestTypes.cpp\
mlir/test/lib/Dialect/Transform/TestTransformDialectExtension.cpp\
mlir/test/lib/Dialect/Test/TestAttributes.cpp\
mlir/unittests/TableGen/EnumsGenTest.cpp\
mlir/test/python/lib/PythonTestCAPI.cpp\
mlir/include/mlir/IR/
```
Differential Revision: https://reviews.llvm.org/D150123
Without this bufferization cannot track operations removed during bufferization.
Unfortunately there is currently no way to enforce that ops need to be erased through
the rewriter and this causes sporadic errors when tracking pointers in Bufferization pass.
Therefore there is no easy way to test that the pattern is doing the right thing.
Reviewed By: mravishankar
Differential Revision: https://reviews.llvm.org/D147095
`getAliasingOpOperands`/`getAliasingOpResults` now encodes OpOperand/OpResult, buffer relation and a degree of certainty. E.g.:
```
// aliasingOpOperands(%r) = {(%t, EQUIV, DEFINITE)}
// aliasingOpResults(%t) = {(%r, EQUIV, DEFINITE)}
%r = tensor.insert %f into %t[%idx] : tensor<?xf32>
// aliasingOpOperands(%r) = {(%t0, EQUIV, MAYBE), (%t1, EQUIV, MAYBE)}
// aliasingOpResults(%t0) = {(%r, EQUIV, MAYBE)}
// aliasingOpResults(%t1) = {(%r, EQUIV, MAYBE)}
%r = arith.select %c, %t0, %t1 : tensor<?xf32>
```
`BufferizableOpInterface::bufferRelation` is removed, as it is now part of `getAliasingOpOperands`/`getAliasingOpResults`.
This change allows for better analysis, in particular wrt. equivalence. This allows additional optimizations and better error checking (which is sometimes overly conservative). Examples:
* EmptyTensorElimination can eliminate `tensor.empty` inside `scf.if` blocks. This requires a modeling of equivalence: It is not a per-OpResult property anymore. Instead, it can be specified for each OpOperand and OpResult. This is important because `tensor.empty` may be eliminated only if all values on the SSA use-def chain to the final consumer (`tensor.insert_slice`) are equivalent.
* The detection of "returning allocs from a block" can be improved. (Addresses a TODO in `assertNoAllocsReturned`.) This allows us to bufferize IR such as "yielding a `tensor.extract_slice` result from an `scf.if` branch", which currently fails to bufferize because the alloc detection is too conservative.
* Better bufferization of loops. Aliases of the iter_arg can be yielded (even if they are not equivalent) without having to realloc and copy the entire buffer on each iteration.
The above-mentioned examples are not yet implemented with this change. This change just improves the BufferizableOpInterface, its implementations and related helper functions, so that better aliasing information is available for each op.
Differential Revision: https://reviews.llvm.org/D142129
* `getAliasingOpOperand` => `getAliasingOpOperands`
* `getAliasingOpResult` => `getAliasingOpResults`
Also a few minor code cleanups and better documentation.
Differential Revision: https://reviews.llvm.org/D142979
The previous name was incorrect. `None` does not mean that there is no buffer relation between two buffers (seems to imply that they do not alias for sure); instead it means that there is no further information available.
Differential Revision: https://reviews.llvm.org/D142870
The name of the method was confusing. It is bufferizesToMemoryWrite, but from the perspective of OpResults.
`bufferizesToMemoryWrite(OpResult)` now supports ops with regions that do not have aliasing OpOperands (such as `scf.if`). These ops no longer need to implement `isMemoryWrite`.
Differential Revision: https://reviews.llvm.org/D141684
This is part of an effort to migrate from llvm::Optional to
std::optional. This patch changes the way mlir-tblgen generates .inc
files, and modifies tests and documentation appropriately. It is a "no
compromises" patch, and doesn't leave the user with an unpleasant mix of
llvm::Optional and std::optional.
A non-trivial change has been made to ControlFlowInterfaces to split one
constructor into two, relating to a build failure on Windows.
See also: https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
Signed-off-by: Ramkumar Ramachandra <r@artagnon.com>
Differential Revision: https://reviews.llvm.org/D138934
MemRef has been accepting a general Attribute as memory space for
a long time. This commits updates bufferization side to catch up,
which allows downstream users to plugin customized symbolic memory
space. This also eliminates quite a few `getMemorySpaceAsInt`
calls, which is deprecated.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D138330
`scf.foreach_thread` defines mapping its loops to processors via an integer array, see an example below. A lowering can use this mapping. However, expressing mapping as an integer array is very confusing, especially when there are multiple levels of parallelism. In addition, the op does not verify the integer array. This change introduces device mapping attribute to make mapping descriptive and verifiable. Then it makes GPU transform dialect use it.
```
scf.foreach_thread (%i, %j) in (%c1, %c2) {
scf.foreach_thread (%i2, %j2) in (%c1, %c2)
{...} { thread_dim_mapping = [0, 1]}
} { thread_dim_mapping = [0, 1]}
```
It first introduces a `DeviceMappingInterface` which is an attribute interface. `scf.foreach_thread` defines its mapping via this interface. A lowering must define its attributes and implement this interface as well. This way gives us a clear validation.
The change also introduces two new attributes (`#gpu.thread<x/y/z>` and `#gpu.block<x,y,z>` ). After this change, the above code prints as below, as seen here, this way clarifies the loop mappings. The change also implements consuming of these two new attribute by the transform dialect. Transform dialect binds the outermost loops to the thread blocks and innermost loops to threads.
```
scf.foreach_thread (%i, %j) in (%c1, %c2) {
scf.foreach_thread (%i2, %j2) in (%c1, %c2)
{...} { thread_dim_mapping = [#gpu.thread<x>, #gpu.thread<y>]}
} { thread_dim_mapping = [#gpu.block<x>, #gpu.block<y>]}
```
Reviewed By: ftynse, nicolasvasilache
Differential Revision: https://reviews.llvm.org/D137413
There was a bug in scf.for loop bufferization that could lead to a missing buffer copy (alloc was there, but not the copy).
Differential Revision: https://reviews.llvm.org/D135053
Currently, if the `before` and `after` regions of a while op have
tensor args in different indices, this leads to a crash.
This moves the pass-through check for args to the handling of the
condition block, since that is where the results are produced, so
it's also where copies must be made.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D133477
Currently, one-shot-bufferize crashes as soon as there's
a mixture of tensor and non-tensor arguments. This seems
to happen for no good reason.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D133419
This change refines the semantics of scf.foreach_thread. Tensors that are inserted into in the terminator must now be passed to the region explicitly via `shared_outs`. Inside of the body of the op, those tensors are then accessed via block arguments.
The body of a scf.foreach_thread is now treated as a repetitive region. I.e., op dominance can no longer be used in conflict detection when using a value that is defined outside of the body. Such uses may now be considered as conflicts (if there is at least one read and one write in the body), effectively privatizing the tensor. Shared outputs are not privatized when they are used via their corresponding block arguments.
As part of this change, it was also necessary to update the "tiling to scf.foreach_thread", such that the generated tensor.extract_slice ops use the scf.foreach_thread's block arguments. This is implemented by cloning the TilingInterface op inside the scf.foreach_thread, rewriting all of its outputs with block arguments and then calling the tiling implementation. Afterwards, the cloned op is deleted again.
Differential Revision: https://reviews.llvm.org/D133114
A part of the functionality of `bufferize` is extracted into `getBufferType`. Also, bufferized scf.yields inside scf.if are now created with the correct bufferized type from the get-to.
Differential Revision: https://reviews.llvm.org/D132862
Even though iter_arg and init_arg of an scf.for loop may have the same tensor type, their bufferized memref types are not necessarily equal. It is sometimes necessary to insert a cast in case of differing layout maps.
Differential Revision: https://reviews.llvm.org/D132860
This change generalizes getBufferType. This function can be used to predict the buffer type of any tensor value (not just BlockArguments) without changing any IR. It also subsumes getMemorySpace. This is useful for loop bufferization, where the precise buffer type of an iter_arg cannot be known without examining the loop body.
Differential Revision: https://reviews.llvm.org/D132859
This is moslty NFC and will allow tensor.parallel_insert_slice to gain
rank-reducing semantics by reusing the vast majority of the tensor.insert_slice impl.
Depends on D128857
Differential Revision: https://reviews.llvm.org/D128920
This allows purging references of scf.ForeachThreadOp and scf.PerformConcurrentlyOp from
ParallelInsertSliceOp.
This will allowmoving the op closer to tensor::InsertSliceOp with which it should share much more
code.
In the future, the decoupling will also allow extending the type of ops that can be used in the
parallel combinator as well as semantics related to multiple concurrent inserts to the same
result.
Differential Revision: https://reviews.llvm.org/D128857
This was previous implemented as part of the BufferizableOpInterface of ForEachThreadOp. Moving the implementation to ParallelInsertSliceOp to be consistent with the remaining ops and to have a nice example op that can serve as a blueprint for other ops.
Differential Revision: https://reviews.llvm.org/D128666
This change updates all remaining bufferization patterns (except for scf.while) and the remaining bufferization infrastructure to infer the memory space whenever possible instead of falling back to "0". (If a default memory space is set in the bufferization options, we still fall back to that value if the memory space could not be inferred.)
Differential Revision: https://reviews.llvm.org/D128423
Add a failure return value and bufferization options argument. This is to keep a subsequent change smaller.
Differential Revision: https://reviews.llvm.org/D128278
An optional thread_dim_mapping index array attribute specifies for each
virtual thread dimension, how it remaps 1-1 to a set of concrete processing
element resources (e.g. a CUDA grid dimension or a level of concrete nested
async parallelism). At this time, the specification is backend-dependent and
is not verified by the op, beyond being an index array attribute.
It is the reponsibility of the lowering to interpret the index array in the
context of the concrete target the op is lowered to, or to ignore it when
the specification is ill-formed or unsupported for a particular target.
Differential Revision: https://reviews.llvm.org/D128633