This commit renames 4 pattern rewriter API functions:
* `updateRootInPlace` -> `modifyOpInPlace`
* `startRootUpdate` -> `startOpModification`
* `finalizeRootUpdate` -> `finalizeOpModification`
* `cancelRootUpdate` -> `cancelOpModification`
The term "root" is a misnomer. The root is the op that a rewrite pattern
matches against
(https://mlir.llvm.org/docs/PatternRewriter/#root-operation-name-optional).
A rewriter must be notified of all in-place op modifications, not just
in-place modifications of the root
(https://mlir.llvm.org/docs/PatternRewriter/#pattern-rewriter). The old
function names were confusing and have contributed to various broken
rewrite patterns.
Note: The new function names use the term "modify" instead of "update"
for consistency with the `RewriterBase::Listener` terminology
(`notifyOperationModified`).
Support distribution of `vector.transfer_read` ops when operands are
defined inside of the region of `warp_execute_on_lane_0` (except for the
buffer from which the op is reading).
Such IR was previously not supported. This commit changes the
implementation such that indices and the padding value are also
distributed.
This commit simplifies the implementation considerably: the original
implementation created a new `transfer_read` op and then checked if this
new op is valid. If not, the rewrite pattern failed. This was a bit
hacky. It was also a violation of the rewrite pattern API (detected by
`MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`) because the IR was modified,
but the pattern returned "failure".
Add a configuration option to allow vector distribution with multiple
elements written by a single lane.
This is so that we can perform vector multi-reduction with multiple
results per workgroup.
The primary difficulty with distribution of masked transfers is when the
permutation map permutes the vector, in which case the distribution
logic needs to make sure the correct mask elements end up with the
distributed transfer. This is only tricky when the permutation map has a
permutation in it, so we can relax the condition for distribution.
A number of the warp distribution patterns work by rewriting a warp op
in place by moving a contained op outside. This notifies the rewriter
that the warp op is changing in this case.
This is the last step needed for basic support for distributing masked
vector code. The lane id gets delinearized based on the distributed mask
shape and then compared against the original mask sizes to compute the
bounds for the distributed mask. Note that the distribution of masks is
implicit on the shape specified by the warp op. As a result, it is the
responsibility of the consumer of the mask to ensure the distributed
mask will match its own distribution semantics.
Currently when there is a mix of transfer read ops and transfer write
ops that need to be distributed, because the pattern for write
distribution is rooted on the transfer write, it is hard to guarantee
that the write gets distributed after the read when the two aren't
directly connected by SSA. This is likely still relatively unsafe when
there are undistributable ops, but structurally these patterns are a bit
difficult to work with. For now pattern benefits give fairly good
guarantees for happy paths.
This fixes two bugs:
1) When deciding whether a transfer read could be propagated out of
a warp op, it looked for the first yield operand that was produced by
a transfer read. If this transfer read wasn't ready to be
distributed, the pattern would not re-check for any other transfer
reads that could have been propagated.
2) When dropping dead warp results, we do so by updating the warp op
signature and splicing in the old region. This does not add the ops
in the body of the warp op back to the pattern applicator's worklist,
and thus those operations won't be DCE'd. This is a problem for
patterns like the one for transfer reads that will still see the dead
operation as a user.
Because the distribution is based on types, supporting general masked
reads requires first materializing the permutation map in IR to align
the elements of the mask with the elements read by the transfer op. For
now just support cases with the trivial permutation map.
General distribution of masked writes requires materializing the permutation on the vector of the write in IR to ensure the vector lines up with the mask. For now just support cases with trivial permutation maps.
After propagation of `vector.warp_execute_on_lane_0` through `scf.for`,
uniform operations like those on the loop iterators can now be hoisted
out of the inner warp op.
The vector.extract assembly format currently only contains the source
type, for example:
%1 = vector.extract %0[1] : vector<3x7x8xf32>
it's not immediately obvious if this is the source or result type. This
patch improves the assembly format to make this clearer, so the above
becomes:
%1 = vector.extract %0[1] : vector<7x8xf32> from vector<3x7x8xf32>
* Always use the auto-generated `getInitArgs` function. Remove the
hand-written `getInitOperands` duplicate.
* Remove `hasIterOperands` and `getNumIterOperands`. The names were
inconsistent because the "arg" is called `initArgs` in TableGen. Use
`getInitArgs().size()` instead.
* Fix verification around ops with no results.
If the original shape and the distributed shape is the same,
we don't distribute at all--every thread is handling the whole.
Reviewed By: hanchung
Differential Revision: https://reviews.llvm.org/D158235
This commit starts enabling vector distruction over multiple
dimensions. It requires delinearize the lane ID to match the
expected rank. shape_cast and transfer_read now can properly
handle multiple dimensions.
Reviewed By: hanchung
Differential Revision: https://reviews.llvm.org/D157931
`DenseI64ArrayAttr` provides a better API than `I64ArrayAttr`. E.g., accessors returning `ArrayRef<int64_t>` (instead of `ArrayAttr`) are generated.
Differential Revision: https://reviews.llvm.org/D156684
In the vector distribute patterns, we used to move
`vector.broadcast`s out of `vector.warp_execute_on_lane0`s
irrespectively of how they were defined.
This could create broadcast operations with invalid semantic.
E.g.,
```
%r = warop ...[32] ... -> vector<1x2xf32> {
%val = broadcast %in : vector<64xf32> to vetor<1x64xf32>
vector.yield %val : vector<1x64xf32>
}
```
=>
```
%r = warop ...[32] ... -> vector<64xf32> {
vector.yield %in : vector<64xf32>
}
// Broadcasting to a narrower type!
broadcast %r : vector<64xf32> to vector<1x2xf32>
```
The root issue is we are trying to broadcast something that is not the same
for each thread, so there is actually nothing to propagate here.
The fix checks that the broadcast we want to create actually makes sense.
Differential Revision: https://reviews.llvm.org/D152154
In the vector distribute patterns, we used to move
`vector.transfer_read`s out of `vector.warp_execute_on_lane0`s
irrespectively of how they were defined.
This could create transfer_read operations that would read values from
within the warpOp's body from outside of the body.
E.g.,
```
warpop {
%defined_in_body
%read = transfer_read %defined_in_body
vector.yield %read
}
```
=>
```
warpop {
%defined_in_body
vector.yield ...
}
// %defined_in_body is referenced outside of its scope.
%read = transfer_read %defined_in_body
```
The fix consists in checking that all the values feeding the new
`transfer_read` are defined outside of warpOp's body.
Note: We could do this check before creating any operation, but that would
mean knowing what `affine::makeComposedAffineApply` actually do. So the
current fix is a trade off of coupling the implementations of this
propagation and `makeComposedAffineApply` versus compile time.
Differential Revision: https://reviews.llvm.org/D152149
The MLIR classes Type/Attribute/Operation/Op/Value support
cast/dyn_cast/isa/dyn_cast_or_null functionality through llvm's doCast
functionality in addition to defining methods with the same name.
This change begins the migration of uses of the method to the
corresponding function call as has been decided as more consistent.
Note that there still exist classes that only define methods directly,
such as AffineExpr, and this does not include work currently to support
a functional cast/isa call.
Caveats include:
- This clang-tidy script probably has more problems.
- This only touches C++ code, so nothing that is being generated.
Context:
- https://mlir.llvm.org/deprecation/ at "Use the free function variants
for dyn_cast/cast/isa/…"
- Original discussion at https://discourse.llvm.org/t/preferred-casting-style-going-forward/68443
Implementation:
This first patch was created with the following steps. The intention is
to only do automated changes at first, so I waste less time if it's
reverted, and so the first mass change is more clear as an example to
other teams that will need to follow similar steps.
Steps are described per line, as comments are removed by git:
0. Retrieve the change from the following to build clang-tidy with an
additional check:
https://github.com/llvm/llvm-project/compare/main...tpopp:llvm-project:tidy-cast-check
1. Build clang-tidy
2. Run clang-tidy over your entire codebase while disabling all checks
and enabling the one relevant one. Run on all header files also.
3. Delete .inc files that were also modified, so the next build rebuilds
them to a pure state.
4. Some changes have been deleted for the following reasons:
- Some files had a variable also named cast
- Some files had not included a header file that defines the cast
functions
- Some files are definitions of the classes that have the casting
methods, so the code still refers to the method instead of the
function without adding a prefix or removing the method declaration
at the same time.
```
ninja -C $BUILD_DIR clang-tidy
run-clang-tidy -clang-tidy-binary=$BUILD_DIR/bin/clang-tidy -checks='-*,misc-cast-functions'\
-header-filter=mlir/ mlir/* -fix
rm -rf $BUILD_DIR/tools/mlir/**/*.inc
git restore mlir/lib/IR mlir/lib/Dialect/DLTI/DLTI.cpp\
mlir/lib/Dialect/Complex/IR/ComplexDialect.cpp\
mlir/lib/**/IR/\
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp\
mlir/lib/Dialect/Vector/Transforms/LowerVectorMultiReduction.cpp\
mlir/test/lib/Dialect/Test/TestTypes.cpp\
mlir/test/lib/Dialect/Transform/TestTransformDialectExtension.cpp\
mlir/test/lib/Dialect/Test/TestAttributes.cpp\
mlir/unittests/TableGen/EnumsGenTest.cpp\
mlir/test/python/lib/PythonTestCAPI.cpp\
mlir/include/mlir/IR/
```
Differential Revision: https://reviews.llvm.org/D150123
Currently conversions to interfaces may happen implicitly (e.g.
`Attribute -> TypedAttr`), failing a runtime assert if the interface
isn't actually implemented. This change marks the `Interface(ValueT)`
constructor as explicit so that a cast is required.
Where it was straightforward to I adjusted code to not require casts,
otherwise I just made them explicit.
Depends on D148491, D148492
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D148493
Replace references to enumerate results with either result_pairs
(reference wrapper type) or structured bindings. I did not use
structured bindings everywhere as it wasn't clear to me it would
improve readability.
This is in preparation to the switch to zip semantics which won't
support non-const lvalue reference to elements:
https://reviews.llvm.org/D144503.
I chose to use values instead of const lvalue-refs because MLIR is
biased towards avoiding `const` local variables. This won't degrade
performance because currently `result_pair` is cheap to copy (size_t
+ iterator), and in the future, the enumerator iterator dereference
will return temporaries anyway.
Reviewed By: dblaikie
Differential Revision: https://reviews.llvm.org/D146006
Plain `getVectorType()` can be quite confusing and error-prone
given that, well, vector ops always work on vector types, and
it can commonly involve both source and result vectors. So this
commit makes various such accessor methods to be explicit w.r.t.
source or result vectors.
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D144159
We should distribute ops that have other uses than the yield op as this
would duplicate those ops.
Differential Revision: https://reviews.llvm.org/D143629
Instead, use the builder and infer the return type based on the inner `yield` ops.
Also, fix uses that do not create the terminator as required for the callback builders.
Differential Revision: https://reviews.llvm.org/D142056
Prevent creating a vector of size 0 that would fail verifier.
Vector 1d with a single element should be treated like 0d vectors.
Differential Revision: https://reviews.llvm.org/D141452
This patch fixes:
mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp:947:13:
error: variable 'distributedDim' set but not used
[-Werror,-Wunused-but-set-variable]
In case the distributed dim of the dest vector is also a dim of the src vector, each lane inserts a smaller part of the source vector. Otherwise, one lane inserts the entire src vector and the other lanes do nothing.
Differential Revision: https://reviews.llvm.org/D137953
In case of a distribution, only one lane inserts the scalar value. In case of a broadcast, every lane inserts the scalar.
Differential Revision: https://reviews.llvm.org/D137929
Ops such as `%1 = vector.extract %0[2] : vector<5x96xf32>`.
Distribute the source vector, then extract. In case of a 1d extract, rewrite to vector.extractelement.
Differential Revision: https://reviews.llvm.org/D137646
Relax unnecessary restriction when distribution a vector.reduce op.
All the float and integer types can be supported by user's lambda.
Differential Revision: https://reviews.llvm.org/D141094
The methods in `SideEffectUtils.h` (and their implementations in
`SideEffectUtils.cpp`) seem to have similar intent to methods already
existing in `SideEffectInterfaces.h`. Move the decleration (and
implementation) from `SideEffectUtils.h` (and `SideEffectUtils.cpp`)
into `SideEffectInterfaces.h` (and `SideEffectInterface.cpp`).
Also drop the `SideEffectInterface::hasNoEffect` method in favor of
`mlir::isMemoryEffectFree` which actually recurses into the operation
instead of just relying on the `hasRecursiveMemoryEffectTrait`
exclusively.
Differential Revision: https://reviews.llvm.org/D137857
Ops such as `%1 = vector.extractelement %0[%pos : index] : vector<96xf32>`.
In case of an extract from a 1D vector, the source vector is distributed. The lane into which the requested position falls, extracts the element and shuffles it to all other lanes.
Differential Revision: https://reviews.llvm.org/D137336
Quantization method is crucial and ubiqutous in accelerating machine
learning workloads. Most of these methods uses f16 and i8 types.
This patch relaxes the type contraints on warp reduce distribution to
allow these types. Furthermore, this patch also changed the interface
and moved the initial reduction of data to a single thread into the
distributedReductionFn, this gives flexibility for developers to control
how they are obtaining the initial lane value, which might differ based
on the input types. (i.e to shuffle 32-width type, we need to reduce f16
to 2xf16 types rather than a single element).
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D137691
When a value used in the forOp is defined outside the region but within
the parent warpOp we need to return and distribute the value to pass it
to new operations created within the loop.
Also simplify the lambda interface.
Differential Revision: https://reviews.llvm.org/D137146
This patch takes the first step towards a more principled modeling of undefined behavior in MLIR as discussed in the following discourse threads:
1. https://discourse.llvm.org/t/semantics-modeling-undefined-behavior-and-side-effects/4812
2. https://discourse.llvm.org/t/rfc-mark-tensor-dim-and-memref-dim-as-side-effecting/65729
This patch in particular does the following:
1. Introduces a ConditionallySpeculatable OpInterface that dynamically determines whether an Operation can be speculated.
2. Re-defines `NoSideEffect` to allow undefined behavior, making it necessary but not sufficient for speculation. Also renames it to `NoMemoryEffect`.
3. Makes LICM respect the above semantics.
4. Changes all ops tagged with `NoSideEffect` today to additionally implement ConditionallySpeculatable and mark themselves as always speculatable. This combined trait is named `Pure`. This makes this change NFC.
For out of tree dialects:
1. Replace `NoSideEffect` with `Pure` if the operation does not have any memory effects, undefined behavior or infinite loops.
2. Replace `NoSideEffect` with `NoSideEffect` otherwise.
The next steps in this process are (I'm proposing to do these in upcoming patches):
1. Update operations like `tensor.dim`, `memref.dim`, `scf.for`, `affine.for` to implement a correct hook for `ConditionallySpeculatable`. I'm also happy to update ops in other dialects if the respective dialect owners would like to and can give me some pointers.
2. Update other passes that speculate operations to consult `ConditionallySpeculatable` in addition to `NoMemoryEffect`. I could not find any other than LICM on a quick skim, but I could have missed some.
3. Add some documentation / FAQs detailing the differences between side effects, undefined behavior, speculatabilty.
Reviewed By: rriddle, mehdi_amini
Differential Revision: https://reviews.llvm.org/D135505
Simplify the lowering of warp_execute_on_lane0 of scf.if by making the
logic more generic. Also remove the assumption that the most inner
dimension is the dimension distributed.
Differential Revision: https://reviews.llvm.org/D133826
This revision significantly improves and tests the broadcast behavior of vector.warp_execute_on_lane_0.
Previously, the implementation of the broadcast behavior of vector.warp_execute_on_lane_0
assumed that the broadcasted value was always of scalar type.
This is not necessarily the case.
Differential Revision: https://reviews.llvm.org/D133767