As part of this extension this change also does some general cleanup
1) Make all the methods take `RewriterBase` as arguments instead of
creating their own builders that tend to crash when used within
pattern rewrites
2) Split `coalesePerfectlyNestedLoops` into two separate methods, one
for `scf.for` and other for `affine.for`. The templatization didnt
seem to be buying much there.
Also general clean up of tests.
Composite pass allows to run sequence of passes in the loop until fixed
point or maximum number of iterations is reached. The usual candidates
are canonicalize+CSE as canonicalize can open more opportunities for CSE
and vice-versa.
Adds a new pass option `add-result-attr` that will make the pass add the
attribute `{bufferize.result}` to each argument that was converted from
a result.
This is important e.g. when later using the python bindings / execution
engine to understand which arguments are actually results.
To be able to test this, the pass option was added to the tablegen. To
avoid collisions with the existing, manually defined option struct
`BufferResultsToOutParamsOptions`, that one was renamed to
`BufferResultsToOutParamsOpts`.
Discussion at https://discourse.llvm.org/t/inliner-cost-model/2992
This change adds a callback that reports whether inlining
of the particular call site (communicated via ResolvedCall argument)
is profitable or not. The default MLIR inliner pass behavior
is unchanged, i.e. the callback always returns true.
This callback may be used to customize the inliner behavior
based on the target specifics (like target instructions costs),
profitability of the inlining for further optimizations
(e.g. if inlining may enable loop optimizations or scalar optimizations
due to object shape propagation), optimization levels (e.g. -Os inlining
may be quite different from -Ofast inlining), etc.
One of the questions is whether the ResolvedCall entity represents
enough of the context for the custom inlining models to come up with
the profitability decision. I think we can start with this and
extend it as necessary.
---------
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
This commit adds listener support to the dialect conversion. Similarly
to the greedy pattern rewrite driver, an optional listener can be
specified in the configuration object.
Listeners are notified only if the dialect conversion succeeds. In case
of a failure, where some IR changes are first performed and then rolled
back, no notifications are sent.
Due to the fact that some kinds of rewrite are reflected in the IR
immediately and some in a delayed fashion, there are certain limitations
when attaching a listener; these are documented in `ConversionConfig`.
To summarize, users are always notified about all rewrites that
happened, but the notifications are sent all at once at the very end,
and not interleaved with the actual IR changes.
This change is in preparation improvements to
`transform.apply_conversion_patterns`, which currently invalidates all
handles. In the future, it can use a listener to update handles
accordingly, similar to `transform.apply_patterns`.
#66771 introduce `llvm::post_order(&r.front())` which is equal to
`r.front().getSuccessor(...)`.
It will visit the succ block of current block. But actually here need to
visit all block of region in reverse order.
Fixes: #77420.
This commit fixes a bug in a dialect conversion. Currently, when a block
is replaced via a signature conversion, the block is erased during the
"commit" phase. This is problematic because the block arguments may
still be referenced internal data structures of the dialect conversion
(`mapping`). Blocks should be treated same as ops: they should be erased
during the "cleanup" phase.
Note: The test case fails without this fix when running with ASAN, but
may pass when running without ASAN.
I believe the semantics should be the same, but this saves 1 op and simplifies the code.
For example, the following two instructions:
```
%2 = cmp sgt %0, %1
%3 = select %2, %0, %1
```
Are equivalent to:
```
%2 = maxsi %0 %1
```
The dialect conversion rolls back in-place op modifications upon
failure. Rolling back modifications of attributes is already supported,
but there was no support for properties until now.
Rename listener callback names:
* `notifyOperationRemoved` -> `notifyOperationErased`
* `notifyBlockRemoved` -> `notifyBlockErased`
The current callback names are misnomers. The callbacks are triggered
when an operation/block is erased, not when it is removed (unlinked).
E.g.:
```c++
/// Notify the listener that the specified operation is about to be erased.
/// At this point, the operation has zero uses.
///
/// Note: This notification is not triggered when unlinking an operation.
virtual void notifyOperationErased(Operation *op) {}
```
This change is in preparation of adding listener support to the dialect
conversion. The dialect conversion internally unlinks IR before erasing
it at a later point of time. There is an important difference between
"remove" and "erase". Lister callback names should be accurate to avoid
confusion.
Add a new rewrite class for "operation movements". This rewrite class
can roll back `moveOpBefore` and `moveOpAfter`.
`RewriterBase::moveOpBefore` and `RewriterBase::moveOpAfter` is no
longer virtual. (The dialect conversion can gather all required
information for rollbacks from listener notifications.)
Following the discussion in
https://discourse.llvm.org/t/symboltable-and-symbol-parent-child-relationship/75446,
we should enforce that a symbol's immediate parent is a symbol table.
I changed some tests to pass the verification. In most cases, we can
wrap the func with a module, change the func to another op with regions
i.e. scf.if, or change the expected error message.
---------
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
Similar to `OpBuilder::clone`, operation/block insertion notifications
should be sent when cloning the contents of a region. E.g., this is to
ensure that the newly created operations are put on the worklist of the
greedy pattern rewriter driver.
Also move `cloneRegionBefore` from `RewriterBase` to `OpBuilder`. It
only creates new IR, so it should be part of the builder API (like
`clone(Operation &)`). The function does not have to be virtual. Now
that notifications are properly sent, the override in the dialect
conversion is no longer needed.
When a block is split with `RewriterBase::splitBlock`, a
`notifyBlockInserted` notification, followed by
`notifyOperationInserted` notifications (for moving over the operations
into the new block) should be sent. This commit adds those
notifications.
When a block is inlined into another block, the nested operations are
moved into another block and the `notifyOperationInserted` callback
should be triggered. This commit adds the missing notifications for:
* `RewriterBase::inlineBlockBefore`
* `RewriterBase::mergeBlocks`
This commit adds a new method to the rewriter API: `moveBlockBefore`.
This op is utilized by `inlineRegionBefore` and covered by dialect
conversion test cases.
Also fixes a bug in `moveOpBefore`, where the previous op location was
not passed correctly. Adds a test case to
`test-strict-pattern-driver.mlir`.
Make GreedyPatternRewriteDriver handle failures of `materializeConstant`
gracefully. Previously it was not checking whether the returned op was
null and crashing. This PR handles it similarly to how OperationFolder
does it.
Follow up to the discussion from #75258, and serves as an alternate
solution for #74670.
Set the location to Unknown for deduplicated / moved / materialized
constants by OperationFolder. This makes sure that the folded constants
don't end up with an arbitrary location of one of the original ops that
became it, and that hoisted ops don't confuse the stepping order.
This commit adds extra assertions to `OperationFolder` and `OpBuilder`
to ensure that the types of the folded SSA values match with the result
types of the op. There used to be checks that discard the folded results
if the types do not match. This commit makes these checks stricter and
turns them into assertions.
Discarding folded results with the wrong type (without failing
explicitly) can hide bugs in op folders. Two such bugs became apparent
in MLIR (and some more in downstream projects) and are fixed with this
change.
Note: The existing type checks were introduced in
https://reviews.llvm.org/D95991.
Migration guide: If you see failing assertions (`folder produced value
of incorrect type`; make sure to run with assertions enabled!), run with
`-debug` or dump the operation right before the failing assertion. This
will point you to the op that has the broken folder. A common mistake is
a mismatch between static/dynamic dimensions (e.g., input has a static
dimension but folded result has a dynamic dimension).
This commit makes reductions part of the terminator. Instead of
`scf.yield`, `scf.reduce` now terminates the body of `scf.parallel` ops.
`scf.reduce` may contain an arbitrary number of reductions, with one
region per reduction.
Example:
```mlir
%init = arith.constant 0.0 : f32
%r:2 = scf.parallel (%iv) = (%lb) to (%ub) step (%step) init (%init, %init)
-> f32, f32 {
%elem_to_reduce1 = load %buffer1[%iv] : memref<100xf32>
%elem_to_reduce2 = load %buffer2[%iv] : memref<100xf32>
scf.reduce(%elem_to_reduce1, %elem_to_reduce2 : f32, f32) {
^bb0(%lhs : f32, %rhs: f32):
%res = arith.addf %lhs, %rhs : f32
scf.reduce.return %res : f32
}, {
^bb0(%lhs : f32, %rhs: f32):
%res = arith.mulf %lhs, %rhs : f32
scf.reduce.return %res : f32
}
}
```
`scf.reduce` operations can no longer be interleaved with other ops in
the body of `scf.parallel`. This simplifies the op and makes it possible
to assign the `RecursiveMemoryEffects` trait to `scf.reduce`. (This was
not possible before because the op was not a terminator, causing the op
to be DCE'd.)
This reverts commit 87e2e89019.
and its follow-ups 0d1490f09f (#75218)
and 6fe3cd5467 (#75312).
We observed significant OOM/timeout issues due to #74670 to quite a few
services including google-research/swirl-lm. The follow-up #75218 and
#75312 do not address the issue. Perhaps this is worth more
investigation.
[PR 74670](https://github.com/llvm/llvm-project/pull/74670) added
support for merging locations at constant folding time. We have
discovered that in some cases, the number of locations grows so big as
to cause a compilation process to OOM. In that case, many of the
locations end up appearing several times in nested fused locations.
We add here a helper that always flattens fused locations in order to
eliminate duplicates in the case of nested fused locations.
When merging constants by the operation folder, the location of the op
that remains should be updated to track the new meaning of this op. This
way we do not lose track of all possible source locations that the
constant op came from, and the final location of the op is less reliant
on the order of folding. This will also help debuggers understand how to
step these instructions.
This PR introduces a helper for operation folder to fuse another
location into the location of an op. When an op is deduplicated, fuse
the location of the op to be removed into the op that is retained. The
retained op now represents both original ops.
The FusedLoc will have a string metadata to help understand the reason
for the location fusion (motivated by the
[example](71be8f3c23/mlir/include/mlir/IR/BuiltinLocationAttributes.td (L130))
in the docstring of FusedLoc).
- Implement `SubsetOpInterface`, `SubsetExtractionOpInterface`,
`SubsetInsertionOpInterface` for `vector.transfer_read` and
`vector.transfer_write`.
- Move all tensor subset hoisting test cases from `Linalg` to
`loop-invariant-subset-hoisting.mlir`. (Removing 1 duplicate test case.)
The majority of subset ops operate on hyperrectangular subsets. This
commit adds a new optional interface method
(`getAccessedHyperrectangularSlice`) that can be implemented by such
subset ops. If implemented, the other `operatesOn...` interface methods
of the `SubsetOpInterface` do not have to be implemented anymore.
The comparison logic for hyperrectangular subsets (is
disjoint/equivalent) is implemented with `ValueBoundsOpInterface`. This
makes the subset hoisting more powerful: simple cases where two
different SSA values always have the same runtime value can now be
supported.
Improve the bypass analysis for loop-like ops. Until now, loop-like ops
were treated like any other non-subset ops: they prevent hoisting of any
sort because the analysis does not know which parts of a tensor init
operand are accessed by the loop-like op. With this change, the analysis
can look into loop-like ops and analyze which subset they are operating
on.
Add a loop-invariant subset hoisting pass to `mlir/Interfaces`. This
pass hoist loop-invariant tensor subsets (subset extraction and subset
insertion ops) from loop-like ops. Extraction ops are moved before the
loop. Insertion ops are moved after the loop. The loop body operates on
newly added region iter_args (one per extraction-insertion pair).
This new pass will be improved in subsequent commits (to support more
cases/ops) and will eventually replace
`Linalg/Transforms/SubsetHoisting.cpp`. In contrast to the existing
Linalg subset hoisting, the new pass is op interface-based
(`SubsetOpInterface` and `LoopLikeOpInterface`).
`affine::replaceForOpWithNewYields` and `replaceLoopWithNewYields` (for
"scf.for") are now interface methods and additional loop-carried
variables can now be added to "scf.for"/"affine.for" uniformly. (No more
`TypeSwitch` needed.)
Note: `scf.while` and other loops with loop-carried variables can
implement `replaceWithAdditionalYields`, but to keep this commit small,
that is not done in this commit.
This revision replaces the LLVM dialect NullOp by the recently
introduced ZeroOp. The ZeroOp is more generic in the sense that it
represents zero values of any LLVM type rather than null pointers only.
This is a follow to https://github.com/llvm/llvm-project/pull/65508
When cloning an op, the `notifyOperationInserted` callback is triggered
for all nested ops. Similarly, the `notifyOperationRemoved` callback
should be triggered for all nested ops when removing an op.
Listeners may inspect the IR during a `notifyOperationRemoved` callback.
Therefore, when multiple ops are removed in a single
`RewriterBase::eraseOp` call, the notifications must be triggered in an
order in which the ops could have been removed one-by-one:
* Op removals must be interleaved with `notifyOperationRemoved`
callbacks. A callback is triggered right before the respective op is
removed.
* Ops are removed post-order and in reverse order. Other traversal
orders could delete an op that still has uses. (This is not avoidable in
graph regions and with cyclic block graphs.)
Differential Revision: Imported from https://reviews.llvm.org/D144193.
This commit implements `LoopLikeOpInterface` on `scf.while`. This
enables LICM (and potentially other transforms) on `scf.while`.
`LoopLikeOpInterface::getLoopBody()` is renamed to `getLoopRegions` and
can now return multiple regions.
Also fix a bug in the default implementation of
`LoopLikeOpInterface::isDefinedOutsideOfLoop()`, which returned "false"
for some values that are defined outside of the loop (in a nested op, in
such a way that the value does not dominate the loop). This interface is
currently only used for LICM and there is no way to trigger this bug, so
no test is added.
This trait is needed so that unstructured control flow is not inlined into "scf.while" ops.
Note: The two regions of "scf.while" are already defined as `SizedRegion<1>`. `SingleBlock` can be queried from C++, `SizedRegion<n>` not.
Fixes#64976.
Differential Revision: https://reviews.llvm.org/D159199
Do not inline IR with multiple blocks into ops that may not support unstructured control flow.
This fixes#64978.
Differential Revision: https://reviews.llvm.org/D159072
This commit fixes an error in the `RemoveDeadValues` pass that is
associated with its incorrect usage of the `cloneInto()` function.
The `setOperands()` function that is used by the `cloneInto()` function
requires all operands to not be null. But, that is not possible in this
pass because we drop uses of dead values, thus making them null. It is
only at the end of the pass that we are assured that such null values
won't exist but during the execution of the pass, there could be null
values.
To fix this, we replace the usage of the `cloneInto()` function to copy
a region with `moveBlock()` to move each block of the region one by one.
This function does not require the presence of non-null values and is
thus the right choice here. This implementation is also more opttimized
because we are moving things instead of copying them. The goal was
always moving.
Signed-off-by: Srishti Srivastava <srishtisrivastava.ai@gmail.com>
Reviewed By: srishti-pm
Differential Revision: https://reviews.llvm.org/D158941
Large deep learning models rely on heavy computations. However, not
every computation is necessary. And, even when a computation is
necessary, it helps if the values needed for the computation are
available in registers (which have low-latency) rather than being in
memory (which has high-latency).
Compilers can use liveness analysis to:-
(1) Remove extraneous computations from a program before it executes on
hardware, and,
(2) Optimize register allocation.
Both these tasks help achieve one very important goal: reducing runtime.
Recently, liveness analysis was added to MLIR. Thus, this commit uses
the recently added liveness analysis utility to try to accomplish task
(1).
It adds a pass called `remove-dead-values` whose goal is
optimization (reducing runtime) by removing unnecessary instructions.
Unlike other passes that rely on local information gathered from
patterns to accomplish optimization, this pass uses a full analysis of
the IR, specifically, liveness analysis, and is thus more powerful.
Currently, this pass performs the following optimizations:
(A) Removes function arguments that are not live,
(B) Removes function return values that are not live across all callers of
the function,
(C) Removes unneccesary operands, results, region arguments, region
terminator operands of region branch ops, and,
(D) Removes simple and region branch ops that have all non-live results and
don't affect memory in any way,
iff
the IR doesn't have any non-function symbol ops, non-call symbol user ops
and branch ops.
Here, a "simple op" refers to an op that isn't a symbol op, symbol-user op,
region branch op, branch op, region branch terminator op, or return-like.
It is noteworthy that we do not refer to non-live values as "dead" in this
file to avoid confusing it with dead code analysis's "dead", which refers to
unreachable code (code that never executes on hardware) while "non-live"
refers to code that executes on hardware but is unnecessary. Thus, while the
removal of dead code helps little in reducing runtime, removing non-live
values should theoretically have significant impact (depending on the amount
removed).
It is also important to note that unlike other passes (like `canonicalize`)
that apply op-specific optimizations through patterns, this pass uses
different interfaces to handle various types of ops and tries to cover all
existing ops through these interfaces.
It is because of its reliance on (a) liveness analysis and (b) interfaces
that makes it so powerful that it can optimize ops that don't have a
canonicalizer and even when an op does have a canonicalizer, it can perform
more aggressive optimizations, as observed in the test files associated with
this pass.
Example of optimization (A):-
```
int add_2_to_y(int x, int y) {
return 2 + y
}
print(add_2_to_y(3, 4))
print(add_2_to_y(5, 6))
```
becomes
```
int add_2_to_y(int y) {
return 2 + y
}
print(add_2_to_y(4))
print(add_2_to_y(6))
```
Example of optimization (B):-
```
int, int get_incremented_values(int y) {
store y somewhere in memory
return y + 1, y + 2
}
y1, y2 = get_incremented_values(4)
y3, y4 = get_incremented_values(6)
print(y2)
```
becomes
```
int get_incremented_values(int y) {
store y somewhere in memory
return y + 2
}
y2 = get_incremented_values(4)
y4 = get_incremented_values(6)
print(y2)
```
Example of optimization (C):-
Assume only `%result1` is live here. Then,
```
%result1, %result2, %result3 = scf.while (%arg1 = %operand1, %arg2 = %operand2) {
%terminator_operand2 = add %arg2, %arg2
%terminator_operand3 = mul %arg2, %arg2
%terminator_operand4 = add %arg1, %arg1
scf.condition(%terminator_operand1) %terminator_operand2, %terminator_operand3, %terminator_operand4
} do {
^bb0(%arg3, %arg4, %arg5):
%terminator_operand6 = add %arg4, %arg4
%terminator_operand5 = add %arg5, %arg5
scf.yield %terminator_operand5, %terminator_operand6
}
```
becomes
```
%result1, %result2 = scf.while (%arg2 = %operand2) {
%terminator_operand2 = add %arg2, %arg2
%terminator_operand3 = mul %arg2, %arg2
scf.condition(%terminator_operand1) %terminator_operand2, %terminator_operand3
} do {
^bb0(%arg3, %arg4):
%terminator_operand6 = add %arg4, %arg4
scf.yield %terminator_operand6
}
```
It is interesting to see that `%result2` won't be removed even though it is
not live because `%terminator_operand3` forwards to it and cannot be
removed. And, that is because it also forwards to `%arg4`, which is live.
Example of optimization (D):-
```
int square_and_double_of_y(int y) {
square = y ^ 2
double = y * 2
return square, double
}
sq, do = square_and_double_of_y(5)
print(do)
```
becomes
```
int square_and_double_of_y(int y) {
double = y * 2
return double
}
do = square_and_double_of_y(5)
print(do)
```
Signed-off-by: Srishti Srivastava <srishtisrivastava.ai@gmail.com>
Reviewed By: matthiaskramm, Mogball, jcai19
Differential Revision: https://reviews.llvm.org/D157049
Add support for reasoning about operations with recursive memory effects
to CSE. The recursive effects are gathered by a helper function. I
decided to allow returning duplicates from the helper function because
there's no benefit to spending the computation time to remove them in
the existing use case.
Differential Revision: https://reviews.llvm.org/D156805
This renaming started with the native ODS support for properties, this is completing it.
A mass automated textual rename seems safe for most codebases.
Drop also the ods prefix to keep the accessors the same as they were before
this change:
properties.odsOperandSegmentSizes
reverts back to:
properties.operandSegementSizes
The ODS prefix was creating divergence between all the places and make it harder to
be consistent.
Reviewed By: jpienaar
Differential Revision: https://reviews.llvm.org/D157173
Move a bunch of lingering test cases from test/Transforms/ into
test/Dialect/Affine and MemRef.
Differential Revision: https://reviews.llvm.org/D155855
This feature was introduced in `D123492`.
Doing equivalence on pointers to sort operands of commutative operations is incorrect when checking equivalence of ops in separate regions (where the lhs and rhs operands are marked as equivalent but are not the same value).
It was also discussed in `D123492` and `D129480` that the correct solution would be to stable sort the operands in canonicalization (based on some numbering in the region maybe), but until that lands, reverting this change will unblock us and other users.
An example of a pass that might not work properly because of this is `DuplicateFunctionEliminationPass`.
Reviewed By: mehdi_amini, jpienaar
Differential Revision: https://reviews.llvm.org/D154699