This commit adds support for `scf.if` to `ValueBoundsConstraintSet`.
Example:
```
%0 = scf.if ... -> index {
scf.yield %a : index
} else {
scf.yield %b : index
}
```
The following constraints hold for %0:
* %0 >= min(%a, %b)
* %0 <= max(%a, %b)
Such constraints cannot be added to the constraint set; min/max is not
supported by `IntegerRelation`. However, if we know which one of %a and
%b is larger, we can add constraints for %0. E.g., if %a <= %b:
* %0 >= %a
* %0 <= %b
This commit required a few minor changes to the
`ValueBoundsConstraintSet` infrastructure, so that values can be
compared while we are still in the process of traversing the IR/adding
constraints.
Note: This is a re-upload of #85895, which was reverted. The bug that
caused the failure was fixed in #87859.
This commit adds support for `scf.if` to `ValueBoundsConstraintSet`.
Example:
```
%0 = scf.if ... -> index {
scf.yield %a : index
} else {
scf.yield %b : index
}
```
The following constraints hold for %0:
* %0 >= min(%a, %b)
* %0 <= max(%a, %b)
Such constraints cannot be added to the constraint set; min/max is not
supported by `IntegerRelation`. However, if we know which one of %a and
%b is larger, we can add constraints for %0. E.g., if %a <= %b:
* %0 >= %a
* %0 <= %b
This commit required a few minor changes to the
`ValueBoundsConstraintSet` infrastructure, so that values can be
compared while we are still in the process of traversing the IR/adding
constraints.
As part of this extension this change also does some general cleanup
1) Make all the methods take `RewriterBase` as arguments instead of
creating their own builders that tend to crash when used within
pattern rewrites
2) Split `coalesePerfectlyNestedLoops` into two separate methods, one
for `scf.for` and other for `affine.for`. The templatization didnt
seem to be buying much there.
Also general clean up of tests.
If `before` block args are directly forwarded to `scf.condition` make
sure they are passed in the same order.
This is needed for `scf.while` uplifting
https://github.com/llvm/llvm-project/pull/76108
Adds support for fusing two scf.for loops occurring in the same block.
Uses the rudimentary checks already in place for scf.forall (like the
target loop's operands being dominated by the source loop).
- Fixes a bug in the dominance check whereby it was checked that values
in the target loop themselves dominated the source loop rather than the
ops that define these operands.
- Renames the LoopFuseSibling op to LoopFuseSiblingOp.
- Updates LoopFuseSiblingOp's description.
- Adds tests for using LoopFuseSiblingOp on scf.for loops, including one
which fails without the fix for the dominance check.
- Adds tests checking the different failure modes of the dominance
checker.
- Adds test for case whereby scf.yield is automatically generated when
there are no loop-carried variables.
This PR fixes the condition used in loop peeling of the first iteration.
Using ceilDiv instead of floorDiv when calculating the loop counts, so
that the first iteration gets peeled as needed.
One-Shot Bufferize currently does not support loops where a yielded
value bufferizes to a buffer that is different from the buffer of the
region iter_arg. In such a case, the bufferization fails with an error
such as:
```
Yield operand #0 is not equivalent to the corresponding iter bbArg
scf.yield %0 : tensor<5xf32>
```
One common reason for non-equivalent buffers is that an op on the path
from the region iter_arg to the terminator bufferizes out-of-place. Ops
that are analyzed earlier are more likely to bufferize in-place.
This commit adds a new heuristic that gives preference to ops that are
reachable on the reverse SSA use-def chain from a region terminator and
are within the parent region of the terminator. This is expected to work
better than the existing heuristics for loops where an iter_arg is
written to multiple times within a loop, but only one write is fed into
the terminator.
Current users of One-Shot Bufferize are not affected by this change.
"Bottom-up" is still the default heuristic. Users can switch to the new
heuristic manually.
This commit also turns the "fuzzer" pass option into a heuristic,
cleaning up the code a bit.
Properly handle fusion of loops with reductions:
* Check there are no first loop results users between loops
* Create new loop op with merged reduction init values
* Update `scf.reduce` op to contain reductions from both loops
* Update loops users with new loop results
When checking the load indices of the second loop coincide with the
store indices of the first loop, it only considers the index values are
the same or not. However, there are some cases the index values defined
by other operators. In these cases, it will treat them as different even
the results of defining operators are the same.
We already check if the iteration space is the same in isFusionLegal().
When checking operands of defining operators, we only need to consider
the operands come from the same induction variables. If so, we know the
results of defining operators are the same.
Enable the fusion of parallel loops also when the 1st loop contains
multiple write accesses to the same buffer, if the accesses are always
on the same indices.
Fix LIT test cases whose loops were not being fused.
Signed-off-by: Fabrizio Indirli <Fabrizio.Indirli@arm.com>
An op verifier should verify only local properties. This commit removes
the verification of `scf.for` step sizes. (Verifiers can check
attributes but should not follow SSA values.) This verification could
reject IR that is actually valid, e.g.:
```mlir
scf.if %always_false {
// Branch is never entered.
scf.for ... step %c0 { ... }
}
```
This commit fixes `for-loop-peeling.mlir` when running with
`MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`:
```
within split at llvm-project/mlir/test/Dialect/SCF/for-loop-peeling.mlir:293 offset :9:3: note: see current operation:
"scf.for"(%0, %3, %2) ({
^bb0(%arg1: index):
%4 = "arith.index_cast"(%arg1) : (index) -> i64
"memref.store"(%4, %arg0) : (i64, memref<i64>) -> ()
"scf.yield"() : () -> ()
}) {__peeled_loop__} : (index, index, index) -> ()
LLVM ERROR: IR failed to verify after folding
```
Note: `%2` is `arith.constant 0 : index`.
Before applying the peeling patterns, it can happen that the `ForOp`
gets a step of zero during folding. This leads to a division-by-zero
down the line.
This patch adds an additional check for a constant-zero step and a
test.
Fix https://github.com/llvm/llvm-project/issues/75758
Introduce a new extension for simple print-debugging of the transform
dialect scripts. The initial version of this extension consists of two
ops that are printing the payload objects associated with transform
dialect values. Similar ops were already available in the test extenion
and several downstream projects, and were extensively used for testing.
Add a check to validate that the schedule passed to the pipeliner
transformation is valid and won't cause the pipeliner to break SSA.
This checks that the for each operation in the loop operations are
scheduled after their operands.
The `GreedyPatternRewriteDriver` tries to iteratively fold ops and apply
rewrite patterns to ops. It has special handling for constants: they are
CSE'd and sometimes moved to parent regions to allow for additional
CSE'ing. This happens in `OperationFolder`.
To allow for efficient CSE'ing, `OperationFolder` maintains an internal
lookup data structure to find the existing constant ops with the same
value for each `IsolatedFromAbove` region:
```c++
/// A mapping between an insertion region and the constants that have been
/// created within it.
DenseMap<Region *, ConstantMap> foldScopes;
```
Rewrite patterns are allowed to modify operations. In particular, they
may move operations (including constants) from one region to another
one. Such an IR rewrite can make the above lookup data structure
inconsistent.
We encountered such a bug in a downstream project. This bug materialized
in the form of an op that uses the result of a constant op from a
different `IsolatedFromAbove` region (that is not accessible).
This commit changes the behavior of the `GreedyPatternRewriteDriver`
such that `OperationFolder` is used to CSE constants at the beginning of
each iteration (as the worklist is populated), but no longer during an
iteration. `OperationFolder` is no longer used after populating the
worklist, so we do not have to care about inconsistent state in the
`OperationFolder` due to IR rewrites. The `GreedyPatternRewriteDriver`
now performs the op folding by itself instead of calling
`OperationFolder::tryToFold`.
This change changes the order of constant ops in test cases, but not the
region in which they appear. All broken test cases were fixed by turning
`CHECK` into `CHECK-DAG`.
Alternatives considered: The state of `OperationFolder` could be
partially invalidated with every `notifyOperationModified` notification.
That is more fragile than the solution in this commit because incorrect
rewriter API usage can lead to missing notifications and hard-to-debug
`IsolatedFromAbove` violations. (It did not fix the above mention bug in
a downstream project, which could be due to incorrect rewriter API usage
or due to another conceptual problem that I missed.) Moreover, ops are
frequently getting modified during a greedy pattern rewrite, so we would
likely keep invalidating large parts of the state of `OperationFolder`
over and over.
Migration guide: Turn `CHECK` into `CHECK-DAG` in test cases. Constant
ops are no longer folded during a greedy pattern rewrite. If you rely on
folding (and rematerialization) of constant ops during a greedy pattern
rewrite, turn the folder into a pattern.
This commit makes reductions part of the terminator. Instead of
`scf.yield`, `scf.reduce` now terminates the body of `scf.parallel` ops.
`scf.reduce` may contain an arbitrary number of reductions, with one
region per reduction.
Example:
```mlir
%init = arith.constant 0.0 : f32
%r:2 = scf.parallel (%iv) = (%lb) to (%ub) step (%step) init (%init, %init)
-> f32, f32 {
%elem_to_reduce1 = load %buffer1[%iv] : memref<100xf32>
%elem_to_reduce2 = load %buffer2[%iv] : memref<100xf32>
scf.reduce(%elem_to_reduce1, %elem_to_reduce2 : f32, f32) {
^bb0(%lhs : f32, %rhs: f32):
%res = arith.addf %lhs, %rhs : f32
scf.reduce.return %res : f32
}, {
^bb0(%lhs : f32, %rhs: f32):
%res = arith.mulf %lhs, %rhs : f32
scf.reduce.return %res : f32
}
}
```
`scf.reduce` operations can no longer be interleaved with other ops in
the body of `scf.parallel`. This simplifies the op and makes it possible
to assign the `RecursiveMemoryEffects` trait to `scf.reduce`. (This was
not possible before because the op was not a terminator, causing the op
to be DCE'd.)
Abort fusion if memref load may alias write, but not the exact alias.
Add alias check hook to `naivelyFuseParallelOps`, so user can customize
alias checking.
Use builtin alias analysis in `ParallelLoopFusion` pass.
There is a use case that we need to peel the first iteration out of the
for loop so that the peeled forOp can be canonicalized away and the
fillOp can be fused into the inner forall loop. For example, we have
nested loops as below
```
linalg.fill ins(...) outs(...)
scf.for %arg = %lb to %ub step %step
scf.forall ...
```
After the peeling transform, it is expected to be
```
scf.forall ...
linalg.fill ins(...) outs(...)
scf.for %arg = %(lb + step) to %ub step %step
scf.forall ...
```
This patch makes the most use of the existing peeling functions and adds
support for peeling the first iteration out of the loop.
Support loops without static boundaries. Since the number of iteration
is not known we need to predicate prologue and epilogue in case the
number of iterations is smaller than the number of stages.
This patch includes work from @chengjunlu
The partial bufferization framework has been replaced with One-Shot
Bufferize. SCF-specific canonicalization patterns for
`to_memref`/`to_tensor` are no longer needed.
-Fix case where an op is scheduled in stage 0 and used with a distance
of 1
-Fix case where we don't peel the epilogue and a value not part of the
last stage is used outside the loop.
Loop peeling is not beneficial if the step size already divides "ub -
lb". There are currently some simple checks to prevent peeling in such
cases when lb, ub, step are constants. This commit adds support for IR
that is the result of loop peeling in the general case; i.e., lb, ub,
step do not necessarily have to be constants.
This change adds a new affine_map simplification rule for semi-affine
maps that appear during loop peeling and are guaranteed to evaluate to a
constant zero. Affine maps such as:
```
(1) affine_map<()[ub, step] -> ((ub - ub mod step) mod step)
(2) affine_map<()[ub, lb, step] -> ((ub - (ub - lb) mod step - lb) mod step)
(3) ^ may contain additional summands
```
Other affine maps with modulo expressions are not supported by the new
simplification rule.
This fixes#71469.
Expose loop results, which correspond to the region iter_arg values that
are returned from the loop when there are no more iterations. Exposing
loop results is optional because some loops (e.g., `scf.while`) do not
have a 1-to-1 mapping between region iter_args and op results.
Also add additional helper functions to query tied
results/iter_args/inits.
Update most test passes to use the transform-interpreter pass instead of
the test-transform-dialect-interpreter-pass. The new "main" interpreter
pass has a named entry point instead of looking up the top-level op with
`PossibleTopLevelOpTrait`, which is arguably a more understandable
interface. The change is mechanical, rewriting an unnamed sequence into
a named one and wrapping the transform IR in to a module when necessary.
Add an option to the transform-interpreter pass to target a tagged
payload op instead of the root anchor op, which is also useful for repro
generation.
Only the test in the transform dialect proper and the examples have not
been updated yet. These will be updated separately after a more careful
consideration of testing coverage of the transform interpreter logic.
Fix a crash reported in #64331. The crash is described in the following
comment:
> It looks like the bug is being caused by the command line argument
--scf-parallel-loop-tiling=parallel-loop-tile-sizes=0. More
specifically, --scf-parallel-loop-tiling=parallel-loop-tile-sizes sets
the tileSize variable to 0 on [this
line](7cc1bfaf37/mlir/lib/Dialect/SCF/Transforms/ParallelLoopTiling.cpp (L67)).
tileSize is then used on [this
line](7cc1bfaf37/mlir/lib/Dialect/SCF/Transforms/ParallelLoopTiling.cpp (L117))
causing a divide by zero exception.
This PR will:
1. Call `signalPassFail()` when 0 is passed as a tile size.
2. Avoid the divide by zero that causes the crash.
Note: This is my first PR for MLIR, so please liberally critique it.
In particular, `upperBoundUnrolledCst` may be larger than `ubCst` when:
1. the step size is greater than 1;
2. `ub - lb` is not evenly divisible by the step size; and
3. the loop's trip count is evenly divisible by the unroll factor.
This is okay since the non-unit step size ensures that the unrolled loop
maintains the same trip count as the original loop. Added a test case
for this.
Fixes#61832.
Co-authored-by: Stephen Chou <stephenchou@google.com>
Add a new interface method that returns the yielded values.
Also add a verifier that checks the number of inits/iter_args/yielded
values. Most of the checked invariants (but not all of them) are already
covered by the `RegionBranchOpInterface`, but the `LoopLikeOpInterface`
now provides (additional) error messages that are easier to read.
The `BufferizableOpInterface` implementation of `scf.for` currently
assumes that an OpResult does not alias with any tensor apart from the
corresponding init OpOperand. Newly allocated buffers (inside of the
loop) are also allowed. The current implementation checks whether the
respective init_arg and yielded value are equivalent. This is overly
strict and causes extra buffer allocations/copies when yielding a new
buffer allocation from a loop.
This patch updates `transform.loop.peel` so that this Op returns two
rather than one handle:
* one for the peeled loop, and
* one for the remainder loop.
Also, following this change this Op will fail if peeling fails. This is
consistent with other similar Ops that also fail if no transformation
takes place.
Relands #67482 with an extra fix for transform_loop_ext.py
This patch updates `transform.loop.peel` so that this Op returns two
rather than one handle:
* one for the peeled loop, and
* one for the remainder loop.
Also, following this change this Op will fail if peeling fails. This is
consistent with other similar Ops that also fail if no transformation
takes place.
Rename and restructure tiling-related transform ops from the structured
extension to be more homogeneous. In particular, all ops now follow a
consistent naming scheme:
- `transform.structured.tile_using_for`;
- `transform.structured.tile_using_forall`;
- `transform.structured.tile_reduction_using_for`;
- `transform.structured.tile_reduction_using_forall`.
This drops the "_op" naming artifact from `tile_to_forall_op` that
shouldn't have been included in the first place, consistently specifies
the name of the control flow op to be produced for loops (instead of
`tile_reduction_using_scf` since `scf.forall` also belongs to `scf`),
and opts for the `using` connector to avoid ambiguity.
The loops produced by tiling are now systematically placed as *trailing*
results of the transform op. While this required changing 3 out of 4 ops
(except for `tile_using_for`), this is the only choice that makes sense
when producing multiple `scf.for` ops that can be associated with a
variadic number of handles. This choice is also most consistent with
*other* transform ops from the structured extension, in particular with
fusion ops, that produce the structured op as the leading result and the
loop as the trailing result.
Add a straightforward sequentialization transform from `scf.forall` to a
nest of `scf.for` in absence of results and expose it as a transform op.
This is helpful in combination with other transform ops, particularly
fusion, that work best on parallel-by-construction `scf.forall` but
later need to target sequential `for` loops.
This commit removes the deallocation capabilities of
one-shot-bufferization. One-shot-bufferization should never deallocate
any memrefs as this should be entirely handled by the
ownership-based-buffer-deallocation pass going forward. This means the
`allow-return-allocs` pass option will default to true now,
`create-deallocs` defaults to false and they, as well as the escape
attribute indicating whether a memref escapes the current region, will
be removed. A new `allow-return-allocs-from-loops` option is added as a
temporary workaround for some bufferization limitations.
Use the new ownership based deallocation pass pipeline in the regression
and integration tests. Some one-shot bufferization tests tested one-shot
bufferize and deallocation at the same time. I removed the deallocation
pass there because the deallocation pass is already thoroughly tested by
itself.
Fixed version of #66471
* Always use the auto-generated `getInitArgs` function. Remove the
hand-written `getInitOperands` duplicate.
* Remove `hasIterOperands` and `getNumIterOperands`. The names were
inconsistent because the "arg" is called `initArgs` in TableGen. Use
`getInitArgs().size()` instead.
* Fix verification around ops with no results.
Use the new ownership based deallocation pass pipeline in the regression
and integration tests. Some one-shot bufferization tests tested one-shot
bufferize and deallocation at the same time. I removed the deallocation
pass there because the deallocation pass is already thoroughly tested by
itself.
This change adds a method to modify the ConversionTarget used during
`transform.apply_conversion_patterns` to the
`ConversionPatternDescriptorOpInterface`. This is needed when the TypeConverter
is used to dictate the dynamic legality of operations, as in "structural"
conversion patterns present in, for example, the SCF and func dialects.
As a first use case/test, this change also adds a
`transform.apply_patterns.scf.structural_conversions` operation to the SCF
dialect.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D158672
The scf.forall.in_parallel terminator operation has a nested graph region with the NoTerminator trait. Such regions are not supported by the default implementations. Therefore, this commit adds a specialized implementation for
this operation which only covers the case where the nested region is empty.
This is because after bufferization, ops like tensor.parallel_insert_slice were already converted to memref operations residing int the scf.forall only and the nested region of scf.forall.in_parallel ends up empty.