Edit the `findValueInReverseUseDefChain` method to accept `OpOperand`
instead of the `Value` type, This change will make sure that the
populated `visitedOpOperands` argument is fully accurate and contains
the opOperand we have started the reverse chain from.
This PR Adds a `ControlBuildSubsetExtractionFn` to the tensor empty
elimination util, This will control the building of the subsets
extraction of the
`SubsetInsertionOpInterface`.
This control function returns the subsets extraction value that will
replace the `emptyTensorOp` use
which is being consumed by a specefic user (which the
util expects to eliminate it).
The default control function will stay like today's behavior without any
additional changes.
In `buffer-results-to-out-params`, when `hoist-static-allocs` option is
enabled the pass was looking for `memref.alloc`s in order to attempt to
avoid copies when it can. Which makes it not extensible to external ops
that have allocation like properties. This patch simply changes
`memref::AllocOp` to `AllocationOpInterface` in the check to enable for
any allocation op.
Moreover, for function call updates, we enable setting an allocation
function callback in `BufferResultsToOutParamsOpts` to allow users to
emit their own alloc-like op.
The greedy rewriter is used in many different flows and it has a lot of
convenience (work list management, debugging actions, tracing, etc). But
it combines two kinds of greedy behavior 1) how ops are matched, 2)
folding wherever it can.
These are independent forms of greedy and leads to inefficiency. E.g.,
cases where one need to create different phases in lowering and is
required to applying patterns in specific order split across different
passes. Using the driver one ends up needlessly retrying folding/having
multiple rounds of folding attempts, where one final run would have
sufficed.
Of course folks can locally avoid this behavior by just building their
own, but this is also a common requested feature that folks keep on
working around locally in suboptimal ways.
For downstream users, there should be no behavioral change. Updating
from the deprecated should just be a find and replace (e.g., `find ./
-type f -exec sed -i
's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} \;` variety)
as the API arguments hasn't changed between the two.
In many cases the emptyTensorElimination can not transform or eliminate
the empty tensor which is being inserted into the
`SubsetInsertionOpInterface`.
Two major reasons for that:
1- Failing when trying to find a legal/suitable insertion point for the
`subsetExtract` which is about to replace the empty tensor. However, we
may try to handle this issue by moving the needed values which
responsible on building the `subsetExtract` nearby the empty tensor
(which is about to be eliminated). Thus increasing the probability to
find a legal insertion point.
2-The EmptyTensorElimination transform replaces the tensor.empty's uses
all at once in one apply, rather than replacing only the specific use
which was visited in the use-def chain (when traversing from the
tensor.insert_slice). This scenario of replacing all the uses of the
tensor.empty may lead into additional read effects after bufferization
of the specific subset extract/subview which should not be the case.
Both cases may result in many copies in the coming bufferization which
can not be canonicalized.
The first case can be noticed when having a `tensor.empty` followed by
`SubsetInsertionOpInterface` (or in simple words `tensor.insert_slice`),
which have been lowered from `tensor/tosa.concat`.
The second case can be noticed when having a `tensor.empty`, with many
uses and leading to applying the transformation only once, since the
whole uses have been replaced at once.
The first commit in the PR only adds the lit tests for the cases shown
above (NFC), to emphasize how the transform works, in the coming MRs
will upload a slight changes to handle these case.
The second commit in this PR, we want to replace only the specific use
which was visited in the `use-def` chain (when traversing from the
`tensor.insert_slice`'s source).
This commit removes the last remaining components of the dialect
conversion-based bufferization passes.
Note for LLVM integration: If you depend on these components, migrate to
One-Shot Bufferize or copy them to your codebase.
As described in issue llvm/llvm-project#91518, a previous PR
llvm/llvm-project#78484 introduced the `defaultMemorySpaceFn` into
bufferization options, allowing one to inform OneShotBufferize that it
should use a specified function to derive the memory space attribute
from the encoding attribute attached to tensor types.
However, introducing this feature exposed unhandled edge cases,
examples of which are introduced by this change in the new test under
`test/Dialect/Bufferization/Transforms/one-shot-bufferize-encodings.mlir`.
Fixing the inconsistencies introduced by `defaultMemorySpaceFn` is
pretty simple. This change:
- Updates the `bufferization.to_memref` and `bufferization.to_tensor`
operations to explicitly include operand and destination types,
whereas previously they relied on type inference to deduce the
tensor types. Since the type inference cannot recover the correct
tensor encoding/memory space, the operand and result types must be
explicitly included. This is a small assembly format change, but it
touches a large number of test files.
- Makes minor updates to other bufferization functions to handle the
changes in building the above ops.
- Updates bufferization of `tensor.from_elements` to handle memory
space.
Integration/upgrade guide:
In downstream projects, if you have tests or MLIR files that explicitly
use
`bufferization.to_tensor` or `bufferization.to_memref`, then update
them to the new assembly format as follows:
```
%1 = bufferization.to_memref %0 : memref<10xf32>
%2 = bufferization.to_tensor %1 : memref<10xf32>
```
becomes
```
%1 = bufferization.to_memref %0 : tensor<10xf32> to memref<10xf32>
%2 = bufferization.to_tensor %0 : memref<10xf32> to tensor<10xf32>
```
The dialect conversion-based bufferization passes have been migrated to
One-Shot Bufferize about two years ago. To clean up the code base, this
commit removes the `finalizing-bufferize` pass, one of the few remaining
parts of the old infrastructure. Most bufferization passes have already
been removed.
Note for LLVM integration: If you depend on this pass, migrate to
One-Shot Bufferize or copy the pass to your codebase.
Depends on #114152.
Add a new helper function `isReachable` to `Block`. This function
traverses all successors of a block to determine if another block is
reachable from the current block.
This functionality has been reimplemented in multiple places in MLIR.
Possibly additional copies in downstream projects. Therefore, moving it
to a common place.
Multiple `func.return` ops inside of a `func.func` op are now supported
during bufferization. This PR extends the code base in 3 places:
- When inferring function return types, `memref.cast` ops are folded
away only if all `func.return` ops have matching buffer types. (E.g., we
don't fold if two `return` ops have operands with different layout
maps.)
- The alias sets of all `func.return` ops are merged. That's because
aliasing is a "may be" property.
- The equivalence sets of all `func.return` ops are taken only if they
match. If different `func.return` ops have different equivalence sets
for their operands, the equivalence information is dropped. That's
because equivalence is a "must be" property.
This commit is in preparation of removing the deprecated
`func-bufferize` pass. That pass can bufferize functions with multiple
`return` ops.
This commit adds support for recursive function calls to One-Shot
Bufferize.
The analysis does not support recursive function calls. The function
body itself can be analyzed, but we cannot make any assumptions about
the aliasing relation between function result and function arguments.
Similarly, when looking at a `call` op, we do not know whether the
operands will bufferize to a memory read/write. In the absence of such
information, we have to conservatively assume that they do.
This commit is in preparation of removing the deprecated
`func-bufferize` pass. That pass can bufferize recursive functions.
This commit adds support for bufferizing external functions that have no
body. Such functions were previously rejected by One-Shot Bufferize if
they returned a tensor value.
This commit is in preparation of removing the deprecated
`func-bufferize` pass. That pass can bufferize external functions.
Also update a few comments.
This commit marks the type converter in `populate...` functions as
`const`. This is useful for debugging.
Patterns already take a `const` type converter. However, some
`populate...` functions do not only add new patterns, but also add
additional type conversion rules. That makes it difficult to find the
place where a type conversion was added in the code base. With this
change, all `populate...` functions that only populate pattern now have
a `const` type converter. Programmers can then conclude from the
function signature that these functions do not register any new type
conversion rules.
Also some minor cleanups around the 1:N dialect conversion
infrastructure, which did not always pass the type converter as a
`const` object internally.
**Description:**
`OneShotModuleBufferize` deals with the bufferization of `FuncOp`,
`CallOp` and `ReturnOp` but they are hard-coded. Any custom
function-like operations will not be handled. The PR replaces a part of
`FuncOp` and `CallOp` with `FunctionOpInterface` and `CallOpInterface`
in `OneShotModuleBufferize` so that custom function ops and call ops can
be bufferized.
**Related Discord Discussion:**
[Link](https://discord.com/channels/636084430946959380/642426447167881246/1280556809911799900)
---------
Co-authored-by: erick-xanadu <110487834+erick-xanadu@users.noreply.github.com>
As specified in the docs,
1) raw_string_ostream is always unbuffered and
2) the underlying buffer may be used directly
( 65b13610a5 for further reference )
* Don't call raw_string_ostream::flush(), which is essentially a no-op.
* Avoid unneeded calls to raw_string_ostream::str(), to avoid excess indirection.
Allow customization of the `resolveCallable` method in the
`CallOpInterface`. This change allows for operations implementing this
interface to provide their own logic for resolving callables.
- Introduce the `resolveCallable` method, which does not include the
optional symbol table parameter. This method replaces the previously
existing extra class declaration `resolveCallable`.
- Introduce the `resolveCallableInTable` method, which incorporates the
symbol table parameter. This method replaces the previous extra class
declaration `resolveCallable` that used the optional symbol table
parameter.
buffer-results-to-out-params pass will have a nullptr-referencing error
when hoist-static-allocs option is on, when the return value of a
function is a parameter of the function. This PR fixes this issue.
Handle caller/callee type mismatch using `castOrReallocMemRefValue`
instead of just a `CastOp`. The method insert a reallocation + copy if
it cannot be statically guaranteed that a direct cast would be valid.
Fix#105916.
Adding a pass that is expected to run after the deallocation pipeline
and will move buffer deallocations right after their last user or
dependency, thus optimizing the allocation liveness.
With this PR I am trying to address:
https://github.com/llvm/llvm-project/issues/63230.
What changed:
- While merging identical blocks, don't add a block argument if it is
"identical" to another block argument. I.e., if the two block arguments
refer to the same `Value`. The operations operands in the block will
point to the argument we already inserted. This needs to happen to all
the arguments we pass to the different successors of the parent block
- After merged the blocks, get rid of "unnecessary" arguments. I.e., if
all the predecessors pass the same block argument, there is no need to
pass it as an argument.
- This last simplification clashed with
`BufferDeallocationSimplification`. The reason, I think, is that the two
simplifications are clashing. I.e., `BufferDeallocationSimplification`
contains an analysis based on the block structure. If we simplify the
block structure (by merging and/or dropping block arguments) the
analysis is invalid . The solution I found is to do a more prudent
simplification when running that pass.
**Note-1**: I ran all the integration tests
(`-DMLIR_INCLUDE_INTEGRATION_TESTS=ON`) and they passed.
**Note-2**: I fixed a bug found by @Dinistro in #97697 . The issue was
that, when looking for redundant arguments, I was not considering that
the block might have already some arguments. So the index (in the block
args list) of the i-th `newArgument` is `i+numOfOldArguments`.
With this PR I am trying to address:
https://github.com/llvm/llvm-project/issues/63230.
What changed:
- While merging identical blocks, don't add a block argument if it is
"identical" to another block argument. I.e., if the two block arguments
refer to the same `Value`. The operations operands in the block will
point to the argument we already inserted. This needs to happen to all
the arguments we pass to the different successors of the parent block
- After merged the blocks, get rid of "unnecessary" arguments. I.e., if
all the predecessors pass the same block argument, there is no need to
pass it as an argument.
- This last simplification clashed with
`BufferDeallocationSimplification`. The reason, I think, is that the two
simplifications are clashing. I.e., `BufferDeallocationSimplification`
contains an analysis based on the block structure. If we simplify the
block structure (by merging and/or dropping block arguments) the
analysis is invalid . The solution I found is to do a more prudent
simplification when running that pass.
**Note**: this a rework of #96871 . I ran all the integration tests
(`-DMLIR_INCLUDE_INTEGRATION_TESTS=ON`) and they passed.
In nested symbols, the dealloc_helper function generated by lower
deallocations pass was incorrectly positioned, causing calls fail. This
patch fixes this issue.
The std::optional returned by buildPromotedAlloc was directly
dereferenced and assumed to be non-null, even though the documentation
for AllocationOpInterface indicates that std::nullopt is a legal value
if buffer stack promotion is not supported (and is the default value
supplied by the TableGen interface file). This patch removes the direct
dereference so that the optional can be null-checked prior to use.
Co-authored-by: Nikhil Kalra <nkalra@apple.com>
With this PR I am trying to address:
https://github.com/llvm/llvm-project/issues/63230.
What changed:
- While merging identical blocks, don't add a block argument if it is
"identical" to another block argument. I.e., if the two block arguments
refer to the same `Value`. The operations operands in the block will
point to the argument we already inserted
- After merged the blocks, get rid of "unnecessary" arguments. I.e., if
all the predecessors pass the same block argument, there is no need to
pass it as an argument.
- This last simplification clashed with
`BufferDeallocationSimplification`. The reason, I think, is that the two
simplifications are clashing. I.e., `BufferDeallocationSimplification`
contains an analysis based on the block structure. If we simplify the
block structure (by merging and/or dropping block arguments) the
analysis is invalid . The solution I found is to do a more prudent
simplification when running that pass.
**Note**: many tests are still not passing. But I wanted to submit the
code before changing all the tests (and probably adding a couple), so
that we can agree in principle on the algorithm/design.
There is an optimization in One-Shot Bufferize wrt. ops that bufferize
to elementwise access. A copy can sometimes be avoided. E.g.:
```
%0 = tensor.empty()
%1 = tensor.fill ...
%2 = linalg.map ins(%1, ...) outs(%1)
```
In the above example, a buffer copy is not needed for %1, even though
the same buffer is read/written by two different operands (of the same
op). That's because the op bufferizes to elementwise access.
```c++
// Two equivalent operands of the same op are not conflicting if the op
// bufferizes to element-wise access. I.e., all loads at a position
// happen before all stores to the same position.
```
This optimization cannot be applied when op dominance cannot be used to
rule out conflicts. E.g., when the `linalg.map` is inside of a loop. In
such a case, the reads/writes happen multiple times and it is not
guaranteed that "all loads at a position happen before all stores to the
same position."
Fixes#90019.
In the origin implementation, the empty tensor elimination will add a
`tensor.cast` and eliminate the tensor even if they have different
element type(f32, bf16). Here add a check for element type and skip the
elimination if they are different.
Handling parallel region RaW conflicts should usually be the
responsibility of the source program, rather than bufferization
analysis. However, to preserve current functionality, checks on parallel
regions is put behind a bufferization in this PR, which is on by
default. Default functionality will not change, but this PR enables the
option to leave parallelism checks out of the bufferization analysis.
This commit fixes a crash in the ownership-based buffer deallocation
pass when indirectly calling a function via SSA value. Such functions
must be conservatively assumed to be public.
Fixes#94780.
These passes have been depreciated for a long time and replaced by
one-shot bufferization. These passes are also unsafe because they do not
check for read-after-write conflicts.
Relands https://github.com/llvm/llvm-project/pull/93488 which failed on
buildbot. Fixes the failure by updating integration tests to use
one-shot-bufferize instead.
These passes have been depreciated for a long time and replaced by
one-shot bufferization. These passes are also unsafe because they do not
check for read-after-write conflicts.
Add an option hoist-static-allocs to remove the unnecessary memref.alloc
and memref.copy after this pass, when the memref in ReturnOp is
allocated by memref.alloc and is statically shaped. Instead, it replaces
the uses of the allocated memref with the memref in the out argument.
By default, BufferResultsToOutParams will result in a memcpy operation
to copy the originally returned memref to the output argument memref.
This is inefficient when the source of memcpy (the returned memref in
the original ReturnOp) is from a local AllocOp. The pass can use the
output argument memref to replace the locally allocated memref for
better performance.hoist-static-allocs avoids dynamic allocation and
memory movement.
This option will be critical for performance-sensivtive applications,
which require BufferResultsToOutParams pass for a caller-owned output
buffer calling convension.
This change corrects an invalid behavior in pass
`--buffer-loop-hoisting`. The pass is in charge of extracting buffer
allocations (e.g., `memref.alloca`) from loop regions (e.g., `scf.for`)
when possible. This works OK for looks with sequential execution
semantics. However, a buffer allocated in the body of a parallel loop
may be concurrently accessed by multiple thread to store its local data.
Extracting such buffer from the loop causes all threads to wrongly share
the same memory region.
In the following example, dimension 1 of the input tensor is reversed.
Dimension 0 is traversed with a parallel loop.
```
func.func @f(%input: memref<2x3xf32>) -> memref<2x3xf32> {
%c0 = index.constant 0
%c1 = index.constant 1
%c2 = index.constant 2
%c3 = index.constant 3
%output = memref.alloc() : memref<2x3xf32>
scf.parallel (%index) = (%c0) to (%c2) step (%c1) {
// Create subviews for working input and output slices
%input_slice = memref.subview %input[%index, 2][1, 3][1, -1] : memref<2x3xf32> to memref<1x3xf32, strided<[3, -1], offset: ?>>
%output_slice = memref.subview %output[%index, 0][1, 3][1, 1] : memref<2x3xf32> to memref<1x3xf32, strided<[3, 1], offset: ?>>
// Copy the input slice into this temporary buffer. This intermediate
// copy is unnecessary, but is used for illustration purposes.
%temp = memref.alloc() : memref<1x3xf32>
memref.copy %input_slice, %temp : memref<1x3xf32, strided<[3, -1], offset: ?>> to memref<1x3xf32>
// Copy temporary buffer into output slice
memref.copy %temp, %output_slice : memref<1x3xf32> to memref<1x3xf32, strided<[3, 1], offset: ?>>
scf.reduce
}
return %output : memref<2x3xf32>
}
```
The patch submitted here prevents `%temp = memref.alloc() :
memref<1x3xf32>` from being hoisted when the containing op is
`scf.parallel` or `scf.forall`. A new op trait called
`HasParallelRegion` is introduced and assigned to these two ops to
indicate that their regions have parallel execution semantics.
@joker-eph @ftynse @nicolasvasilache @sabauma