Patterns in `LowerContractionToSMMLAPattern` are designed to handle
vector-to-matrix multiplication but not matrix-to-vector. This leads to
the following error when processing `rhs` with rank < 2:
```
iree-compile: /usr/local/google/home/kooljblack/code/iree-build/llvm-project/tools/mlir/include/mlir/IR/BuiltinTypeInterfaces.h.inc:268: int64_t mlir::detail::ShapedTypeTrait<mlir::VectorType>::getDimSize(unsigned int) const [ConcreteType = mlir::VectorType]: Assertion `idx < getRank() && "invalid index for shaped type"' failed.
```
Updates to explicitly check the rhs rank and fail cases that cannot
process.
Added lowering support for IS_DEVICE_PTR and HAS_DEVICE_ADDR clauses for
OMP TARGET directive and added related tests for these changes.
IS_DEVICE_PTR and HAS_DEVICE_ADDR clauses apply to OMP TARGET directive
OpenMP spec states
`The **is_device_ptr** clause indicates that its list items are device
pointers.`
`The **has_device_addr** clause indicates that its list items already
have device addresses and therefore they may be directly accessed from a
target device.`
Whereas USE_DEVICE_PTR and USE_DEVICE_ADDR clauses apply to OMP TARGET
DATA directive and OpenMP spec for them states
`Each list item in the **use_device_ptr** clause results in a new list
item that is a device pointer that refers to a device address`
`Each list item in a **use_device_addr** clause that is present in the
device data environment is treated as if it is implicitly mapped by a
map clause on the construct with a map-type of alloc`
This MR adds the `lower-vector-multi-reduction` pass to lower the
vector.multi_reduction operation.
While the Transform Dialect includes an operation,
`transform.apply_patterns.vector.lower_multi_reduction`, intended for a
similar purpose, its utility is limited to projects that have adopted
the Transform Dialect. Recognizing that not all projects are equipped to
integrate this dialect, the proposed pass serves as a vital standalone
alternative. It ensures that projects solely dependent on the
traditional pass infrastructure can also benefit from the optimized
lowering of `multi_reduction` operation.
---------
Co-authored-by: Xiaolei Shi <xiaoleis@nvidia.com>
Disallows initialization of scalable vectors with an attribute of
arbitrary values, e.g.:
```mlir
%c = arith.constant dense<[0, 1]> : vector<[2] x i32>
```
Initialization using vector splats remains allowed (i.e. when all the
init values are identical):
```mlir
%c = arith.constant dense<[1, 1]> : vector<[2] x i32>
```
Note: This is a re-upload of #86178
Disallows initialization of scalable vectors with an attribute of
arbitrary values, e.g.:
```mlir
%c = arith.constant dense<[0, 1]> : vector<[2] x i32>
```
Initialization using vector splats remains allowed (i.e. when all the
init values are identical):
```mlir
%c = arith.constant dense<[1, 1]> : vector<[2] x i32>
```
This commit adds a `ValueBoundsOpInterface` implementation for
`arith.select`. The implementation is almost identical to `scf.if`
(#85895), but there is one special case: if the condition is a shaped
value, the selection is applied element-wise and the result shape can be
inferred from either operand.
Note: This is a re-upload of #86383.
This commit adds support for `scf.if` to `ValueBoundsConstraintSet`.
Example:
```
%0 = scf.if ... -> index {
scf.yield %a : index
} else {
scf.yield %b : index
}
```
The following constraints hold for %0:
* %0 >= min(%a, %b)
* %0 <= max(%a, %b)
Such constraints cannot be added to the constraint set; min/max is not
supported by `IntegerRelation`. However, if we know which one of %a and
%b is larger, we can add constraints for %0. E.g., if %a <= %b:
* %0 >= %a
* %0 <= %b
This commit required a few minor changes to the
`ValueBoundsConstraintSet` infrastructure, so that values can be
compared while we are still in the process of traversing the IR/adding
constraints.
Note: This is a re-upload of #85895, which was reverted. The bug that
caused the failure was fixed in #87859.
This commit removes the no longer required bitcast inserting pattern in
LLVM dialect's type consistency pattern. This was previously required to
enable Mem2Reg and SROA to promote accesses that had different types.
Recent changes to both passes added direct support for this feature to
them, so the pattern has no further use.
In scalable code it is very common to have constant multiples of vscale,
e.g. `4 * vscale`. This updates `arith.muli` to pretty print the result
name in cases like this, so `4 * vscale` would be `%c4_vscale`.
This makes reading IR dumps of scalable code a little nicer.
This commit extends the folders of chainable casts (bitcast and
addrspacecast) to ensure that they fold a chain of the same casts into a
single cast.
Additionally cleans up the canonicalization test file, as this used some
outdated constructs.
This commit relaxes Mem2Reg's type equality requirement for the LLVM
dialect's load and store operations. For now, we only allow loads to be
promoted if the reaching definition can be casted into a value of the
target type.
For stores, the same conversion casting check is applied and we ensure
that their result is properly casted to the type of the memory slot.
This is necessary to satisfy assumptions of the general mem2reg pass, as
it creates block arguments with the types of the memory slot.
This relands https://github.com/llvm/llvm-project/pull/87504
This commit adds a `ValueBoundsOpInterface` implementation for
`arith.select`. The implementation is almost identical to `scf.if`
(#85895), but there is one special case: if the condition is a shaped
value, the selection is applied element-wise and the result shape can be
inferred from either operand.
This commit adds support for `scf.if` to `ValueBoundsConstraintSet`.
Example:
```
%0 = scf.if ... -> index {
scf.yield %a : index
} else {
scf.yield %b : index
}
```
The following constraints hold for %0:
* %0 >= min(%a, %b)
* %0 <= max(%a, %b)
Such constraints cannot be added to the constraint set; min/max is not
supported by `IntegerRelation`. However, if we know which one of %a and
%b is larger, we can add constraints for %0. E.g., if %a <= %b:
* %0 >= %a
* %0 <= %b
This commit required a few minor changes to the
`ValueBoundsConstraintSet` infrastructure, so that values can be
compared while we are still in the process of traversing the IR/adding
constraints.
As part of this extension this change also does some general cleanup
1) Make all the methods take `RewriterBase` as arguments instead of
creating their own builders that tend to crash when used within
pattern rewrites
2) Split `coalesePerfectlyNestedLoops` into two separate methods, one
for `scf.for` and other for `affine.for`. The templatization didnt
seem to be buying much there.
Also general clean up of tests.
Add `requiresReplacedValues` and `visitReplacedValues` methods to
`PromotableOpInterface`. These methods allow `PromotableOpInterface` ops
to transforms definitions mutated by a `store`.
This change is necessary to correctly handle the promotion of
`LLVM_DbgDeclareOp`.
---------
Co-authored-by: Théo Degioanni <30992420+Moxinilian@users.noreply.github.com>
This reverts commit d6e4582198 as it
violates an assumption of Mem2Reg's block argument creation. Mem2Reg
strongly assumes that all involved values have the same type as the
alloca, which was relaxed by this PR. Therefore, branches got created
that jumped to basic blocks with differently typed block arguments.
Before the change `test-loop-fusion` and `affine-super-vectorizer-test`
options were in their own category. This was because they used the
standard llvm command line parsing with `llvm::cl::opt`. This PR moves
them over to the mlir `Pass::Option` class.
Before the change
```
$ mlir-opt --help
...
General options:
...
Compiler passes to run
Passes:
...
Pass Pipelines:
...
Generic Options:
....
affine-super-vectorizer-test options:
--backward-slicing
...
--vectorize-affine-loop-nest
test-loop-fusion options:
--test-loop-fusion-dependence-check
...
--test-loop-fusion-transformation
```
After the change
```
$ mlir-opt --help
...
General options:
...
Compiler passes to run
Passes:
...
--affine-super-vectorizer-test
--backward-slicing
...
--vectorize-affine-loop-nest
...
--test-loop-fusion options:
--test-loop-fusion-dependence-check
...
--test-loop-fusion-transformation
...
Pass Pipelines:
...
Generic Options:
...
```
---------
Signed-off-by: philass <plassen@groq.com>
Currently, by-ref reductions will allocate the per-thread reduction
variable in the initialization region. Adding a cleanup region allows
that allocation to be undone. This will allow flang to support reduction
of arrays stored on the heap.
This conflation of allocation and initialization in the initialization
should be fixed in the future to better match the OpenMP standard, but
that is beyond the scope of this patch.
This commit relaxes Mem2Reg's type equality requirement for the LLVM
dialect's load and store operations. For now, we only allow loads to be
promoted if the reaching definition can be casted into a value of the
target type.
For stores, all type checks are removed, as a non-volatile store that
does not write out the alloca's pointer can always be deleted.
Updates `castAwayContractionLeadingOneDim` to check for leading unit
dimensions before inserting `vector.transpose` ops.
Currently `castAwayContractionLeadingOneDim` removes all leading unit
dims based on the accumulator and transpose any subsequent operands to
match the accumulator indexing. This does not take into account if the
transpose is strictly necessary, for instance when given this
vector-matrix contract:
```mlir
%result = vector.contract {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d1, d2)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"], kind = #vector.kind<add>} %lhs, %rhs, %acc : vector<1x1x8xi32>, vector<1x8x8xi32> into vector<1x8xi32>
```
Passing this through `castAwayContractionLeadingOneDim` pattern produces
the following:
```mlir
%0 = vector.transpose %arg0, [1, 0, 2] : vector<1x1x8xi32> to vector<1x1x8xi32>
%1 = vector.extract %0[0] : vector<1x8xi32> from vector<1x1x8xi32>
%2 = vector.extract %arg2[0] : vector<8xi32> from vector<1x8xi32>
%3 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %1, %arg1, %2 : vector<1x8xi32>, vector<1x8x8xi32> into vector<8xi32>
%4 = vector.broadcast %3 : vector<8xi32> to vector<1x8xi32>
```
The `vector.transpose` introduced does not affect the underlying data
layout (effectively a no op), but it cannot be folded automatically.
This change avoids inserting transposes when only leading unit
dimensions are involved.
Fixes#85691
Updates smmla unrolling patterns to handle vecmat contracts where `dimM=1`. This includes explicit vecmats in the form: `<1x8xi8> x <8x8xi8> --> <1x8xi32>` or implied with the leading dim folded: `<8xi8> x <8x8xi8> --> <8xi32>`
Since the smmla operates on two `<2x8xi8>` input vectors to produce `<2x2xi8>` accumulators, half of each 2x2 accumulator tile is dummy data not pertinent to the computation, resulting in half throughput.
This addition catches common cases of malformed `tosa.reshape` ops. This
prevents the `--tosa-to-tensor` pass from asserting when fed invalid
operations, as these will be caught ahead of time by the verifier.
Closes#87396
This commit extends the data layout subsystem with accessors for the
endianness. The implementation follows the structure implemented for
alloca, global, and program memory spaces.
`linalg.generic` ops with sparse tensors do not necessarily bufferize to
element-wise access, because insertions into a sparse tensor may change
the layout of (or reallocate) the underlying sparse data structures.
The tosa-infer-shapes pass inserts tensor.cast operations to mediate
refined result types with consumers whose types cannot be refined. This
process interferes with how types are refined in tosa.while_loop body
regions, where types are propagated speculatively (to determine the
types of the tosa.yield terminator) and then reverted.
The new tosa.cast operations result in a crash due to not having types
associated to them for the reversion process.
This change modifies the shape propagation behavior so that the
introduction to tensor.cast operations behaves better with this type
reversion process. The new behavior is to only introduce tensor.cast
operations once we wish to commit the newly computed types to the IR.
This is an example causing the crash:
```mlir
func.func @while_dont_crash(%arg0 : tensor<i32>) -> (tensor<*xi32>) {
%0 = tosa.add %arg0, %arg0 : (tensor<i32>, tensor<i32>) -> tensor<*xi32>
%1 = tosa.while_loop (%arg1 = %0) : (tensor<*xi32>) -> tensor<*xi32> {
%2 = "tosa.const"() <{value = dense<3> : tensor<i32>}> : () -> tensor<i32>
%3 = tosa.greater_equal %2, %arg1 : (tensor<i32>, tensor<*xi32>) -> tensor<*xi1>
tosa.yield %3 : tensor<*xi1>
} do {
^bb0(%arg1: tensor<*xi32>):
// Inferrable operation whose type will refine to tensor<i32>
%3 = tosa.add %arg1, %arg1 : (tensor<*xi32>, tensor<*xi32>) -> tensor<*xi32>
// Non-inferrable use site, will require the cast:
// tensor.cast %3 : tensor<i32> to tensor<*xi32>
//
// The new cast operation will result in accessing undefined memory through
// originalTypeMap in the C++ code.
"use"(%3) : (tensor<*xi32>) -> ()
tosa.yield %3 : tensor<*xi32>
}
return %1 : tensor<*xi32>
}
```
The `tensor.cast` operation inserted in the loop body causes a failure
in the code which resets the types after propagation through the loop
body:
```c++
// The types inferred in the block assume the operand types specified for
// this iteration. We need to restore the original types to ensure that
// future iterations only use the already specified types, not possible
// types from previous iterations.
for (auto &block : bodyRegion) {
for (auto arg : block.getArguments())
arg.setType(originalTypeMap[arg]);
for (auto &op : block)
for (auto result : op.getResults())
result.setType(originalTypeMap[result]); // problematic access
}
```
---------
Co-authored-by: Spenser Bauman <sabauma@fastmail>
If `before` block args are directly forwarded to `scf.condition` make
sure they are passed in the same order.
This is needed for `scf.while` uplifting
https://github.com/llvm/llvm-project/pull/76108
This commit removes SROA's type consistency constraints from LLVM
dialect's GEPOp. The checks for valid indexing are now purely done by
computing the GEP's offset with the aid of the data layout.
To simplify handling of "nested subslots", we are tricking the SROA by
handing in memory slots that hold byte array types. This ensures that
subsequent accesses only need to check if their access will be
in-bounds. This lifts the requirement of determining the sub-types for
all but the first level of subslots.
There is no good way to report detailed errors from inside
`Pass::initializeOptions` function as context may not be available at
this point and writing directly to `llvm::errs()` is not composable.
See
https://github.com/llvm/llvm-project/pull/87166#discussion_r1546426763
* Add error handler callback to `Pass::initializeOptions`
* Update `PassOptions::parseFromString` to support custom error stream
instead of using `llvm::errs()` directly.
* Update default `Pass::initializeOptions` implementation to propagate
error string from `parseFromString` to new error handler.
* Update `MapMemRefStorageClassPass` to report error details using new
API.
Introduce 3 new optional attributes to the `transform.print` ops:
* `assume_verified`
* `use_local_scope`
* `skip_regions`
The primary motivation is to allow printing on large inputs that
otherwise take forever to print and verify. For the full context, see
this IREE issue: https://github.com/openxla/iree/issues/16901.
Also add some tests and fix the op description.
Add rounding mode attribute to `arith`. This attribute can be used in
different FP `arith` operations to control rounding mode. Rounding modes
correspond to IEEE 754-specified rounding modes. Use in `arith.truncf` folding.
As this is not supported in dialects other than LLVM, conversion should fail for
now in case this attribute is present.
---------
Signed-off-by: Victor Perez <victor.perez@codeplay.com>
These non-finite math ops are supported by SPIR-V but not by WGSL.
Assume finite floating point values and expand these ops into `false`.
Previously, this worked by adding fast math flags during conversion from
arith to spirv, but this got removed in
https://github.com/llvm/llvm-project/pull/86578.
Also do some misc cleanups in the surrounding code.
The original implementation was eagerly reporting silenceable failures
from actions as definite failures. Since silenceable failures are
intended for cases when the IR has not been irreversibly modified, it's
okay to propagate them as silenceable failures of the parent op.
Fixes#86834.
…d viceversa
-- Adds folding of producer linalg transpose op with consumer unpack op,
also adds folding of producer unpack op and consumer transpose op.
-- Minor bug fixes w.r.t. to the test cases.
Fold the `tensor.insert_slice` of `tensor.extract_slice` into
`tensor_extract_slice` when the `insert_slice` simply expand some unit
dims dropped by the `extract_slice`.
Adds support for scalable vectors to patterns defined in
VectorLineralize.cpp.
Linearization is disable in 2 notable cases:
* vectors with more than 1 scalable dimension (we cannot represent
vscale^2),
* vectors initialised with arith.constant that's not a vector splat
(such arith.constant Ops cannot be flattened).
Adds support for fusing two scf.for loops occurring in the same block.
Uses the rudimentary checks already in place for scf.forall (like the
target loop's operands being dominated by the source loop).
- Fixes a bug in the dominance check whereby it was checked that values
in the target loop themselves dominated the source loop rather than the
ops that define these operands.
- Renames the LoopFuseSibling op to LoopFuseSiblingOp.
- Updates LoopFuseSiblingOp's description.
- Adds tests for using LoopFuseSiblingOp on scf.for loops, including one
which fails without the fix for the dominance check.
- Adds tests checking the different failure modes of the dominance
checker.
- Adds test for case whereby scf.yield is automatically generated when
there are no loop-carried variables.