This MR adds the `lower-vector-multi-reduction` pass to lower the
vector.multi_reduction operation.
While the Transform Dialect includes an operation,
`transform.apply_patterns.vector.lower_multi_reduction`, intended for a
similar purpose, its utility is limited to projects that have adopted
the Transform Dialect. Recognizing that not all projects are equipped to
integrate this dialect, the proposed pass serves as a vital standalone
alternative. It ensures that projects solely dependent on the
traditional pass infrastructure can also benefit from the optimized
lowering of `multi_reduction` operation.
---------
Co-authored-by: Xiaolei Shi <xiaoleis@nvidia.com>
Disallows initialization of scalable vectors with an attribute of
arbitrary values, e.g.:
```mlir
%c = arith.constant dense<[0, 1]> : vector<[2] x i32>
```
Initialization using vector splats remains allowed (i.e. when all the
init values are identical):
```mlir
%c = arith.constant dense<[1, 1]> : vector<[2] x i32>
```
Note: This is a re-upload of #86178
Updates `castAwayContractionLeadingOneDim` to check for leading unit
dimensions before inserting `vector.transpose` ops.
Currently `castAwayContractionLeadingOneDim` removes all leading unit
dims based on the accumulator and transpose any subsequent operands to
match the accumulator indexing. This does not take into account if the
transpose is strictly necessary, for instance when given this
vector-matrix contract:
```mlir
%result = vector.contract {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d1, d2)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"], kind = #vector.kind<add>} %lhs, %rhs, %acc : vector<1x1x8xi32>, vector<1x8x8xi32> into vector<1x8xi32>
```
Passing this through `castAwayContractionLeadingOneDim` pattern produces
the following:
```mlir
%0 = vector.transpose %arg0, [1, 0, 2] : vector<1x1x8xi32> to vector<1x1x8xi32>
%1 = vector.extract %0[0] : vector<1x8xi32> from vector<1x1x8xi32>
%2 = vector.extract %arg2[0] : vector<8xi32> from vector<1x8xi32>
%3 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %1, %arg1, %2 : vector<1x8xi32>, vector<1x8x8xi32> into vector<8xi32>
%4 = vector.broadcast %3 : vector<8xi32> to vector<1x8xi32>
```
The `vector.transpose` introduced does not affect the underlying data
layout (effectively a no op), but it cannot be folded automatically.
This change avoids inserting transposes when only leading unit
dimensions are involved.
Fixes#85691
Adds support for scalable vectors to patterns defined in
VectorLineralize.cpp.
Linearization is disable in 2 notable cases:
* vectors with more than 1 scalable dimension (we cannot represent
vscale^2),
* vectors initialised with arith.constant that's not a vector splat
(such arith.constant Ops cannot be flattened).
This patch refactors the `linearize.mlir` test - currently it contains
some duplication and can be tricky to follow.
Summary of changes:
* reduce duplication by introducing a shared check prefix (`ALL`) and
by introducing `-check-prefixes`,
* make sure that every "check" line is directly above the
corresponding line of input MLIR,
* group check lines corresponding to a particular prefix together (so
that it's easier to see the expected output for a particular
prefix),
* remove `CHECK` from prefix names (with multiple prefixes that's just
noise that can be avoided) and use a bit more descriptive prefixes
instead (`CHECK0` -> `BW-0`, where `BW` stands for bitwidth),
* unify indentation,
* `nonvec_result` -> `test_tensor_no_linearize` (for consistency with
`test_index_no_linearize`).
NOTE: This change only updates the format of the "CHECK" lines and
doesn't affect what's being tested.
This change is intended as preparation for adding support for scalable
vectors to `LinearizeConstant` and `LinearizeVectorizable` - i.e.
patterns that `linearlize.mlir` is meant to test.
Updates `castAwayContractionLeadingOneDim` to inherit from
`MaskableOpRewritePattern` so that this pattern can support masking.
Builds on top of #83827
This adds a new API built with the `ValueBoundsConstraintSet` to compute
the bounds of possibly scalable quantities. It uses knowledge of the
range of vscale (which is defined by the target architecture), to solve
for the bound as either a constant or an expression in terms of vscale.
The result is an `AffineMap` that will always take at most one
parameter, vscale, and returns a single result, which is the bound of
`value`.
The API is defined as follows:
```c++
FailureOr<ConstantOrScalableBound>
vector::ScalableValueBoundsConstraintSet::computeScalableBound(
Value value, std::optional<int64_t> dim,
unsigned vscaleMin, unsigned vscaleMax,
presburger::BoundType boundType,
bool closedUB = true,
StopConditionFn stopCondition = nullptr);
```
Note: `ConstantOrScalableBound` is a thin wrapper over the `AffineMap`
with a utility for converting the bound to a single quantity (i.e. a
size and scalable flag).
We believe this API could prove useful downstream in IREE (which uses a
similar analysis to hoist allocas, which currently fails for scalable
vectors).
Transform interfaces are implemented, direction or via extensions, in
libraries belonging to multiple other dialects. Those dialects don't
need to depend on the non-interface part of the transform dialect, which
includes the growing number of ops and transitive dependency footprint.
Split out the interfaces into a separate library. This in turn requires
flipping the dependency from the interface on the dialect that has crept
in because both co-existed in one library. The interface shouldn't
depend on the transform dialect either.
As a consequence of splitting, the capability of the interpreter to
automatically walk the payload IR to identify payload ops of a certain
kind based on the type used for the entry point symbol argument is
disabled. This is a good move by itself as it simplifies the interpreter
logic. This functionality can be trivially replaced by a
`transform.structured.match` operation.
Buffers are no longer deallocation by One-Shot Bufferize. This is now
done by a separate buffer deallocation pass.
Also fix a bug in the `vector.mask` folding, which was triggered by
`-buffer-deallocation-pipeline`, which runs the canonicalizer.
This PR is adds support for `vector.insert` to the patterns that bubble up and down `vector.bitcat` ops across `vector.extract/extract_slice/insert_slice` ops.
This reverts commit 5cdb8c0c88.
This pattern is producing incorrect IR. For example,
```mlir
func.func @extract_subvector_from_constant_mask() -> vector<16xi1> {
%mask = vector.constant_mask [2, 3] : vector<16x16xi1>
%extract = vector.extract %mask[8] : vector<16xi1> from vector<16x16xi1>
return %extract : vector<16xi1>
}
```
Canonicalizes to
```mlir
func.func @extract_subvector_from_constant_mask() -> vector<16xi1> {
%0 = vector.constant_mask [3] : vector<16xi1>
return %0 : vector<16xi1>
}
```
Where it should be a zero mask because the extraction index (8) is
greater than the constant mask size along that dim (2).
Currently n-d transfer write distribution can be inconsistent with
distribution of reductions if a value has multiple users, one of which
is a transfer_write with a non-standard distribution map, and the other
of which is a vector.reduction.
We may want to consider removing the distribution map functionality in
the future for this reason.
This PR add support for `arith.trunci` to vector narrow type emulation for iX -> i4 truncations, for X >= 8. For now, the pattern only works for 1D vectors and is based on `vector.shuffle` ops. We would need `vector.deinterleave` to add n-D vector support.
This test failed after landing #81964 due to a bad merge. I provided a quick fix and this PR is adding the rest of CHECK rules that were not merged properly.
It looks like the affine map generated to compute the indices of the
collapsed dimensions used the wrong dim size. For indices `[idx0][idx1]`
we computed the collapsed index as `idx0*size0 + idx1` instead of
`idx0*size1 + idx1`. This led to correctness issues in convolution tests
when enabling this transformation internally.
This PR replaces the generation of `vector.shuffle` with
`vector.interleave` in the i4 conversions in vector narrow type
emulation. The multi dimensional semantics of `vector.interleave` allow
us to enable these conversion emulations also for multi dimensional
vectors.
This PR adds an optional bitwidth parameter to the vector xfer op
flattening transformation so that the flattening doesn't happen if the
trailing dimension of the read/writen vector is larger than this
bitwidth (i.e., we are already able to fill at least one vector register
with that size).
Common backends (LLVM, SPIR-V) only supports 1D vectors, LLVM conversion
handles ND vectors (N >= 2) as `array<array<... vector>>` and SPIR-V
conversion doesn't handle them at all at the moment. Sometimes it's
preferable to treat multidim vectors as linearized 1D. Add pass to do
this. Only constants and simple elementwise ops are supported for now.
@krzysz00 I've extracted yours result type conversion code from
LegalizeToF32 and moved it to common place.
Also, add ConversionPattern class operating on traits.
The interleave operation constructs a new vector by interleaving the
elements from the trailing (or final) dimension of two input vectors,
returning a new vector where the trailing dimension is twice the size.
Note that for the n-D case this differs from the interleaving possible
with `vector.shuffle`, which would only operate on the leading
dimension.
Another key difference is this operation supports scalable vectors,
though currently a general LLVM lowering is limited to the case where
only the trailing dimension is scalable.
Example:
```mlir
%0 = vector.interleave %a, %b
: vector<[4]xi32> ; yields vector<[8]xi32>
%1 = vector.interleave %c, %d
: vector<8xi8> ; yields vector<16xi8>
%2 = vector.interleave %e, %f
: vector<f16> ; yields vector<2xf16>
%3 = vector.interleave %g, %h
: vector<2x4x[2]xf64> ; yields vector<2x4x[4]xf64>
%4 = vector.interleave %i, %j
: vector<6x3xf32> ; yields vector<6x6xf32>
```
Note: This change alone does not add any lowerings.
This is part of
66347e516e
The regression in downstream projects is about transfer_read patterns,
which needs more investigation. Add the support for transfer_write for
now.
This PR adds patterns to convert a sub-byte vector transpose into a
sequence of instructions that perform the transpose on i8 vector
elements. Whereas this rewrite may not lead to the absolute peak
performance, it should ensure correctness when dealing with sub-byte
transposes.
This PR adds new patterns to improve the generated vector code for the emulation of any conversion that have to go through an i4 -> i8 type extension (only signed extensions are supported for now). This will impact any i4 -> i8/i16/i32/i64 signed extensions as well as sitofp i4 -> f8/f16/f32/f64.
The asm code generated for the supported cases is significantly better after this PR for both x86 and aarch64.
Extends `vector.insert_strided_slice` and `vector.insert_strided_slice`
to allow scalable input and output vectors. For scalable sizes, the
corresponding slice size has to match the corresponding dimension in the
output/input vector (insert/extract, respectively).
This is supported:
```mlir
vector.extract_strided_slice %1 {
offsets = [0, 3, 0],
sizes = [1, 1, 4],
strides = [1, 1, 1] } : vector<1x4x[4]xi32> to vector<1x1x[4]xi32>
```
This is not supported:
```mlir
vector.extract_strided_slice %1 {
offsets = [0, 3, 0],
sizes = [1, 1, 2],
strides = [1, 1, 1] } : vector<1x4x[4]xi32> to vector<1x1x[2]xi32>
```
If a rewrite pattern returns "failure", it must not have modified the
IR. This commit fixes
`Dialect/Vector/vector-contract-to-outerproduct-transforms-unsupported.mlir`
when running with `MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`.
```
* Pattern (anonymous namespace)::ContractionOpToOuterProductOpLowering : 'vector.contract -> ()' {
Trying to match "(anonymous namespace)::ContractionOpToOuterProductOpLowering"
** Insert : 'vector.transpose'(0x5625b3a8cb30)
** Insert : 'vector.transpose'(0x5625b3a8cbc0)
"(anonymous namespace)::ContractionOpToOuterProductOpLowering" result 0
} -> failure : pattern failed to match
} -> failure : pattern failed to match
LLVM ERROR: pattern returned failure but IR did change
```
Note: `vector-contract-to-outerproduct-transforms-unsupported.mlir` is
merged into `vector-contract-to-outerproduct-matvec-transforms.mlir`.
The `greedy pattern application failed` error is not longer produced.
This error indicates that the greedy pattern rewrite did not
convergence; it does not mean that a pattern could not be applied.
Support distribution of `vector.transfer_read` ops when operands are
defined inside of the region of `warp_execute_on_lane_0` (except for the
buffer from which the op is reading).
Such IR was previously not supported. This commit changes the
implementation such that indices and the padding value are also
distributed.
This commit simplifies the implementation considerably: the original
implementation created a new `transfer_read` op and then checked if this
new op is valid. If not, the rewrite pattern failed. This was a bit
hacky. It was also a violation of the rewrite pattern API (detected by
`MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`) because the IR was modified,
but the pattern returned "failure".
As per the docs [1]:
```
In absence of an explicit layout, a memref is considered to have a
multi-dimensional identity affine map layout.
```
This patch makes sure that MemRefs with no strides (i.e. no explicit
layout) are treated as contiguous when checking whether a particular
vector is a contiguous slice of the given MemRef.
[1] https://mlir.llvm.org/docs/Dialects/Builtin/#layout
Follow-up for #76428.
The `GreedyPatternRewriteDriver` tries to iteratively fold ops and apply
rewrite patterns to ops. It has special handling for constants: they are
CSE'd and sometimes moved to parent regions to allow for additional
CSE'ing. This happens in `OperationFolder`.
To allow for efficient CSE'ing, `OperationFolder` maintains an internal
lookup data structure to find the existing constant ops with the same
value for each `IsolatedFromAbove` region:
```c++
/// A mapping between an insertion region and the constants that have been
/// created within it.
DenseMap<Region *, ConstantMap> foldScopes;
```
Rewrite patterns are allowed to modify operations. In particular, they
may move operations (including constants) from one region to another
one. Such an IR rewrite can make the above lookup data structure
inconsistent.
We encountered such a bug in a downstream project. This bug materialized
in the form of an op that uses the result of a constant op from a
different `IsolatedFromAbove` region (that is not accessible).
This commit changes the behavior of the `GreedyPatternRewriteDriver`
such that `OperationFolder` is used to CSE constants at the beginning of
each iteration (as the worklist is populated), but no longer during an
iteration. `OperationFolder` is no longer used after populating the
worklist, so we do not have to care about inconsistent state in the
`OperationFolder` due to IR rewrites. The `GreedyPatternRewriteDriver`
now performs the op folding by itself instead of calling
`OperationFolder::tryToFold`.
This change changes the order of constant ops in test cases, but not the
region in which they appear. All broken test cases were fixed by turning
`CHECK` into `CHECK-DAG`.
Alternatives considered: The state of `OperationFolder` could be
partially invalidated with every `notifyOperationModified` notification.
That is more fragile than the solution in this commit because incorrect
rewriter API usage can lead to missing notifications and hard-to-debug
`IsolatedFromAbove` violations. (It did not fix the above mention bug in
a downstream project, which could be due to incorrect rewriter API usage
or due to another conceptual problem that I missed.) Moreover, ops are
frequently getting modified during a greedy pattern rewrite, so we would
likely keep invalidating large parts of the state of `OperationFolder`
over and over.
Migration guide: Turn `CHECK` into `CHECK-DAG` in test cases. Constant
ops are no longer folded during a greedy pattern rewrite. If you rely on
folding (and rematerialization) of constant ops during a greedy pattern
rewrite, turn the folder into a pattern.
Similar to `vector.transfer_read`/`vector.transfer_write`, allow 0-D
vectors.
This commit fixes
`mlir/test/Dialect/Vector/vector-transfer-to-vector-load-store.mlir`
when verifying the IR after each pattern (#74270). That test produces a
temporary 0-D load/store op.