This MR adds the `lower-vector-multi-reduction` pass to lower the
vector.multi_reduction operation.
While the Transform Dialect includes an operation,
`transform.apply_patterns.vector.lower_multi_reduction`, intended for a
similar purpose, its utility is limited to projects that have adopted
the Transform Dialect. Recognizing that not all projects are equipped to
integrate this dialect, the proposed pass serves as a vital standalone
alternative. It ensures that projects solely dependent on the
traditional pass infrastructure can also benefit from the optimized
lowering of `multi_reduction` operation.
---------
Co-authored-by: Xiaolei Shi <xiaoleis@nvidia.com>
Updates `castAwayContractionLeadingOneDim` to check for leading unit
dimensions before inserting `vector.transpose` ops.
Currently `castAwayContractionLeadingOneDim` removes all leading unit
dims based on the accumulator and transpose any subsequent operands to
match the accumulator indexing. This does not take into account if the
transpose is strictly necessary, for instance when given this
vector-matrix contract:
```mlir
%result = vector.contract {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d1, d2)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"], kind = #vector.kind<add>} %lhs, %rhs, %acc : vector<1x1x8xi32>, vector<1x8x8xi32> into vector<1x8xi32>
```
Passing this through `castAwayContractionLeadingOneDim` pattern produces
the following:
```mlir
%0 = vector.transpose %arg0, [1, 0, 2] : vector<1x1x8xi32> to vector<1x1x8xi32>
%1 = vector.extract %0[0] : vector<1x8xi32> from vector<1x1x8xi32>
%2 = vector.extract %arg2[0] : vector<8xi32> from vector<1x8xi32>
%3 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %1, %arg1, %2 : vector<1x8xi32>, vector<1x8x8xi32> into vector<8xi32>
%4 = vector.broadcast %3 : vector<8xi32> to vector<1x8xi32>
```
The `vector.transpose` introduced does not affect the underlying data
layout (effectively a no op), but it cannot be folded automatically.
This change avoids inserting transposes when only leading unit
dimensions are involved.
Fixes#85691
Adds support for scalable vectors to patterns defined in
VectorLineralize.cpp.
Linearization is disable in 2 notable cases:
* vectors with more than 1 scalable dimension (we cannot represent
vscale^2),
* vectors initialised with arith.constant that's not a vector splat
(such arith.constant Ops cannot be flattened).
Updates `castAwayContractionLeadingOneDim` to inherit from
`MaskableOpRewritePattern` so that this pattern can support masking.
Builds on top of #83827
Adds a generic pattern rewrite for maskable Ops, `MaskableOpRewritePattern`,
that will work for both masked and un-masked cases, e.g. for both:
* `vector.mask {vector.contract}` (masked), and
* `vector.contract` (not masked).
This helps to reduce code-duplication and standardise how we implement such
patterns.
Fixes#78787
This PR is adds support for `vector.insert` to the patterns that bubble up and down `vector.bitcat` ops across `vector.extract/extract_slice/insert_slice` ops.
Currently n-d transfer write distribution can be inconsistent with
distribution of reductions if a value has multiple users, one of which
is a transfer_write with a non-standard distribution map, and the other
of which is a vector.reduction.
We may want to consider removing the distribution map functionality in
the future for this reason.
This PR add support for `arith.trunci` to vector narrow type emulation for iX -> i4 truncations, for X >= 8. For now, the pattern only works for 1D vectors and is based on `vector.shuffle` ops. We would need `vector.deinterleave` to add n-D vector support.
It looks like the affine map generated to compute the indices of the
collapsed dimensions used the wrong dim size. For indices `[idx0][idx1]`
we computed the collapsed index as `idx0*size0 + idx1` instead of
`idx0*size1 + idx1`. This led to correctness issues in convolution tests
when enabling this transformation internally.
This PR replaces the generation of `vector.shuffle` with
`vector.interleave` in the i4 conversions in vector narrow type
emulation. The multi dimensional semantics of `vector.interleave` allow
us to enable these conversion emulations also for multi dimensional
vectors.
This PR adds an optional bitwidth parameter to the vector xfer op
flattening transformation so that the flattening doesn't happen if the
trailing dimension of the read/writen vector is larger than this
bitwidth (i.e., we are already able to fill at least one vector register
with that size).
Common backends (LLVM, SPIR-V) only supports 1D vectors, LLVM conversion
handles ND vectors (N >= 2) as `array<array<... vector>>` and SPIR-V
conversion doesn't handle them at all at the moment. Sometimes it's
preferable to treat multidim vectors as linearized 1D. Add pass to do
this. Only constants and simple elementwise ops are supported for now.
@krzysz00 I've extracted yours result type conversion code from
LegalizeToF32 and moved it to common place.
Also, add ConversionPattern class operating on traits.
This is part of
66347e516e
The regression in downstream projects is about transfer_read patterns,
which needs more investigation. Add the support for transfer_write for
now.
This PR adds patterns to convert a sub-byte vector transpose into a
sequence of instructions that perform the transpose on i8 vector
elements. Whereas this rewrite may not lead to the absolute peak
performance, it should ensure correctness when dealing with sub-byte
transposes.
This PR adds new patterns to improve the generated vector code for the emulation of any conversion that have to go through an i4 -> i8 type extension (only signed extensions are supported for now). This will impact any i4 -> i8/i16/i32/i64 signed extensions as well as sitofp i4 -> f8/f16/f32/f64.
The asm code generated for the supported cases is significantly better after this PR for both x86 and aarch64.
This commit renames 4 pattern rewriter API functions:
* `updateRootInPlace` -> `modifyOpInPlace`
* `startRootUpdate` -> `startOpModification`
* `finalizeRootUpdate` -> `finalizeOpModification`
* `cancelRootUpdate` -> `cancelOpModification`
The term "root" is a misnomer. The root is the op that a rewrite pattern
matches against
(https://mlir.llvm.org/docs/PatternRewriter/#root-operation-name-optional).
A rewriter must be notified of all in-place op modifications, not just
in-place modifications of the root
(https://mlir.llvm.org/docs/PatternRewriter/#pattern-rewriter). The old
function names were confusing and have contributed to various broken
rewrite patterns.
Note: The new function names use the term "modify" instead of "update"
for consistency with the `RewriterBase::Listener` terminology
(`notifyOperationModified`).
This commit fixes `Dialect/Vector/vector-rewrite-narrow-types.mlir` when
running with `MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`.
```
within split at llvm-project/mlir/test/Dialect/Vector/vector-rewrite-narrow-types.mlir:1 offset :118:8: error: 'arith.trunci' op operand type 'vector<3xi16>' and result type 'vector<3xi16>' are cast incompatible
%1 = vector.bitcast %0 : vector<16xi3> to vector<3xi16>
^
within split at llvm-project/mlir/test/Dialect/Vector/vector-rewrite-narrow-types.mlir:1 offset :118:8: note: see current operation: %48 = "arith.trunci"(%47) : (vector<3xi16>) -> vector<3xi16>
LLVM ERROR: IR failed to verify after pattern application
```
If a rewrite pattern returns "failure", it must not have modified the
IR. This commit fixes
`Dialect/Vector/vector-contract-to-outerproduct-transforms-unsupported.mlir`
when running with `MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`.
```
* Pattern (anonymous namespace)::ContractionOpToOuterProductOpLowering : 'vector.contract -> ()' {
Trying to match "(anonymous namespace)::ContractionOpToOuterProductOpLowering"
** Insert : 'vector.transpose'(0x5625b3a8cb30)
** Insert : 'vector.transpose'(0x5625b3a8cbc0)
"(anonymous namespace)::ContractionOpToOuterProductOpLowering" result 0
} -> failure : pattern failed to match
} -> failure : pattern failed to match
LLVM ERROR: pattern returned failure but IR did change
```
Note: `vector-contract-to-outerproduct-transforms-unsupported.mlir` is
merged into `vector-contract-to-outerproduct-matvec-transforms.mlir`.
The `greedy pattern application failed` error is not longer produced.
This error indicates that the greedy pattern rewrite did not
convergence; it does not mean that a pattern could not be applied.
Support distribution of `vector.transfer_read` ops when operands are
defined inside of the region of `warp_execute_on_lane_0` (except for the
buffer from which the op is reading).
Such IR was previously not supported. This commit changes the
implementation such that indices and the padding value are also
distributed.
This commit simplifies the implementation considerably: the original
implementation created a new `transfer_read` op and then checked if this
new op is valid. If not, the rewrite pattern failed. This was a bit
hacky. It was also a violation of the rewrite pattern API (detected by
`MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`) because the IR was modified,
but the pattern returned "failure".
This is to avoid confusion when dealing with reduction/combining kinds.
For example, see a recent PR comment:
https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175.
Previously, they were picked to mostly mirror the names of the llvm
vector reduction intrinsics:
https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In
isolation, it was not clear if `<maxf>` has `arith.maxnumf` or
`arith.maximumf` semantics. The new reduction kind names map 1:1 to
arith ops, which makes it easier to tell/look up their semantics.
Because both the vector and the gpu dialect depend on the arith dialect,
it's more natural to align names with those in arith than with the
lowering to llvm intrinsics.
Issue: https://github.com/llvm/llvm-project/issues/72354
The number of vector elements considered 'small' enough to extract is
parameterized.
This is to avoid going into specialized reduction lowering when a
single/couple of arith ops can do. Targets without dedicated reduction
intrinsics can use that as an emulation path too.
Depends on https://github.com/llvm/llvm-project/pull/75846.
For vectors with either leading or trailing unit dim, replaces:
elementwise(a, b)
with:
sc_a = shape_cast(a)
sc_b = shape_cast(b)
res = elementwise(sc_a, sc_b)
return shape_cast(res)
The newly inserted shape_cast Ops fold (before elementwise Op) and then
restore (after elementwise Op) the unit dim. Vectors `a` and `b` are
required to be rank > 1.
Example:
```mlir
%mul = arith.mulf %B_row, %A_row : vector<1x[4]xf32>
%cast = vector.shape_cast %mul : vector<1x[4]xf32> to vector<[4]xf32>
```
gets converted to:
```mlir
%B_row_sc = vector.shape_cast %B_row : vector<1x[4]xf32> to vector<[4]xf32>
%A_row_sc = vector.shape_cast %A_row : vector<1x[4]xf32> to vector<[4]xf32>
%mul = arith.mulf %B_row_sc, %A_row_sc : vector<[4]xf32>
%mul_sc = vector.shape_cast %mul : vector<[4]xf32> to vector<1x[4]xf32>
%cast = vector.shape_cast %mul_sc : vector<1x[4]xf32> to vector<[4]xf32>
```
In practice, the bottom 2 shape_cast(s) will be folded away.
Add a configuration option to allow vector distribution with multiple
elements written by a single lane.
This is so that we can perform vector multi-reduction with multiple
results per workgroup.