This is to avoid confusion when dealing with reduction/combining kinds.
For example, see a recent PR comment:
https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175.
Previously, they were picked to mostly mirror the names of the llvm
vector reduction intrinsics:
https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In
isolation, it was not clear if `<maxf>` has `arith.maxnumf` or
`arith.maximumf` semantics. The new reduction kind names map 1:1 to
arith ops, which makes it easier to tell/look up their semantics.
Because both the vector and the gpu dialect depend on the arith dialect,
it's more natural to align names with those in arith than with the
lowering to llvm intrinsics.
Issue: https://github.com/llvm/llvm-project/issues/72354
The number of vector elements considered 'small' enough to extract is
parameterized.
This is to avoid going into specialized reduction lowering when a
single/couple of arith ops can do. Targets without dedicated reduction
intrinsics can use that as an emulation path too.
Depends on https://github.com/llvm/llvm-project/pull/75846.
For vectors with either leading or trailing unit dim, replaces:
elementwise(a, b)
with:
sc_a = shape_cast(a)
sc_b = shape_cast(b)
res = elementwise(sc_a, sc_b)
return shape_cast(res)
The newly inserted shape_cast Ops fold (before elementwise Op) and then
restore (after elementwise Op) the unit dim. Vectors `a` and `b` are
required to be rank > 1.
Example:
```mlir
%mul = arith.mulf %B_row, %A_row : vector<1x[4]xf32>
%cast = vector.shape_cast %mul : vector<1x[4]xf32> to vector<[4]xf32>
```
gets converted to:
```mlir
%B_row_sc = vector.shape_cast %B_row : vector<1x[4]xf32> to vector<[4]xf32>
%A_row_sc = vector.shape_cast %A_row : vector<1x[4]xf32> to vector<[4]xf32>
%mul = arith.mulf %B_row_sc, %A_row_sc : vector<[4]xf32>
%mul_sc = vector.shape_cast %mul : vector<[4]xf32> to vector<1x[4]xf32>
%cast = vector.shape_cast %mul_sc : vector<1x[4]xf32> to vector<[4]xf32>
```
In practice, the bottom 2 shape_cast(s) will be folded away.
Add a configuration option to allow vector distribution with multiple
elements written by a single lane.
This is so that we can perform vector multi-reduction with multiple
results per workgroup.
Updates patterns for flattening `vector.transfer_read` by relaxing the
requirement that the "collapsed" indices are all zero. This enables
collapsing cases like this one:
```mlir
%2 = vector.transfer_read %arg4[%c0, %arg0, %arg1, %c0] ... :
memref<1x43x4x6xi32>, vector<1x2x6xi32>
```
Previously only the following case would be consider for collapsing
(all indices are 0):
```mlir
%2 = vector.transfer_read %arg4[%c0, %c0, %c0, %c0] ... :
memref<1x43x4x6xi32>, vector<1x2x6xi32>
```
Also adds some new comments and renames the `firstContiguousInnerDim`
parameter as `firstDimToCollapse` (the latter better matches the actual
meaning).
Similar updates for `vector.transfer_write` will be implemented in a
follow-up patch.
Following the discussion here:
* https://github.com/llvm/llvm-project/pull/72105
this patch makes the `TransposeOpLowering` configurable so that one can select
whether to favour `vector.shape_cast` over `vector.transpose`.
As per the discussion in #72105, using `vector.shape_cast` is very beneficial
and desirable when targeting `LLVM IR` (CPU lowering), but won't work when
targeting `SPIR-V` today (GPU lowering). Hence the need for a mechanism to be
able to disable/enable the pattern introduced in #72105. This patch proposes one
such mechanism.
While this should solve the problem that we are facing today, it's understood to
be a temporary workaround. It should be removed once support for lowering
`vector.shape_cast` to SPIR-V is added. Also, (once implemented) the following
proposal might make this workaround redundant:
* https://discourse.llvm.org/t/improving-handling-of-unit-dimensions-in-the-vector-dialect/
Updates "flatten vector" patterns to support more cases, namely Ops that
read/write vectors with leading unit dims. For example:
```mlir
%0 = vector.transfer_read %arg0[%c0, %c0, %c0, %c0] ... :
memref<5x4x3x2xi8, strided<[24, 6, 2, 1], offset: ?>>, vector<1x1x2x2xi8>
```
Currently, the `vector.transfer_read` above would not be flattened. With
this
change, it will be rewritten as follows:
```mlir
%collapse_shape = memref.collapse_shape %arg0 [[0, 1, 2, 3]] :
memref<5x4x3x2xi8, strided<[24, 6, 2, 1], offset: ?>>
into memref<120xi8, strided<[1], offset: ?>>
%0 = vector.transfer_read %collapse_shape[%c0] ... :
memref<120xi8, strided<[1], offset: ?>>, vector<4xi8>
%1 = vector.shape_cast %0 : vector<4xi8> to vector<1x1x2x2xi8>
```
`hasMatchingInnerContigousShape` is generalised and renamed as
`isContiguousSlice` to better match the updated functionality. A few
test names are updated to better highlight what case is being exercised.
The idea is similar to vector.maskedload + vector.store emulation. What
the emulation does is:
1. Get a compressed mask and load the data from destination.
2. Bitcast the data to original vector type.
3. Select values between `op.valueToStore` and the data from load using
original mask.
4. Bitcast the new value and store it to destination using compressed
masked.
Apart from the test itself, this patch also updates a few patterns to
fix how new VectorType(s) are created. Namely, it makes sure that
"scalability" is correctly propagated.
Regression tests will be updated seperately while auditing Vector
dialect tests in the context of scalable vectors:
* https://github.com/orgs/llvm/projects/23
The primary difficulty with distribution of masked transfers is when the
permutation map permutes the vector, in which case the distribution
logic needs to make sure the correct mask elements end up with the
distributed transfer. This is only tricky when the permutation map has a
permutation in it, so we can relax the condition for distribution.
Previously the pattern only worked when the permutation map was a minor
identity. Infer the new mask type from the new transfer map after
dropping leading unit dims.
Chained reductions get created during vector unrolling. These patterns
simplify them into a series of adds followed by a final reductions.
This is preferred on GPU targets like SPIR-V/Vulkan where vector
reduction gets lowered into subgroup operations that are generally more
expensive than simple vector additions.
For now, only the `add` combining kind is handled.
* Declare arguments/results with `let` statements.
* Rename `transp` to `permutation`.
* Change type of `transp` from `I64ArrayAttr` to `DenseI64ArrayAttr`
(provides direct access to `ArrayRef<int64_t>` instead of `ArrayAttr`).
This patch extends TransferReadDropUnitDimsPattern to support dropping
unit dims from partially-static memrefs, for example:
%v = vector.transfer_read %base[%c0, %c0], %pad {in_bounds = [true, true]} :
memref<?x1xi8, strided<[?, ?], offset: ?>>, vector<[16]x1xi8>
Is rewritten as:
%dim0 = memref.dim %base, %c0 : memref<?x1xi8, strided<[?, ?], offset: ?>>
%subview = memref.subview %base[0, 0] [%dim0, 1] [1, 1] :
memref<?x1xi8, strided<[?, ?], offset: ?>> to memref<?xi8, #map1>
%v = vector.transfer_read %subview[%c0], %pad {in_bounds = [true]}
: memref<?xi8, #map1>, vector<[16]xi8>
Scalable vectors are now also supported, the scalable dims were being
dropped when creating the rank-reduced vector type. The xfer op can also
have a mask of type 'vector.create_mask', which gets rewritten as long
as the mask of the unit dim is a constant of 1.
- Better documentation.
- Rename interface methods: `source` -> `getSource`, `indices` ->
`getIndices`, etc. to conform with MLIR naming conventions. A default
implementation is not needed.
- Turn many interface methods into helper functions. Most of the
previous interface methods were not meant to be overridden, and if some
were overridden without others, the op would be have been broken.
This patch extends the vector.transpose lowering to replace:
vector.transpose %0, [1, 0] : vector<nx1x<eltty>> to vector<1xnx<eltty>>
with:
vector.shape_cast %0 : vector<nx1x<eltty>> to vector<1xnx<eltty>>
Source with leading unit-dim (inverse) is also replaced. Unit dim must
be fixed. Non-unit dim can be scalable.
A check is also added to bail out for scalable vectors before unrolling.
This patch adds handling of an empty `MaskOp` to `MaskOpRewritePattern`
and thereby fixes a crash.
It also pulls the `MaskOp` canonicalization patterns into
`LowerVectorMask` so that empty `MaskOp`s are folded away in the Pass.
Fix https://github.com/llvm/llvm-project/issues/71036
A number of the warp distribution patterns work by rewriting a warp op
in place by moving a contained op outside. This notifies the rewriter
that the warp op is changing in this case.
The `stride == 1` does not imply that we can drop it. Because it could
load more than 1 elements. We should also take source sizes and vector
sizes into account. Otherwise it generates invalid IRs. E.g.,
```mlir
func.func @foo(%arg0: memref<1x1xf32>) -> vector<4x8xf32> {
%c0 = arith.constant 0 : index
%cst = arith.constant 0.000000e+00 : f32
%0 = vector.transfer_read %arg0[%c0, %c0], %cst : memref<1x1xf32>, vector<4x8xf32>
return %0 : vector<4x8xf32>
}
```
Fixes https://github.com/openxla/iree/issues/15493
This is the last step needed for basic support for distributing masked
vector code. The lane id gets delinearized based on the distributed mask
shape and then compared against the original mask sizes to compute the
bounds for the distributed mask. Note that the distribution of masks is
implicit on the shape specified by the warp op. As a result, it is the
responsibility of the consumer of the mask to ensure the distributed
mask will match its own distribution semantics.
Currently when there is a mix of transfer read ops and transfer write
ops that need to be distributed, because the pattern for write
distribution is rooted on the transfer write, it is hard to guarantee
that the write gets distributed after the read when the two aren't
directly connected by SSA. This is likely still relatively unsafe when
there are undistributable ops, but structurally these patterns are a bit
difficult to work with. For now pattern benefits give fairly good
guarantees for happy paths.
This fixes two bugs:
1) When deciding whether a transfer read could be propagated out of
a warp op, it looked for the first yield operand that was produced by
a transfer read. If this transfer read wasn't ready to be
distributed, the pattern would not re-check for any other transfer
reads that could have been propagated.
2) When dropping dead warp results, we do so by updating the warp op
signature and splicing in the old region. This does not add the ops
in the body of the warp op back to the pattern applicator's worklist,
and thus those operations won't be DCE'd. This is a problem for
patterns like the one for transfer reads that will still see the dead
operation as a user.
Because the distribution is based on types, supporting general masked
reads requires first materializing the permutation map in IR to align
the elements of the mask with the elements read by the transfer op. For
now just support cases with the trivial permutation map.
General distribution of masked writes requires materializing the permutation on the vector of the write in IR to ensure the vector lines up with the mask. For now just support cases with trivial permutation maps.
This handles `vector.transfer_read`, `vector.transfer_write`, and
`vector.constant_mask`. The unit dims are only relevant for masks
created by `create_mask` and `constant_mask` if the mask size for the
unit dim is non-one, in which case all subsequent sizes must also be
zero. From the perspective of the vector transfers, however, these unit
dims can just be dropped directly.
After propagation of `vector.warp_execute_on_lane_0` through `scf.for`,
uniform operations like those on the loop iterators can now be hoisted
out of the inner warp op.
- Implement `SubsetOpInterface`, `SubsetExtractionOpInterface`,
`SubsetInsertionOpInterface` for `vector.transfer_read` and
`vector.transfer_write`.
- Move all tensor subset hoisting test cases from `Linalg` to
`loop-invariant-subset-hoisting.mlir`. (Removing 1 duplicate test case.)
Update the remaining tests for matrix multiplication (_matmul_) in:
* vector-contract-to-outerproduct-transforms.mlir
with cases for scalable vectors.
Note that in order for the "vector.contract -> vector.outerproduct"
patterns to work, only the non-reduction dimension can be scalable (*).
For Matmul operations that is set to be the N dimension (i.e. rows of
the output matrix), which matches how matrix multiplication are normally
implemented for e.g. Arm's SVE. However, making the M dimension scalable
(i.e. columns of the output matrix) should work as well.
Making both parellel dimensions scalable is left as a TODO for when
support for 2-D scalable vectors is more established (this is
work-in-progress as part of the effort to support Arm's SME in MLIR).
The change in:
* `UnrolledOuterProductGenerator`
is a "bug fix" to make sure that the conversion pattern correctly
propagates scalability when creating `arith.extf` operations.
(*) The conversion tested in this file unrolls along the reduction
dimension, which is not supported for scalable vectors.
Recent changes (https://github.com/llvm/llvm-project/pull/66930)
disabled vector transfer ops hoisting with view-like intermediate ops.
The recommended way is to fold subview ops into transfer op indices
before invoking hoisting. That would mean now we see transfer op indices
involving dynamic values, instead of static constant values before with
subview ops. Therefore hoisting won't kick in anymore. This breaks
downstream users.
To fix it, this commit enables hoisting transfer ops with dynamic
indices by using `ValueBoundsConstraintSet` to prove ranges are disjoint
in `isDisjointTransferIndices`. Given that utility is used in many
places including op folders, right now we introduce a flag to it and
only set as true for "heavy" transforms in hoisting and load-store
forwarding.
This patch constrains the patterns for converting `vector.contract` to
`vector.outerproduct` so that
* the reduction dimension is _not unrolled_ if the corresponding
dimension is scalable.
This is necessary as the current lowering is incorrect for scalable
dims. Indeed, the following unrolling for `vector.contract` would be
invalid if the corresponding dimension was scalable (K is the size of
the reduction dimension):
```
// K times. This is valid if K _is not_ scalable.
%lhs = vector.extract %LHS[0]
%rhs = vector.extract %RHS[0]
vector.outerproduct %lhs, %rhs
%lhs = vector.extract %LHS[1]
%rhs = vector.extract %RHS[1]
vector.outerproduct %lhs, %rhs
// ...
```
Instead, a `for` loop should be generated:
```
// This would be valid regardless of whether K is scalable or not
scf.for %k = 0 to K step 1
%lhs = vector.extract LHS[%k]
%rhs = vector.extract RHS[%k]
vector.outerproduct %lhs, %rhs
```
However, the lowering of:
* `vector.extract` of vector slices with dynamic indices
is incomplete and hence the implementation proposed above (with
`scf.for`) wouldn't work just yet, i.e. it wouldn't be possible to lower
it further. Instead, this patch disables unrolling in cases when the
reduction dimension is scalable, i.e. where the generated code would be
functionally incorrect.
In order to document unsupported cases, a dedicated test file is added:
* "vector-contract-to-outerproduct-transforms-unsupported.mlir"
This is the first patch in a series of patches that strives to update
these patterns (and to test them) for scalable vectors.
Resolves#68400
Also fix an issue with sink broadcast across elementwise where
`arith.cmpf` is elementwise, but result type is different. The result
type is not same as the operand type, creating illegal IR.
Similar issue with `vector.fma` which only accepts vector operand types,
while broadcasts can have scalar sources. Sinking broadcast across would
result in an illegal `vector.fma` (with scalar operands).
The following patterns
- TransferReadToVectorLoadLowering
- TransferWriteToVectorStoreLowering
attempt to generate invalid vector.maskedload and vector.maskedstore ops
for non rank-1 vector types. These ops operate on 1-D vectors. This
patch adds a check to prevent this.
The vector.extract assembly format currently only contains the source
type, for example:
%1 = vector.extract %0[1] : vector<3x7x8xf32>
it's not immediately obvious if this is the source or result type. This
patch improves the assembly format to make this clearer, so the above
becomes:
%1 = vector.extract %0[1] : vector<7x8xf32> from vector<3x7x8xf32>