Currently, `GeneralizePadOpPattern` is grouped under
`populatePadOpVectorizationPatterns`. However, as noted in #111349, this
transformation "decomposes" rather than "vectorizes" `tensor.pad`. As
such, it functions as:
* a vectorization _pre-processing_ transformation, not
* a vectorization transformation itself.
To clarify its purpose, this PR turns `GeneralizePadOpPattern` into a
standalone transformation by:
* introducing a dedicated `populateDecomposePadPatterns` method,
* adding a `apply_patterns.linalg.decompose_pad` Transform Dialect Op,
* removing it from `populatePadOpVectorizationPatterns`.
In addition, to better reflect its role, it is renamed as "decomposition"
rather then "generalization". This is in line with the recent renaming
of similar ops, i.e. tensor.pack/tensor.unpack Ops in #116439.
The earlier PR(https://github.com/llvm/llvm-project/pull/104783) which
introduces
transpose and broadcast semantic to linalg.matmul was reverted due to
two failing
OpDSL test for linalg.matmul.
Since linalg.matmul is now defined using TableGen ODS instead of
Python-based OpDSL,
these test started failing and needs to be removed/updated.
This commit removes/updates the failing obsolete tests from below files.
All other files
were part of earlier PR and just cherry picked.
"mlir/test/python/integration/dialects/linalg/opsrun.py"
"mlir/test/python/integration/dialects/transform.py"
---------
Co-authored-by: Renato Golin <rengolin@systemcall.eu>
At the moment, `GenericPadOpVectorizationPattern` implements two
orthogonal transformations:
1. Rewrites `tensor::PadOp` into a sequence of `tensor::EmptyOp`,
`linalg::FillOp` and `tensor::InsertSliceOp`.
2. Vectorizes (where possible) `tensor::InsertSliceOp` (see
`tryVectorizeCopy`).
This patch splits `GenericPadOpVectorizationPattern` into two separate
patterns:
1. `GeneralizePadOpPattern` for the first transformation (note that
currently `GenericPadOpVectorizationPattern` inherits from
`GeneralizePadOpPattern`).
2. `InsertSliceVectorizePattern` to vectorize `tensor::InsertSliceOp`.
With this change, we gain the following:
* a clear separation between pre-processing and vectorization
transformations/stages,
* a path to support masked vectorisation for `tensor.insert_slice`
(with a dedicated pattern for vectorization, it is much easier to
specify the input vector sizes used in masking),
* more opportunities to vectorize `tensor.insert_slice`.
Note for downstream users:
--------------------------
If you were using `populatePadOpVectorizationPatterns`, following this
change you will also have to add
`populateInsertSliceVectorizationPatterns`.
Finer implementation details:
-----------------------------
1. The majority of changes in this patch are copy & paste + some edits.
1.1. The only functional change is that the vectorization of
`tensor.insert_slice` is now broadly available (as opposed to being
constrained to the pad vectorization pattern:
`GenericPadOpVectorizationPattern`).
1.2. Following-on from the above, `@pad_and_insert_slice_dest` is
updated. As expected, the input `tensor.insert_slice` Op is no
longer "preserved" and instead gets vectorized successfully.
2. The `linalg.fill` case in `getConstantPadVal` works under the
assumption that only _scalar_ source values can be used. That's
consistent with the definition of the Op, but it's not tested at the
moment. Hence a test case in Linalg/invalid.mlir is added.
3. The behaviour of the two TD vectorization Ops,
`transform.structured.vectorize_children_and_apply_patterns` and
`transform.structured.vectorize` is preserved.
This PR simply wraps `populatePadOpVectorizationPatterns` into a new
Transform Dialect Op: `apply_patterns.linalg.pad_vectorization`.
This change makes it possible to run (and test) the corresponding
patterns _without_:
`transform.structured.vectorize_children_and_apply_patterns`.
Note that the Op above only supports non-masked vectorisation (i.e. when
the inputs are static), so, effectively, only fixed-width vectorisation
(as opposed to scalable vectorisation). As such, this change is required
to construct vectorization pipelines for tensor.pad targeting scalable
vectors.
To test the new Op and the corresponding patterns, I added
"vectorization-pad-patterns.mlir" - most tests have been extracted from
"vectorization-with-patterns.mlir".
Fixes loop comparison condition in the vectorizer.
As that logic is used specifically for vectorising `tensor.extract`, I
also added a test that violates the assumptions made inside
`getTrailingNonUnitLoopDimIdx`, namely that Linalg loops are non-empty.
Vectorizer pre-conditions will capture that much earlier making sure
that `getTrailingNonUnitLoopDimIdx` is only run when all the assumptions
are actually met.
Thank you for pointing this out, @pfusik !
We use the term "masking map" throughout the Linalg vectorization logic,
but we don't really define what it is and how it differs from Linalg
indexing maps. This PR clarifies the differnces, makes sure that the new
terminology is used consistenty and improves code re-use.
I noticed that several assertions in MLIR codebase have issues with
operator precedence
The issue with operator precedence in these assertions is due to the way
logical operators are evaluated. The `&&` operator has higher precedence
than the `||` operator, which means the assertion is currently
evaluating incorrectly, like this:
```
assert((resType.getNumDynamicDims() == dynOutDims.size()) ||
(dynOutDims.empty() && "Either none or all output dynamic dims must be specified!"));
```
We should add parentheses around the entire expression involving
`dynOutDims.empty()` to ensure that the logical conditions are grouped
correctly. Here’s the corrected version:
```
assert(((resType.getNumDynamicDims() == dynOutDims.size()) || dynOutDims.empty()) &&
"Either none or all output dynamic dims must be specified!");
```
The main goal of this patch is to extend the semantic of 'linalg.matmul'
named op to include per operand transpose semantic while also laying out
a way to move ops definition from OpDSL to tablegen. Hence, it is
implemented in tablegen. Transpose semantic is as follows.
By default 'linalg.matmul' behavior will remain as is. Transpose
semantics can be appiled on per input operand by specifying the optional
permutation attributes (namely 'permutationA' for 1st input and
'permutationB' for 2nd input) for each operand explicitly as needed. By
default, no transpose is mandated for any of the input operand.
Example:
```
%val = linalg.matmul ins(%arg0, %arg1 : memref<5x3xf32>,
memref<5x7xf32>)
outs(%arg2: memref<3x7xf32>)
permutationA = [1, 0]
permutationB = [0, 1]
```
The newly added hook simply returns `false` for Ops for which there's no
"vectorization logic" in the Linalg Vectorizer (i.e. the `vectorize()`
method). It's added so that the following two TD ops expose identical
level of functionality (that's not the case ATM):
* `transform.structured.vectorize_children_and_apply_patterns`
* `transform.structured.vectorize`
Specifically, ATM, the former works only for Linalg Ops, while the
latter works for all Ops that the vectorizer supports (*). With this
change,
I am making sure that both TD will behave consistently.
Note, this shouldn't affect any of the current uses of the vectorizer.
(*) This is implemented via the `vectorize()` method in
Vectorization.cpp.
NOTE: This is a follow-up for #97049 in which the `in_bounds` attribute
was made mandatory.
This PR updates the semantics of the `in_bounds` attribute so that
broadcast dimensions are no longer required to be "in bounds".
Specifically, these xfer_read/xfer_write Ops become valid after this
change:
```mlir
%read = vector.transfer_read %A[%base1, %base2], %pad
{in_bounds = [false], permutation_map = affine_map<(d0, d1) -> (0)>}
{permutation_map = affine_map<(d0, d1) -> (0)>}
: memref<?x?xf32>, vector<9xf32>
vector.transfer_write %vec, %A[%base1, %base2],
{in_bounds = [false], permutation_map = affine_map<(d0, d1) -> (0)>}
{permutation_map = affine_map<(d0, d1) -> (0)>}
: vector<9xf32>, memref<?x?xf32>
```
Note that the value `false` merely means "may run out-of-bounds", i.e.,
the corresponding access can still be "in bounds". In fact, the folder
for xfer Ops is also updated (*) and will update the attribute value
corresponding to broadcast dims to `true` if all non-broadcast dims
are marked as "in bounds".
Note that this PR doesn't change any of the lowerings. The changes in
"SuperVectorize.cpp", "Vectorization.cpp" and "AffineMap.cpp" are simple
reverts of recent changes in #97049. Those were only meant to facilitate
making `in_bounds` mandatory and to work around the extra requirements
for broadcast dims (those requirements ere removed in this PR). All
changes in tests are also reverts of changes from #97049.
For context, here's a PR in which "broadcast" dims where forced to
always be "in-bounds":
* https://reviews.llvm.org/D102566
(*) See `foldTransferInBoundsAttribute`.
This PR fixes how broadcast dims (identified as "zero" results in
permutation maps) corresponding to a reduction iterator are vectorised
in the case of generic Ops. Here's an example:
```mlir
#map = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>
#map1 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, 0)>
func.func @generic_with_reduction_and_broadcast(%arg0: tensor<1x12x197x197xf32>) -> (tensor<1x12x197x1xf32>) {
%0 = tensor.empty() : tensor<1x12x197x1xf32>
%1 = linalg.generic {indexing_maps = [#map, #map1],
iterator_types = ["parallel", "parallel", "parallel", "reduction"]}
ins(%arg0 : tensor<1x12x197x197xf32>)
outs(%0 : tensor<1x12x197x1xf32>) {
^bb0(%in: f32, %out: f32):
%818 = arith.addf %in, %out : f32
linalg.yield %818 : f32
} -> tensor<1x12x197x1xf32>
return %1 : tensor<1x12x197x1xf32>
}
```
This is a perfectly valid Generic Op, but currently triggers two issues
in the vectoriser. The root cause is this map:
```mlir
#map1 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, 0)>
```
This map triggers an assert in `reindexIndexingMap` - this hook
incorrectly assumes that every result in the input map is a `dim`
expression and that there are no constants. That's not the case in this
example. `reindexIndexingMap` is extended to allow maps like the one
above. For now, only constant "zero" results are allowed. This can be
extended in the future once a good motivating example is available.
Separately, the permutation map highlighted above "breaks" mask
calculation (ATM masks are always computed, even in the presence of
static shapes). When applying the following permutation:
```mlir
(d0, d1, d2, d3) -> (d0, d1, d2, 0)
```
to these canonical shapes (corresponding to the example above):
```
(1, 12, 197, 197)
```
we end up with the following error:
```bash
error: vector types must have positive constant sizes but got 1, 12, 197, 0
```
The error makes sense and indicates that we should update the
permutation map above to:
```
(d0, d1, d2, d3) -> (d0, d1, d2)
```
This would correctly give the following vector type:
```
vector<1x12x197xi1>
```
Fixes#97247
Normally convolutions present with the following linalg op region
```
^bb0(%arg14: i4, %arg15: i4, %arg16: i4):
%17 = arith.muli %arg14, %arg15 : i4
%18 = arith.addi %arg16, %17 : i4
linalg.yield %18 : i4
```
However, for i1 due to strength reduction we get something like
```
^bb0(%arg14: i1, %arg15: i1, %arg16: i1):
%17 = arith.andi %arg14, %arg15 : i1
%18 = arith.ori %arg16, %17 : i1
linalg.yield %18 : i1
```
This PR updates the logic to support this region for i1 types.
This PR fixes a bug in `isLoopInvariantIdx`. It makes sure that the
following case is vectorised as `vector.gather` (as opposed to
attempting a contiguous load):
```mlir
func.func @index_from_output_column_vector_gather_load(%src: tensor<8x128xf32>) -> tensor<8x1xf32> {
%c0 = arith.constant 0 : index
%0 = tensor.empty() : tensor<8x1xf32>
%res = linalg.generic {
indexing_maps = [#map],
iterator_types = ["parallel", "parallel"]
} outs(%0 : tensor<8x1xf32>) {
^bb0(%arg1: f32):
%1 = linalg.index 0 : index
%extracted = tensor.extract %src[%1, %c0] : tensor<8x128xf32>
linalg.yield %extracted : f32
} -> tensor<8x1xf32>
return %res : tensor<8x1xf32>
}
```
Specifically, when looking for loop-invariant indices in
`tensor.extract` Ops, any `linalg.index` Op that's used in address
colcluation should only access loop dims that are == 1. In the example
above, the following does not meet that criteria:
```mlir
%1 = linalg.index 0 : index
```
Note that this PR also effectively addresses the issue fixed in #107922,
i.e. exercised by:
* `@vectorize_nd_tensor_extract_load_1d_column_vector_using_gather_load`
`getNonUnitLoopDim` introduced in #107922 is still valid though. In
fact, it is required to identify that the following case is a contiguous
load:
```mlir
func.func @index_from_output_column_vector_contiguous_load(%src: tensor<8x128xf32>) -> tensor<8x1xf32> {
%c0 = arith.constant 0 : index
%0 = tensor.empty() : tensor<8x1xf32>
%res = linalg.generic {
indexing_maps = [#map],
iterator_types = ["parallel", "parallel"]
} outs(%0 : tensor<8x1xf32>) {
^bb0(%arg1: f32):
%1 = linalg.index 0 : index
%extracted = tensor.extract %src[%c0, %1] : tensor<8x128xf32>
linalg.yield %extracted : f32
} -> tensor<8x1xf32>
return %res : tensor<8x1xf32>
}
```
Some logic is still missing to lower the above to
`vector.transfer_read`, so it is conservatively lowered to
`vector.gather` instead (see TODO in
`getTensorExtractMemoryAccessPattern`).
There's a few additional changes:
* `getNonUnitLoopDim` is simplified and renamed as
`getTrailingNonUnitLoopDimIdx`, additional comments are added (note
that the functionality didn't change);
* extra comments in a few places, variable names in comments update to
use Markdown (which is the preferred approach in MLIR).
This is a follow-on for:
* https://github.com/llvm/llvm-project/pull/107922
* https://github.com/llvm/llvm-project/pull/102321
This patch fixes:
mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp:821:12: error:
variable 'countNonUnitDim' set but not used
[-Werror,-Wunused-but-set-variable]
In https://github.com/llvm/llvm-project/pull/102321 we relaxed the
vectorizer so that when checking for contiguous loads we dont always
have a trailing non unit dim. For example in the test case added we have
`tensor<8x1xf32>` which is now a valid candidate for contiguous load.
However, the logic to check contiguous load assumed that only the
trailing dim will be non unit so this PR just updates that logic to find
the actual non unit dim.
This PR removes the assumption that reading from a dynamic tensor is
always a gather load:
```mlir
%extracted = tensor.extract %src[%c79, %3] : tensor<?x?xf32>
```
That assumption was originally introduced to simplify the implementation
and to reduce the number of cases to consider. Now that the
vectorisation of `tensor.extract` has been around for > 1 year and has
been quite stable, we can safely relax it.
This is a relatively small change - rather than using the parent linalg
Op to infer the target output shape (not possible with dynamic shapes),
the vectorizer will use the (previously constructed) output vector
shape instead.
As expected, the following test required updating (`vector.gather` ->
`vector.transfer_read`):
*
@masked_dynamic_vectorize_nd_tensor_extract_with_affine_apply_contiguous
Similar test for scalable vectors is also added.
[This reverts commit 6662523d6b2ca0198141c94ee80ebbb41601df9f]
Simplifies the vectorization of tensor.extract so that:
* all cases that read into a genuinely multi-dim vector (*) are
considered a gather load,
* all other cases are considered as potential contiguous loads.
This change means that the following extraction from a "column" tensor
is correctly identified as a scalar load followed by a broadcast (rather
than a gather load).
```mlir
func.func @vectorize_scalar_broadcast_column_tensor(%in: tensor<1x1x4xi32>) -> tensor<1x1x4xi32> {
%c4 = arith.constant 4 : index
%c0 = arith.constant 0 : index
%cst = arith.constant dense<[...]> : tensor<15x1xi32>
%out = linalg.generic {
indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>],
iterator_types = ["parallel", "parallel", "parallel"]}
outs(%in : tensor<1x1x4xi32>) {
^bb0(%out: i32):
%8 = linalg.index 0 : index
%idx_0 = linalg.index 0 : index
%extracted = tensor.extract %cst[%idx_0, %c0] : tensor<15x1xi32>
linalg.yield %extracted : i32
} -> tensor<1x1x4xi32>
return %out:tensor<1x1x4xi32>
}
```
Overview of the delta compared to the original submission (#99299):
* removed an assert representing a condition that is being relaxed
here,
* added a test (reading from a column tensor) based on a repro from
@hanhanW.
(*) `vector<1x4x1xf32>` is considered as 1D vector in this context.
Simplifies the vectorization of tensor.extract so that:
* all cases that read into a genuinely multi-dim vector (*) are
considered a gather load,
* all other cases are considered as potential contiguous loads.
This change means that the following extraction from a "column" tensor
will be correctly identified as a scalar load followed by a broadcast (rather
than a gather load).
```mlir
func.func @vectorize_scalar_broadcast_column_tensor(%in: tensor<1x1x4xi32>) -> tensor<1x1x4xi32> {
%c4 = arith.constant 4 : index
%c0 = arith.constant 0 : index
%cst = arith.constant dense<[...]> : tensor<15x1xi32>
%out = linalg.generic {
indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>],
iterator_types = ["parallel", "parallel", "parallel"]}
outs(%in : tensor<1x1x4xi32>) {
^bb0(%out: i32):
%idx_0 = linalg.index 0 : index
%extracted = tensor.extract %cst[%idx_0, %c0] : tensor<15x1xi32>
linalg.yield %extracted : i32
} -> tensor<1x1x4xi32>
return %out:tensor<1x1x4xi32>
}
```
(*) `vector<1x4x1xf32>` is considered as 1D vector in this context.
This PR fixes one very specific aspect of vectorising `tensor.extract`
Ops when targeting scalable vectors. Namely, it makes sure that the
scalable flag is correctly propagated when creating
`vector::ShapeCastOp`.
BEFORE:
```mlir
vector.shape_cast %idx_vec : vector<1x1x[4]xindex> to vector<4xindex>
```
AFTER:
```mlir
vector.shape_cast %idx_vec : vector<1x1x[4]xindex> to vector<[4]xindex>
```
This particular ShapeCastOp is created when generating an index for
`vector.transfer_read` operations. Strictly speaking, casting is not
really required. However, it makes the subsequent address calculation
much simpler (*).
The following test is updated to demonstrate the use of
`vector.shape_cast` by the vectoriser:
*
@masked_static_vectorize_nd_tensor_extract_with_affine_apply_contiguous
Similar test with scalable vectors is also added.
(*) At this point in the vectoriser it is known
that all leading dims in the index vector are "1").
Allow scalable vectorization of linalg::reduce and linalg::generic that has
reduction iterator(s) with two restrictions:
1. The reduction dim is the last (innermost) dim of the op; and
2. Only the reduction dim is requested for scalable vectorization.
One exception is that scalable vectorization of the reduction dim in
Matmul-like ops are not supported even above restrictions are met.
Allowed combinations of scalable flags and iterator types:
Matmul:
Iterators: ["parallel", "parallel", "reduction"]
Scalable Flags: ["true", "true", "false"]
["false", "true", "false"]
Matvec:
Iterators: ["parallel", "reduction"]
Scalable Flags: ["false", "true"]
["true", "false"]
Updates `vectorizeScalableVectorPrecondition` so that scalable
vectorisation is only applied in well understood and tested scenarios.
It's unlikely that we would ever want an arbitrary dimension to be
scalable. While the Linalg vectoriser should be flexible enough to
handle all possibilities:
* in more "exotic" cases, we are likely to struggle with lowerings
further down the compilation stack,
* it would be impractical given the limitations of LLVM (which usually
reflect the limitations of actual hardware) - e.g. no support for
"scalable" arrays of scalable or fixed width vectors (*).
Ultimately, the goal of this patch is to better document what's
currently supported. While this PR adds some new restrictions, no
existing tests are affected.
(*) At MLIR vector level that would correspond to e.g.
`vector<[4]x8xf32>`.
At the moment, the in_bounds attribute has two confusing/contradicting
properties:
1. It is both optional _and_ has an effective default-value.
2. The default value is "out-of-bounds" for non-broadcast dims, and
"in-bounds" for broadcast dims.
(see the `isDimInBounds` vector interface method for an example of this
"default" behaviour [1]).
This PR aims to clarify the logic surrounding the `in_bounds` attribute
by:
* making the attribute mandatory (i.e. it is always present),
* always setting the default value to "out of bounds" (that's
consistent with the current behaviour for the most common cases).
#### Broadcast dimensions in tests
As per [2], the broadcast dimensions requires the corresponding
`in_bounds` attribute to be `true`:
```
vector.transfer_read op requires broadcast dimensions to be in-bounds
```
The changes in this PR mean that we can no longer rely on the
default value in cases like the following (dim 0 is a broadcast dim):
```mlir
%read = vector.transfer_read %A[%base1, %base2], %f, %mask
{permutation_map = affine_map<(d0, d1) -> (0, d1)>} :
memref<?x?xf32>, vector<4x9xf32>
```
Instead, the broadcast dimension has to explicitly be marked as "in
bounds:
```mlir
%read = vector.transfer_read %A[%base1, %base2], %f, %mask
{in_bounds = [true, false], permutation_map = affine_map<(d0, d1) -> (0, d1)>} :
memref<?x?xf32>, vector<4x9xf32>
```
All tests with broadcast dims are updated accordingly.
#### Changes in "SuperVectorize.cpp" and "Vectorization.cpp"
The following patterns in "Vectorization.cpp" are updated to explicitly
set the `in_bounds` attribute to `false`:
* `LinalgCopyVTRForwardingPattern` and `LinalgCopyVTWForwardingPattern`
Also, `vectorizeAffineLoad` (from "SuperVectorize.cpp") and
`vectorizeAsLinalgGeneric` (from "Vectorization.cpp") are updated to
make sure that xfer Ops created by these hooks set the dimension
corresponding to broadcast dims as "in bounds". Otherwise, the Op
verifier would complain
Note that there is no mechanism to verify whether the corresponding
memory access are indeed in bounds. Still, this is consistent with the
current behaviour where the broadcast dim would be implicitly assumed
to be "in bounds".
[1]
4145ad2bac/mlir/include/mlir/Interfaces/VectorInterfaces.td (L243-L246)
[2]
https://mlir.llvm.org/docs/Dialects/Vector/#vectortransfer_read-vectortransferreadop
The vectorization of linalg.index operations doesn't support scalable
vectors when computing the index vector. This patch fixes this with the
vector.step operation.
Depends on #96776
If this is not set the fact that the dynamic channel is in-bounds cannot
be inferred automatically (like it can for static sizes), which
eventually leads to it being marked as out-of-bounds (which prevents
some rewrites).
Made the createReadOrMaskedRead and isValidMaskedInputVector utility
functions - to be accessible outside of the CU. Needed by the IREE new
TopK implementation.
…se of tensor pack
When the vector sizes are not passed as inputs to the vector transform
operation, the vector sizes are queried from the static result shape in
the case of tensor.pack op.
This patch adds support for masked vectorisation of depthwise 1D WC
convolutions,`linalg.depthwise_conv_1d_nwc_wc`. This is implemented by
adding support for masking.
Two major assumptions are made:
* only the channel dimension can be dynamic/scalable (i.e. the
trailing dim),
* when specifying vector sizes to use in the vectoriser, only the size
corresponding to the channel dim is effectively used (other dims are
inferred from the context).
In terms of scalable vectorisation, this should be sufficient to cover
all practical cases (i.e. making arbitrary dim scalable wouldn't make
much sense). As for more generic cases with dynamic shapes (e.g. W or N
dims being dynamic), more work would be needed. In particular, one would
have to consider the filter and input/output tensors separately.
The reverse op is treated as a VectorMemoryAccessKind::Contiguous load.
It is contiguous slice, but we'll need to compute indices differently
and apply a reverse at vector level. It takes non-trivial efforts for
the approach. The revision flips the case to use vector.gather.
Otherwise there are functionality issues. E.g., the below example loaded
`2, 3, 4` (which is a bug), but what we want is `2, 1, 0`.
Before vectorization:
```mlir
func.func @vectorize_reverse_like_tensor_extract(%arg0: tensor<1x2x3xf32>, %arg1: tensor<1x1x3xf32>, %arg2: index) -> tensor<1x1x3xf32> {
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
%c2 = arith.constant 2 : index
%0 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel", "parallel", "parallel"]} outs(%arg1 : tensor<1x1x3xf32>) {
^bb0(%out: f32):
%1 = linalg.index 1 : index
%2 = linalg.index 0 : index
%3 = affine.apply #map1(%1, %2, %arg2)
%4 = linalg.index 2 : index
%5 = arith.subi %c2, %4 : index
%extracted = tensor.extract %arg0[%c0, %3, %5] : tensor<1x2x3xf32>
linalg.yield %extracted : f32
} -> tensor<1x1x3xf32>
return %0 : tensor<1x1x3xf32>
}
```
Partial IR after vectorization:
```
%5 = vector.constant_mask [1, 1, 3] : vector<1x1x4xi1>
%6 = vector.broadcast %arg0 : index to vector<1x1x4xindex>
%7 = vector.shape_cast %6 : vector<1x1x4xindex> to vector<4xindex>
%8 = vector.extractelement %7[%c0_i32 : i32] : vector<4xindex>
%9 = vector.transfer_read %3[%c0, %8, %c2], %cst, %5 {in_bounds = [true, true, true]} : tensor<1x2x3xf32>, vector<1x1x4xf32>
```
Added support to vectorized tensor.unpack. The unpack Op is split into a
`vector.transfer_read`, `vector.transpose`, `vector.shape_cast` and a
`vector.transfer_write`.
This PR adds a direct vectorization lowering of `tensor.pack` into
`mask(vector.transfer_read)`->`vector.shape_cast`->`vector.transpose`->`vector.transfer_write`.
This commit renames 4 pattern rewriter API functions:
* `updateRootInPlace` -> `modifyOpInPlace`
* `startRootUpdate` -> `startOpModification`
* `finalizeRootUpdate` -> `finalizeOpModification`
* `cancelRootUpdate` -> `cancelOpModification`
The term "root" is a misnomer. The root is the op that a rewrite pattern
matches against
(https://mlir.llvm.org/docs/PatternRewriter/#root-operation-name-optional).
A rewriter must be notified of all in-place op modifications, not just
in-place modifications of the root
(https://mlir.llvm.org/docs/PatternRewriter/#pattern-rewriter). The old
function names were confusing and have contributed to various broken
rewrite patterns.
Note: The new function names use the term "modify" instead of "update"
for consistency with the `RewriterBase::Listener` terminology
(`notifyOperationModified`).
Rename interface functions as follows:
* `hasTensorSemantics` -> `hasPureTensorSemantics`
* `hasBufferSemantics` -> `hasPureBufferSemantics`
These two functions return "true" if the op has tensor/buffer operands
but not buffer/tensor operands.
Also drop the "ranked" part from the interface, i.e., do not distinguish
between ranked/unranked types.
The new function names describe the functions more accurately. They also
align their semantics with the notion of "tensor semantics" with the
bufferization framework. (An op is supposed to be bufferized if it has
tensor operands, and we don't care if it also has memref operands.)
This change is in preparation of #75273, which adds
`BufferizableOpInterface::hasTensorSemantics`. By renaming the functions
in the `DestinationStyleOpInterface`, we can avoid name clashes between
the two interfaces.
This is to avoid confusion when dealing with reduction/combining kinds.
For example, see a recent PR comment:
https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175.
Previously, they were picked to mostly mirror the names of the llvm
vector reduction intrinsics:
https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In
isolation, it was not clear if `<maxf>` has `arith.maxnumf` or
`arith.maximumf` semantics. The new reduction kind names map 1:1 to
arith ops, which makes it easier to tell/look up their semantics.
Because both the vector and the gpu dialect depend on the arith dialect,
it's more natural to align names with those in arith than with the
lowering to llvm intrinsics.
Issue: https://github.com/llvm/llvm-project/issues/72354
Updates the vectorisation of 1D depthwise convolution when flattening
the channel dimension (introduced in #71918). In particular - how the
convolution filter is "flattened". ATM, the vectoriser will use
`vector.shape_cast`:
```mlir
%b_filter = vector.broadcast %filter : vector<4xf32> to vector<3x2x4xf32>
%sc_filter = vector.shape_cast %b_filter : vector<3x2x4xf32> to vector<3x8xf32>
```
This lowering is not ideal - `vector.shape_cast` can be convenient when
it's folded away, but that's not happening in this case. Instead, this
patch updates the vectoriser to use `vector.shuffle` (the overall result
is identical):
```mlir
%sh_filter = vector.shuffle %filter, %filter
[0, 1, 2, 3, 0, 1, 2, 3] : vector<4xf32>, vector<4xf32>
%b_filter = vector.broadcast %sh_filter : vector<8xf32> to vector<3x8xf32>
```
The current vectorization of 1D depthwise convolutions in Linalg is
_sub-optimal_ for tensor with a low number of channel dimensions, e.g.:
```mlir
linalg.depthwise_conv_1d_nwc_wc
{dilations = dense<1> : vector<1xi64>,
strides = dense<1> : vector<1xi64>}
ins(%input, %filter : tensor<1x8x3xi8>, tensor<1x3xi8>)
outs(%output : tensor<1x8x3xi8>) -> tensor<1x8x3xi8>
```
That's due to the fact that ultimately (i.e. at LLVM level),
vectorization happens along the trailing dimension (i.e. the channel
dimension). In this case it leads to vectors with 3 elements (or worse,
if there's e.g. only 1 channel dimension). For comparison, a 128 bit
wide vector registers can hold 16 x i8.
Instead, this patch adds an option to flatten/collapse the channel
dimension into the width dimension of the input/filter/output using
`vector.shape_cast` operation:
```mlir
%sc_input = vector.shape_cast %input : vector<1x8x3xi8> to vector<1x24xi8>
%sc_output = vector.shape_cast %output : vector<1x8x3xi8> to vector<1x24xi8>
%b_filter = vector.broadcast %filter : vector<3xi8> to vector<1x8x3xi8>
%sc_filter = vector.shape_cast %b_filter : vector<1x8x3xi8> to vector<1x24xi8>
```
This new vectorization mode is implemented in `depthwiseConv` by
inserting `vector.shape_cast` Ops before and after
`depthwiseConv1dSliceAsMulAcc` is invoked. It can be selected through
e.g. a transform dialect attribute:
```mlir
transform.structured.vectorize_children_and_apply_patterns %conv {flatten_1d_depthwise_conv}
```
A forthcoming patch will implement a strategy to automatically switch
between the two implementations, depending on the shape of the input
tensors.
Co-authored by: Bradley Smith <bradley.smith@arm.com>