Commit Graph

1618 Commits

Author SHA1 Message Date
Andrzej Warzyński
39ad84e4d1 [mlir][linalg] Split GenericPadOpVectorizationPattern into two patterns (#111349)
At the moment, `GenericPadOpVectorizationPattern` implements two
orthogonal transformations:
  1. Rewrites `tensor::PadOp` into a sequence of `tensor::EmptyOp`,
    `linalg::FillOp` and `tensor::InsertSliceOp`.
  2. Vectorizes (where possible) `tensor::InsertSliceOp` (see
    `tryVectorizeCopy`).

This patch splits `GenericPadOpVectorizationPattern` into two separate
patterns:
  1. `GeneralizePadOpPattern` for the first transformation (note that
    currently `GenericPadOpVectorizationPattern` inherits from
    `GeneralizePadOpPattern`).
  2. `InsertSliceVectorizePattern` to vectorize `tensor::InsertSliceOp`.

With this change, we gain the following:
  * a clear separation between pre-processing and vectorization
    transformations/stages,
  * a path to support masked vectorisation for `tensor.insert_slice`
    (with a dedicated pattern for vectorization, it is much easier to
    specify the input vector sizes used in masking),
  * more opportunities to vectorize `tensor.insert_slice`.

Note for downstream users:
--------------------------

If you were using `populatePadOpVectorizationPatterns`, following this
change you will also have to add
`populateInsertSliceVectorizationPatterns`.

Finer implementation details:
-----------------------------

1.  The majority of changes in this patch are copy & paste + some edits.
  1.1. The only functional change is that the vectorization of
    `tensor.insert_slice` is now broadly available (as opposed to being
    constrained to the pad vectorization pattern:
    `GenericPadOpVectorizationPattern`).
  1.2. Following-on from the above, `@pad_and_insert_slice_dest` is
    updated. As expected, the input `tensor.insert_slice` Op is no
    longer "preserved" and instead gets vectorized successfully.

2. The `linalg.fill` case in `getConstantPadVal` works under the
   assumption that only _scalar_ source values can be used. That's
   consistent with the definition of the Op, but it's not tested at the
   moment. Hence a test case in Linalg/invalid.mlir is added.

3. The behaviour of the two TD vectorization Ops,
   `transform.structured.vectorize_children_and_apply_patterns` and
   `transform.structured.vectorize` is preserved.
2024-10-29 16:57:23 +00:00
Andrzej Warzyński
ac4bd74190 [mlir] Add apply_patterns.linalg.pad_vectorization TD Op (#112504)
This PR simply wraps `populatePadOpVectorizationPatterns` into a new
Transform Dialect Op: `apply_patterns.linalg.pad_vectorization`.

This change makes it possible to run (and test) the corresponding
patterns _without_:

  `transform.structured.vectorize_children_and_apply_patterns`.

Note that the Op above only supports non-masked vectorisation (i.e. when
the inputs are static), so, effectively, only fixed-width vectorisation
(as opposed to scalable vectorisation). As such, this change is required
to construct vectorization pipelines for tensor.pad targeting scalable
vectors.

To test the new Op and the corresponding patterns, I added
"vectorization-pad-patterns.mlir" - most tests have been extracted from
"vectorization-with-patterns.mlir".
2024-10-25 10:39:26 -07:00
Max191
2bff9d9ffe [mlir] Don't hoist transfers from potentially zero trip loops (#112752)
The hoistRedundantVectorTransfers function does not verification of loop
bounds when hoisting vector transfers. This is not safe in general,
since it is possible that the loop will have zero trip count. This PR
uses ValueBounds to verify that the lower bound is less than the upper
bound of the loop before hoisting. Trip count verification is currently
behind an option `verifyNonZeroTrip`, which is false by default.

Zero trip count loops can arise in GPU code generation, where a loop
bound can be dependent on a thread id. If not all threads execute the
loop body, then hoisting out of the loop can cause these threads to
execute the transfers when they are not supposed to.

---------

Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
2024-10-18 16:11:21 -04:00
Andrzej Warzyński
0a3347dc63 [mlir][linalg] Fix idx comparison in the vectorizer (#112900)
Fixes loop comparison condition in the vectorizer.

As that logic is used specifically for vectorising `tensor.extract`, I
also added a test that violates the assumptions made inside
`getTrailingNonUnitLoopDimIdx`, namely that Linalg loops are non-empty.
Vectorizer pre-conditions will capture that much earlier making sure
that `getTrailingNonUnitLoopDimIdx` is only run when all the assumptions
are actually met.

Thank you for pointing this out, @pfusik !
2024-10-18 15:27:43 +01:00
Andrzej Warzyński
f7f51f2afb [mlir][vector] Clarify the semantics of masking maps (nfc) (#111383)
We use the term "masking map" throughout the Linalg vectorization logic,
but we don't really define what it is and how it differs from Linalg
indexing maps. This PR clarifies the differnces, makes sure that the new
terminology is used consistenty and improves code re-use.
2024-10-18 08:58:58 +01:00
Alexander Pivovarov
a24c468782 [MLIR] Fix assert expressions (#112474)
I noticed that several assertions in MLIR codebase have issues with
operator precedence

The issue with operator precedence in these assertions is due to the way
logical operators are evaluated. The `&&` operator has higher precedence
than the `||` operator, which means the assertion is currently
evaluating incorrectly, like this:
```
assert((resType.getNumDynamicDims() == dynOutDims.size()) ||
       (dynOutDims.empty() && "Either none or all output dynamic dims must be specified!"));
```

We should add parentheses around the entire expression involving
`dynOutDims.empty()` to ensure that the logical conditions are grouped
correctly. Here’s the corrected version:
```
assert(((resType.getNumDynamicDims() == dynOutDims.size()) || dynOutDims.empty()) &&
       "Either none or all output dynamic dims must be specified!");

```
2024-10-16 15:22:29 -07:00
Andrzej Warzyński
a758bcdbd9 [mlir][td] Rename pack_paddings in structured.pad (#111036)
The pack_paddings attribute in the structure.pad TD Op is used to set
the `nofold` attribute in the generated tensor.pad Op. The current name
is confusing and suggests that there's a relation with the tensor.pack
Op. This patch renames it as `nofold_flags` to better match the actual
usage.
2024-10-15 19:24:43 +01:00
Longsheng Mou
4b31568e02 [mlir][linalg] Bugfix for InlineScalarOperands (#111534)
This PR fixes a bug where `scalarOperand` is a simple scalar and should
be used directly, rather than accessed via `tensor.extract`. Fixes
#111243.
2024-10-14 15:38:35 +08:00
Javed Absar
c13f806f17 [mlir][linalg] raise generic to named ops. (#110421)
Add support for specializing linalg.broadcast and linalg.transform from generic. Also, does some refactoring to reuse specialization checks, migrating some common uses to op interface methods.
2024-10-11 15:27:27 +01:00
Emilio Cota
1276ce9e97 Revert "[mlir][linalg] Introduce transpose semantic to 'linalg.matmul' ops. (#104783)"
This reverts commit 03483737a7 and
99c8557, which is a fix-up on top of the former.

I'm reverting because this commit broke two tests:
  mlir/test/python/integration/dialects/linalg/opsrun.py
  mlir/test/python/integration/dialects/transform.py
See https://lab.llvm.org/buildbot/#/builders/138/builds/4872

I'm not familiar with the tests, so I'm leaving it to the original author
to either remove or adapt the broken tests, as discussed here:
  https://github.com/llvm/llvm-project/pull/104783#issuecomment-2406390905
2024-10-11 05:22:56 -04:00
Dmitriy Smirnov
bb4696ce30 [mlir][linalg] Fix for bias handling for Winograd (#110331)
PR makes winograd.output_transform op a destination style op and fixes
handing of a pre-existing data in its output argument (i.e. possibly
pre-initialized with bias, which was discarded before).

---------

Signed-off-by: Dmitriy Smirnov <dmitriy.smirnov@arm.com>
2024-10-11 09:39:19 +01:00
Md Asghar Ahmad Shahid
03483737a7 [mlir][linalg] Introduce transpose semantic to 'linalg.matmul' ops. (#104783)
The main goal of this patch is to extend the semantic of 'linalg.matmul'
named op to include per operand transpose semantic while also laying out
a way to move ops definition from OpDSL to tablegen. Hence, it is
implemented in tablegen. Transpose semantic is as follows.

By default 'linalg.matmul' behavior will remain as is. Transpose
semantics can be appiled on per input operand by specifying the optional
permutation attributes (namely 'permutationA' for 1st input and
'permutationB' for 2nd input) for each operand explicitly as needed. By
default, no transpose is mandated for any of the input operand.

    Example:
    ```
%val = linalg.matmul ins(%arg0, %arg1 : memref<5x3xf32>,
memref<5x7xf32>)
              outs(%arg2: memref<3x7xf32>)
              permutationA = [1, 0]
              permutationB = [0, 1]
    ```
2024-10-10 17:00:58 +01:00
Andrzej Warzyński
f59b0c7603 [mlir][linalg][nfc] Delete references to args_in/args_out (#111517)
After the refactor in:
  * ed229132f1,

the `args_in` and `args_out` attributes are no longer used by
`linalg.generic`. This patch removes most the remaining references.
I've left out BufferDeallocationInternals.md, which doesn't seem
maintained anymore and is quite out of sync with other bits of MLIR
(e.g. `test.generic` instead of `linalg.generic`).
2024-10-10 15:45:52 +01:00
BARRET
1666d13078 [CMake]: Remove unnecessary dependencies on LLVM/MLIR (#111255)
Previous https://github.com/llvm/llvm-project/pull/110362 (reverted)
caused breakage. Here is the PR with fix.

My build cmdline:

```
cmake ../llvm \
    -G Ninja \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INSTALL_PREFIX=install \
    -DCMAKE_C_COMPILER=gcc-9 \
    -DCMAKE_CXX_COMPILER=g++-9 \
    -DCMAKE_CUDA_COMPILER=$(which nvcc) \
    -DLLVM_ENABLE_LLD=OFF \
    -DLLVM_ENABLE_ASSERTIONS=ON \
    -DLLVM_BUILD_EXAMPLES=ON \
    -DCOMPILER_RT_BUILD_LIBFUZZER=OFF \
    -DLLVM_CCACHE_BUILD=ON \
    -DMLIR_ENABLE_BINDINGS_PYTHON=ON \
    -DBUILD_SHARED_LIBS=ON \
    -DLLVM_ENABLE_PROJECTS='llvm;mlir'
```
2024-10-07 15:52:43 +02:00
Andrzej Warzyński
d9d623310d [mlir][linalg] Add a new helper hook: hasVectorizationImpl (#110708)
The newly added hook simply returns `false` for Ops for which there's no
"vectorization logic" in the Linalg Vectorizer (i.e. the `vectorize()`
method). It's added so that the following two TD ops expose identical
level of functionality (that's not the case ATM):

  * `transform.structured.vectorize_children_and_apply_patterns`
  * `transform.structured.vectorize`

Specifically, ATM, the former works only for Linalg Ops, while the
latter works for all Ops that the vectorizer supports (*). With this
change,
I am making sure that both TD will behave consistently.

Note, this shouldn't affect any of the current uses of the vectorizer.

(*) This is implemented via the `vectorize()` method in
Vectorization.cpp.
2024-10-04 10:06:33 +01:00
Andrzej Warzyński
56d6b56739 [mlir][vector] Relax the requirements on broadcast dims (#99341)
NOTE: This is a follow-up for #97049 in which the `in_bounds` attribute
was made mandatory.

This PR updates the semantics of the `in_bounds` attribute so that
broadcast dimensions are no longer required to be "in bounds".
Specifically, these xfer_read/xfer_write Ops become valid after this
change:

```mlir
  %read = vector.transfer_read %A[%base1, %base2], %pad
      {in_bounds = [false], permutation_map = affine_map<(d0, d1) -> (0)>}
      {permutation_map = affine_map<(d0, d1) -> (0)>}
      : memref<?x?xf32>, vector<9xf32>

  vector.transfer_write %vec, %A[%base1, %base2],
      {in_bounds = [false], permutation_map = affine_map<(d0, d1) -> (0)>}
      {permutation_map = affine_map<(d0, d1) -> (0)>}
      : vector<9xf32>, memref<?x?xf32>
```

Note that the value `false` merely means "may run out-of-bounds", i.e.,
the corresponding access can still be "in bounds". In fact, the folder
for xfer Ops is also updated (*) and will update the attribute value
corresponding to broadcast dims to `true` if all non-broadcast dims
are marked as "in bounds". 

Note that this PR doesn't change any of the lowerings. The changes in
"SuperVectorize.cpp", "Vectorization.cpp" and "AffineMap.cpp" are simple
reverts of recent changes in #97049. Those were only meant to facilitate
making `in_bounds` mandatory and to work around the extra requirements
for broadcast dims (those requirements ere removed in this PR). All
changes in tests are also reverts of changes from #97049.

For context, here's a PR in which "broadcast" dims where forced to
always be "in-bounds":
  * https://reviews.llvm.org/D102566

(*) See `foldTransferInBoundsAttribute`.
2024-10-04 07:41:20 +01:00
Rolf Morel
94cf80d6fd [MLIR][Linalg] Pattern to fold AddOp to accumulation via contraction op's dest (#110514)
Replaces a linalg.add with one operand the single user of a contraction,
which has a zero-filled, "identity-mapped" destination and is dominated
by the `other` operand, by the contraction with `other` as its dest.

Benefits include elision of an elementwise op, namely the linalg.add,
and removing a tensor.empty as a destination which is likely to require
an allocation upon bufferization.
2024-10-03 12:22:57 +02:00
Andrzej Warzyński
1c01bcb350 [mlir][tensor] Relax the logic to generalise tensor.pack (#110807)
Make sure that the logic to generalize tensor.pack (into e.g. tensor.pad
tensor.transpose) does indeed allow multiple dynamic tile sizes. This
was effectively already implemented in #109815 - in this PR I merely
removing one `if` condition and adding a test.

I also took the liberty of renaming a few test functions - just to
better highlight the differences between the old and the new tests.

Follow-on for #109815.
2024-10-02 16:43:29 +01:00
Andrzej Warzyński
66f84c8b8a [mlir][tensor] Extend the logic to generalise tensor.pack (#109815)
Extends the logic to generalise tensor.pack (into e.g. tensor.pad +
tensor.transpose) so that it also works when one of the inner tile sizes
is scalable (i.e. a multiple of `vector.vscale`). For example:
```mlir
  %c8 = arith.constant 8 : index
  %vscale = vector.vscale
  %c8_vscale = arith.muli %vscale, %c8 : index
  %0 = tensor.pack %input
      padding_value(%pad : f32)
      inner_dims_pos = [0, 1]
      inner_tiles = [%c8_vscale, 2]
      into %output : tensor<5x1xf32> -> tensor<1x1x?x2xf32>
}
```
is generalised as:
```mlir
  %c8 = arith.constant 8 : index
  %vscale = vector.vscale
  %c8_vscale = arith.muli %vscale, %c8 : index
  %0 = affine.apply #map()[%c8_vscale, %c5]
  %padded = tensor.pad %arg0 low[0, 0] high[%0, 1] {
  ^bb0(%arg3: index, %arg4: index):
    tensor.yield %arg2 : f32
  } : tensor<5x1xf32> to tensor<?x2xf32>
```

At the Tensor level, we model scalability using dynamic shapes and this
change basically extends the relevant logic so that it also works for
dynamic shapes.
2024-10-02 09:44:13 +01:00
Mehdi Amini
8b47711e84 Revert "CMake: Remove unnecessary dependencies on LLVM/MLIR" (#110594)
Reverts llvm/llvm-project#110362

Multiple bots are broken.
2024-10-01 00:44:21 +02:00
BARRET
4980f2177e CMake: Remove unnecessary dependencies on LLVM/MLIR (#110362)
There are some spurious libraries which can be removed.

I'm trying to bundle MLIR/LLVM library dependencies for our own
libraries. We're utilizing cmake function to recursively collect
MLIR/LLVM related dependencies. However, we identified certain library
dependencies as redundant and safe for removal.
2024-09-30 23:57:13 +02:00
Andrzej Warzyński
6d11494414 [mlir][Linalg] Refine how broadcast dims are treated (#99015)
This PR fixes how broadcast dims (identified as "zero" results in
permutation maps) corresponding to a reduction iterator are vectorised
in the case of generic Ops. Here's an example:

```mlir
  #map = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>
  #map1 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, 0)>

  func.func @generic_with_reduction_and_broadcast(%arg0: tensor<1x12x197x197xf32>) -> (tensor<1x12x197x1xf32>) {
    %0 = tensor.empty() : tensor<1x12x197x1xf32>

    %1 = linalg.generic {indexing_maps = [#map, #map1],
                        iterator_types = ["parallel", "parallel", "parallel", "reduction"]}
      ins(%arg0 : tensor<1x12x197x197xf32>)
      outs(%0 : tensor<1x12x197x1xf32>) {

    ^bb0(%in: f32, %out: f32):
      %818 = arith.addf %in, %out : f32
      linalg.yield %818 : f32
    } -> tensor<1x12x197x1xf32>
    return %1 : tensor<1x12x197x1xf32>
  }
```

This is a perfectly valid Generic Op, but currently triggers two issues
in the vectoriser. The root cause is this map:

```mlir
  #map1 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, 0)>
```

This map triggers an assert in `reindexIndexingMap` -  this hook
incorrectly assumes that every result in the input map is a `dim`
expression and that there are no constants. That's not the case in this
example. `reindexIndexingMap` is extended to allow maps like the one
above. For now, only constant "zero" results are allowed. This can be
extended in the future once a good motivating example is available.

Separately, the permutation map highlighted above "breaks" mask
calculation (ATM masks are always computed, even in the presence of
static shapes). When applying the following permutation:
```mlir
  (d0, d1, d2, d3) -> (d0, d1, d2, 0)
```

to these canonical shapes (corresponding to the example above):
```
  (1, 12, 197, 197)
```
we end up with the following error:
```bash
error: vector types must have positive constant sizes but got 1, 12, 197, 0
```

The error makes sense and indicates that we should update the
permutation map above to:
```
  (d0, d1, d2, d3) -> (d0, d1, d2)
```

This would correctly give the following vector type:
```
  vector<1x12x197xi1>
```

Fixes #97247
2024-09-26 16:17:15 +01:00
Hugo Trachino
28039055e5 [MLIR][Transform] Hoist Pad generates linalg.transpose (#109669)
For readability purpose, generate linalg named ops when possible.
For maintainability purpose, get rid of duplicated code.
2024-09-26 09:33:47 +01:00
Nirvedh Meshram
234193bae6 [mlir][linalg] Vectorization support for convolution of i1 type (#109480)
Normally convolutions present with the following linalg op region
```
^bb0(%arg14: i4, %arg15: i4, %arg16: i4):
  %17 = arith.muli %arg14, %arg15 : i4
  %18 = arith.addi %arg16, %17 : i4
  linalg.yield %18 : i4
  ```
  However, for i1 due to strength reduction we get something like
  ```
  ^bb0(%arg14: i1, %arg15: i1, %arg16: i1):
  %17 = arith.andi %arg14, %arg15 : i1
  %18 = arith.ori %arg16, %17 : i1
  linalg.yield %18 : i1
  ```
  This PR updates the logic to support this region for i1 types.
2024-09-24 12:24:59 -05:00
Andrzej Warzyński
b47d1787b5 [mlir][vector] Refine vectorisation of tensor.extract (#109580)
This PR fixes a bug in `isLoopInvariantIdx`. It makes sure that the
following case is vectorised as `vector.gather` (as opposed to
attempting a contiguous load):
```mlir
  func.func @index_from_output_column_vector_gather_load(%src: tensor<8x128xf32>) -> tensor<8x1xf32> {
    %c0 = arith.constant 0 : index
    %0 = tensor.empty() : tensor<8x1xf32>
    %res = linalg.generic {
      indexing_maps = [#map],
      iterator_types = ["parallel", "parallel"]
    } outs(%0 : tensor<8x1xf32>) {
    ^bb0(%arg1: f32):
        %1 = linalg.index 0 : index
      %extracted = tensor.extract %src[%1, %c0] : tensor<8x128xf32>
        linalg.yield %extracted : f32
    } -> tensor<8x1xf32>
    return %res : tensor<8x1xf32>
  }
```

Specifically, when looking for loop-invariant indices in
`tensor.extract` Ops, any `linalg.index` Op that's used in address
colcluation should only access loop dims that are == 1. In the example
above, the following does not meet that criteria:
```mlir
  %1 = linalg.index 0 : index
```

Note that this PR also effectively addresses the issue fixed in #107922,
i.e. exercised by:
  * `@vectorize_nd_tensor_extract_load_1d_column_vector_using_gather_load`

`getNonUnitLoopDim` introduced in #107922 is still valid though. In
fact, it is required to identify that the following case is a contiguous
load:
```mlir
  func.func @index_from_output_column_vector_contiguous_load(%src: tensor<8x128xf32>) -> tensor<8x1xf32> {
    %c0 = arith.constant 0 : index
    %0 = tensor.empty() : tensor<8x1xf32>
    %res = linalg.generic {
      indexing_maps = [#map],
      iterator_types = ["parallel", "parallel"]
    } outs(%0 : tensor<8x1xf32>) {
    ^bb0(%arg1: f32):
        %1 = linalg.index 0 : index
      %extracted = tensor.extract %src[%c0, %1] : tensor<8x128xf32>
        linalg.yield %extracted : f32
    } -> tensor<8x1xf32>
    return %res : tensor<8x1xf32>
  }
```
Some logic is still missing to lower the above to
`vector.transfer_read`, so it is conservatively lowered to
`vector.gather` instead (see TODO in
`getTensorExtractMemoryAccessPattern`).

There's a few additional changes:
  * `getNonUnitLoopDim` is simplified and renamed as
    `getTrailingNonUnitLoopDimIdx`, additional comments are added (note
    that the functionality didn't change);
  * extra comments in a few places, variable names in comments update to
    use Markdown (which is the preferred approach in MLIR).

This is a follow-on for:
  * https://github.com/llvm/llvm-project/pull/107922
  * https://github.com/llvm/llvm-project/pull/102321
2024-09-24 14:03:30 +01:00
Andrzej Warzyński
c1826aeef3 [mlir][tensor] Add new helper hooks for RelayoutOp (#109642)
Implements two helper hooks for PackOp and UnPackOP, `getAllOuterDims`
and `getTiledOuterDims`, and adds them to RelayoutOp (that both PackOp
an UnPackOp inherit from).

This improves code re-use and also clarifies the meaning of "outer dims"
and "tiled outer dims".
2024-09-24 13:14:49 +01:00
Kazu Hirata
f264d9a9d5 [Linalg] Fix a warning
This patch fixes:

  mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp:821:12: error:
  variable 'countNonUnitDim' set but not used
  [-Werror,-Wunused-but-set-variable]
2024-09-21 13:59:36 -07:00
Nirvedh Meshram
e45fc5140d [Linalg][Vectorization] Add support for linalg vectorization of a tensor.extract case (#107922)
In https://github.com/llvm/llvm-project/pull/102321 we relaxed the
vectorizer so that when checking for contiguous loads we dont always
have a trailing non unit dim. For example in the test case added we have
`tensor<8x1xf32>` which is now a valid candidate for contiguous load.
However, the logic to check contiguous load assumed that only the
trailing dim will be non unit so this PR just updates that logic to find
the actual non unit dim.
2024-09-21 15:12:51 -05:00
Andrzej Warzyński
315ba77406 [mlir][linalg] Vectorisation of tensor.extract - dynamic shapes (#100582)
This PR removes the assumption that reading from a dynamic tensor is
always a gather load:

```mlir
%extracted = tensor.extract %src[%c79, %3] : tensor<?x?xf32>
```

That assumption was originally introduced to simplify the implementation
and to reduce the number of cases to consider. Now that the
vectorisation of `tensor.extract` has been around for > 1 year and has
been quite stable, we can safely relax it.

This is a relatively small change - rather than using the parent linalg
Op to infer the target output shape (not possible with dynamic shapes),
the vectorizer will use the (previously constructed) output vector
shape instead.

As expected, the following test required updating (`vector.gather` ->
`vector.transfer_read`):
*
@masked_dynamic_vectorize_nd_tensor_extract_with_affine_apply_contiguous

Similar test for scalable vectors is also added.
2024-09-19 19:53:11 +01:00
Max191
08efa23083 [mlir] Allow multi-result ops in reshape fusion (#108576)
Fusion of reshapes by collapsing patterns were restricted to single
result operations, but the implementation supports multi result ops.
This PR removes the restriction, since it is not necessary.
2024-09-16 13:06:38 -04:00
Thomas Preud'homme
326287fd5b Add missing FillOp to winograd lowering (#108181)
Winograd lowering involves a number of matmul and batch_matmul which
are currently passed tensor.empty result as out parameter, thereby
are undefined behaviour. This commit adds the necessary linalg.fill.

---------

Co-authored-by: Max191 <44243577+Max191@users.noreply.github.com>
2024-09-13 15:48:17 +01:00
MaheshRavishankar
d5f0969c96 [mlir][TilingInterface] Avoid looking at operands for getting slices to continue tile + fuse. (#107882)
Current implementation of `scf::tileConsumerAndFuseProducerUsingSCF`
looks at operands of tiled/tiled+fused operations to see if they are
produced by `extract_slice` operations to populate the worklist used to
continue fusion. This implicit assumption does not always work. Instead
make the implementations of `getTiledImplementation` return the slices
to use to continue fusion.

This is a breaking change

- To continue to get the same behavior of
`scf::tileConsumerAndFuseProducerUsingSCF`, change all out-of-tree
implementation of `TilingInterface::getTiledImplementation` to return
the slices to continue fusion on. All in-tree implementations have been
adapted to this.
- This change touches parts that required a simplification to the
`ControlFn` in `scf::SCFTileAndFuseOptions`. It now returns a
`std::optional<scf::SCFTileAndFuseOptions::ControlFnResult>` object that
should be `std::nullopt` if fusion is not to be performed.

Signed-off-by: MaheshRavishankar <mahesh.revishankar@gmail.com>
2024-09-11 22:15:43 -07:00
Max191
e982d7fd7c [mlir] Reuse pack dest in tensor.pack decomposition (#108025)
In the `lowerPack` transform, there is a special case for lowering into
a simple `tensor.pad` + `tensor.insert_slice`, but the destination
becomes a newly created `tensor.empty`. This PR fixes the transform to
reuse the original destination of the `tensor.pack`.
2024-09-10 10:45:08 -04:00
Longsheng Mou
f3b4e47b34 [mlir][linalg][NFC] Drop redundant rankReductionStrategy (#107875)
This patch drop redundant rankReductionStrategy in
`populateFoldUnitExtentDimsViaSlicesPatterns` and fixes comment typos.
2024-09-10 09:19:22 +08:00
Adrian Kuegel
b7981a78f0 [mlir] Apply ClangTidyPerformance finding (NFC).
Use const reference for loop variable.
2024-08-29 08:14:55 +00:00
Christopher Bate
8bf69ceb00 Reapply "[mlir] NFC: fix dependence of (Tensor|Linalg|MemRef|Complex) dialects on LLVM Dialect and LLVM Core in CMake build (#104832)" (#105703)
Reapply the commit 43b5085667 with
additional fixes for building with
BUILD_SHARED_LIBS=ON.
2024-08-28 22:34:14 -06:00
MaheshRavishankar
4dbaef6d5e [mlir][Linalg] Avoid doing op replacement in linalg::dropUnitDims. (#105749)
It is better to do the replacement in the caller. This avoids the
footgun if the caller needs the original operation. Instead return the
produced operation and replacement values.

Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
2024-08-23 13:43:33 -07:00
Jie Fu
d8b6df2e8b [mlir] Fix -Wunused-result in ElementwiseOpFusion.cpp (NFC)
/llvm-project/mlir/lib/Dialect/Linalg/Transforms/ElementwiseOpFusion.cpp:124:7:
error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result]
      opOperandsToIgnore.pop_back_val();
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
2024-08-21 09:04:24 +08:00
DanielLevi6
4a4b233f35 [mlir][linalg] Improve getPreservedProducerResults estimation in ElementwiseOpFusion (#104409)
This commit changes the getPreservedProducerResults function so that it
takes the consumer into account along with the producer, in order to
predict which of the producer’s outputs can be dropped during the fusion
process. It provides a more accurate prediction, considering that the
fusion process also depends on the consumer.
2024-08-20 16:14:01 -07:00
Christopher Bate
06fd808654 Revert "[mlir] NFC: fix dependence of (Tensor|Linalg|MemRef|Complex) dialects on LLVM Dialect and LLVM Core in CMake build (#104832)"
This reverts commit 43b5085667 since it
caused the build to break with BUILD_SHARED_LIBS=ON.
2024-08-20 03:46:29 +00:00
Christopher Bate
43b5085667 [mlir] NFC: fix dependence of (Tensor|Linalg|MemRef|Complex) dialects on LLVM Dialect and LLVM Core in CMake build (#104832)
This change removes dependencies declared as either 'LINK_LIBS' or
'LINK_COMPONENTS' across several MLIR libraries. The removed
dependencies appear
to be incorrect and may have been required in older versions of the
project.
These dependencies cause many high level dialects to have transitive
dependence on the LLVM dialect and the LLVM 'Core' library
('llvm/lib/IR').

Note that if using the 'Ninja' CMake generator, one can inspect the
dependencies
(including all transitive libraries) of any given MLIR target but using
the command `ninja -C <build dir> -t browse` and navigating to the
library
of interest in a web browser.
2024-08-19 18:49:22 -06:00
Hsiangkai Wang
c4bf949171 [mlir][linalg] Implement TilingInterface for winograd operators (#96184)
In order to support arbitrary size input data of conv2d, implement
TilingInterface for winograd operations. Before converting winograd
operations into nested loops with matrix multiply, tile the input of
conv2d into the supported size first.

Add a transform operation structured.decompose_winograd_op to decompose
winograd operations. Before applying the transform op, use
tile_using_for to tile the input data into supported size. The test case
shows how to tile and decompose winograd operations.
2024-08-16 16:22:02 +01:00
Ian Wood
a95ad2da36 [mlir] Add bubbling patterns for non intersecting reshapes (#103401)
Refactored @Max191's PR https://github.com/llvm/llvm-project/pull/94637
to move it to `Tensor`

From the original PR
>This PR adds fusion by expansion patterns to push a tensor.expand_shape
up through a tensor.collapse_shape with non-intersecting reassociations.
Sometimes parallel collapse_shape ops like this can block propagation of
expand_shape ops, so this allows them to pass through each other.

I'm not sure if I put the code/tests in the right places, so let me know
where those go if they aren't.

cc @MaheshRavishankar @hanhanW

---------

Co-authored-by: Max Dawkins <max.dawkins@gmail.com>
2024-08-14 13:58:35 -07:00
Frank Schlimbach
baabcb2898 [mlir][mesh] Shardingcontrol (#102598)
This is a fixed copy of #98145 (necessary after it got reverted).

@sogartar @yaochengji
This PR adds the following to #98145:
- `UpdateHaloOp` accepts a `memref` (instead of a tensor) and not
returning a result to clarify its inplace-semantics
- `UpdateHaloOp` accepts `split_axis` to allow multiple mesh-axes per
tensor/memref-axis (similar to `mesh.sharding`)
- The implementation of `Shardinginterface` for tensor operation
(`tensor.empty` for now) moved from the tensor library to the mesh
interface library. `spmdize` uses features from `mesh` dialect.
@rengolin agreed that `tensor` should not depend on `mesh` so this
functionality cannot live in a `tensor`s lib. The unfulfilled dependency
caused the issues leading to reverting #98145. Such cases are generally
possible and might lead to re-considering the current structure (like
for tosa ops).
- rebased onto latest main
--------------------------
Replacing `#mesh.sharding` attribute with operation `mesh.sharding`
- extended semantics now allow providing optional `halo_sizes` and
`sharded_dims_sizes`
- internally a sharding is represented as a non-IR class
`mesh::MeshSharding`

What previously was
```mlir
%sharded0 = mesh.shard %arg0 <@mesh0, [[0]]> : tensor<4x8xf32>
%sharded1 = mesh.shard %arg1 <@mesh0, [[0]]> annotate_for_users : tensor<16x8xf32>
```
is now
```mlir
%sharding = mesh.sharding @mesh0, [[0]] : !mesh.sharding
%0 = mesh.shard %arg0 to %sharding : tensor<4x8xf32>
%1 = mesh.shard %arg1 to %sharding annotate_for_users : tensor<16x8xf32>
```
and allows additional annotations to control the shard sizes:
```mlir
mesh.mesh @mesh0 (shape = 4)
%sharding0 = mesh.sharding @mesh0, [[0]] halo_sizes = [1, 2] : !mesh.sharding
%0 = mesh.shard %arg0 to %sharding0 : tensor<4x8xf32>
%sharding1 = mesh.sharding @mesh0, [[0]] sharded_dims_sizes = [3, 5, 5, 3] : !mesh.sharding
%1 = mesh.shard %arg1 to %sharding1 annotate_for_users : tensor<16x8xf32>
```
- `mesh.shard` op accepts additional optional attribute `force`, useful
for halo updates
- Some initial spmdization support for the new semantics
- Support for `tensor.empty` reacting on `sharded_dims_sizes` and
`halo_sizes` in the sharding
- New collective operation `mesh.update_halo` as a spmdized target for
shardings with `halo_sizes`

---------

Co-authored-by: frank.schlimbach <fschlimb@smtp.igk.intel.com>
Co-authored-by: Jie Fu <jiefu@tencent.com>
2024-08-12 12:20:58 +01:00
Kazu Hirata
165f45354a [mlir] Use llvm::is_contained (NFC) (#102714) 2024-08-09 21:42:19 -07:00
Andrzej Warzyński
62e5032c9a Reapply "[mlir][linalg] Relax tensor.extract vectorization" (#102321)
[This reverts commit 6662523d6b2ca0198141c94ee80ebbb41601df9f]

Simplifies the vectorization of tensor.extract so that:
* all cases that read into a genuinely multi-dim vector (*) are
  considered a gather load,
* all other cases are considered as potential contiguous loads.

This change means that the following extraction from a "column" tensor
is correctly identified as a scalar load followed by a broadcast (rather
than a gather load).

```mlir
func.func @vectorize_scalar_broadcast_column_tensor(%in: tensor<1x1x4xi32>) -> tensor<1x1x4xi32> {
  %c4 = arith.constant 4 : index
  %c0 = arith.constant 0 : index
  %cst = arith.constant dense<[...]> : tensor<15x1xi32>

  %out = linalg.generic {
    indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>],
    iterator_types = ["parallel", "parallel", "parallel"]}
    outs(%in : tensor<1x1x4xi32>) {

  ^bb0(%out: i32):
    %8 = linalg.index 0 : index
    %idx_0 = linalg.index 0 : index
    %extracted = tensor.extract %cst[%idx_0, %c0] : tensor<15x1xi32>
    linalg.yield %extracted : i32
  } -> tensor<1x1x4xi32>

  return %out:tensor<1x1x4xi32>
}
```

Overview of the delta compared to the original submission (#99299):
  * removed an assert representing a condition that is being relaxed
    here,
  * added a test (reading from a column tensor) based on a repro from
    @hanhanW.

(*) `vector<1x4x1xf32>` is considered as 1D vector in this context.
2024-08-08 09:36:10 +01:00
Renato Golin
3968942f10 Revert "[mlir][mesh] adding shard-size control (#98145)"
This reverts commit fca69838ca.

Also reverts the fixup: "[mlir] Fix -Wunused-variable in MeshOps.cpp (NFC)"

This reverts commit fc737368fe.
2024-08-07 15:12:37 +01:00
Frank Schlimbach
fca69838ca [mlir][mesh] adding shard-size control (#98145)
- Replacing `#mesh.sharding` attribute with operation `mesh.sharding`
- extended semantics now allow providing optional `halo_sizes` and
`sharded_dims_sizes`
- internally a sharding is represented as a non-IR class
`mesh::MeshSharding`

What previously was
```mlir
%sharded0 = mesh.shard %arg0 <@mesh0, [[0]]> : tensor<4x8xf32>
%sharded1 = mesh.shard %arg1 <@mesh0, [[0]]> annotate_for_users : tensor<16x8xf32>
```
is now
```mlir
%sharding = mesh.sharding @mesh0, [[0]] : !mesh.sharding
%0 = mesh.shard %arg0 to %sharding : tensor<4x8xf32>
%1 = mesh.shard %arg1 to %sharding annotate_for_users : tensor<16x8xf32>
```
and allows additional annotations to control the shard sizes:
```mlir
mesh.mesh @mesh0 (shape = 4)
%sharding0 = mesh.sharding @mesh0, [[0]] halo_sizes = [1, 2] : !mesh.sharding
%0 = mesh.shard %arg0 to %sharding0 : tensor<4x8xf32>
%sharding1 = mesh.sharding @mesh0, [[0]] sharded_dims_sizes = [3, 5, 5, 3] : !mesh.sharding
%1 = mesh.shard %arg1 to %sharding1 annotate_for_users : tensor<16x8xf32>
```
- `mesh.shard` op accepts additional optional attribute `force`, useful
for halo updates
- Some initial spmdization support for the new semantics
- Support for `tensor.empty` reacting on `sharded_dims_sizes` and
`halo_sizes` in the sharding
- New collective operation `mesh.update_halo` as a spmdized target for
shardings with `halo_sizes`

@sogartar @yaochengji
2024-08-07 13:34:57 +01:00
Han-Chung Wang
28fa83f8d4 Revert "[mlir][linalg] Relax tensor.extract vectorization" (#102232)
Reverts llvm/llvm-project#99299 because it breaks the lowering. To
repro: `mlir-opt -transform-interpreter ~/repro.mlir`

```mlir
#map = affine_map<(d0, d1) -> (d0)>
#map1 = affine_map<(d0, d1) -> (d1)>
#map2 = affine_map<(d0, d1) -> (d0, d1)>
#map3 = affine_map<(d0, d1) -> (d0 + d1)>
module {
  func.func @foo(%arg0: index, %arg1: tensor<2xf32>, %arg2: tensor<4xf32>, %arg3: tensor<1xf32>) -> tensor<4x1xf32> {
    %c0 = arith.constant 0 : index
    %cst = arith.constant 1.000000e+00 : f32
    %cst_0 = arith.constant 0.000000e+00 : f32
    %0 = tensor.empty() : tensor<4x1xf32>
    %1 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = ["parallel", "parallel"]} ins(%arg2, %arg3 : tensor<4xf32>, tensor<1xf32>) outs(%0 : tensor<4x1xf32>) {
    ^bb0(%in: f32, %in_1: f32, %out: f32):
      %2 = linalg.index 0 : index
      %3 = linalg.index 1 : index
      %4 = affine.apply #map3(%3, %arg0)
      %extracted = tensor.extract %arg1[%c0] : tensor<2xf32>
      %5 = arith.cmpi eq, %2, %c0 : index
      %6 = arith.cmpi ult, %2, %c0 : index
      %7 = arith.select %5, %cst, %in : f32
      %8 = arith.select %6, %cst_0, %7 : f32
      %9 = arith.cmpi eq, %4, %c0 : index
      %10 = arith.cmpi ult, %4, %c0 : index
      %11 = arith.select %9, %cst, %in_1 : f32
      %12 = arith.select %10, %cst_0, %11 : f32
      %13 = arith.mulf %8, %12 : f32
      %14 = arith.mulf %13, %extracted : f32
      %15 = arith.cmpi eq, %2, %4 : index
      %16 = arith.select %15, %cst, %cst_0 : f32
      %17 = arith.subf %16, %14 : f32
      linalg.yield %17 : f32
    } -> tensor<4x1xf32>
    return %1 : tensor<4x1xf32>
  }
}

module attributes {transform.with_named_sequence} {
  transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
    %0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!transform.any_op) -> !transform.any_op
    transform.structured.vectorize %0 : !transform.any_op
    transform.yield
  }
}
```
2024-08-06 14:35:27 -07:00
Andrzej Warzyński
8868c02cda [mlir][linalg] Relax tensor.extract vectorization (#99299)
Simplifies the vectorization of tensor.extract so that:
* all cases that read into a genuinely multi-dim vector (*) are
  considered a gather load,
* all other cases are considered as potential contiguous loads.

This change means that the following extraction from a "column" tensor
will be correctly identified as a scalar load followed by a broadcast (rather
than a gather load).

```mlir
func.func @vectorize_scalar_broadcast_column_tensor(%in: tensor<1x1x4xi32>) -> tensor<1x1x4xi32> {
  %c4 = arith.constant 4 : index
  %c0 = arith.constant 0 : index
  %cst = arith.constant dense<[...]> : tensor<15x1xi32>

  %out = linalg.generic {
    indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>],
    iterator_types = ["parallel", "parallel", "parallel"]}
    outs(%in : tensor<1x1x4xi32>) {

  ^bb0(%out: i32):
    %idx_0 = linalg.index 0 : index
    %extracted = tensor.extract %cst[%idx_0, %c0] : tensor<15x1xi32>
    linalg.yield %extracted : i32
  } -> tensor<1x1x4xi32>

  return %out:tensor<1x1x4xi32>
}
```

(*) `vector<1x4x1xf32>` is considered as 1D vector in this context.
2024-08-06 10:57:10 +01:00