Currently, only the values defined outside ForOp but inside the original
WarpOp are considered "escaping values". However this is not true if the
ForOp has some unused results. In this case, corresponding IterArgs must
also be yielded by the original WarpOp. This PR adds the required code
changes to achieve this.
Start removing debug intrinsics support -- starting with the flag that
controls production of their replacement, debug records. This patch
removes the command-line-flag and with it the ability to switch back to
intrinsics. The module / function / block level "IsNewDbgInfoFormat"
flags get hardcoded to true, I'll to incrementally remove things that
depend on those flags.
This PR adds `arith.scaling_truncf` and `arith.scaling_extf` operations
which supports the block quantization following OCP MXFP specs listed
here
https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf
OCP MXFP Spec comes with reference implementation here
https://github.com/microsoft/microxcaling/tree/main
Interesting piece of reference code is this method `_quantize_mx`
7bc41952de/mx/mx_ops.py (L173).
Both `arith.scaling_truncf` and `arith.scaling_extf` are designed to be
an elementwise operation. Please see description about them in
`ArithOps.td` file for more details.
Internally,
`arith.scaling_truncf` does the
`arith.truncf(arith.divf(input/(2^scale)))`. `scale` should have
necessary broadcast, clamping, normalization and NaN propagation done
before callling into `arith.scaling_truncf`.
`arith.scaling_extf` does the `arith.mulf(2^scale, input)` after taking
care of necessary data type conversions.
CC: @krzysz00 @dhernandez0 @bjacob @pashu123 @MaheshRavishankar
@tgymnich
---------
Co-authored-by: Prashant Kumar <pk5561@gmail.com>
Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>
This is to enable `CooperativeMatrixType` to be used with
`DenseElementsAttr`, so that a `spirv.Constant` can be easily built from
`OpConstantComposite`. For example:
```mlir
%cst = spirv.Constant dense<0.000000e+00> : !spirv.coopmatrix<1x1xf32, Subgroup, MatrixAcc>
```
Constraints of arithmetic operations are changed, as
`SameOperandsAndResultType` can no longer fully verify CoopMatrices.
This is because for shaped types the verifier only checks element type
and shapes, whereas for any other arbitrary type it looks for an exact
match.
This patch does not enable the actual deserialization. This will be
done in a subsequent PR.
- try_emplace(Key) is shorter than insert({Key, nullptr}).
- try_emplace performs value initialization without value parameters.
- We overwrite values on successful insertion anyway.
This PR makes `dump-pass-pipeline` pretty-print the dumped pipeline. For
large pipelines the current behavior produces a wall of text that is
hard to visually navigate.
For the command
```bash
mlir-opt --pass-pipeline="builtin.module(flatten-memref, expand-strided-metadata,func.func(arith-expand,func.func(affine-scalrep)))" --dump-pass-pipeline
```
Before:
```bash
Pass Manager with 3 passes:
builtin.module(flatten-memref,expand-strided-metadata,func.func(arith-expand{include-bf16=false include-f8e8m0=false},func.func(affine-scalrep)))
```
After:
```bash
Pass Manager with 3 passes:
builtin.module(
flatten-memref,
expand-strided-metadata,
func.func(
arith-expand{include-bf16=false include-f8e8m0=false},
func.func(
affine-scalrep
)
)
)
```
Another nice feature of this is that the pretty-printed string can still
be copy/pasted into `-pass-pipeline` using a quote:
```bash
$ bin/mlir-opt --dump-pass-pipeline test.mlir --pass-pipeline='
builtin.module(
flatten-memref,
expand-strided-metadata,
func.func(
arith-expand{include-bf16=false include-f8e8m0=false},
func.func(
affine-scalrep
)
)
)'
```
---------
Co-authored-by: Jeremy Kun <j2kun@users.noreply.github.com>
This patch removes `inputVecSizesForLeadingDims` from the parameter list
of `createWriteOrMaskedWrite`. That argument is unnecessary - vector
sizes can be obtained from the `vecToStore` parameter. Since this doesn't
change behavior or test results, it's marked as NFC.
Additional cleanups:
* Renamed `vectorToStore` to `vecToStore` for consistency and brevity.
* Rewrote a conditional at the end of the function to use early exit,
improving readability:
```cpp
// BEFORE:
if (maskingRequried) {
Value maskForWrite = ...;
write = maskOperation(write, maskForWrite);
}
return write;
// AFTER
if (!maskingRequried)
return write;
Value maskFroWrite = ...;
return vector::maskOperation(builder, write, maskForWrite);
```
This patch refactors two vectorization hooks in Vectorization.cpp:
* `createWriteOrMaskedWrite` gains a new parameter for write indices,
aligning it with its counterpart `createReadOrMaskedRead`.
* `vectorizeAsInsertSliceOp` is updated to reuse both of the above
hooks, rather than re-implementing similar logic.
CONTEXT
-------
This is effectively a refactoring of the logic for vectorizing
`tensor.insert_slice`. Recent updates added masking support:
* https://github.com/llvm/llvm-project/pull/122927
* https://github.com/llvm/llvm-project/pull/123031
At the time, reuse of the shared `create*` hooks wasn't feasible due to
missing parameters and overly rigid assumptions. This patch resolves
that and moves us closer to a more maintainable structure.
CHANGES IN `createWriteOrMaskedWrite`
-------------------------------------
* Introduces a clear distinction between the destination tensor and the
vector to store, via named variables like `destType`/`vecToStoreType`,
`destShape`/`vecToStoreShape`, etc.
* Ensures the correct rank and shape are used for attributes like
`in_bounds`. For example, the size of the `in_bounds` attr now matches
the source vector rank, not the tensor rank.
* Drops the assumption that `vecToStoreRank == destRank` - this doesn't
hold in many real examples.
* Deduces mask dimensions from `vecToStoreShape` (vector) instead of
`destShape` (tensor). (Eventually we should not require
`inputVecSizesForLeadingDims` at all - mask shape should be inferred.)
NEW HELPER: `isMaskTriviallyFoldable`
-------------------------------------
Adds a utility to detect when masking is unnecessary. This avoids
inserting redundant masks and reduces the burden on canonicalization to
clean them up later.
Example where masking is provably unnecessary:
```mlir
%2 = vector.mask %1 {
vector.transfer_write %0, %arg1[%c0, %c0, %c0, %c0, %c0, %c0]
{in_bounds = [true, true, true]}
: vector<1x2x3xf32>, tensor<9x8x7x1x2x3xf32>
} : vector<1x2x3xi1> -> tensor<9x8x7x1x2x3xf32>
```
Also, without this hook, tests are more complicated and require more
matching.
VECTORIZATION BEHAVIOUR
-----------------------
This patch preserves the current behaviour around masking and the use
of`in_bounds` attribute. Specifically:
* `useInBoundsInsteadOfMasking` is set when no input vector sizes are
available.
* The vectorizer continues to infer vector sizes where needed.
Note: the computation of the `in_bounds` attribute is not always
correct. That
issue is tracked here:
* https://github.com/llvm/llvm-project/issues/142107
This will be addressed separately.
TEST CHANGES
-----------
Only affects vectorization of:
* `tensor.insert_slice` (now refactored to use shared hooks)
Test diffs involve additional `arith.constant` Ops due to increased
reuse of
shared helpers (which generate their own constants). This will be
cleaned up
via constant caching (see #138265).
NOTE FOR REVIEWERS
------------------
This is a fairly substantial rewrite. You may find it easier to review
`createWriteOrMaskedWrite` as a new method rather than diffing
line-by-line.
TODOs (future PRs)
------------------
Further alignment of `createWriteOrMaskedWrite` and
`createReadOrMaskedRead`:
* Move `createWriteOrMaskedWrite` next to `createReadOrMaskedRead` (in
VectorUtils.cpp)
* Make `createReadOrMaskedRead` leverage `isMaskTriviallyFoldable`.
* Extend `isMaskTriviallyFoldable` with value-bounds-analysis. See the
updated test in transform-vector.mlir for an example that would
benefit from this.
* Address #142107
(*) This method will eventually be moved out of Vectorization.cpp, which
isn't the right long-term home for it.
Makes it possible to pass around the options to a pass inside a schedule.
The refactoring also makes it so that the pass manager and pass are only
constructed once per `apply()` of the transform op versus for each target
payload given to the op's `apply()`.
The current algorithm searching for circular function calls scales quadratically due to the linear scan of the functions vector that is performed for each element of the vector itself. The PR replaces such algorithm with an O(V + E) version based on the Khan's algorithm for topological sorting, where V is the number of functions and E is the number of function calls.
The `DeallocationState` class has been modified to keep a reference to an externally owned `SymbolTableCollection` class, to preserve the cached symbol tables across multiple insertions of deallocation instructions.
`RTDyldObjectLinkingLayer` is currently creating a memory manager
without any parameters.
In this PR I am passing the MemoryBuffer that will be emitted to the
MemoryManager so that the user can use it to configure the behaviour of
the MemoryManager.
This PR adds `UnrollBroadcastPattern` to `VectorUnroll` transform.
To support this, it also extends `BroadcastOp` definition with
`VectorUnrollOpInterface`
Use the spellings in the generated clause parser. The functions
`get<lang>ClauseKind` and `get<lang>ClauseName` are not yet updated.
The definitions of both clauses and directives now take a list of
"Spelling"s instead of a single string. For example
```
def ACCC_Copyin : Clause<[Spelling<"copyin">,
Spelling<"present_or_copyin">,
Spelling<"pcopyin">]> { ... }
```
A "Spelling" is a versioned string, defaulting to "all versions".
For background information see
https://discourse.llvm.org/t/rfc-alternative-spellings-of-openmp-directives/85507
Before this PR, a rank-m -> rank-n vector.shape_cast with m,n>1 was
lowered to extracts/inserts of single elements, so that a shape_cast on
a vector with N elements would always require N extracts/inserts. While
this is necessary in the worst case scenario it is sometimes possible to
use fewer, larger extracts/inserts. Specifically, the largest common
suffix on the shapes of the source and result can be extracted/inserted.
For example:
```mlir
%0 = vector.shape_cast %arg0 : vector<10x2x3xf32> to vector<2x5x2x3xf32>
```
has common suffix of shape `2x3`. Before this PR, this would be lowered
to 60 extract/insert pairs with extracts of the form
`vector.extract %arg0 [a, b, c] : f32 from vector<10x2x3xf32>`. With
this PR it is 10 extract/insert pairs with extracts of the form
`vector.extract %arg0 [a] : vector<2x3xf32> from vector<10x2x3xf32>`.
This adds verifier checks for the scatter op
to make sure the shapes of inputs and output
are consistent with respect to spec.
Signed-off-by: Tai Ly <tai.ly@arm.com>
The previous verifier checks did not correctly handle unranked operands.
For example, it could incorrectly assume the number of
`rankedOperandTypes` would be >= 2, which isn't the case when both a and
b are unranked.
This change simplifies these checks such that they only operate over the
intended a and b operands as opposed to the shift operand as well.
Bug description: Global variables and functions created during
gpu.printf conversion to NVVM may contain debug info metadata from
function containing the gpu.printf which cannot be used out of that
function.
Add `RuntimeVerifiableOpInterface` implementations for the following
ops. These were mostly copied from the respective memref
implementations. Only the part that deals with offsets and strides was
removed.
* `tensor.cast`: `memref.cast`
* `tensor.dim`: `memref.dim`
* `tensor.extract`: `memref.load`
* `tensor.insert`: `memref.store`
* `tensor.extract_slice`: `memref.subview`
Below is the original commit description. Furthermore, it applies a
[fix](33a26b9ca2)
for CMakeList.txt
The issue occurs during a downstream pass which does dialect conversion,
where both
[`FuncOpConversion`](cde67b6663/mlir/lib/Conversion/FuncToLLVM/FuncToLLVM.cpp (L480))
and
[`SubviewFolder`](cde67b6663/mlir/lib/Dialect/MemRef/Transforms/ExpandStridedMetadata.cpp (L187))
are run together. The original starting IR is:
```mlir
module {
func.func @foo(%arg0: memref<100x100xf32>, %arg1: index, %arg2: index, %arg3: index, %arg4: index) -> memref<?x?xf32, strided<[100, 1], offset: ?>> {
%subview = memref.subview %arg0[%arg1, %arg2] [%arg3, %arg4] [1, 1] : memref<100x100xf32> to memref<?x?xf32, strided<[100, 1], offset: ?>>
return %subview : memref<?x?xf32, strided<[100, 1], offset: ?>>
}
}
```
After `FuncOpConversion` runs, the IR looks like:
```mlir
"builtin.module"() ({
"llvm.func"() <{CConv = #llvm.cconv<ccc>, function_type = !llvm.func<struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)> (ptr, ptr, i64, i64, i64, i64, i64, i64, i64, i64, i64)>, linkage = #llvm.linkage<external>, sym_name = "foo", visibility_ = 0 : i64}> ({
^bb0(%arg0: !llvm.ptr, %arg1: !llvm.ptr, %arg2: i64, %arg3: i64, %arg4: i64, %arg5: i64, %arg6: i64, %arg7: i64, %arg8: i64, %arg9: i64, %arg10: i64):
%0 = "memref.subview"(<<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>>) <{operandSegmentSizes = array<i32: 1, 2, 2, 0>, static_offsets = array<i64: -9223372036854775808, -9223372036854775808>, static_sizes = array<i64: -9223372036854775808, -9223372036854775808>, static_strides = array<i64: 1, 1>}> : (memref<100x100xf32>, index, index, index, index) -> memref<?x?xf32, strided<[100, 1], offset: ?>>
"func.return"(%0) : (memref<?x?xf32, strided<[100, 1], offset: ?>>) -> ()
}) : () -> ()
"func.func"() <{function_type = (memref<100x100xf32>, index, index, index, index) -> memref<?x?xf32, strided<[100, 1], offset: ?>>, sym_name = "foo"}> ({
}) : () -> ()
}) {llvm.data_layout = "", llvm.target_triple = ""} : () -> ()
```
The `<<UNKNOWN SSA VALUE>>`'s here are block arguments of a separate
unlinked block, which is disconnected from the rest of the IR (so not
only is the IR verifier-invalid, it can't even be parsed). This IR is
created by signature conversion in the dialect conversion infra.
Now `SubviewFolder` is applied, and the utility function here is called
on one of these disconnected block arguments, causing a crash.
The TestMemRefToLLVMWithTransforms pass is introduced to exercise the
bug, and it can be reused by other contributors in the future.
Co-authored-by: Rahul Kayaith <rkayaith@gmail.com>
---------
Signed-off-by: hanhanW <hanhan0912@gmail.com>
The issue occurs during a downstream pass which does dialect conversion,
where both
[`FuncOpConversion`](cde67b6663/mlir/lib/Conversion/FuncToLLVM/FuncToLLVM.cpp (L480))
and
[`SubviewFolder`](cde67b6663/mlir/lib/Dialect/MemRef/Transforms/ExpandStridedMetadata.cpp (L187))
are run together. The original starting IR is:
```mlir
module {
func.func @foo(%arg0: memref<100x100xf32>, %arg1: index, %arg2: index, %arg3: index, %arg4: index) -> memref<?x?xf32, strided<[100, 1], offset: ?>> {
%subview = memref.subview %arg0[%arg1, %arg2] [%arg3, %arg4] [1, 1] : memref<100x100xf32> to memref<?x?xf32, strided<[100, 1], offset: ?>>
return %subview : memref<?x?xf32, strided<[100, 1], offset: ?>>
}
}
```
After `FuncOpConversion` runs, the IR looks like:
```mlir
"builtin.module"() ({
"llvm.func"() <{CConv = #llvm.cconv<ccc>, function_type = !llvm.func<struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)> (ptr, ptr, i64, i64, i64, i64, i64, i64, i64, i64, i64)>, linkage = #llvm.linkage<external>, sym_name = "foo", visibility_ = 0 : i64}> ({
^bb0(%arg0: !llvm.ptr, %arg1: !llvm.ptr, %arg2: i64, %arg3: i64, %arg4: i64, %arg5: i64, %arg6: i64, %arg7: i64, %arg8: i64, %arg9: i64, %arg10: i64):
%0 = "memref.subview"(<<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>>) <{operandSegmentSizes = array<i32: 1, 2, 2, 0>, static_offsets = array<i64: -9223372036854775808, -9223372036854775808>, static_sizes = array<i64: -9223372036854775808, -9223372036854775808>, static_strides = array<i64: 1, 1>}> : (memref<100x100xf32>, index, index, index, index) -> memref<?x?xf32, strided<[100, 1], offset: ?>>
"func.return"(%0) : (memref<?x?xf32, strided<[100, 1], offset: ?>>) -> ()
}) : () -> ()
"func.func"() <{function_type = (memref<100x100xf32>, index, index, index, index) -> memref<?x?xf32, strided<[100, 1], offset: ?>>, sym_name = "foo"}> ({
}) : () -> ()
}) {llvm.data_layout = "", llvm.target_triple = ""} : () -> ()
```
The `<<UNKNOWN SSA VALUE>>`'s here are block arguments of a separate
unlinked block, which is disconnected from the rest of the IR (so not
only is the IR verifier-invalid, it can't even be parsed). This IR is
created by signature conversion in the dialect conversion infra.
Now `SubviewFolder` is applied, and the utility function here is called
on one of these disconnected block arguments, causing a crash.
The TestMemRefToLLVMWithTransforms pass is introduced to exercise the
bug, and it can be reused by other contributors in the future.
---------
Signed-off-by: hanhanW <hanhan0912@gmail.com>
Co-authored-by: Rahul Kayaith <rkayaith@gmail.com>
The class "ClauseVal" actually represents a definition of an enumeration
value, and in itself it is not bound to any clause. Rename it to EnumVal
and add a comment clarifying how it's translated into an actual enum
definition in the generated source code.
There is no change in functionality.
This patch adds an assert to check if the result of `getConstantInt` is
non-null. Previously the code failed with Segmentation Fault if
`getConstantInt` failed to look up the value. This primarily occurrs when
the value is defined as OpSpecConstant rather than OpConstant.
This commit introduces `visitCallOperation` and `visitCallableOperation`
extension points in the sparse data flow analysis framework. This
allows, for example, to make the analysis less conservative, without a
lot of code duplication, propagating information even if not all the
call or return sites are known.
This patch fixes:
mlir/lib/Dialect/Tensor/IR/TensorOps.cpp:1680:37: error: comparison
of integers of different signs: 'int' and 'uint64_t' (aka 'unsigned
long') [-Werror,-Wsign-compare]
This fixes Tosa VariableOp to align with spec 1.0
- add var_shape attribute to store shape of variable type
- change type attribute to store element type of variable type
- add a builder so previous construction calls still work
- fix up level check of rank to be on variable type instead of initial
value which is optional
- add level check of size for variable type
- add lit tests for variable op's without initial values
- add lit test for variable op with fixed rank but unknown dimension
- add invalid lit test for variable op with unranked type
Signed-off-by: Tai Ly <tai.ly@arm.com>
Adds a few canonicalizers, folders, and rewrite patterns to tensor ops:
* tensor.insert folder: insert into a constant is replaced with a new
constant
* tensor.extract folder: extract from a parent tensor that was inserted
at the same indices is folded into the inserted value
* rewrite pattern added that replaces an extract of a collapse shape
with an extract of the source tensor (requires static source dimensions)
Signed-off-by: Asra Ali <asraa@google.com>
The main idea behind the change is to allow expand-of-collapse folds for
reshapes like `?x?xk` -> `?` (k>1). The rationale here is that the
expand op must have a coherent index/affine expression specified in its
`output_shape` argument (see example below), and if it doesn't, the IR
has already been invalidated at an earlier stage:
```
%c32 = arith.constant 32 : index
%div = arith.divsi %<some_index>, %c32 : index
%collapsed = tensor.collapse_shape %41#1 [[0], [1, 2], [3, 4]]
: tensor<9x?x32x?x32xf32> into tensor<9x?x?xf32>
%affine = affine.apply affine_map<()[s0] -> (s0 * 32)> ()[%div]
%expanded = tensor.expand_shape %collapsed [[0], [1, 2], [3]] output_shape [9, %div, 32, %affine]
: tensor<9x?x?xf32> into tensor<9x?x32x?xf32>
```
On the above assumption, adjust the routine in
`getReassociationIndicesForCollapse()` to allow dynamic reshapes beyond
just `?x..?x1x1x..x1` -> `?`. Dynamic subshapes introduce two kinds of
issues:
1. n>2 consecutive dynamic dimensions in the source shape cannot be
collapsed together into 1<k<n neighboring dynamic dimensions in the
target shape, since there'd be more than one suitable reassociation
(example: `?x?x10x? into ?x?`)
2. When figuring out static subshape reassociations based on products,
there are cases where a static dimension is collapsed with a dynamic
one, and should therefore be skipped when comparing products of source &
target dimensions (e.g. `?x2x3x4 into ?x12`)
To address 1, we should detect such sequences in the target shape before
assigning multiple dynamic dimensions into the same index set. For 2, we
take note that a static target dimension was preceded by a dynamic one
and allow an "offset" subshape of source static dimensions, as long as
there's an exact sequence for the target size later in the source shape.
This PR aims to address all reshapes that can be determined based purely
on shapes (and original reassociation
maps, as done in
`ComposeExpandOfCollapseOp::findCollapsingReassociation)`. It doesn't
seem possible to fold all qualifying dynamic shape patterns in a
deterministic way without looking into affine expressions
simultaneously. That would be difficult to maintain in a single general
utility, so a path forward would be to provide dialect-specific
implementations for Linalg/Tensor.
Signed-off-by: Artem Gindinson <gindinson@roofline.ai>
---------
Signed-off-by: Artem Gindinson <gindinson@roofline.ai>
Co-authored-by: Ian Wood <ianwood2024@u.northwestern.edu>
Currently some control flow patterns cannot be structurized into
existing SPIR-V MLIR constructs, e.g., conditional early exits (break).
Since the support for early exit cannot be currently added
(https://github.com/llvm/llvm-project/pull/138688#pullrequestreview-2830791677)
this patch enables structurizer to be disabled to keep
the control flow unstructurized. By default, the control flow is
structurized.
Address TODO regarding the recomputation of symbol tables. The signature of the `getFuncOpsOrderedByCalls` function is modified to receive the collection of cached symbol tables.
This patch wraps `populateLowerContractionToSMMLAPatternPatterns` into a
new TD Op `apply_patterns.arm_neon.vector_contract_to_i8mm` .
It also removes the "test-lower-to-arm-neon" pass.