ArrayRef has a constructor that accepts std::nullopt. This
constructor dates back to the days when we still had llvm::Optional.
Since the use of std::nullopt outside the context of std::optional is
kind of abuse and not intuitive to new comers, I would like to move
away from the constructor and eventually remove it.
This patch takes care of the mlir side of the migration, starting with
straightforward places where I see ArrayRef or ValueRange nearby.
Note that ValueRange has a constructor that forwards arguments to an
ArrayRef constructor.
Cf. https://discourse.llvm.org/t/mlir-dead-code-analysis/67568/10
Custom analysis passes will not work properly unless both
DeadCodeAnalysis and SparseConstantPropagation are loaded to the
DataFlowSolver. This is intended behavior, but surprising to many users
as shown in the thread. In lieu of a longer-term fix (which I am not
knowledgeable enough to implement myself, yet), this commit adds a
helper function that loads these two analyses, as well as providing
breadcrumbs for an explanation of the problem. The existing places in
the codebase where these two analyses are loaded for the purpose of
running other unrelated analyses are replaced by the use of the helper.
---------
Co-authored-by: Jeremy Kun <j2kun@users.noreply.github.com>
Co-authored-by: Oleksandr "Alex" Zinenko <azinenko@amd.com>
…ransferInterface out of LinalgStructuredInterface and use that for
PadTilingInterface
Along the way, a bug was found on the handling of scalar values, fix it
and add a test.
When no offload targets are specified flang will ignore "target"
constructs, but not "target data" constructs. This patch makes the
behavior consistent across all offload-related operations.
While ignoring "target" may produce semantically incorrect code, it may
still be a useful debugging tool.
…t up to the transform dialect
This revision revisits the padding transformation from first principles
and prepares it to work more generally with TilingInterface.
Compared to structured.transform.pad it has the following differences:
- no support for nofold, copy-back, transpose and hoisting: these have
been carried by the padding op in the very early days of StructuredOps
and have since then been separated out as independent transformations
that compose.
- no conflated static bounding box derivation attempts:
pad_tiling_interface composes more naturally with or without tiling.
- properly derives padding size on outputs where multiple dimensions
contribute: this is not supported in structured.transform.pad
- geared towards supporting TilingInterface once the proper control
mechanisms are supported through a PadSizeComputationFunction (supports
LinalgOp by default)
This will gradually replace structured.transform.pad as it is fleshed
out and tested more comprehensively.
In the future this should be moved out of a specific Linalg
implementation file and into a more general "Structured" file.
New gfx950 MFMA allows bf16 operands.
c0cc81cdc0/llvm/include/llvm/IR/IntrinsicsAMDGPU.td (L3434)
When running `amdgpu-to-rocdl`, Current logic converts bf16 to i16
always which fails to compile for newer bf16 MFMA e.g.
`v_mfma_f32_16x16x32bf16`.
Backend expects bf16 type for the operands for those newer MFMAs. This
patch fixes it.
CC: @krzysz00 @dhernandez0 @giuseros @antiagainst @kuhar
This patch updates the following ops to use `result` (instead of `res`)
as the name for their result argument:
* `vector.scalable.insert`
* `vector.scalable.extract`
* `vector.insert_strided_slice`
This change ensures naming consistency with other ops in the `vector`
dialect. It addresses part of:
* https://github.com/llvm/llvm-project/issues/131602
The Arith dialect includes patterns that canonicalize a sequence of:
- trunci(shrui(mul(sext(x), sext(y)), c)) -> mulsi_extended(x, y)
- trunci(shrui(mul(zext(x), zext(y)), c)) -> mului_extended(x, y)
These patterns return the high word of an extended multiplication, which
assumes that the shift amount is equal to the bit width of the original
operands. This check was missing, leading to incorrect canonicalizations
when the shift amount was less than the bit width.
For example, the following code:
```
%x = arith.extui %a: i32 to i33
%y = arith.extui %b: i32 to i33
%m = arith.muli %x, %y: i33
%c1 = arith.constant 1: i33
%sh = arith.shrui %m, %c1 : i33
%hi = arith.trunci %sh: i33 to i32
```
would incorrectly be canonicalized to:
```
_, %hi = arith.mului_extended %a, %b : i32
```
This commit removes the faulty canonicalizations since they are not
believed to be generally beneficial (c.f., the discussion of the
alternative https://github.com/llvm/llvm-project/pull/144787 which fixes
the canonicalizations).
Before this PR, users had to pass the "old" block argument when
replacing the uses of a block argument in a newly converted block. Users
can now pass the actual block argument that should be replaced.
Note for LLVM integration: Make sure to pass the current block argument
instead of the old one.
This patch makes the following changes:
- Add a `ValueRange typeDynDims` argument to
`linalg::makeComposedPadHighOp`, allowing to pad a tensor with dynamic
dimensions using `tensor::createPadHighOp`.
- Add a `DenseMap<std::pair<unsigned, unsigned>, OpFoldResult>
sizeToPadTo;` option to `LinalgPaddingOptions`. This option allows
setting the size to use when padding a dimension of an operand, allowing
to pad operands even in the case they don't have a constant upper
bounding box. If the value is not provided, then the constant upper
bound is used by default.
- Add a `use_prescribed_tensor_shapes` option to
`transform.structured.pad`. If set to true then `tensor.dim` will be
used as dimensions to compute the size of the padded dim instead of
computing the constant upper bound.
- This patch also changes the behavior for computing the padded shape
`linalg::rewriteAsPaddedOp`, by using the newly added options in
`LinalgPaddingOptions`.
- Finally it adds tests verifying the behavior.
From the description of gpu.shuffle operation, shuffle up/down rotates
values in the subgroup because it applies modulo on the shifted value to
calculate the result lane ID. It is inconsistent with the definition of
SPIR-V shuffle up/down and NVVM data movement definitions within
subgroup.
In NVVM, it says
"If the computed source lane index j is in range, the returned i32 value
will be the value of %a from lane j; otherwise, it will be the the value
of %a from the current thread."
It will keep the original value if the result land ID is out of range.
In SPIR-V OpGroupNonUniformShuffleUp and OpGroupNonUniformShuffleDown,
it says
"The resulting value is undefined if Delta is greater than the current
invocation’s id within the scope or if the identified invocation is not
in scope restricted tangle."
It's an undefined value if the result land ID is out of range.
Anyway, there is no circular movement in shuffle up/down from these 2
specifications. This patch removes the circular movement in gpu.shuffle
up/down and lower gpu.shuffle up/down to SPIR-V
OpGroupNonUniformShuffleUp and OpGroupNonUniformShuffleDown directly.
Reference:
https://docs.nvidia.com/cuda/archive/12.2.1/nvvm-ir-spec/index.html#data-movementhttps://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpGroupNonUniformShuffleUphttps://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpGroupNonUniformShuffleDown
This PR introduces the `vector.to_elements` op, which decomposes a
vector into its scalar elements. This operation is symmetrical to the
existing `vector.from_elements`.
Examples:
```
// Decompose a 0-D vector.
%0 = vector.to_elements %v0 : vector<f32>
// %0 = %v0[0]
// Decompose a 1-D vector.
%0:2 = vector.to_elements %v1 : vector<2xf32>
// %0#0 = %v1[0]
// %0#1 = %v1[1]
// Decompose a 2-D.
%0:6 = vector.to_elements %v2 : vector<2x3xf32>
// %0#0 = %v2[0, 0]
// %0#1 = %v2[0, 1]
// %0#2 = %v2[0, 2]
// %0#3 = %v2[1, 0]
// %0#4 = %v2[1, 1]
// %0#5 = %v2[1, 2]
```
This op is aimed at reducing code size when modeling "structured" vector
extractions and simplifying canonicalizations of large sequences of
`vector.extract` and `vector.insert` ops into `vector.shuffle` and other
sophisticated ops that can re-arrange vector elements.
### Description
This patch improves the folding efficiency of `vector.insert` and
`vector.extract` operations by not returning early after successfully
converting dynamic indices to static indices.
This PR also renames the test pass `TestConstantFold` to
`TestSingleFold` and adds comprehensive documentation explaining the
single-pass folding behavior.
### Motivation
Since the `OpBuilder::createOrFold` function only calls `fold` **once**,
the current `fold` methods of `vector.insert` and `vector.extract` may
leave the op in a state that can be folded further. For example,
consider the following un-folded IR:
```
%v1 = vector.insert %e1, %v0 [0] : f32 into vector<128xf32>
%c0 = arith.constant 0 : index
%e2 = vector.extract %v1[%c0] : f32 from vector<128xf32>
```
If we use `createOrFold` to create the `vector.extract` op, then the
result will be:
```
%v1 = vector.insert %e1, %v0 [127] : f32 into vector<128xf32>
%e2 = vector.extract %v1[0] : f32 from vector<128xf32>
```
But this is not the optimal result. `createOrFold` should have returned
`%e1`.
The reason is that the execution of fold returns immediately after
`extractInsertFoldConstantOp`, causing subsequent folding logics to be
skipped.
---------
Co-authored-by: Yang Bai <yangb@nvidia.com>
Following the addition of TensorLike and BufferLike type interfaces (see
00eaff3e9c), introduce minimal changes
required to bufferize a custom tensor operation into a custom buffer
operation.
To achieve this, new interface methods are added to TensorLike type
interface that abstract away the differences between existing (tensor ->
memref) and custom conversions.
The scope of the changes is intentionally limited (for example,
BufferizableOpInterface is untouched) in order to first understand the
basics and reach consensus design-wise.
---
Notable changes:
* mlir::bufferization::getBufferType() returns BufferLikeType (instead
of BaseMemRefType)
* ToTensorOp / ToBufferOp operate on TensorLikeType / BufferLikeType.
Operation argument "memref" renamed to "buffer"
* ToTensorOp's tensor type inferring builder is dropped (users now need
to provide the tensor type explicitly)
`unresolvedMaterializations` is a mapping from
`UnrealizedConversionCastOp` to `UnresolvedMaterializationRewrite`. This
mapping is needed to find the correct type converter for an unresolved
materialization.
With this commit, `unresolvedMaterializations` is updated immediately
when an op is being erased. This also cleans up the code base a bit:
`SingleEraseRewriter` is now used only during the "cleanup" phase and no
longer needed as a field of `ConversionRewriterImpl`.
This commit is in preparation of the One-Shot Dialect Conversion
refactoring: `allowPatternRollback = false` will in the future trigger
immediate materialization of all IR changes.
Generally, bufferization should be able to create a memref from a tensor
without needing to know more than just a mlir::Type. Thus, change
BufferizationOptions::UnknownTypeConverterFn to accept just a type
(mlir::TensorType for now) instead of mlir::Value. Additionally, apply
the same rationale to getMemRefType() helper function.
Both changes are prerequisites to enable custom types support in
one-shot bufferization.
The expansion of `memref.atomic_rmw` into a `memref.generic_atomic_rmw`
for floating-point min/max operations is no longer necessary as those
are now supported by the LLVM dialect and LLVM IR.
Furthermore, combining this expansion with direct lowering of
`generic_atomic_rmw` could leads to invalid LLVM dialect IR with
`cmpxchg` operating on floating-point values that it does not support.
The traversal order in sharding propagation was hard-coded. This PR
provides options to the pass to select a suitable order
- forward-only
- backward-only
- forward-backward
- backward-forward
Default is the previous behavior (backward-forward).
By defining `CExpressionInterface`, we move the side effect detection
logic from `emitc.expression` into the individual operations
implementing the interface allowing operations to gradually tune the
side effect.
It also allows checking for side effects each operation individually.
Test cleanup:
1) separate layout.mlir from ops.mlir for layout related test
2) remove lane layout for ops working at work item scope.
3) remove redundant test in create_tdesc/update_tdesc/prefetch.
4) remove "test_" from all test function name.
The new interfaces provide getters and setters for the weight
information about the branches of BranchOpInterface and
RegionBranchOpInterface operations.
These interfaces are done the same way as LLVM dialect's
BranchWeightOpInterface.
The plan is to produce this information in Flang, e.g. mark
most probably "cold" code as such and allow LLVM to order
basic blocks accordingly. An example of such a code is
copy loops generated for arrays repacking - we can mark it
as "cold" assuming that the copy will not happen dynamically.
If the copy actually happens the overhead of the copy is probably high
enough so that we may not care about the little overhead
of jumping to the "cold" code and fetching it.
Add support for load/store with chunk_size, which requires special
consideration for the operand blocking since offests and masks are
n-D and tensor are n+1-D. Support operations including create_tdesc,
update_tdesc, load, store, and prefetch.
---------
Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>
At the MLIR level unsigned integer and signless integers are different
types. Indeed when looking up the two types in type definition cache
they do not match.
Hence when translating a SPIR-V module which contains both usign and
signless integers will contain the same type declaration twice
(something like OpTypeInt 32 0) which is not permitted in SPIR-V and
such generated modules fail validation.
This patch solves the problem by mapping unisgned integer types to
singless integer types before looking up in the type definition cache.
---------
Signed-off-by: Davide Grohmann <davide.grohmann@arm.com>
Whereas backward-slice matching provides support to limit traversal by
specifying the desired depth level, this pull request introduces support
for limiting traversal with a nested matcher (adding forward-slice
also). It also adds support for variadic operators, including `anyOf`
and `allOf`. Rather than simply stopping traversal when an operation
named foo is encountered, one can now define a matcher that specifies
different exit conditions. Variadic support implementation within
mlir-query is very similar to clang-query.
We have been confusingly, and arguably incorrectly, lowering `m**imumf`
atomic RMW operations in the MemRef dialect to `fm**` atomic RMW
operations in the LLVM dialect, which have different NaN-propagation
semantics: `m**imumf` propagates NaNs from either operand whereas
`fm**`, which lowers to the `fm**num` intrinsic returns the non-NaN
operand. This also contradicts the lowering of `arith.m**imumf` and
`arith.m**numf` operations.
Change the lowering to match the terminology in arith.
Add tests for these lowerings.
Keep a debug message in case of surprising behavior downstream (the code
may be producing more NaNs now).