Patterns in `LowerContractionToSMMLAPattern` are designed to handle
vector-to-matrix multiplication but not matrix-to-vector. This leads to
the following error when processing `rhs` with rank < 2:
```
iree-compile: /usr/local/google/home/kooljblack/code/iree-build/llvm-project/tools/mlir/include/mlir/IR/BuiltinTypeInterfaces.h.inc:268: int64_t mlir::detail::ShapedTypeTrait<mlir::VectorType>::getDimSize(unsigned int) const [ConcreteType = mlir::VectorType]: Assertion `idx < getRank() && "invalid index for shaped type"' failed.
```
Updates to explicitly check the rhs rank and fail cases that cannot
process.
This ports https://github.com/openxla/xla/pull/10503 by @pearu. The new
implementation matches mpmath's results for most inputs, see caveats in
the linked pull request. In addition to the filecheck test here, the
accuracy was tested with XLA's complex_unary_op_test and its MLIR
emitters.
Summary:
These entires are generic for offloading with the new driver now. Having
the `omp` prefix was a historical artifact and is confusing when used
for CUDA. This patch just renames them for now, future patches will
rework the binary format to make it more common.
Added lowering support for IS_DEVICE_PTR and HAS_DEVICE_ADDR clauses for
OMP TARGET directive and added related tests for these changes.
IS_DEVICE_PTR and HAS_DEVICE_ADDR clauses apply to OMP TARGET directive
OpenMP spec states
`The **is_device_ptr** clause indicates that its list items are device
pointers.`
`The **has_device_addr** clause indicates that its list items already
have device addresses and therefore they may be directly accessed from a
target device.`
Whereas USE_DEVICE_PTR and USE_DEVICE_ADDR clauses apply to OMP TARGET
DATA directive and OpenMP spec for them states
`Each list item in the **use_device_ptr** clause results in a new list
item that is a device pointer that refers to a device address`
`Each list item in a **use_device_addr** clause that is present in the
device data environment is treated as if it is implicitly mapped by a
map clause on the construct with a map-type of alloc`
This MR adds the `lower-vector-multi-reduction` pass to lower the
vector.multi_reduction operation.
While the Transform Dialect includes an operation,
`transform.apply_patterns.vector.lower_multi_reduction`, intended for a
similar purpose, its utility is limited to projects that have adopted
the Transform Dialect. Recognizing that not all projects are equipped to
integrate this dialect, the proposed pass serves as a vital standalone
alternative. It ensures that projects solely dependent on the
traditional pass infrastructure can also benefit from the optimized
lowering of `multi_reduction` operation.
---------
Co-authored-by: Xiaolei Shi <xiaoleis@nvidia.com>
Fix typo bug in AffineExprVisitor for the WalkResult return case. This
didn't show up immmediately because most walks in the tree didn't
use walk result.
This PR adds support for lowering the following Math operations to
`libm` calls:
* `math.absf` -> `fabsf, fabs`
* `math.exp` -> `expf, exp`
* `math.exp2` -> `exp2f, exp2`
* `math.fma` -> `fmaf, fma`
* `math.log` -> `logf, log`
* `math.log2` -> `log2f, log2`
* `math.log10` -> `log10f, log10`
* `math.powf` -> `powf, pow`
* `math.sqrt` -> `sqrtf, sqrt`
These operations are direct members of `libm`, and do not seem to
require any special manipulations on their operands.
Disallows initialization of scalable vectors with an attribute of
arbitrary values, e.g.:
```mlir
%c = arith.constant dense<[0, 1]> : vector<[2] x i32>
```
Initialization using vector splats remains allowed (i.e. when all the
init values are identical):
```mlir
%c = arith.constant dense<[1, 1]> : vector<[2] x i32>
```
Note: This is a re-upload of #86178
Disallows initialization of scalable vectors with an attribute of
arbitrary values, e.g.:
```mlir
%c = arith.constant dense<[0, 1]> : vector<[2] x i32>
```
Initialization using vector splats remains allowed (i.e. when all the
init values are identical):
```mlir
%c = arith.constant dense<[1, 1]> : vector<[2] x i32>
```
Followup to this discussion:
https://github.com/llvm/llvm-project/pull/80251#discussion_r1535599920.
The previous debug importer was correct but inefficient. For cases with
mutual recursion that contain more than one back-edge, each back-edge
would result in a new translated instance. This is because the previous
implementation never caches any translated result with unbounded
self-references. This means all translation inside a recursive context
is performed from scratch, which will incur repeated run-time cost as
well as repeated attribute sub-trees in the translated IR (differing
only in their `recId`s).
This PR refactors the importer to handle caching inside a recursive
context.
- In the presence of unbound self-refs, the translation result is cached
in a separate cache that keeps track of the set of dependent unbound
self-refs.
- A dependent cache entry is valid only when all the unbound self-refs
are in scope. Whenever a cached entry goes out of scope, it will be
removed the next time it is looked up.
This patch adds the `convertInstruction` and `getSupportedInstructions`
to `LLVMImportInterface`, allowing any non-LLVM dialect to specify how
to import LLVM IR instructions and overriding the default import of LLVM instructions.
This commit adds a `ValueBoundsOpInterface` implementation for
`arith.select`. The implementation is almost identical to `scf.if`
(#85895), but there is one special case: if the condition is a shaped
value, the selection is applied element-wise and the result shape can be
inferred from either operand.
Note: This is a re-upload of #86383.
This commit adds support for `scf.if` to `ValueBoundsConstraintSet`.
Example:
```
%0 = scf.if ... -> index {
scf.yield %a : index
} else {
scf.yield %b : index
}
```
The following constraints hold for %0:
* %0 >= min(%a, %b)
* %0 <= max(%a, %b)
Such constraints cannot be added to the constraint set; min/max is not
supported by `IntegerRelation`. However, if we know which one of %a and
%b is larger, we can add constraints for %0. E.g., if %a <= %b:
* %0 >= %a
* %0 <= %b
This commit required a few minor changes to the
`ValueBoundsConstraintSet` infrastructure, so that values can be
compared while we are still in the process of traversing the IR/adding
constraints.
Note: This is a re-upload of #85895, which was reverted. The bug that
caused the failure was fixed in #87859.
Emitting trivial getters that amount to `(*this)->getOperand(1)`
out-of-line or `getProperties().foo` is a pretty significant performance
hit on these basic MLIR APIs for manipulating ops (3-4x). Emit them
inline (without adding additional dependencies to header files).
The lowering of n-D vector.extract/insert ops to LLVM is not supported
but if one of these accidentally reaches the vector-to-llvm conversion
patterns, we end up with a kind of puzzling crash. This PR fixes that
crash and gracefully bails out in those cases.
This commit removes the no longer required bitcast inserting pattern in
LLVM dialect's type consistency pattern. This was previously required to
enable Mem2Reg and SROA to promote accesses that had different types.
Recent changes to both passes added direct support for this feature to
them, so the pattern has no further use.
This patch separates the lowering dispatch for host and target devices.
For the target device, if the current operation is not a top-level
operation (e.g. omp.target) or is inside a target device code region it
will be ignored, since it belongs to the host code.
This is an alternative approach to #84611, the new test in this PR was
taken from there.
In scalable code it is very common to have constant multiples of vscale,
e.g. `4 * vscale`. This updates `arith.muli` to pretty print the result
name in cases like this, so `4 * vscale` would be `%c4_vscale`.
This makes reading IR dumps of scalable code a little nicer.
Adds two new CMake functions to query the host system:
* `check_hwcap`,
* `check_emulator`.
Together, these functions are used to check whether a given set of MLIR
integration tests require an emulator. If yes, then the corresponding
CMake var that defies the required emulator executable is also checked.
`check_hwcap` relies on ELF_HWCAP for discovering CPU features from
userspace on Linux systems. This is the recommended approach for Arm
CPUs running on Linux as outlined in this blog post:
* https://community.arm.com/arm-community-blogs/b/operating-systems-blog/posts/runtime-detection-of-cpu-features-on-an-armv8-a-cpu
Other operating systems (e.g. Android) and CPU architectures will
most likely require some other approach. Right now these new hooks are
only used for SVE and SME integration tests.
This relands #86489 with the following changes:
* Replaced:
`set(hwcap_test_file ${CMAKE_BINARY_DIR}/${CMAKE_FILES_DIRECTORY}/hwcap_check.c)`
with:
`set(hwcap_test_file ${CMAKE_BINARY_DIR}/temp/hwcap_check.c)`
The former would trigger an infinite loop when running `ninja`
(after the initial CMake configuration).
* Fixed commit msg. Previous one was taken from the initial GH PR
commit rather than the final re-worked solution (missed this when
merging via GH UI).
* A couple more NFCs/tweaks.
This commit extends the folders of chainable casts (bitcast and
addrspacecast) to ensure that they fold a chain of the same casts into a
single cast.
Additionally cleans up the canonicalization test file, as this used some
outdated constructs.
This commit relaxes Mem2Reg's type equality requirement for the LLVM
dialect's load and store operations. For now, we only allow loads to be
promoted if the reaching definition can be casted into a value of the
target type.
For stores, the same conversion casting check is applied and we ensure
that their result is properly casted to the type of the memory slot.
This is necessary to satisfy assumptions of the general mem2reg pass, as
it creates block arguments with the types of the memory slot.
This relands https://github.com/llvm/llvm-project/pull/87504
This commit adds a `ValueBoundsOpInterface` implementation for
`arith.select`. The implementation is almost identical to `scf.if`
(#85895), but there is one special case: if the condition is a shaped
value, the selection is applied element-wise and the result shape can be
inferred from either operand.
This commit adds support for `scf.if` to `ValueBoundsConstraintSet`.
Example:
```
%0 = scf.if ... -> index {
scf.yield %a : index
} else {
scf.yield %b : index
}
```
The following constraints hold for %0:
* %0 >= min(%a, %b)
* %0 <= max(%a, %b)
Such constraints cannot be added to the constraint set; min/max is not
supported by `IntegerRelation`. However, if we know which one of %a and
%b is larger, we can add constraints for %0. E.g., if %a <= %b:
* %0 >= %a
* %0 <= %b
This commit required a few minor changes to the
`ValueBoundsConstraintSet` infrastructure, so that values can be
compared while we are still in the process of traversing the IR/adding
constraints.
As part of this extension this change also does some general cleanup
1) Make all the methods take `RewriterBase` as arguments instead of
creating their own builders that tend to crash when used within
pattern rewrites
2) Split `coalesePerfectlyNestedLoops` into two separate methods, one
for `scf.for` and other for `affine.for`. The templatization didnt
seem to be buying much there.
Also general clean up of tests.
ODS was still generating the old `Operation::setAttr` hooks for ODS
methods for setting attributes, when the backing implementation of the
attributes was changed to properties. No idea how this wasn't noticed
until now.
Add `requiresReplacedValues` and `visitReplacedValues` methods to
`PromotableOpInterface`. These methods allow `PromotableOpInterface` ops
to transforms definitions mutated by a `store`.
This change is necessary to correctly handle the promotion of
`LLVM_DbgDeclareOp`.
---------
Co-authored-by: Théo Degioanni <30992420+Moxinilian@users.noreply.github.com>
This reverts commit d6e4582198 as it
violates an assumption of Mem2Reg's block argument creation. Mem2Reg
strongly assumes that all involved values have the same type as the
alloca, which was relaxed by this PR. Therefore, branches got created
that jumped to basic blocks with differently typed block arguments.
Integration tests for ArmSME require an emulator (there's no hardware
available). Make sure that CMake complains if `MLIR_RUN_ARM_SME_TESTS`
is set while `ARM_EMULATOR_EXECUTABLE` is empty.
I'm also adding a note in the docs for future reference.
Before the change `test-loop-fusion` and `affine-super-vectorizer-test`
options were in their own category. This was because they used the
standard llvm command line parsing with `llvm::cl::opt`. This PR moves
them over to the mlir `Pass::Option` class.
Before the change
```
$ mlir-opt --help
...
General options:
...
Compiler passes to run
Passes:
...
Pass Pipelines:
...
Generic Options:
....
affine-super-vectorizer-test options:
--backward-slicing
...
--vectorize-affine-loop-nest
test-loop-fusion options:
--test-loop-fusion-dependence-check
...
--test-loop-fusion-transformation
```
After the change
```
$ mlir-opt --help
...
General options:
...
Compiler passes to run
Passes:
...
--affine-super-vectorizer-test
--backward-slicing
...
--vectorize-affine-loop-nest
...
--test-loop-fusion options:
--test-loop-fusion-dependence-check
...
--test-loop-fusion-transformation
...
Pass Pipelines:
...
Generic Options:
...
```
---------
Signed-off-by: philass <plassen@groq.com>
Currently, by-ref reductions will allocate the per-thread reduction
variable in the initialization region. Adding a cleanup region allows
that allocation to be undone. This will allow flang to support reduction
of arrays stored on the heap.
This conflation of allocation and initialization in the initialization
should be fixed in the future to better match the OpenMP standard, but
that is beyond the scope of this patch.
The argument to the initialization region of reduction declarations was
never mapped. This meant that if this argument was accessed inside the
initialization region, that mlir operation would be translated to an
llvm operation with a null argument (failing verification).
Adding the mapping ensures that the right LLVM value can be found when
inlining and converting the initialization region.
We have to separately establish and clean up these mappings for each use
of the reduction declaration because repeated usage of the same
declaration will inline it using a different concrete value for the
block argument.
This argument was never used previously because for most cases the
initialized value depends only upon the type of the reduction, not on
the original variable. It is needed now so that we can read the array
extents for the local copy from the mold.
Flang support for reductions on assumed shape arrays patch 2/3
This commit changes the API of `ValueBoundsConstraintSet`: the stop
condition is now passed to the constructor instead of `processWorklist`.
That makes it easier to add items to the worklist multiple times and
process them in a consistent manner. The current
`ValueBoundsConstraintSet` is passed as a reference to the stop
function, so that the stop function can be defined before the the
`ValueBoundsConstraintSet` is constructed.
This change is in preparation of adding support for branches.
This commit relaxes Mem2Reg's type equality requirement for the LLVM
dialect's load and store operations. For now, we only allow loads to be
promoted if the reaching definition can be casted into a value of the
target type.
For stores, all type checks are removed, as a non-volatile store that
does not write out the alloca's pointer can always be deleted.
Operations must be created with the supplied builder. Otherwise, the
dialect conversion / greedy pattern rewrite driver can break.
This commit fixes a crash in the dialect conversion:
```
within split at llvm-project/mlir/test/Conversion/TosaToLinalg/tosa-to-linalg-invalid.mlir:1 offset :8:8: error: failed to legalize operation 'tosa.add'
%0 = tosa.add %1, %arg2 : (tensor<10x10xf32>, tensor<*xf32>) -> tensor<*xf32>
^
within split at llvm-project/mlir/test/Conversion/TosaToLinalg/tosa-to-linalg-invalid.mlir:1 offset :8:8: note: see current operation: %9 = "tosa.add"(%8, %arg2) : (tensor<10x10xf32>, tensor<*xf32>) -> tensor<*xf32>
mlir-opt: llvm-project/mlir/include/mlir/IR/UseDefLists.h:198: mlir::IRObjectWithUseList<mlir::OpOperand>::~IRObjectWithUseList() [OperandType = mlir::OpOperand]: Assertion `use_empty() && "Cannot destroy a value that still has uses!"' failed.
```
This commit is the proper fix for #87297 (which was reverted).
Updates `castAwayContractionLeadingOneDim` to check for leading unit
dimensions before inserting `vector.transpose` ops.
Currently `castAwayContractionLeadingOneDim` removes all leading unit
dims based on the accumulator and transpose any subsequent operands to
match the accumulator indexing. This does not take into account if the
transpose is strictly necessary, for instance when given this
vector-matrix contract:
```mlir
%result = vector.contract {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d1, d2)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"], kind = #vector.kind<add>} %lhs, %rhs, %acc : vector<1x1x8xi32>, vector<1x8x8xi32> into vector<1x8xi32>
```
Passing this through `castAwayContractionLeadingOneDim` pattern produces
the following:
```mlir
%0 = vector.transpose %arg0, [1, 0, 2] : vector<1x1x8xi32> to vector<1x1x8xi32>
%1 = vector.extract %0[0] : vector<1x8xi32> from vector<1x1x8xi32>
%2 = vector.extract %arg2[0] : vector<8xi32> from vector<1x8xi32>
%3 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %1, %arg1, %2 : vector<1x8xi32>, vector<1x8x8xi32> into vector<8xi32>
%4 = vector.broadcast %3 : vector<8xi32> to vector<1x8xi32>
```
The `vector.transpose` introduced does not affect the underlying data
layout (effectively a no op), but it cannot be folded automatically.
This change avoids inserting transposes when only leading unit
dimensions are involved.
Fixes#85691