This ports https://github.com/openxla/xla/pull/10503 by @pearu. The new
implementation matches mpmath's results for most inputs, see caveats in
the linked pull request. In addition to the filecheck test here, the
accuracy was tested with XLA's complex_unary_op_test and its MLIR
emitters.
This PR adds support for lowering the following Math operations to
`libm` calls:
* `math.absf` -> `fabsf, fabs`
* `math.exp` -> `expf, exp`
* `math.exp2` -> `exp2f, exp2`
* `math.fma` -> `fmaf, fma`
* `math.log` -> `logf, log`
* `math.log2` -> `log2f, log2`
* `math.log10` -> `log10f, log10`
* `math.powf` -> `powf, pow`
* `math.sqrt` -> `sqrtf, sqrt`
These operations are direct members of `libm`, and do not seem to
require any special manipulations on their operands.
The lowering of n-D vector.extract/insert ops to LLVM is not supported
but if one of these accidentally reaches the vector-to-llvm conversion
patterns, we end up with a kind of puzzling crash. This PR fixes that
crash and gracefully bails out in those cases.
There is no good way to report detailed errors from inside
`Pass::initializeOptions` function as context may not be available at
this point and writing directly to `llvm::errs()` is not composable.
See
https://github.com/llvm/llvm-project/pull/87166#discussion_r1546426763
* Add error handler callback to `Pass::initializeOptions`
* Update `PassOptions::parseFromString` to support custom error stream
instead of using `llvm::errs()` directly.
* Update default `Pass::initializeOptions` implementation to propagate
error string from `parseFromString` to new error handler.
* Update `MapMemRefStorageClassPass` to report error details using new
API.
Add rounding mode attribute to `arith`. This attribute can be used in
different FP `arith` operations to control rounding mode. Rounding modes
correspond to IEEE 754-specified rounding modes. Use in `arith.truncf` folding.
As this is not supported in dialects other than LLVM, conversion should fail for
now in case this attribute is present.
---------
Signed-off-by: Victor Perez <victor.perez@codeplay.com>
This patch fixes:
mlir/lib/Conversion/TosaToTensor/TosaToTensor.cpp:76:46: error:
'multiplies' may not intend to support class template argument
deduction [-Werror,-Wctad-maybe-unsupported]
- Revamped lowering conversion pattern for `tosa.reshape` to handle previously unsupported combinations of dynamic dimensions in input and output tensors. The lowering strategy continues to rely on pairs `tensor.collapse_shape` + `tensor.expand_shape`, which allow for downstream fusion with surrounding `linalg.generic` ops.
- Fixed bug in canonicalization pattern `ReshapeOp::fold()` in `TosaCanonicalizations.cpp`. The input and result types being equal is not a sufficient condition for folding. If there is more than 1 dynamic dimension in the input and result types, a productive reshape could still occur.
- This work exposed the fact that bufferization does not properly handle a `tensor.collapse_shape` op producing a 0D tensor from a dynamically shaped one due to a limitation in `memref.collapse_shape`. While the proper way to address this would involve releasing the `memref.collapse_shape` restriction and verifying correct bufferization, this is left as possible future work. For now, this scenario is avoided by casting the `tosa.reshape` input tensor to a static shape if necessary (see `inferReshapeInputType()`.
- An extended set of tests are intended to cover relevant conversion paths. Tests are named using pattern `test_reshape_<rank>_{up|down|same}_{s2s|s2d|d2s|d2d}_{explicit|auto}[_empty][_identity]`, where:
- `<rank>` is the input rank (e.g., 3d, 6d)
- `{up|down|same}` indicates whether the reshape increases, decreases, or retains the input rank.
- `{s2s|s2d|d2s|d2d}` indicates whether reshape converts a statically shaped input to a statically shaped result (`s2s`), a statically shaped input to a dynamically shaped result (`s2d`), etc.
- `{explicit|auto}` is used to indicate that all values in the `new_shape` attribute are >=0 (`explicit`) or that a -1 placeholder value is used (`auto`).
- `empty` is used to indicate that `new_shape` includes a component set to 0.
- `identity` is used when the input and result shapes are the same.
Important to consider that `arith` has wrap around semantics, and in C++
signed overflow is UB.
Unless the operation guarantees that no signed overflow happens, we will
perform the arithmetic in an equivalent unsigned type.
`bool` also doesn't wrap around in C++, and is not addressed here.
Investigate the lowering of MemRef Load/Store ops and implement
additional folding of created ops
Aims to improve readability of generated lowered SPIR-V code.
Part of work llvm#70704
We currently always lower shuffle to the struct-returning variant. I saw
some cases where this survived all the way through ptx, resulting in
increased register usage. The easiest fix is to simply lower to the
single-result version when the predicate is unused.
This fixes the following failure when doing a clean build (in particular
no .ninja* lying around) of lib/libMLIRLinalgToStandard.a only:
```
In file included from mlir/include/mlir/Dialect/Vector/Transforms/VectorTransforms.h:12,
from mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h:21,
from mlir/lib/Conversion/LinalgToStandard/LinalgToStandard.cpp:15:
mlir/include/mlir/Dialect/Vector/Transforms/VectorRewritePatterns.h:20:10: fatal error: mlir/Dialect/Vector/Transforms/VectorTransformsEnums.h.inc: No such file or directory
```
This fixes the following failure when doing a clean build (in particular
no .ninja* lying around) of lib/libMLIRMemRefToLLVM.a only:
```
In file included from mlir/lib/Conversion/MemRefToLLVM/MemRefToLLVM.cpp:18:
mlir/include/mlir/Dialect/Func/IR/FuncOps.h:29:10: fatal error: mlir/Dialect/Func/IR/FuncOps.h.inc: No such file or directory
```
This fixes the following failure when doing a clean build (in particular
no .ninja* lying around) of lib/libMLIRNVVMToLLVM.a only:
```
In file included from mlir/lib/Conversion/NVVMToLLVM/NVVMToLLVM.cpp:18:
mlir/include/mlir/Dialect/Func/IR/FuncOps.h:29:10: fatal error: mlir/Dialect/Func/IR/FuncOps.h.inc: No such file or directory
```
On some architectures (currently gfx90a, gfx94*, and gfx10**), we can
implement an LDS barrier using compiler intrinsics instead of inline
assembly, improving optimization possibilities and decreasing the
fragility of the underlying code.
Other AMDGPU chipsets continue to require inline assembly to implement
this barrier, as, by the default, the LLVM backend will insert waits on
global memory (s_waintcnt vmcnt(0)) before barriers in order to ensure
memory watchpoints set by debuggers work correctly.
Use of amdgpu.lds_barrier, on these architectures, imposes a tradeoff
between debugability and performance. The documentation, as well as the
generated inline assembly, have been updated to explicitly call
attention to this fact.
For chipsets that did not require the inline assembly hack, we move to
the s.waitcnt and s.barrier intrinsics, which have been added to the
ROCDL dialect. The magic constants used as an argument to the waitcnt
intrinsic can be derived from
llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
This patch adds a the `LowerVectorToArmNeonPattern` patterns to the
ArmNeon.
This pattern inspects `vector.contract` ops that can be 1-1 mapped to an
`arm.neon.smmla` intrinsic. The contract ops must be separated into
tiles who's inputs must fit that of a single smmla op (`2x8xi32` inputs
and `2x2xi32` output). The `vector.contract` inputs must be sign
extended from narrow types (<=i8) to be converted. If all conditions are
met, an smmla op is inserted with additional `vector.shape_casts` to
handle linearizing the input and output dimension.
This adds patterns and a pass to convert the Arith dialect to EmitC. For
now, this covers arithemtic binary ops operating on floating point
types.
It is not checked within the patterns whether the types, such as the
Tensor type, are supported in the respective EmitC operations. If
unsupported types should be converted, the conversion will fail anyway
because no legal EmitC operation can be created. This can clearly be
improved in a follow up, also resulting in better error messages.
Functions for such checks should not solely be used in the conversions
and should also be (re)used in the verifier.
From https://reviews.llvm.org/D153245
This adds support for native PDL (and PDLL) C++ constraints to return
results.
This is useful for situations where a pattern checks for certain
constraints of multiple interdependent attributes and computes a new
attribute value based on them. Currently, for such an example it is
required to escape to C++ during matching to perform the check and after
a successful match again escape to native C++ to perform the computation
during the rewriting part of the pattern. With this work we can do the
computation in C++ during matching and use the result in the rewriting
part of the pattern. Effectively this enables a choice in the trade-off
of memory consumption during matching vs recomputation of values.
This is an example of a situation where this is useful: We have two
operations with certain attributes that have interdependent constraints.
For instance `attr_foo: one_of [0, 2, 4, 8], attr_bar: one_of [0, 2, 4,
8]` and `attr_foo == attr_bar`. The pattern should only match if all
conditions are true. The new operation should be created with a new
attribute which is computed from the two matched attributes e.g.
`attr_baz = attr_foo * attr_bar`. For the check we already escape to
native C++ and have all values at hand so it makes sense to directly
compute the new attribute value as well:
```
Constraint checkAndCompute(attr0: Attr, attr1: Attr) -> Attr;
Pattern example with benefit(1) {
let foo = op<test.foo>() {attr = attr_foo : Attr};
let bar = op<test.bar>(foo) {attr = attr_bar : Attr};
let attr_baz = checkAndCompute(attr_foo, attr_bar);
rewrite bar with {
let baz = op<test.baz> {attr=attr_baz};
replace bar with baz;
};
}
```
To achieve this the following notable changes were necessary:
PDLL:
- Remove check in PDLL parser that prevented native constraints from
returning results
PDL:
- Change PDL definition of pdl.apply_native_constraint to allow variadic
results
PDL_interp:
- Change PDL_interp definition of pdl_interp.apply_constraint to allow
variadic results
PDLToPDLInterp Pass:
The input to the pass is an arbitrary number of PDL patterns. The pass
collects the predicates that are required to match all of the pdl
patterns and establishes an ordering that allows creation of a single
efficient matcher function to match all of them. Values that are matched
and possibly used in the rewriting part of a pattern are represented as
positions. This allows fusion and thus reusing a single position for
multiple matching patterns. Accordingly, we introduce
ConstraintPosition, which records the type and index of the result of
the constraint. The problem is for the corresponding value to be used in
the rewriting part of a pattern it has to be an input to the
pdl_interp.record_match operation, which is generated early during the
pass such that its surrounding block can be referred to by branching
operations. In consequence the value has to be materialized after the
original pdl.apply_native_constraint has been deleted but before we get
the chance to generate the corresponding pdl_interp.apply_constraint
operation. We solve this by emitting a placeholder value when a
ConstraintPosition is evaluated. These placeholder values (due to fusion
there may be multiple for one constraint result) are replaced later when
the actual pdl_interp.apply_constraint operation is created.
Changes since the phabricator review:
- Addressed all comments
- In particular, removed registerConstraintFunctionWithResults and
instead changed registerConstraintFunction so that contraint functions
always have results (empty by default)
- Thus we don't need to reuse `rewriteFunctions` to store constraint
functions with results anymore, and can instead use
`constraintFunctions`
- Perform a stable sort of ConstraintQuestion, so that
ConstraintQuestion appear before other ConstraintQuestion that use their
results.
- Don't create placeholders for pdl_interp::ApplyConstraintOp. Instead
generate the `pdl_interp::ApplyConstraintOp` before generating the
successor block.
- Fixed a test failure in the pdl python bindings
Original code by @martin-luecke
Co-authored-by: martin-luecke <martinpaul.luecke@amd.com>
Since the vector.print str provides no punctuation control, it is
slightly more flexible to let the client of this operation decide
whether there should be a trailing newline. This allows for printing
like
vector.print str "nse = "
vector.print %nse : index
as
nse = 42
In order to ensure operations lower correctly (especially
memref.addrspacecast, which relies on the data layout benig set
correctly then dealing with dynamic memrefs) and to prevent compilation
issues later down the line, set the `llvm.data_layout` attribute on GPU
modules when lowering their contents to a ROCDL / AMDGPU target.
If there's a good way to test the embedded string to prevent it from
going out of sync with the LLVM TargetMachine, I'd appreciate hearing
about it. (Or, alternatively, if there's a place I could farctor the
string out to).
When creating a new block in (conversion) rewrite patterns,
`OpBuilder::createBlock` must be used. Otherwise, no
`notifyBlockInserted` notification is sent to the listener.
Note: The dialect conversion relies on listener notifications to keep
track of IR modifications. Creating blocks without the builder API can
lead to memory leaks during rollback.
tosa.clamp takes `min`/`max` attributes as i64, so ensure that the
lowering to linalg works for the whole range.
Co-authored-by: Tiago Trevisan Jost <tiago.trevisanjost@amd.com>
I believe the semantics should be the same, but this saves 1 op and simplifies the code.
For example, the following two instructions:
```
%2 = cmp sgt %0, %1
%3 = select %2, %0, %1
```
Are equivalent to:
```
%2 = maxsi %0 %1
```
This is a new ODS feature that allows dialects to define a list of
key/value pair representing an attribute type and a name.
This will generate helper classes on the dialect to be able to
manage discardable attributes on operations in a type safe way.
For example the `test` dialect can define:
```
let discardableAttrs = (ins
"mlir::IntegerAttr":$discardable_attr_key,
);
```
And the following will be generated in the TestDialect class:
```
/// Helper to manage the discardable attribute `discardable_attr_key`.
class DiscardableAttrKeyAttrHelper {
::mlir::StringAttr name;
public:
static constexpr ::llvm::StringLiteral getNameStr() {
return "test.discardable_attr_key";
}
constexpr ::mlir::StringAttr getName() {
return name;
}
DiscardableAttrKeyAttrHelper(::mlir::MLIRContext *ctx)
: name(::mlir::StringAttr::get(ctx, getNameStr())) {}
mlir::IntegerAttr getAttr(::mlir::Operation *op) {
return op->getAttrOfType<mlir::IntegerAttr>(name);
}
void setAttr(::mlir::Operation *op, mlir::IntegerAttr val) {
op->setAttr(name, val);
}
bool isAttrPresent(::mlir::Operation *op) {
return op->hasAttrOfType<mlir::IntegerAttr>(name);
}
void removeAttr(::mlir::Operation *op) {
assert(op->hasAttrOfType<mlir::IntegerAttr>(name));
op->removeAttr(name);
}
};
DiscardableAttrKeyAttrHelper getDiscardableAttrKeyAttrHelper() {
return discardableAttrKeyAttrName;
}
```
User code having an instance of the TestDialect can then manipulate this
attribute on operation using:
```
auto helper = testDialect.getDiscardableAttrKeyAttrHelper();
helper.setAttr(op, value);
helper.isAttrPresent(op);
...
```
Extends the `omp.parallel` op by adding a `private` clause to model
[first]private variables. This uses the `omp.private` op to map
privatized variables to their corresponding privatizers.
Example `omp.private` op with `private` variable:
```
omp.parallel private(@x.privatizer %arg0 -> %arg1 : !llvm.ptr) {
^bb0(%arg1: !llvm.ptr):
// ... use %arg1 ...
omp.terminator
}
```
Whether the variable is private or firstprivate is determined by the
attributes of the corresponding `omp.private` op.
Introduces conversion of `omp.private`'s regions to the LLVM dialect.
This reuses the already existing conversion pattern for
`ReducetionDeclareOp` and repurposes it to be used for multi-region ops
as well.
This patch reworks the way that wsloop reduction operations function to
better match the expected semantics from the OpenMP specification,
following the rework of parallel reductions.
The new semantics create a private reduction variable as a block
argument which should be used normally for all operations on that
variable in the region; this private variable is then combined with the
others into the shared variable. This way no special omp.reduction
operations are needed inside the region. These block arguments follow
the loop control block arguments.
---------
Co-authored-by: Kiran Chandramohan <kiran.chandramohan@arm.com>