Previously this was only populated in the create method later. This
resolves some of invalid builder paths. This may also be sufficient that
type inference functions no longer have to consider whether property
conversion has happened (but haven't verified that yet).
This also makes Attributes corresponding to Properties as optional
inside the set from attributes method. Today that is in effect what
happens with Property value initialization and folks use it to define
custom C++ types whose default initialization is what they want. This is
the behavior users get if they use properties directly. Propagating
Attributes without allowing partial setting would require iterating over
the dictionary attribute considering the properties of the op type that
will be created. This could also have been an additional method
generated or optional behavior on the set method. But doing it
consistently seems better. In terms of whats lost, it doesn't seem like
anything compared to the pure Property path where Property is default
value initialized and then partially overwritten (this doesn't seem to
buy anything else verification wise).
Default valued Properties (as specified ODS side rather than C++ side)
triggered error as the containing class was not yet complete but
referenced nested class, so that we couldn't have default initializer
for them in the parent class. Added an additional forwarding builder to
avoid needing to update call sites. This could be split out to separate
change.
Inlined templated function in unit test that was only used once. Moved
initialization earlier where seen.
This patch rewrites the ArmSME tile allocator to use liveness
information to make better tile allocation decisions and improve the
correctness of the ArmSME dialect. This algorithm used here is a linear
scan over live ranges, where live ranges are assigned to tiles as they
appear in the program (chronologically). Live ranges release their
assigned tile ID when the current program point is passed their end.
This is a greedy algorithm (which is mainly to keep the implementation
relatively straightforward), and because it seems to be sufficient for
most kernels (e.g. matmuls) that use ArmSME. The general steps of this
are roughly from
https://link.springer.com/content/pdf/10.1007/3-540-45937-5_17.pdf,
though there have been a few simplifications and assumptions made for
our use case.
Hopefully, the only changes needed for a user of the ArmSME dialect is
that:
- `-allocate-arm-sme-tiles` will no longer be a standalone pass
- `-test-arm-sme-tile-allocation` is only for unit tests
- `-convert-arm-sme-to-llvm` must happen after `-convert-scf-to-cf`
- SME tile allocation is now part of the LLVM conversion
By integrating this into the `ArmSME -> LLVM` conversion we can allow
high-level (value-based) ArmSME operations to be side-effect-free, as we
can guarantee nothing will rearrange ArmSME operations before we emit
intrinsics (which could invalidate the tile allocation).
The hope is for ArmSME operations to have no hidden state/side effects
and allow easily lowering dialects such as `vector` and `arith` to SME,
without making assumptions about how the input IR looks, as the
semantics of the operations will be the same. That is no (new) side
effects and the IR follows the rules of SSA (a value will never change).
The aim is correctness, so we have a base for working on optimizations.
Adds the LLVM vector.deinterleave2 intrinsic to the MLIR LLVM dialect. The
deinterleave2 intrinsic takes a vector and returns two vectors with the first
having even elements and the second with odd elements from the input
vector. The inverse of vector.interleave2.
Following #90236, adding `select` to linalg as `arith.select`. No
implicit type casting.
OpDSL doesn't expose a type restriction for bool, but I saw no reason in
adding it (put a separate symbolic type and check the semantics in the
builder).
---------
Co-authored-by: Renato Golin <rengolin@systemcall.eu>
Co-authored-by: Maksim Levental <maksim.levental@gmail.com>
This commit extends the SROA interfaces to ensure the interface
instantiations can communicate newly created allocators to the
algorithm. This ensures that the SROA implementation does no longer
require re-walking the IR to find new allocators.
After #86509 allowed all integer types in TOSA ops, this PR allows TOSA
ops on all floating point types.
This helps to experiment with `f64` and 8-bit float types when spec
conformance is not required.
This is a proof of concept recognition of the most basic forms of ReLu
operations, used to show-case sparsification of end-to-end PyTorch
models. In the long run, we must avoid lowering such constructs too
early (with this need for raising them back).
See discussion at
https://discourse.llvm.org/t/min-max-abs-relu-recognition-starter-project/78918
In summary:
- `Monomial` -> `MonomialBase` with two inheriting `IntMonomial` and
`FloatMonomial` for the different coefficient types
- `Polynomial` -> `PolynomialBase` with `IntPolynomial` and
`FloatPolynomial` inheriting
- `PolynomialAttr` -> `IntPolynomialAttr`, and new `FloatPolynomialAttr`
attribute, both of which may be input to `polynomial.constant`
- Refactoring common parts of attribute parsers.
---------
Co-authored-by: Jeremy Kun <j2kun@users.noreply.github.com>
This commit fixes Mem2Regs mutli-slot allocator handling and extends the
test dialect to test this.
Additionally, this modifies Mem2Reg's API to always attempt a full
promotion on all the passed in "allocators". This ensures that the pass
does not require unnecessary walks over the regions and improves caching
benefits.
The pass runs a `DataFlowSolver` and collects state information on the
input IR. Then, the rewrite driver and folding is applied. During
pattern application and folding it can happen that an Op from the input
IR is deleted and a new Op is created at the same address. When the
newly created Ops is looked up in the `DataFlowSolver` state memory, the
state of the original Op is returned.
This patch adds a method to `DataFlowSolver` which removes all state
related to a `ProgramPoint`. It also adds a listener to the Pass which
clears the state information of deleted Ops from the `DataFlowSolver`.
Fix https://github.com/llvm/llvm-project/issues/81228
Being able to add custom dialects is one of the big missing pieces of
the C API. This change should make it achievable via IRDL. Hopefully
this should open custom dialect definition to non-C++ users of MLIR.
This PR adds two new fields to omp.map_info, one BoolAttr and one I64ArrayAttr.
The BoolAttr is named partial_map, and is a flag that indicates if the record type captured by
the map_info operation is a partial map, or if it is mapped in its entirety, this currently helps
the later lowering determine the type of map entries that need to be generated.
The I64ArrayAttr named members_index is intended to track the placement of each member
map_info operations (and by extension mapped member variable) placement in the parent
record type. This may need to be extended to an N-D array for nested member mapping.
Pull Request: https://github.com/llvm/llvm-project/pull/82851
This fixes a TODO to use alias analysis to determine whether a read op
intervenes between two write operations to the same memref.
Signed-off-by: Asra <asraa@google.com>
Pack a matmul MxNxK operation into 4D blocked layout. Any present batch
dimensions remain unchanged and the result is unpacked back to the
original layout.
Matmul block packing splits the operands into major blocks (outer
dimensions) and minor blocks (inner dimensions). The desired block
layout can be controlled through packing options.
* adds comments in tile-spills-and-fills.mlir
* adds comments in ArmSMEIntrinsicOps.td
* updates test in tile-spills-and-fills.mlir not to return 2D scalable
vectors (e.g. vector<[4]x[4]xf32>) - that's not supported and not
needed for that test
Adds the [DiagnosticTag][diagtag] LSP construct to the LSP support
headers. I also added a unit test file to validate that the `tags` array
is omitted entirely if it's empty.
The LSP spec requires that `Diagnostic::tags` be an array; in order to
conform to that I used `std::vector`, as `SmallVector` doesn't have JSON
decoding support (you can encode it to JSON, but not decode it from
JSON).
[diagtag]:
https://microsoft.github.io/language-server-protocol/specifications/lsp/3.18/specification/#diagnosticTag
In some cases this pattern may ignore static information due to dynamic
operands in the insert_slice sizes operands, e.g.:
```
%0 = tensor.cast %arg0 : tensor<1x?xf32> to tensor<?x?xf32>
%1 = tensor.insert_slice %0 into %arg1[...] [%s0, %s1] [...]
: tensor<?x?xf32> into tensor<?x?xf32>
```
Can be rewritten into:
```
%1 = tensor.insert_slice %arg0 into %arg1[...] [1, %s1] [...]
: tensor<1x?xf32> into tensor<?x?xf32>
```
This PR updates the matching in the pattern to allow rewrites like this.
This patch is a first pass at making consistent syntax across the
`LinalgTransformOp`s that use dynamic index lists for size parameters.
Previously, there were two different forms: inline types in the list, or
place them in the functional style tuple. This patch goes for the
latter.
In order to do this, the `printPackedOrDynamicIndexList`,
`printDynamicIndexList` and their `parse` counterparts were modified so
that the types can be optionally provided to the corresponding custom
directives.
All affected ops now use tablegen `assemblyFormat`, so custom
`parse`/`print` functions have been removed. There are a couple ops that
will likely add dynamic size support, and once that happens it should be
made sure that the assembly remains consistent with the changes in this
patch.
The affected ops are as follows: `pack`, `pack_greedily`,
`tile_using_forall`. The `tile_using_for` and `vectorize` ops already
used this syntax, but their custom assembly was removed.
---------
Co-authored-by: Oleksandr "Alex" Zinenko <ftynse@gmail.com>
This commit ensures that Mem2Reg reuses the `DominanceInfo` as well as
block index maps to avoid expensive recomputations. Due to the recent
migration to `OpBuilder`, the promotion of a slot does no longer replace
blocks. Having stable blocks makes the `DominanceInfo` preservable and
additionally allows to cache block index maps between different
promotions.
Performance measurements on very large functions show an up to 4x
speedup by these changes.
This commit changes the `MemorySlotInterface` back to using `OpBuilder`
instead of a rewriter. This was originally introduced in
https://reviews.llvm.org/D150432 but it was shown that patterns are a
bad idea for both Mem2Reg and SROA.
Mem2Reg suffers from the usage of a rewriter due to being forced to
create new basic blocks. This is an issue, as it leads to the
invalidation of the dominance information, which can be expensive to
recompute.
Add an option hoist-static-allocs to remove the unnecessary memref.alloc
and memref.copy after this pass, when the memref in ReturnOp is
allocated by memref.alloc and is statically shaped. Instead, it replaces
the uses of the allocated memref with the memref in the out argument.
By default, BufferResultsToOutParams will result in a memcpy operation
to copy the originally returned memref to the output argument memref.
This is inefficient when the source of memcpy (the returned memref in
the original ReturnOp) is from a local AllocOp. The pass can use the
output argument memref to replace the locally allocated memref for
better performance.hoist-static-allocs avoids dynamic allocation and
memory movement.
This option will be critical for performance-sensivtive applications,
which require BufferResultsToOutParams pass for a caller-owned output
buffer calling convension.
1. Verify that the type of explicit/implicit values should be the same
as the tensor element type.
2. Verify that implicit value could only be zero.
3. Verify that explicit/implicit values should be numeric.
4. Fix the type change issue caused by SparseTensorType(enc).
Add an option to unique the numbers of values, block arguments and
naming conflicts when requested and/or printing generic op form. This is
helpful when debugging. For example, if you have:
scf.for
%0 =
%1 = opA %0
scf.for
%0 =
%1 = opB %0
And you get a verifier error which says opB's "operand #0 does not
dominate this use", it looks like %0 does dominate the use. This is not
intuitive. If these were numbered uniquely, it would look like:
scf.for
%0 =
%1 = opA %0
scf.for
%2 =
%3 = opB %0
And thus, much clearer as to why you are getting the error since %0 is
out of scope. Since generic op form should aim to give you the most
possible information, it seems like a good idea to use unique numbers in
this situation. Adding an option also gives those an option to use it
outside of generic op form.
Co-authored-by: Scott Manley <scmanley@nvidia.com>
-- This commit adds a canonicalization pattern to fold away iter args of
scf.forall if :-
a. The corresponding tied result has no use.
b. It is not being modified within the loop.
Signed-off-by: Abhishek Varma <avarma094@gmail.com>
These two ops represent a number-theoretic transform of a polynomial to
a tensor of evaluations of the polynomial at a list of powers of
primitive roots of the polynomial.
To support this, a new optional attribute is added to the ring attribute
to specify the primitive root of unity used for the NTT. A verifier for
the op is added to ensure the chosen root is a primitive nth root of
unity.
---------
Co-authored-by: Jeremy Kun <j2kun@users.noreply.github.com>
Co-authored-by: Oleksandr "Alex" Zinenko <ftynse@gmail.com>
This patch modifies the definition of `PadOp` to take transform params
and handles for the `pad_to_multiple_of` operand.
---------
Co-authored-by: Oleksandr "Alex" Zinenko <ftynse@gmail.com>
Parsing support for floating point types was missing a few features:
1. Parsing floating point attributes from integer literals was supported
only for types with bitwidth smaller or equal to 64.
2. Downstream users could not use `AsmParser::parseFloat` to parse float
types which are printed as integer literals.
This commit addresses both these points. It extends
`Parser::parseFloatFromIntegerLiteral` to support arbitrary bitwidth,
and exposes a new API to parse arbitrary floating point given an
fltSemantics as input. The usage of this new API is introduced in the
Test Dialect.
This change corrects an invalid behavior in pass
`--buffer-loop-hoisting`. The pass is in charge of extracting buffer
allocations (e.g., `memref.alloca`) from loop regions (e.g., `scf.for`)
when possible. This works OK for looks with sequential execution
semantics. However, a buffer allocated in the body of a parallel loop
may be concurrently accessed by multiple thread to store its local data.
Extracting such buffer from the loop causes all threads to wrongly share
the same memory region.
In the following example, dimension 1 of the input tensor is reversed.
Dimension 0 is traversed with a parallel loop.
```
func.func @f(%input: memref<2x3xf32>) -> memref<2x3xf32> {
%c0 = index.constant 0
%c1 = index.constant 1
%c2 = index.constant 2
%c3 = index.constant 3
%output = memref.alloc() : memref<2x3xf32>
scf.parallel (%index) = (%c0) to (%c2) step (%c1) {
// Create subviews for working input and output slices
%input_slice = memref.subview %input[%index, 2][1, 3][1, -1] : memref<2x3xf32> to memref<1x3xf32, strided<[3, -1], offset: ?>>
%output_slice = memref.subview %output[%index, 0][1, 3][1, 1] : memref<2x3xf32> to memref<1x3xf32, strided<[3, 1], offset: ?>>
// Copy the input slice into this temporary buffer. This intermediate
// copy is unnecessary, but is used for illustration purposes.
%temp = memref.alloc() : memref<1x3xf32>
memref.copy %input_slice, %temp : memref<1x3xf32, strided<[3, -1], offset: ?>> to memref<1x3xf32>
// Copy temporary buffer into output slice
memref.copy %temp, %output_slice : memref<1x3xf32> to memref<1x3xf32, strided<[3, 1], offset: ?>>
scf.reduce
}
return %output : memref<2x3xf32>
}
```
The patch submitted here prevents `%temp = memref.alloc() :
memref<1x3xf32>` from being hoisted when the containing op is
`scf.parallel` or `scf.forall`. A new op trait called
`HasParallelRegion` is introduced and assigned to these two ops to
indicate that their regions have parallel execution semantics.
@joker-eph @ftynse @nicolasvasilache @sabauma
This is a new take on #89111. Now that #90040 is merged, this has become
trivial to implement. The added test shows the kind of benefit that we
get from this: now dim-of-expand-shape naturally folds without us
needing to implement an ad-hoc folding rewrite.
It may be useful to have access to additional handles or parameters when
performing matches and actions in `foreach_match`, for example, to
parameterize the matcher by rank or restrict it in a non-trivial way.
Enable `foreach_match` to forward additional handles from operands to
matcher symbols and from action symbols to results.
Introduce a new Transform dialect extension that uses IRDL op
definitions as matcher descriptors. IRDL allows one to essentially
define additional op constraits to be verified and, unlike PDL, does not
assume rewriting will happen. Leverage IRDL verification capability to
filter out ops that match an IRDL definition without actually
registering the corresponding operation with the system.
Rather than force callbacks for outgoing requests to parse the result
JSON themselves (of type `llvm::Expected<llvm::json::Value>`), allow
users to specify the result type, which
`MessageHandler::outgoingRequest` will parse for them. This eliminates
boilerplate for users sending outgoing requests.
This patch generalizes tensor.expand_shape and memref.expand_shape to
consume the output shape as a list of SSA values. This enables us to
implement generic reshape operations with dynamic shapes using
collapse_shape/expand_shape pairs.
The output_shape input to expand_shape follows the static/dynamic
representation that's also used in `tensor.extract_slice`.
Differential Revision: https://reviews.llvm.org/D140821
---------
Signed-off-by: Gaurav Shukla<gaurav.shukla@amd.com>
Signed-off-by: Gaurav Shukla <gaurav.shukla@amd.com>
Co-authored-by: Ramiro Leal-Cavazos <ramiroleal050@gmail.com>