The PR makes the following refine changes to the XeGPU dialect.
1. Separated the old `TensorDescAttr` into two independent attributes: `BlockTensorDescAttr` and `ScatterTensorDescAttr`
2. Renamed the `MemoryScopeAttr` to `MemorySpaceAttr` and updated the enumeration value for shared memory following OpenCL standard.
3. Introduced `transpose` UnitAttr to `StoreScatterOp`and `LoadGatherOp`
4. Added memory space check for `CreateNdDesc` and `CreateDesc` op, as well as valid and invalid test cases for them.
The `split_axes` attribute is defined as "array attribute of array
attributes". Following the definition, empty `split_axes` values should
not be allowed, since that would break the definition and would lead to
invalid IR. In such scenario, passes leveraging the mesh dialect can
observe:
* crashes in sharding-propagation;
* creation of null MeshShardingAttrs in spmdization;
* non roundtrippable IR.
The patch prevents `split_axes` to become empty by modifying the
`removeTrailingEmptySubArray` such that a minimum size of one is
guaranteed when constructing the attribute, and adds a test that would
crash without the change.
(this is the part related to bolt, lld and mlir)
Without these explicit includes, removing other headers, who implicitly
include llvm-config.h, may have non-trivial side effects. For example,
`clangd` may report even `llvm-config.h` as "no used" in case it defines
a macro, that is explicitly used with #ifdef. It is actually amplified
with different build configs which use different set of macros.
This PR adds ability to pass non-default value to
`.amdhsa_code_object_version` metadata when serializing ROCDL GPU
modules.
It also fixes typos in two places.
---------
Co-authored-by: Fabian Mora <fmora.dev@gmail.com>
Add support for the -frecord-command-line option that will produce the
llvm.commandline metadata which will eventually be saved in the object
file. This behavior is also supported in clang. Some refactoring of the
code in flang to handle these command line options was carried out. The
corresponding -grecord-command-line option which saves the command line
in the debug information has not yet been enabled for flang.
Making the existing populateGpuLowerSubgroupReduceToShufflePatterns()
function also cover the new "clustered" subgroup reductions is proving
to be inconvenient, because certain backends may have more specific
lowerings that only cover the non-clustered type, and this creates pass
ordering constraints. This commit removes coverage of clustered
reductions from this function in favour of a new separate function,
which makes controlling the lowering much more straightforward.
When using the `enable_ir_printing` API from Python, it invokes IR
printing with default args, printing the IR before each pass and
printing IR after pass only if there have been changes. This PR attempts
to align the `enable_ir_printing` API with the documentation
This PR adds `f6E2M3FN` type to mlir.
`f6E2M3FN` type is proposed in [OpenCompute MX
Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
It defines a 6-bit floating point number with bit layout S1E2M3. Unlike
IEEE-754 types, there are no infinity or NaN values.
```c
f6E2M3FN
- Exponent bias: 1
- Maximum stored exponent value: 3 (binary 11)
- Maximum unbiased exponent value: 3 - 1 = 2
- Minimum stored exponent value: 1 (binary 01)
- Minimum unbiased exponent value: 1 − 1 = 0
- Has Positive and Negative zero
- Doesn't have infinity
- Doesn't have NaNs
Additional details:
- Zeros (+/-): S.00.000
- Max normal number: S.11.111 = ±2^(2) x (1 + 0.875) = ±7.5
- Min normal number: S.01.000 = ±2^(0) = ±1.0
- Max subnormal number: S.00.111 = ±2^(0) x 0.875 = ±0.875
- Min subnormal number: S.00.001 = ±2^(0) x 0.125 = ±0.125
```
Related PRs:
- [PR-94735](https://github.com/llvm/llvm-project/pull/94735) [APFloat]
Add APFloat support for FP6 data types
- [PR-105573](https://github.com/llvm/llvm-project/pull/105573) [MLIR]
Add f6E3M2FN type - was used as a template for this PR
----------
Motivation:
----------
Some legalization pathways introduce redundant tosa.TRANSPOSE
operations that result in avoidable data movement. For example,
PyTorch -> TOSA contains a lot of unnecessary transposes due
to conversions between NCHW and NHWC.
We wish to remove all the ones that we can, since in general
it is possible to remove the overwhelming majority.
------------
Changes Made:
------------
- Add the --tosa-reduce-transposes pass
- Add TosaElementwiseOperator trait.
-------------------
High-Level Overview:
-------------------
The pass works through the transpose operators in the program. It begins
at some
transpose operator with an associated permutations tensor. It traverses
upwards
through the dependencies of this transpose and verifies that we
encounter only
operators with the TosaElementwiseOperator trait and terminate in either
constants, reshapes, or transposes.
We then evaluate whether there are any additional restrictions (the
transposes
it terminates in must invert the one we began at, and the reshapes must
be ones
in which we can fold the transpose into), and then we hoist the
transpose through
the intervening operators, folding it at the constants, reshapes, and
transposes.
Finally, we ensure that we do not need both the transposed form (the
form that
had the transpose hoisted through it) and the untransposed form (which
it was prior),
by analyzing the usages of those dependent operators of a given
transpose we are
attempting to hoist and replace.
If they are such that it would require both forms to be necessary, then
we do not
replace the hoisted transpose, causing the new chain to be dead.
Otherwise, we do
and the old chain (untransposed form) becomes dead. Only one chain will
ever then
be live, resulting in no duplication.
We then perform a simple one-pass DCE, so no canonicalization is
necessary.
--------------
Impact of Pass:
--------------
Patching the dense_resource artifacts (from PyTorch) with dense
attributes to
permit constant folding, we receive the following results.
Note that data movement represents total transpose data movement,
calculated
by noting which dimensions moved during the transpose.
///////////
MobilenetV3:
///////////
BEFORE total data movement: 11798776 B (11.25 MiB)
AFTER total data movement: 2998016 B (2.86 MiB)
74.6% of data movement removed.
BEFORE transposes: 82
AFTER transposes: 20
75.6% of transposes removed.
////////
ResNet18:
////////
BEFORE total data movement: 20596556 B (19.64 MiB)
AFTER total data movement: 1003520 B (0.96 MiB)
95.2% of data movement removed.
BEFORE transposes: 56
AFTER transposes: 5
91.1% of transposes removed.
////////
ResNet50:
////////
BEFORE total data movement: 83236172 B (79.3 MiB)
AFTER total data movement: 3010560 B (2.87 MiB)
96.4% of data movement removed
BEFORE transposes: 120
AFTER transposes: 7
94.2% of transposes removed.
/////////
ResNet101:
/////////
BEFORE total data movement: 124336460 B (118.58 MiB)
AFTER total data movement: 3010560 B (2.87 MiB)
97.6% of data movement removed
BEFORE transposes: 239
AFTER transposes: 7
97.1% of transposes removed.
/////////
ResNet152:
/////////
BEFORE total data movement: 175052108 B (166.94 MiB)
AFTER total data movement: 3010560 B (2.87 MiB)
98.3% of data movement removed
BEFORE transposes: 358
AFTER transposes: 7
98.0% of transposes removed.
////////
Overview:
////////
We see that we remove up to 98% of transposes and eliminate
up to 98.3% of redundant transpose data movement.
In the context of ResNet50, with 120 inferences per second,
we reduce dynamic transpose data bandwidth from 9.29 GiB/s
to 344.4 MiB/s.
-----------
Future Work:
-----------
(1) Evaluate tradeoffs with permitting ConstOp to be duplicated across
hoisted
transposes with different permutation tensors.
(2) Expand the class of foldable upstream ReshapeOp we permit beyond
N -> 1x1x...x1xNx1x...x1x1.
(3) Enchance the pass to permit folding arbitrary transpose pairs,
beyond
those that form the identity.
(4) Add support for more instructions besides TosaElementwiseOperator as
the intervening ones (for example, the reduce_* operators).
(5) Support hoisting transposes up to an input parameter.
Signed-off-by: Arteen Abrishami <arteen.abrishami@arm.com>
Update the GPU to NVVM lowerings to correctly propagate range
information on IDs and dimension queries, etiher from
known_{block,grid}_size attributes or from `upperBound` annotations on
the operations themselves.
This patch modifies the representation of `OpenMP_Clause` to allow
definitions to incorporate both required and optional arguments while
still allowing operations including them and overriding the
`assemblyFormat` to take advantage of automatically-populated format
strings.
The proposed approach is to split the `assemblyFormat` clause property
into `reqAssemblyFormat` and `optAssemblyFormat`, and remove the
`isRequired` template and associated `required` property. The
`OpenMP_Op` class, in turn, populates the new `clausesReqAssemblyFormat`
and `clausesOptAssemblyFormat` properties in addition to
`clausesAssemblyFormat`. These properties can be used by clause-based
OpenMP operation definitions to reconstruct parts of the
clause-inherited format string in a more flexible way when overriding
it.
Clause definitions are updated to follow this new approach and some
operation definitions overriding the `assemblyFormat` are simplified by
taking advantage of the improved flexibility, reducing code duplication.
The `verify-openmp-ops` tablegen pass is updated for the new
`OpenMP_Clause` representation.
Some MLIR and Flang unit tests had to be updated due to changes to the
default printing order of clauses on updated operations.
* Fix a bug introduced by the Chipset refactoring in #107720 where
atomics emulation for adds was mistakenly applied to gfx11+
* Add the case needed for gfx11+ atomic emulation, namely that gfx11
doesn't support atomically adding a v2f16 or v2bf16, thus requiring
MLIR-level legalization for buffer intrinsics that attempt to do such an
addition
* Add tests, including tests for gfx11 atomic emulation
Co-authored-by: Manupa Karunaratne <manupa.karunaratne@amd.com>
This commit introduces a ConstantRange attribute to match the
ConstantRange attribute type present in LLVM IR.
It then refactors the LLVM_IntrOpBase so that the basic part of the
intrinsic builder code can be re-used without needing to copy it or
get rid of important context. This, along with adding code for
handling an optional `range` attribute to that same base, allows us to
make the support for range() annotations generic without adding
another bit to IntrOpBase.
This commit then updates the lowering of index intrinsic operations to
use the new ConstantRange attribute and fixes a bug (where we'd be
subtracting 1 from upper bounds instead of adding it on operations
like gpu.block_dim) along the way.
The point of these changes is to enable these range annotations to be
used for the corresponding NVVM operations in a future commit.
Current implementation of `scf::tileConsumerAndFuseProducerUsingSCF`
looks at operands of tiled/tiled+fused operations to see if they are
produced by `extract_slice` operations to populate the worklist used to
continue fusion. This implicit assumption does not always work. Instead
make the implementations of `getTiledImplementation` return the slices
to use to continue fusion.
This is a breaking change
- To continue to get the same behavior of
`scf::tileConsumerAndFuseProducerUsingSCF`, change all out-of-tree
implementation of `TilingInterface::getTiledImplementation` to return
the slices to continue fusion on. All in-tree implementations have been
adapted to this.
- This change touches parts that required a simplification to the
`ControlFn` in `scf::SCFTileAndFuseOptions`. It now returns a
`std::optional<scf::SCFTileAndFuseOptions::ControlFnResult>` object that
should be `std::nullopt` if fusion is not to be performed.
Signed-off-by: MaheshRavishankar <mahesh.revishankar@gmail.com>
/llvm-project/mlir/include/mlir/Analysis/Presburger/Utils.h:320:26:
error: result of comparison of constant 18446744073709551615 with expression of type 'unsigned int' is always true [-Werror,-Wtautological-constant-out-of-range-compare]
preIndent = (preIndent != std::string::npos) ? preIndent + 1 : 0;
~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~
/llvm-project/mlir/include/mlir/Analysis/Presburger/Utils.h:335:28:
error: result of comparison of constant 18446744073709551615 with expression of type 'unsigned int' is always true [-Werror,-Wtautological-constant-out-of-range-compare]
preIndent = (preIndent != std::string::npos) ? preIndent + 1 : 0;
~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~
2 errors generated.
Hello Arjun! Please allow me to contribute this patch as it helps me
debugging significantly! When the 1's and 0's don't line up when
debugging farkas lemma of numerous polyhedrons using simplex lexmin
solver, it is truly straining on the eyes. Hopefully this patch can help
others!
The unfortunate part is the lack of testcase as I'm not sure how to add
testcase for debug dumps. :) However, you can add this testcase to the
SimplexTest.cpp to witness the nice printing!
```c++
TEST(SimplexTest, DumpTest) {
int COLUMNS = 2;
int ROWS = 2;
LexSimplex simplex(COLUMNS * 2);
IntMatrix m1(ROWS, COLUMNS * 2 + 1);
// Adding LHS columns.
for (int i = 0; i < ROWS; i++) {
// an arbitrary formula to test all kinds of integers
for (int j = 0; j < COLUMNS; j++)
m1(i, j) = i + (2 << (i % 3)) * (-1 * ((i + j) % 2));
}
// Adding RHS columns.
for (int i = 0; i < ROWS; i++) {
for (int j = 0; j < COLUMNS; j++)
m1(i, j + COLUMNS) = j - (3 << (j % 4)) * (-1 * ((i + j * 2) % 2));
}
for (int i = 0; i < m1.getNumRows(); i++) {
ArrayRef<DynamicAPInt> curRow = m1.getRow(i);
simplex.addInequality(curRow);
}
IntegerRelation rel =
parseRelationFromSet("(x, y, z)[] : (z - x - 17 * y == 0, x - 11 * z >= 1)",2);
simplex.dump();
m1.dump();
rel.dump();
}
```
```
rows = 2, columns = 7
var: c3, c4, c5, c6
con: r0 [>=0], r1 [>=0]
r0: -1, r1: -2
c0: denom, c1: const, c2: 2147483647, c3: 0, c4: 1, c5: 2, c6: 3
1 0 1 0 -2 0 1
1 0 -8 -3 1 3 7
0 -2 0 1 0
-3 1 3 7 0
Domain: 2, Range: 1, Symbols: 0, Locals: 0
2 constraints
-1 -17 1 0 = 0
1 0 -11 -1 >= 0
```
This patch fixes attr type of out_shape, which is i64 dense array
attribute with exactly 4 elements.
- Fix description of DenseArrayMaxCt
- Add DenseArrayMinCt and move it to CommonAttrConstraints.td
- Change type of out_shape to Tosa_IntArrayAttr4
Fixes#107804.
Extend the lowering of atomic.fadd to support the v2f16 variant
avaliable on some AMDGPU chips.
Re-lands #108238 (and addresses review comments from there)
Co-authored-by: Giuseppe Rossini <giuseppe.rossini@amd.com>
Extend the lowering of atomic.fadd to support the v2f16 variant
avaliable on some AMDGPU chips.
Co-authored-by: Giuseppe Rossini <giuseppe.rossini@amd.com>
----------
Motivation:
----------
Spec conformance. Allows assumptions to be made in TOSA code.
------------
Changes Made:
------------
Add full permutation tensor verification to tosa.TRANSPOSE. Priorly
would not verify that permuted values were between 0 - (rank - 1).
Update tosa.TRANSPOSE perms data type to be strictly i32.
Verify input/output shapes for tosa.TRANSPOSE.
Add verifier to tosa.CONST, with consideration for quantization.
Fix TOSA conformance of tensor type to disallow dimensions with size 0
for ranked tensors, per spec.
This is not the same as rank 0 tensors. Here is an example of a
disallowed tensor: tensor<3x0xi32>. Naturally, this means that the
number of elements in a TOSA tensor will always be greater than 0.
Signed-off-by: Arteen Abrishami <arteen.abrishami@arm.com>
Refactors the tblgen-to-irdl script slightly and adds support for
- Various integer types
- Various Float types
- Confined types
- Complex types (with fixed element type)
Also doesn't add the operand and result ops if they are empty.
I could potentially split this into smaller PRs if that'd be helpful
(refactor + integer/float/complex, confined type, optional
operand/result).
@math-fehr
This patch adds the "gen-openmp-clause-ops" `mlir-tblgen` generator to
produce the structure definitions previously in OpenMPClauseOperands.h
automatically from the information contained in OpenMPOps.td and
OpenMPClauses.td.
The original header is maintained to enable the definition of similar
structures that are not directly related to any single `OpenMP_Clause`
or `OpenMP_Op` tablegen definition.
This PR adds `f6E3M2FN` type to mlir.
`f6E3M2FN` type is proposed in [OpenCompute MX
Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
It defines a 6-bit floating point number with bit layout S1E3M2. Unlike
IEEE-754 types, there are no infinity or NaN values.
```c
f6E3M2FN
- Exponent bias: 3
- Maximum stored exponent value: 7 (binary 111)
- Maximum unbiased exponent value: 7 - 3 = 4
- Minimum stored exponent value: 1 (binary 001)
- Minimum unbiased exponent value: 1 − 3 = −2
- Has Positive and Negative zero
- Doesn't have infinity
- Doesn't have NaNs
Additional details:
- Zeros (+/-): S.000.00
- Max normal number: S.111.11 = ±2^(4) x (1 + 0.75) = ±28
- Min normal number: S.001.00 = ±2^(-2) = ±0.25
- Max subnormal number: S.000.11 = ±2^(-2) x 0.75 = ±0.1875
- Min subnormal number: S.000.01 = ±2^(-2) x 0.25 = ±0.0625
```
Related PRs:
- [PR-94735](https://github.com/llvm/llvm-project/pull/94735) [APFloat]
Add APFloat support for FP6 data types
- [PR-97118](https://github.com/llvm/llvm-project/pull/97118) [MLIR] Add
f8E4M3 type - was used as a template for this PR
Allow customization of the `resolveCallable` method in the
`CallOpInterface`. This change allows for operations implementing this
interface to provide their own logic for resolving callables.
- Introduce the `resolveCallable` method, which does not include the
optional symbol table parameter. This method replaces the previously
existing extra class declaration `resolveCallable`.
- Introduce the `resolveCallableInTable` method, which incorporates the
symbol table parameter. This method replaces the previous extra class
declaration `resolveCallable` that used the optional symbol table
parameter.
Update the Chipset struct to follow the `IsaVersion` definition from
llvm's `TargetParser`. This is a follow up to
https://github.com/llvm/llvm-project/pull/106169#discussion_r1733955012.
* Add the stepping version. Note: This may break downstream code that
compares against the minor version directly.
* Use comparisons with full Chipset version where possible.
Note that we can't use the code in `TargetParser` directly because the
chipset utility is outside of `mlir/Target` that re-exports llvm's
target library.
This PR enables `func::ConstantOp` creation and usage for device
functions inside GPU modules.
The current main returns error for referencing device functions via
`func::ConstantOp`, because during the `ConstantOp` verification it only
checks symbols in `ModuleOp` symbol table, which, of course, does not
contain device functions that are defined in `GPUModuleOp`. This PR
proposes a more general solution.
Co-authored-by: Artem Kroviakov <artem.kroviakov@tum.de>
The `GetResultPtrElementType` interface is dead now that MLIR has fully
moved to opaque pointers, and can be removed.
Add namespace qualifiers to all argument types and return types of
interface methods for when they're used outside of LLVM dialect.