The PR fixes the datatype error for `nvvm.mma.sync` when the operand is
`bf16`. This operation originally requires the A/B type to be `f16x2`
for the `bf16` MMA. However, it violates the NVVM intrinsic
[[here](372044ee09/llvm/include/llvm/IR/IntrinsicsNVVM.td (L119))],
where the A/B operand type should be `i32`. This is a bug, and there are
no tests in MLIR that cover this datatype.
```
// mma bf16 -> s32 @ m16n8k16/m16n8k8
!eq(gft,"m16n8k16:a:bf16") : !listsplat(llvm_i32_ty, 4),
!eq(gft,"m16n8k16:b:bf16") : !listsplat(llvm_i32_ty, 2),
!eq(gft,"m16n8k8:a:bf16") : !listsplat(llvm_i32_ty, 2),
!eq(gft,"m16n8k8:b:bf16") : [llvm_i32_ty],
```
This PR addresses this bug and adds tests to guarantee correctness.
Co-authored-by: Xiaolei Shi <xiaoleis@nvidia.com>
Note that PointerUnion::{is,get} have been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
I'm not touching PointerUnion::dyn_cast for now because it's a bit
complicated; we could blindly migrate it to dyn_cast_if_present, but
we should probably use dyn_cast when the operand is known to be
non-null.
Note that PointerUnion::{is,get} have been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
#115808 adds additional `custom<>` parser/printer variants. The overall
list of overloads/variants is getting larger.
This commit removes overloads that are not needed, to keep the
parser/printer simple.
The GPU ID operations already implement InferIntRangeInterface, which
gives constant lower and upper bounds on those IDs when appropriate
metadata is prentent on the operations or in the surrounding context.
This commit uses that existing code to implement the
ValueBoundsOpInterface, which is used when analyzing affine operations
(unlike the integer range interface, which is used for arithmetic
optimization).
It also implements the interface for gpu.launch, where we can use it to
express the constraint that block/grid sizes are equal to their value
from outside the launch op and that the corresponding IDs are bounded
above by that size.
As a consequence, the test pass for this inference is updated to work on
a FunctionOpInterface and not a func.func, creating minor churn in other
tests.
OpenACC data clause operations previously required that the variable
operand implemented PointerLikeType interface. This was a reasonable
constraint because the dialects currently mixed with `acc` do use
pointers to represent variables. However, this forces the "pointer"
abstraction to be exposed too early and some cases are not cleanly
representable through this approach (more specifically FIR's `fix.box`
abstraction).
Thus, relax this by allowing a variable to be a type which implements
either `PointerLikeType` interface or `MappableType` interface.
This commit is a further incremental step toward moving the whole
mlir-vulkan-runner MLIR pass pipeline into mlir-opt (see #73457). The
previous step was b225b3adf7b78387c9fcb97a3ff0e0a1e26eafe2, which moved
all device passes prior to SPIR-V serialization into a new mlir-opt test
pass, `-test-vulkan-runner-pipeline`.
This commit changes how SPIR-V serialization is accomplished for Vulkan
runner tests. Until now, this was done by the Vulkan-specific
ConvertGpuLaunchFuncToVulkanLaunchFunc pass. With this commit, this
responsibility is removed from that pass, and is instead done with the
existing generic GpuModuleToBinaryPass. In addition, the SPIR-V
serialization step is no longer done inside mlir-vulkan-runner, but
rather inside mlir-opt (in the `-test-vulkan-runner-pipeline` pass).
Both of these changes represent a greater alignment between
mlir-vulkan-runner and the other GPU integration tests. Notably, the IR
shapes produced by the mlir-opt pipelines for the Vulkan and SYCL
runners are now much more similar, with both using a gpu.binary op for
the serialized SPIR-V kernel.
In order to enable this, this commit includes these supporting changes:
- ConvertToSPIRVPass is enhanced to support producing the IR shape where
a spirv.module is nested inside a gpu.module, since this is what
GpuModuleToBinaryPass expects.
- ConvertGPULaunchFuncToVulkanLaunchFunc is changed to remove its SPIR-V
serialization functionality, and instead now extracts the SPIR-V from a
gpu.binary operation (as produced by ConvertToSPIRVPass).
- `-test-vulkan-runner-pipeline` now attaches SPIR-V target information
required by GpuModuleToBinaryPass.
- The WebGPU pass option, which had been removed from mlir-vulkan-runner
in the previous commit in this series, is restored as an option to
`-test-vulkan-runner-pipeline` instead, so that the WebGPU pass
continues being inserted into the pipeline just before SPIR-V
serialization.
This change:
- makes **scf.if** recursively speculatable like **affine.if** is.
- also introduces related LICM tests for both **scf.if** and
**affine.if**
Some functions of the deprecated 1:N dialect conversion were marked as
`LLVM_DEPRECATED`. This caused compilation warnings because there are
still test cases of the 1:N dialect conversion framework. (These test
cases will be deleted at the same time when the 1:N driver is deleted.)
Tosa v1.0 adds accumulator type attributes to the various convolution
operations defined in the spec. Update the dialect and any lit tests to
include these attributes.
Signed-off-by: Tai Ly <tai.ly@arm.com>
Co-authored-by: Tai Ly <tai.ly@arm.com>
This PR adds a missing verifier for `tosa.pad`, ensuring that the
padding shape matches [2*rank(shape1)] according to V1.0.0
Specification. Fixes#119840.
The current implementation of LocationSnapshotPass takes an
OpPrintingFlags argument and stores it as member, but does not use it
for printing.
Properly implement the printing flags, also supporting command line args.
---------
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
Refactors `analysisStates` to use two nested maps . This prevents
`eraseState` from having to scan through every analysis state which can
be costly when there are many analysis states and/or `eraseState` is
called frequently.
Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
This commit add an NVIDIA-specific lowering of `cf.assert` to to
`__assertfail`.
Note: `getUniqueFormatGlobalName`, `getOrCreateFormatStringConstant` and
`getOrDefineFunction` are moved to `GPUOpsLowering.h`, so that they can
be reused.
The `properlyDominates` implementations for blocks and ops are very
similar. This commit replaces them with a single implementation that
operates on block iterators. That implementation can be used to
implement both `properlyDominates` variants.
Before:
```c++
template <bool IsPostDom>
bool DominanceInfoBase<IsPostDom>::properlyDominatesImpl(Block *a,
Block *b) const;
template <bool IsPostDom>
bool DominanceInfoBase<IsPostDom>::properlyDominatesImpl(
Operation *a, Operation *b, bool enclosingOpOk) const;
```
After:
```c++
template <bool IsPostDom>
bool DominanceInfoBase<IsPostDom>::properlyDominatesImpl(
Block *aBlock, Block::iterator aIt, Block *bBlock, Block::iterator bIt,
bool enclosingOk) const;
```
Note: A subsequent commit will add a new public `properlyDominates`
overload that accepts block iterators. That functionality can then be
used to find a valid insertion point at which a range of values is
defined (by utilizing post dominance).
The existing canonicalization patterns would only cancel out cases where
the entire result list of an affine.delineraize_index was passed to an
affine.lineraize_index and the basis elements matched exactly (except
possibly for the outer bounds).
This was correct, but limited, and left open many cases where a
delinearize_index would take a series of divisions and modulos only for
a subsequent linearize_index to use additions and multiplications to
undo all that work.
This sort of simplification is reasably easy to observe at the level of
splititng and merging indexes, but difficult to perform once the
underlying arithmetic operations have been created.
Therefore, this commit generalizes the existing simplification logic.
Now, any run of two or more delinearize_index results that appears
within the argument list of a linearize_index operation with the same
basis (or where they're both at the outermost position and so can be
unbonded, or when `linearize_index disjoint` implies a bound not present
on the `delinearize_index`) will be reduced to one signle
delinearize_index output, whose basis element (that is, size or length)
is equal to the product of the sizes that were simplified away.
That is, we can now simplify
%0:2 = affine.delinearize_index %n into (8, 8) : inde, index
%1 = affine.linearize_index [%x, %0#0, %0#1, %y] by (3, 8, 8, 5) : index
to the simpler
%1 = affine.linearize_index [%x, %n, %y] by (3, 64, 5) : index
This new pattern also works with dynamically-sized basis values.
While I'm here, I fixed a bunch of typos in existing tests, and added a
new getPaddedBasis() method to make processing
potentially-underspecified basis elements simpler in some cases.
This commit updates the internal `ConversionValueMapping` data structure
in the dialect conversion driver to support 1:N replacements. This is
the last major commit for adding 1:N support to the dialect conversion
driver.
Since #116470, the infrastructure already supports 1:N replacements. But
the `ConversionValueMapping` still stored 1:1 value mappings. To that
end, the driver inserted temporary argument materializations (converting
N SSA values into 1 value). This is no longer the case. Argument
materializations are now entirely gone. (They will be deleted from the
type converter after some time, when we delete the old 1:N dialect
conversion driver.)
Note for LLVM integration: Replace all occurrences of
`addArgumentMaterialization` (except for 1:N dialect conversion passes)
with `addSourceMaterialization`.
---------
Co-authored-by: Markus Böck <markus.boeck02@gmail.com>
Switch from rewrite patterns to conversion patterns. This allows to
perform type conversions together with other parts of the IR. For
example, this allows to convert from index to emit.size_t types.
Edit the `findValueInReverseUseDefChain` method to accept `OpOperand`
instead of the `Value` type, This change will make sure that the
populated `visitedOpOperands` argument is fully accurate and contains
the opOperand we have started the reverse chain from.
This PR Adds a `ControlBuildSubsetExtractionFn` to the tensor empty
elimination util, This will control the building of the subsets
extraction of the
`SubsetInsertionOpInterface`.
This control function returns the subsets extraction value that will
replace the `emptyTensorOp` use
which is being consumed by a specefic user (which the
util expects to eliminate it).
The default control function will stay like today's behavior without any
additional changes.
This PR adds a new interface method to PartialReductionOpInterface which
allows it to query the result tile position for the partial result.
Previously, tiling the reduction dimension with
SplitReductionOuterReduction when the result has transposed parallel
dimensions would produce wrong results.
Other fixes that were needed to make this PR work:
- Instead of ad-hoc logic to decide where to place the new reduction
dimensions in the partial result based on the iteration space, the
reduction dimensions are always appended to the partial result tensor.
- Remove usage of PartialReductionOpInterface in Mesh dialect. The
implementation was trying to just get a neutral element, but ended up
trying to use PartialReductionOpInterface for it, which is not right. It
was also passing the wrong sizes to it.
The API used internally expects PartialReductionOpInterface. This patch
allows any operation implementing this interface to use this transform
op (instead of just LinalgOp).
- do not refer to handles as `PDLOperation`, this is an outdated and
incorrect vision of what they are based on the type used in the early
days;
- use backticks around inline code.
In `buffer-results-to-out-params`, when `hoist-static-allocs` option is
enabled the pass was looking for `memref.alloc`s in order to attempt to
avoid copies when it can. Which makes it not extensible to external ops
that have allocation like properties. This patch simply changes
`memref::AllocOp` to `AllocationOpInterface` in the check to enable for
any allocation op.
Moreover, for function call updates, we enable setting an allocation
function callback in `BufferResultsToOutParamsOpts` to allow users to
emit their own alloc-like op.
Implementatoin details:
Both Mutexinoutset and Inoutset is recognized as flag=0x4
and 0x8 respectively, the flags is set to `kmp_depend_info` and
passed as argument to `__kmpc_omp_task_with_deps` runtime call
Since the property system isn't currently in heavy use, it's probably
the right time to fix a choice I made when expanding ODS property
support.
Specifically, most of the property subclasses, like OptionalProperty or
IntProperty, wrote out the word "Property" in full. The corresponding
classes in the Attribute hierarchy uses the short-form "Attr" in those
cases, as in OptionalAttr or DefaultValuedAttr.
This commit changes all those uses of "Property" to "Prop" in order to
prevent excessively verbose tablegen files that needlessly repeat the
full name of a core concept that can be abbreviated.
So, this commit renames all the FooProperty classes to FooProp, and
keeps the existing names as alias with a Deprecated<> on them to warn
people.
In addition, this commit updates the documentation around properties to
mention the constraint support.
In downstream (check https://github.com/google/heir/pull/1228,
especially [this
commit](fbf0b2733f);
also check https://github.com/google/heir/pull/1154) we often need to
re-run the analysis during the transformation pass as IR get changed
based on the analysis result and analysis continuously get invalidated.
There are solutions to it like `getOrCreateState` for newly created
`Value` (`AnchorT`), but warning is that the new state does not
propagate! This is quite unexpected as user of analysis would expect it
to propagate. We downstream used to use `solver->propagateIfChanged` but
that turned out to be not working, see detailed writeup in
https://github.com/google/heir/issues/1153.
Just call `initializeAndRun` repeatedly also does not solve the problem
as `analysisStates` is not cleared and the monotonicity of
`AnalysisState` will make the analysis invalid as `join` will not work
as expected (the first join is no longer `join(uninitialized, init
value)`, instead it becomes `join(higher value, init value)`.
To correctly re-run the analysis, either a new `DataFlowSolver` is
created, or we can just clear the `analysisState`.
Fix a use-after-return in #117513. Free-standing lambdas should not be
defined inside of the `LLVMTypeConverter` constructor because they go
out of scope.
This PR updates the WgmmaFenceAlignedOp, WgmmaGroupSyncAlignedOp, and
WgmmaWaitGroupSyncOp Ops in the NVVM Dialect to lower to the
corresponding intrinsics instead of inline-ptx.
The existing test under Conversion/NVVMToLLVM is updated to check for
the new patterns and separate tests are added under Target/LLVMIR to
verify the lowered intrinsics.