The PR fixes the datatype error for `nvvm.mma.sync` when the operand is
`bf16`. This operation originally requires the A/B type to be `f16x2`
for the `bf16` MMA. However, it violates the NVVM intrinsic
[[here](372044ee09/llvm/include/llvm/IR/IntrinsicsNVVM.td (L119))],
where the A/B operand type should be `i32`. This is a bug, and there are
no tests in MLIR that cover this datatype.
```
// mma bf16 -> s32 @ m16n8k16/m16n8k8
!eq(gft,"m16n8k16:a:bf16") : !listsplat(llvm_i32_ty, 4),
!eq(gft,"m16n8k16:b:bf16") : !listsplat(llvm_i32_ty, 2),
!eq(gft,"m16n8k8:a:bf16") : !listsplat(llvm_i32_ty, 2),
!eq(gft,"m16n8k8:b:bf16") : [llvm_i32_ty],
```
This PR addresses this bug and adds tests to guarantee correctness.
Co-authored-by: Xiaolei Shi <xiaoleis@nvidia.com>
The decomposition of `linalg.softmax` uses `maxnumf`, but the identity
element that is used in the generated code is the one for `maximumf`.
They are not the same, as the identity for `maxnumf` is `NaN`, while the
one of `maximumf` is `-Infty`. This is wrong and prevents the maxnumf
from being folded.
Related to #114595, which fixed the folder for maxnumf.
In `Gather1DToConditionalLoads`, currently we will check if the stride
of the most minor dim of the input memref is 1. And if not, the
rewriting pattern will not be applied. However, according to the
verification of `vector.load` here:
4e32271e8b/mlir/lib/Dialect/Vector/IR/VectorOps.cpp (L4971-L4975)
.. if the output vector type of `vector.load` contains only one element,
we can ignore the requirement of the stride of the input memref, i.e.
the input memref can be with any stride layout attribute in such case.
So here we can allow more cases in lowering `vector.gather` by relaxing
such check.
As shown in the test case attached in this patch
[here](1933fbad58/mlir/test/Dialect/Vector/vector-gather-lowering.mlir (L151)),
now `vector.gather` of memref with non-trivial stride can be lowered
successfully if the result vector contains only one element.
---------
Signed-off-by: PragmaTwice <twice@apache.org>
Co-authored-by: Andrzej Warzyński <andrzej.warzynski@gmail.com>
Certain non-standard float types were directly passed through in the
LLVM type converter, resulting in invalid IR or failed assertions:
```
mlir-opt: mlir/lib/Conversion/LLVMCommon/TypeConverter.cpp:638: FailureOr<Type> mlir::LLVMTypeConverter::convertVectorType(VectorType) const: Assertion `LLVM::isCompatibleVectorType(vectorType) && "expected vector type compatible with the LLVM dialect"' failed.
```
The LLVM type converter should not define invalid type conversion rules
for such types. If there is no type conversion rule, conversion patterns
will not apply to ops with such operand types.
Note that PointerUnion::{is,get} have been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
I'm not touching PointerUnion::dyn_cast for now because it's a bit
complicated; we could blindly migrate it to dyn_cast_if_present, but
we should probably use dyn_cast when the operand is known to be
non-null.
Note that PointerUnion::{is,get} have been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
Note that PointerUnion::{is,get} have been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
Note that PointerUnion::{is,get} have been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
I'm not touching PointerUnion::dyn_cast for now because it's a bit
complicated; we could blindly migrate it to dyn_cast_if_present, but
we should probably use dyn_cast when the operand is known to be
non-null.
Add the `convergent` attribute to builtin functions and builtin function
calls when lowering SPIR-V non-uniform group functions to LLVM dialect.
---------
Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
The GPU ID operations already implement InferIntRangeInterface, which
gives constant lower and upper bounds on those IDs when appropriate
metadata is prentent on the operations or in the surrounding context.
This commit uses that existing code to implement the
ValueBoundsOpInterface, which is used when analyzing affine operations
(unlike the integer range interface, which is used for arithmetic
optimization).
It also implements the interface for gpu.launch, where we can use it to
express the constraint that block/grid sizes are equal to their value
from outside the launch op and that the corresponding IDs are bounded
above by that size.
As a consequence, the test pass for this inference is updated to work on
a FunctionOpInterface and not a func.func, creating minor churn in other
tests.
OpenACC data clause operations previously required that the variable
operand implemented PointerLikeType interface. This was a reasonable
constraint because the dialects currently mixed with `acc` do use
pointers to represent variables. However, this forces the "pointer"
abstraction to be exposed too early and some cases are not cleanly
representable through this approach (more specifically FIR's `fix.box`
abstraction).
Thus, relax this by allowing a variable to be a type which implements
either `PointerLikeType` interface or `MappableType` interface.
This commit is a further incremental step toward moving the whole
mlir-vulkan-runner MLIR pass pipeline into mlir-opt (see #73457). The
previous step was b225b3adf7b78387c9fcb97a3ff0e0a1e26eafe2, which moved
all device passes prior to SPIR-V serialization into a new mlir-opt test
pass, `-test-vulkan-runner-pipeline`.
This commit changes how SPIR-V serialization is accomplished for Vulkan
runner tests. Until now, this was done by the Vulkan-specific
ConvertGpuLaunchFuncToVulkanLaunchFunc pass. With this commit, this
responsibility is removed from that pass, and is instead done with the
existing generic GpuModuleToBinaryPass. In addition, the SPIR-V
serialization step is no longer done inside mlir-vulkan-runner, but
rather inside mlir-opt (in the `-test-vulkan-runner-pipeline` pass).
Both of these changes represent a greater alignment between
mlir-vulkan-runner and the other GPU integration tests. Notably, the IR
shapes produced by the mlir-opt pipelines for the Vulkan and SYCL
runners are now much more similar, with both using a gpu.binary op for
the serialized SPIR-V kernel.
In order to enable this, this commit includes these supporting changes:
- ConvertToSPIRVPass is enhanced to support producing the IR shape where
a spirv.module is nested inside a gpu.module, since this is what
GpuModuleToBinaryPass expects.
- ConvertGPULaunchFuncToVulkanLaunchFunc is changed to remove its SPIR-V
serialization functionality, and instead now extracts the SPIR-V from a
gpu.binary operation (as produced by ConvertToSPIRVPass).
- `-test-vulkan-runner-pipeline` now attaches SPIR-V target information
required by GpuModuleToBinaryPass.
- The WebGPU pass option, which had been removed from mlir-vulkan-runner
in the previous commit in this series, is restored as an option to
`-test-vulkan-runner-pipeline` instead, so that the WebGPU pass
continues being inserted into the pipeline just before SPIR-V
serialization.
Replaces https://github.com/llvm/llvm-project/pull/121886
Fixes https://github.com/llvm/llvm-project/issues/120254 (hopefully 🤞)
## Problem
Consider the following example:
```fortran
program test
real :: x(1)
integer :: i
!$omp parallel do reduction(+:x)
do i = 1,1
x = 1
end do
!$omp end parallel do
end program
```
The HLFIR+OMP IR for this example looks like this:
```mlir
func.func @_QQmain() {
...
omp.parallel {
%5 = fir.embox %4#0(%3) : (!fir.ref<!fir.array<1xf32>>, !fir.shape<1>) -> !fir.box<!fir.array<1xf32>>
%6 = fir.alloca !fir.box<!fir.array<1xf32>>
...
omp.wsloop private(@_QFEi_private_ref_i32 %1#0 -> %arg0 : !fir.ref<i32>) reduction(byref @add_reduction_byref_box_1xf32 %6 -> %arg1 : !fir.ref<!fir.box<!fir.array<1xf32>>>) {
omp.loop_nest (%arg2) : i32 = (%c1_i32) to (%c1_i32_0) inclusive step (%c1_i32_1) {
...
omp.yield
}
}
omp.terminator
}
return
}
```
The problem addressed by this PR is related to: the `alloca` in the
`omp.parallel` region + the related `reduction` clause on the
`omp.wsloop` op. When we try translate the reduction from MLIR to LLVM,
we have to choose an `alloca` insertion point. This happens in
`convertOmpWsloop` where at entry to that function, this is what the
LLVM module looks like:
```llvm
define void @_QQmain() {
%tid.addr = alloca i32, align 4
...
entry:
%omp_global_thread_num = call i32 @__kmpc_global_thread_num(ptr @1)
br label %omp.par.entry
omp.par.entry:
%tid.addr.local = alloca i32, align 4
...
br label %omp.par.region
omp.par.region:
br label %omp.par.region1
omp.par.region1:
...
%5 = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
```
Now, when we choose an `alloca` insertion point for the reduction, this
is the chosen block `omp.par.entry` (without the changes in this PR).
The problem is that the allocation needed for the reduction needs to
reference the `%5` SSA value. This results in inserting allocations in
`omp.par.entry` that reference allocations in a later block
`omp.par.region1` which causes the `Instruction does not dominate all
uses!` error.
## Possible solution - take 2:
This PR contains a more localized solution than
https://github.com/llvm/llvm-project/pull/121886. It makes sure that on
entry to `initReductionVars`, the IR builder is at a point where we can
starting inserting initialization region; to make things cleaner, we
still split the builder insertion point to a dedicated
`omp.reduction.init`. This way we avoid splitting after the latest
allocation block; which is what causing the issue.
This PR adds support to the `bf16` and `i1` data types when converting
`gpu::shuffle` to the `LLVMSPV` dialect, by inserting `bitcast` to/from
`i16` (for `bf16`) and extending/truncating to `i8` (for `i1`).
Tosa v1.0 adds accumulator type attributes to the various convolution
operations defined in the spec. Update the dialect and any lit tests to
include these attributes.
Signed-off-by: Tai Ly <tai.ly@arm.com>
Co-authored-by: Tai Ly <tai.ly@arm.com>
This PR adds a missing verifier for `tosa.pad`, ensuring that the
padding shape matches [2*rank(shape1)] according to V1.0.0
Specification. Fixes#119840.
the `ptx_kernel` calling convention is a more idiomatic and standard way
of specifying a NVPTX kernel than using the metadata which is not
supposed to change the meaning of the program. Further, checking the
calling convention is significantly faster than traversing the metadata,
improving compile time.
This change updates the clang and mlir frontends as well as the
NVPTXCtorDtorLowering pass to emit kernels using the calling convention.
In addition, this updates all NVPTX unit tests to use the calling
convention as well.
Since a need for it came up dowstream (in proving that loops run at
least once), this commit implements the ValueBoundsOpInterface for
affine.delinearize_index and affine.linearize_index, using affine map
representations of the operations they perform.
These implementations also use information from outer bounds to impose
additional constraints when those are available.
The current implementation of LocationSnapshotPass takes an
OpPrintingFlags argument and stores it as member, but does not use it
for printing.
Properly implement the printing flags, also supporting command line args.
---------
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
Since index operations have no set bitwidth, it is ill-defined to use
signed/unsigned wrapping behavior. The corollary to which is that it is
always safe to add nsw/nuw to lowering of affine ops.
Also add a folder to fold `div(s|u)i (mul (a, v), v) -> a`
Signed-off-by: MaheshRavishankar <mravisha@amd.com>
This commit add an NVIDIA-specific lowering of `cf.assert` to to
`__assertfail`.
Note: `getUniqueFormatGlobalName`, `getOrCreateFormatStringConstant` and
`getOrDefineFunction` are moved to `GPUOpsLowering.h`, so that they can
be reused.
Print operations are often used for debugging, immediately before the
compiler aborts. In such cases, it is sometimes possible that the output
isn't fully produced yet. Make sure it is by explicitly flushing the
output.
Fixes two minor issues in `findOrBuildReplacementValue`:
* Remove a redundant `mapping.map`.
* Map `repl` instead of `value`. We used to overwrite an existing
mapping, which could introduce extra materializations.
Note: We generally do not want to overwrite mappings, but create a chain
of mappings. There are still a few more places, where a mapping is
overwritten. Once those are fixed, I will put an assertion into
`ConversionValueMapping::map`.
Remove `func.call` and `func.return` patterns from
`populateArmSVELegalizeForLLVMExportPatterns`. This function is called
from `ConvertVectorToLLVMPass::runOnOperation`. That pass should lower
only `vector` dialect ops, not `func` dialect ops. These patterns also
seem to be unnecessary, as no test cases are failing without them. Also
note that there is no `func.func` pattern, so any application of the
above-mentioned patterns produces invalid IR.
The `buildUnresolvedMaterialization` implementation used to check if a
materialization is necessary. A materialization is not necessary if the
desired types already match the input. However, this situation can never
happen: we look for mapped values with the desired type at the call
sites before requesting a new unresolved materialization.
The previous implementation seemed incorrect because
`buildUnresolvedMaterialization` created a mapping that is never rolled
back. (When in reality that code was never executed, so it is
technically not incorrect.)
Also fix a comment that in `findOrBuildReplacementValue` that was
incorrect.
The `properlyDominates` implementations for blocks and ops are very
similar. This commit replaces them with a single implementation that
operates on block iterators. That implementation can be used to
implement both `properlyDominates` variants.
Before:
```c++
template <bool IsPostDom>
bool DominanceInfoBase<IsPostDom>::properlyDominatesImpl(Block *a,
Block *b) const;
template <bool IsPostDom>
bool DominanceInfoBase<IsPostDom>::properlyDominatesImpl(
Operation *a, Operation *b, bool enclosingOpOk) const;
```
After:
```c++
template <bool IsPostDom>
bool DominanceInfoBase<IsPostDom>::properlyDominatesImpl(
Block *aBlock, Block::iterator aIt, Block *bBlock, Block::iterator bIt,
bool enclosingOk) const;
```
Note: A subsequent commit will add a new public `properlyDominates`
overload that accepts block iterators. That functionality can then be
used to find a valid insertion point at which a range of values is
defined (by utilizing post dominance).
The existing canonicalization patterns would only cancel out cases where
the entire result list of an affine.delineraize_index was passed to an
affine.lineraize_index and the basis elements matched exactly (except
possibly for the outer bounds).
This was correct, but limited, and left open many cases where a
delinearize_index would take a series of divisions and modulos only for
a subsequent linearize_index to use additions and multiplications to
undo all that work.
This sort of simplification is reasably easy to observe at the level of
splititng and merging indexes, but difficult to perform once the
underlying arithmetic operations have been created.
Therefore, this commit generalizes the existing simplification logic.
Now, any run of two or more delinearize_index results that appears
within the argument list of a linearize_index operation with the same
basis (or where they're both at the outermost position and so can be
unbonded, or when `linearize_index disjoint` implies a bound not present
on the `delinearize_index`) will be reduced to one signle
delinearize_index output, whose basis element (that is, size or length)
is equal to the product of the sizes that were simplified away.
That is, we can now simplify
%0:2 = affine.delinearize_index %n into (8, 8) : inde, index
%1 = affine.linearize_index [%x, %0#0, %0#1, %y] by (3, 8, 8, 5) : index
to the simpler
%1 = affine.linearize_index [%x, %n, %y] by (3, 64, 5) : index
This new pattern also works with dynamically-sized basis values.
While I'm here, I fixed a bunch of typos in existing tests, and added a
new getPaddedBasis() method to make processing
potentially-underspecified basis elements simpler in some cases.
This alters the condition in ForOpIterArgsFolder to always remove iter
args when their initial value equals the yielded value, not just when
the arg has no use.
Currently the compiler will ICE in programs like the following on the
device lowering pass:
```
program main
implicit none
type i1_t
integer :: val(1000)
end type i1_t
integer :: i
type(i1_t), pointer :: newi1
type(i1_t), pointer :: tab=>null()
integer, dimension(:), pointer :: tabval
!$omp THREADPRIVATE(tab)
allocate(newi1)
tab=>newi1
tab%val(:)=1
tabval=>tab%val
!$omp target teams distribute parallel do
do i = 1, 1000
tabval(i) = i
end do
!$omp end target teams distribute parallel do
end program main
```
This is due to the fact that THREADPRIVATE returns a result operation,
and this operation can actually be used by other LLVM dialect (or other
dialect) operations. However, we currently skip the lowering of
threadprivate, so we effectively never generate and bind an LLVM-IR
result to the threadprivate operation result. So when we later go on to
lower dependent LLVM dialect operations, we are missing the required
LLVM-IR result, try to access and use it and then ICE. The fix in this
particular PR is to allow compilation of threadprivate for device as
well as host, and simply treat the device compilation as a no-op,
binding the LLVM-IR result of threadprivate with no alterations and
binding it, which will allow the rest of the compilation to proceed,
where we'll eventually discard the host segment in any case.
The other possible solution to this I can think of, is doing something
similar to Flang's passes that occur prior to CodeGen to the LLVM
dialect, where they erase/no-op certain unrequired operations or
transform them to lower level series of operations. And we would
erase/no-op threadprivate on device as we'd never have these in target
regions.
The main issues I can see with this are that we currently do not
specialise this stage based on wether we're compiling for device or
host, so it's setting a precedent and adding another point of having to
understand the separation between target and host compilation. I am also
not sure we'd necessarily want to enforce this at a dialect level incase
someone else wishes to add a different lowering flow or translation
flow. Another possible issue is that a target operation we have/utilise
would depend on the result of threadprivate, meaning we'd not be allowed
to entirely erase/no-op it, I am not sure of any situations where this
may be an issue currently though.