Commit Graph

2341 Commits

Author SHA1 Message Date
Benjamin Maxwell
a4e15416b4 [mlir][ArmSME] Move creation of load/store intrinsics to helpers (NFC) (#76168)
Also, for consistency make the ZeroOp lowering switch on the ArmSMETileType,
rather than the element bit width.
2023-12-21 17:46:12 +00:00
Jakub Kuderski
72003adf6b [mlir][gpu] Allow subgroup reductions over 1-d vector types (#76015)
Each vector element is reduced independently, which is a form of
multi-reduction.

The plan is to allow for gradual lowering of multi-reduction that
results in fewer `gpu.shuffle` ops at the end:
1d `vector.multi_reduction` --> 1d `gpu.subgroup_reduce` --> smaller 1d
`gpu.subgroup_reduce` --> packed `gpu.shuffle` over i32

For example we can perform 2 independent f16 reductions with a series of
`gpu.shuffles` over i32, reducing the final number of `gpu.shuffles` by 2x.
2023-12-21 11:55:43 -05:00
Paul C Fuqua
11141bc68a Fix what seems to be a silly bug in gpu.set_default_device rewriting. Smoke test included. (#75756) 2023-12-20 09:35:42 -06:00
Cullen Rhodes
4db0bd28e8 [mlir][vector][nfc] remove unused template parameter (#75931) 2023-12-20 08:06:25 +00:00
Jakub Kuderski
560564f51c [mlir][vector][gpu] Align minf/maxf reduction kind names with arith (#75901)
This is to avoid confusion when dealing with reduction/combining kinds.
For example, see a recent PR comment:
https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175.

Previously, they were picked to mostly mirror the names of the llvm
vector reduction intrinsics:
https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In
isolation, it was not clear if `<maxf>` has `arith.maxnumf` or
`arith.maximumf` semantics. The new reduction kind names map 1:1 to
arith ops, which makes it easier to tell/look up their semantics.

Because both the vector and the gpu dialect depend on the arith dialect,
it's more natural to align names with those in arith than with the
lowering to llvm intrinsics.

Issue: https://github.com/llvm/llvm-project/issues/72354
2023-12-20 00:14:43 -05:00
Matthias Springer
10056c821a [mlir][SCF] scf.parallel: Make reductions part of the terminator (#75314)
This commit makes reductions part of the terminator. Instead of
`scf.yield`, `scf.reduce` now terminates the body of `scf.parallel` ops.
`scf.reduce` may contain an arbitrary number of reductions, with one
region per reduction.

Example:
```mlir
%init = arith.constant 0.0 : f32
%r:2 = scf.parallel (%iv) = (%lb) to (%ub) step (%step) init (%init, %init)
    -> f32, f32 {
  %elem_to_reduce1 = load %buffer1[%iv] : memref<100xf32>
  %elem_to_reduce2 = load %buffer2[%iv] : memref<100xf32>
  scf.reduce(%elem_to_reduce1, %elem_to_reduce2 : f32, f32) {
    ^bb0(%lhs : f32, %rhs: f32):
      %res = arith.addf %lhs, %rhs : f32
      scf.reduce.return %res : f32
  }, {
    ^bb0(%lhs : f32, %rhs: f32):
      %res = arith.mulf %lhs, %rhs : f32
      scf.reduce.return %res : f32
  }
}
```

`scf.reduce` operations can no longer be interleaved with other ops in
the body of `scf.parallel`. This simplifies the op and makes it possible
to assign the `RecursiveMemoryEffects` trait to `scf.reduce`. (This was
not possible before because the op was not a terminator, causing the op
to be DCE'd.)
2023-12-20 11:06:27 +09:00
long.chen
227bfa1fb1 [mlir] fix a crash when lower parallel loop to gpu (#75811) (#75946) 2023-12-20 09:13:15 +08:00
Rik Huijzer
9f5afc3de9 Revert "[mlir][vector] Fix invalid LoadOp indices being created (#75519)"
This reverts commit 3a1ae2f46d.
2023-12-17 12:34:17 +01:00
Rik Huijzer
3a1ae2f46d [mlir][vector] Fix invalid LoadOp indices being created (#75519)
Fixes https://github.com/llvm/llvm-project/issues/71326.

The cause of the issue was that a new `LoadOp` was created which looked
something like:
```mlir
%arg4 = 
func.func main(%arg1 : index, %arg2 : index) {
  %alloca_0 = memref.alloca() : memref<vector<1x32xi1>>
  %1 = vector.type_cast %alloca_0 : memref<vector<1x32xi1>> to memref<1xvector<32xi1>>
  %2 = memref.load %1[%arg1, %arg2] : memref<1xvector<32xi1>>
  return
}
```
which crashed inside the `LoadOp::verify`. Note here that `%alloca_0` is
0 dimensional, `%1` has one dimension, but `memref.load` tries to index
`%1` with two indices.

This is now fixed by using the fact that `unpackOneDim` always unpacks
one dim


1bce61e6b0/mlir/lib/Conversion/VectorToSCF/VectorToSCF.cpp (L897-L903)

and so the `loadOp` should just index only one dimension.

---------

Co-authored-by: Benjamin Maxwell <macdue@dueutil.tech>
2023-12-17 11:42:35 +01:00
Rob Suderman
aa165edca8 [mlir][math] Added math.sinh with expansions to math.exp (#75517)
Includes end-to-end tests for the cpu running, folders using `libm` and
lowerings to the corresponding `libm` operations.
2023-12-15 11:35:40 -08:00
Cullen Rhodes
e7432babaf [mlir][ArmSME] Fail instead of error in vector.outerproduct lowering (#75447)
The 'vector.outerproduct' -> 'arm_sme.outerproduct' conversion currently
errors on unsupported cases when it should return failure.
2023-12-15 07:30:32 +00:00
Pablo Antonio Martinez
7f4f75c144 [MLIR][SCFToOpenMP] Add num-threads option (#74854)
Add `num-threads` option to the `-convert-scf-to-openmp` pass, allowing
to set the number of threads to be used in the `omp.parallel` to a fixed
value.
2023-12-14 09:07:17 +00:00
Sungsoon Cho
762964e97f Add cosh op to the math dialect. (#75153) 2023-12-13 12:25:37 +01:00
Benjamin Maxwell
01ac530a2e [mlir][ArmSME] Remove vector.print legality from ArmSMEToSCF (NFC) (#74875)
This was moved to VectorToArmSME in #74063, so this is no longer needed.

VectorToArmSME uses a greedy rewriter, so a similar legality rule is not
needed there.

See:
bbb8a0df73/mlir/lib/Conversion/VectorToArmSME/VectorToArmSMEPass.cpp (L35)
2023-12-11 11:25:43 +00:00
Frederik Harwath
f7250179e2 Implement acos operator in MLIR Math Dialect (#74584)
Required for torch-mlir.
Cf. llvm/torch-mlir#2604 "Implement torch.aten.acos".
2023-12-08 09:08:43 -08:00
Mehdi Amini
847d8457d1 Apply clang-tidy fixes for performance-unnecessary-value-param in VectorToGPU.cpp (NFC) 2023-12-07 21:39:25 -08:00
Mehdi Amini
b8a3f0fd3a Apply clang-tidy fixes for llvm-qualified-auto in VectorToGPU.cpp (NFC) 2023-12-07 21:39:25 -08:00
Mehdi Amini
1cef577b90 Apply clang-tidy fixes for llvm-qualified-auto in PredicateTree.cpp (NFC) 2023-12-07 21:39:25 -08:00
Mehdi Amini
345d574b65 Apply clang-tidy fixes for llvm-prefer-isa-or-dyn-cast-in-conditionals in MapMemRefStorageClassPass.cpp (NFC) 2023-12-07 21:39:25 -08:00
Mehdi Amini
6ac80a7677 Apply clang-tidy fixes for readability-identifier-naming in GPUToLLVMConversion.cpp (NFC) 2023-12-07 21:39:25 -08:00
harsh-nod
42bba97fc2 [mlir] Extend CombineTransferReadOpTranspose pattern to handle extf ops. (#74754)
This patch modifies the CombineTransferReadOpTranspose pattern to handle
extf ops. Also adds a test which shows the transpose getting folded into
the transfer_read.
2023-12-07 15:01:55 -08:00
Benjamin Maxwell
b0b69fd879 [mlir][ArmSME] More precisely model dataflow in ArmSME to SCF lowerings (#73922)
Since #73253, loops over tiles in SSA form (i.e. loops that take
`iter_args` and yield a new tile) are supported, so this patch updates
ArmSME lowerings to this form. This is a NFC, as it still lowers to the
same intrinsics, but this makes IR less 'surprising' at a higher-level,
and may be recognised by more transforms.


Example:

IR before:
```mlir
scf.for %tile_slice_index = %c0 to %num_tile_slices step %c1 
{
   arm_sme.move_vector_to_tile_slice 
     %broadcast_to_1d, %tile, %tile_slice_index : 
     vector<[4]xi32> into vector<[4]x[4]xi32>
}
// ... later use %tile
```
IR now:
```mlir
%broadcast_to_tile = scf.for %tile_slice_index = %c0 to %num_tile_slices
    step %c1 iter_args(%iter_tile = %init_tile) -> (vector<[4]x[4]xi32>)
{
   %tile_update = arm_sme.move_vector_to_tile_slice
      %broadcast_to_1d, %iter_tile, %tile_slice_index :
      vector<[4]xi32> into vector<[4]x[4]xi32>
  scf.yield %tile_update : vector<[4]x[4]xi32>
}
// ... later use %broadcast_to_tile
```
2023-12-06 14:31:05 +00:00
Georgios Pinitas
3a772c3bfe [mlir][tosa] Add fp16 support to tosa.resize (#73019) 2023-12-06 12:48:44 +00:00
Tom Eccles
fcd06d774d [mlir][flang] add fast math attribute to fcmp (#74315)
`llvm.fcmp` does support fast math attributes therefore so should
`arith.cmpf`.

The heavy churn in flang tests are because flang sets
`fastmath<contract>` by default on all operations that support the fast
math interface. Downstream users of MLIR should not be so effected.

This was requested in https://github.com/llvm/llvm-project/issues/74263
2023-12-06 10:19:48 +00:00
Guray Ozen
641e05decc [mlir][gpu] Support dynamic_shared_memory Op with vector dialect (#74475)
`gpu.dynamic_shared_memory` currently does not get lowered when it is
used with vector dialect. The reason is that vector-to-llvm conversion
is not included in gpu-to-nvvm. This PR includes that and adds a test.
2023-12-06 10:41:57 +01:00
Matthias Springer
8f9aac4427 [mlir][vector] Fix invalid IR in vector.print lowering (#74410)
`DecomposePrintOpConversion` used to generate invalid op such as:
```
error: 'arith.extsi' op operand type 'vector<10xi32>' and result type 'vector<10xi32>' are cast incompatible
  vector.print %v9 : vector<10xi32>
```

This commit fixes tests such as
`mlir/test/Integration/Dialect/Vector/CPU/test-reductions-i32.mlir` when
verifying the IR after each pattern application (#74270).
2023-12-06 09:44:03 +09:00
Guray Ozen
391a7577e7 [mlir][gpu] Add lowering dynamic_shared_memory op for rocdl (#74473)
This PR adds lowering of `gpu.dynamic_shared_memory` to rocdl target.
2023-12-05 19:56:43 +01:00
Benjamin Maxwell
01e40a8a3d [mlir][ArmSME] Remove ArmSMETypeConverter (and configure LLVM one instead) (#73639)
This patch removes the ArmSMETypeConverter, and instead updates
`populateArmSMEToLLVMConversionPatterns()` to add an ArmSME vector type
conversion to the existing LLVMTypeConverter. This makes it easier to
add these patterns to an existing `-to-llvm` lowering pass.
2023-12-04 17:02:48 +00:00
Guray Ozen
3a03da37a3 [mlir][nvgpu] Add address space attribute converter in nvgpu-to-nvvm pass (#74075)
GPU dialect has `#gpu.address_space<workgroup>` for shared memory of
NVGPU (address space =3). Howeverm when IR combine NVGPU and GPU
dialect, `nvgpu-to-nvvm` pass fails due to missing attribute conversion.

This PR adds `populateGpuMemorySpaceAttributeConversions` to
nvgou-to-nvvm lowering, so we can use `#gpu.address_space<workgroup>`
`nvgpu-to-nvvm` pass
2023-12-04 16:48:39 +01:00
Benjamin Maxwell
10063c5a29 [mlir][ArmSME] Move vector.print -> ArmSME lowering to VectorToArmSME (#74063)
This moves the SME tile vector.print lowering from
`-convert-arm-sme-to-scf` to `-convert-vector-to-arm-sme`. This seems
like a more logical place, as this is lowering a vector op to ArmSME,
and it also prevents vector.print from blocking tile allocation.
2023-12-04 09:42:11 +00:00
Spenser Bauman
293c21db93 [mlir][tosa] Improve lowering of tosa.conv2d (#74143)
The existing lowering of tosa.conv2d emits a separate linalg.generic
operator to add the bias after computing the computation.

This change eliminates that additional step by using the generated
linalg.conv_2d_* operator by using the bias value as the input to the
linalg.conv_2d operation.

Rather than:

    %init = tensor.empty()
    %conv = linalg.conv_2d ins(%A, %B) %outs(%init)

    %init = tensor.empty()
    %bias = linalg.generic ins(%conv, %bias) outs(%init2) {
      // perform add operation
    }

The lowering now produces:

    %init = tensor.empty()
    %bias_expanded = linalg.broadcast ins(%bias) outs(%init)
    %conv = linalg.conv_2d ins(%A, %B) %outs(%bias)

This is the same strategy as
https://github.com/llvm/llvm-project/pull/73049 applied to convolutions.
2023-12-02 12:29:10 +00:00
Spenser Bauman
f58fb8c209 [mlir][tosa] Fix lowering of tosa.conv2d (#73240)
The lowering of tosa.conv2d produces an illegal tensor.empty operation
where the number of inputs do not match the number of dynamic dimensions
in the output type.

The fix is to base the generation of tensor.dim operations off the
result type of the conv2d operation, rather than the input type. The
problem and fix are very similar to this fix

https://github.com/llvm/llvm-project/pull/72724

but for convolution.
2023-12-01 15:33:14 +00:00
Spenser Bauman
0d87e25779 [mlir][tosa] Improve lowering to tosa.fully_connected (#73049)
The current lowering of tosa.fully_connected produces a linalg.matmul
followed by a linalg.generic to add the bias. The IR looks like the
following:

    %init = tensor.empty()
    %zero = linalg.fill ins(0 : f32) outs(%init)
    %prod = linalg.matmul ins(%A, %B) outs(%zero)

    // Add the bias
    %initB = tensor.empty()
    %result = linalg.generic ins(%prod, %bias) outs(%initB) {
       // add bias and product
    }

This has two down sides:

1. The tensor.empty operations typically result in additional
allocations after bufferization
2. There is a redundant traversal of the data to add the bias to the
matrix product.

This extra work can be avoided by leveraging the out-param of
linalg.matmul. The new IR sequence is:

    %init = tensor.empty()
    %broadcast = linalg.broadcast ins(%bias) outs(%init)
    %prod = linalg.matmul ins(%A, %B) outs(%broadcast)

In my experiments, this eliminates one loop and one allocation (post
bufferization) from the generated code.
2023-12-01 15:16:51 +00:00
Rik Huijzer
c84061fd34 [mlir][vector] Fix a target-rank=0 unrolling (#73365)
Fixes https://github.com/llvm/llvm-project/issues/64269.

With this patch, calling `mlir-opt "-convert-vector-to-scf=full-unroll
target-rank=0"` on
```mlir
func.func @main(%vec : vector<2xi32>) {
  %alloc = memref.alloc() : memref<4xi32>
  %c0 = arith.constant 0 : index
  vector.transfer_write %vec, %alloc[%c0] : vector<2xi32>, memref<4xi32>
  return
}
```
will result in
```mlir
module {
  func.func @main(%arg0: vector<2xi32>) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %alloc = memref.alloc() : memref<4xi32>
    %0 = vector.extract %arg0[0] : i32 from vector<2xi32>
    %1 = vector.broadcast %0 : i32 to vector<i32>
    vector.transfer_write %1, %alloc[%c0] : vector<i32>, memref<4xi32>
    %2 = vector.extract %arg0[1] : i32 from vector<2xi32>
    %3 = vector.broadcast %2 : i32 to vector<i32>
    vector.transfer_write %3, %alloc[%c1] : vector<i32>, memref<4xi32>
    return
  }
}
```

I've also tried to proactively find other `target-rank=0` bugs, but
couldn't find any. `options.targetRank` is only used 8 times throughout
the `mlir` folder, all inside `VectorToSCF.cpp`. None of the other uses
look like they could cause a crash. I've also tried

```mlir
func.func @main(%vec : vector<2xi32>) -> vector<2xi32> {
  %alloc = memref.alloc() : memref<4xindex>
  %c0 = arith.constant 0 : index
  %out = vector.transfer_read %alloc[%c0], %c0 : memref<4xindex>, vector<2xi32>
  return %out : vector<2xi32>
}
```
with `"--convert-vector-to-scf=full-unroll target-rank=0"` and that also
didn't crash. (Maybe obvious. I have to admit that I'm not very familiar
with these ops.)
2023-11-30 13:29:09 +01:00
Benjamin Maxwell
eaff02f28e [mlir][ArmSME] Switch to an attribute-based tile allocation scheme (#73253)
This reworks the ArmSME dialect to use attributes for tile allocation.
This has a number of advantages and corrects some issues with the
previous approach:

* Tile allocation can now be done ASAP (i.e. immediately after
`-convert-vector-to-arm-sme`)
* SSA form for control flow is now supported (e.g.`scf.for` loops that
yield tiles)
* ArmSME ops can be converted to intrinsics very late (i.e. after
lowering to control flow)
 * Tests are simplified by removing constants and casts
* Avoids correctness issues with representing LLVM `immargs` as MLIR
values
- The tile ID on the SME intrinsics is an `immarg` (so is required to be
a compile-time constant), `immargs` should be mapped to MLIR attributes
(this is already the case for intrinsics in the LLVM dialect)
- Using MLIR values for `immargs` can lead to invalid LLVM IR being
generated (and passes such as -cse making incorrect optimizations)

As part of this patch we bid farewell to the following operations:

```mlir
arm_sme.get_tile_id : i32
arm_sme.cast_tile_to_vector : i32 to vector<[4]x[4]xi32>
arm_sme.cast_vector_to_tile : vector<[4]x[4]xi32> to i32
```

These are now replaced with:
```mlir
// Allocates a new tile with (indeterminate) state:
arm_sme.get_tile : vector<[4]x[4]xi32>
// A placeholder operation for lowering ArmSME ops to intrinsics:
arm_sme.materialize_ssa_tile : vector<[4]x[4]xi32>
```

The new tile allocation works by operations implementing the
`ArmSMETileOpInterface`. This interface says that an operation needs to
be assigned a tile ID, and may conditionally allocate a new SME tile.

Operations allocate a new tile by implementing...
```c++
std::optional<arm_sme::ArmSMETileType> getAllocatedTileType()
```
...and returning what type of tile the op allocates (ZAB, ZAH, etc).

Operations that don't allocate a tile return `std::nullopt` (which is
the default behaviour).

Currently the following ops are defined as allocating:
```mlir
arm_sme.get_tile
arm_sme.zero
arm_sme.tile_load
arm_sme.outerproduct // (if no accumulator is specified)
```

Allocating operations become the roots for the tile allocation pass,
which currently just (naively) assigns all transitive uses of a root
operation the same tile ID. However, this is enough to handle current
use cases.

Once tile IDs have been allocated subsequent rewrites can forward the
tile IDs to any newly created operations.
2023-11-30 10:22:22 +00:00
Mehdi Amini
9415fca848 [mlir] Fix build with shared libs (missing cmake link dependency) (NFC) 2023-11-29 12:17:52 -08:00
Mehdi Amini
9e7b6f46ba [mlir] Adopt ConvertToLLVMPatternInterface GpuToLLVMConversionPass to align with convert-to-llvm (#73761)
This is a follow-up to the introduction of `convert-to-llvm`: it is
supposed to be a unifying pass through the
`ConvertToLLVMPatternInterface`, but some specific conversion (like the
GPU target) aren't vanilla LLVM target. Instead they need extra
customizations that are specific to LLVM-on-GPUs and our custom runtime
wrappers.
This change make the GpuToLLVMConversionPass just as pluggable as the
`convert-to-llvm` by using the same mechanism.
2023-11-29 11:37:53 -08:00
Jakub Kuderski
1bdb2e8550 [mlir][spirv] Simplify gpu reduction to spirv logic (#73546)
Check the type only once and then specialize op handlers based on it.

Do not use boolean types in group arithmetic ops that expect integers.
2023-11-27 12:33:41 -05:00
Jakub Kuderski
7eccd52842 Reland "[mlir][gpu] Align reduction operations with vector combining kinds (#73423)"
This reverts commit dd09221a29 and relands
https://github.com/llvm/llvm-project/pull/73423.

* Updated `gpu.all_reduce` `min`/`max` in CUDA integration tests.
2023-11-27 11:38:18 -05:00
Jakub Kuderski
dd09221a29 Revert "[mlir][gpu] Align reduction operations with vector combining kinds (#73423)"
This reverts commit e0aac8c88d.

I'm seeing some nvidia integration test failures:
https://lab.llvm.org/buildbot/#/builders/61/builds/52334.
2023-11-27 11:29:23 -05:00
Jakub Kuderski
e0aac8c88d [mlir][gpu] Align reduction operations with vector combining kinds (#73423)
The motivation for this change is explained in
https://github.com/llvm/llvm-project/issues/72354.

Before this change, we could not tell between signed/unsigned
minimum/maximum and NaN treatment for floating point values.

The mapping of old reduction operations to the new ones is as follows:
*  `min` --> `minsi` for ints, `minf` for floats
*  `max` --> `maxsi` for ints, `maxf` for floats

New reduction kinds not represented in the old enum: `minui`, `maxui`,
`minimumf`, `maximumf`.

As a next step, I would like to have a common definition of combining
kinds used by the `vector` and `gpu` dialects. Separately, the GPU to
SPIR-V lowering does not yet properly handle zero and NaN values -- the
behavior of floating point min/max group reductions is not specified by
the SPIR-V spec, see https://github.com/llvm/llvm-project/issues/73459. 

Issue: https://github.com/llvm/llvm-project/issues/72354
2023-11-27 11:19:20 -05:00
Jakub Kuderski
6b9c186b2d [mlir][spirv] Handle non-innerprod float vector add reductions (#73476)
Instead of extracting all individual vector components and performing a
scalar summation, use `spirv.Dot` with the original reduction operand
and a vector constant of all ones.
2023-11-27 11:10:25 -05:00
Guray Ozen
edf5cae739 [mlir][gpu] Support Cluster of Thread Blocks in gpu.launch_func (#72871)
NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA).
It is a new level of parallelism, allowing clustering of Cooperative
Thread Arrays (CTA) to synchronize and communicate through shared memory
while running concurrently.

This PR enables support for CGA within the `gpu.launch_func` in the GPU
dialect. It extends `gpu.launch_func` to accommodate this functionality.

The GPU dialect remains architecture-agnostic, so we've added CGA
functionality as optional parameters. We want to leverage mechanisms
that we have in the GPU dialects such as outlining and kernel launching,
making it a practical and convenient choice.

An example of this implementation can be seen below:

```
gpu.launch_func @kernel_module::@kernel
                clusters in (%1, %0, %0) // <-- Optional
                blocks in (%0, %0, %0)
                threads in (%0, %0, %0)
```

The PR also introduces index and dimensions Ops specific to clusters,
binding them to NVVM Ops:

```
%cidX = gpu.cluster_id  x
%cidY = gpu.cluster_id  y
%cidZ = gpu.cluster_id  z

%cdimX = gpu.cluster_dim  x
%cdimY = gpu.cluster_dim  y
%cdimZ = gpu.cluster_dim  z
```

We will introduce cluster support in `gpu.launch` Op in an upcoming PR. 

See [the
documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays)
provided by NVIDIA for details.
2023-11-27 11:05:07 +01:00
Hsiangkai Wang
477c0b67a3 [mlir][affine][gpu] Replace DivSIOp to CeilDivSIOp when lowering to GPU launch (#73328)
When converting affine.for to GPU launch operator, we have to calculate
the block dimension and thread dimension for the launch operator.

The formula of the dimension size is

(upper_bound - lower_bound) / step_size

When the difference is indivisible by step_size, we use rounding-to-zero
as the division result. However, the block dimension and thread
dimension is right-open range, i.e., [0, block_dim) and [0, thread_dim).
So, we will get the wrong result if we use DivSIOp. In this patch, we
replace it with CeilDivSIOp to get the correct block and thread
dimension values.
2023-11-27 08:05:54 +00:00
Jakub Kuderski
897141449e [mlir][spirv] Add floating point dot product (#73466)
Because `OpDot` does not require any extra capabilities or extensions,
enable it by default in the vector to spirv conversion.
2023-11-26 19:27:16 -05:00
Jakub Kuderski
d625ea12c7 [mlir][spirv] Split codegen for float min/max reductions and others v2. [NFC] (#73363)
This is https://github.com/llvm/llvm-project/pull/69023 but with
cleanups.
Reduced complexity by avoiding CRTP and preprocessor defines in favor of
free functions

Original description by @unterumarmung:

---

This patch is part of a larger initiative aimed at fixing floating-point
`max` and `min` operations in MLIR:
https://discourse.llvm.org/t/rfc-fix-floating-point-max-and-min-operations-in-mlir/72671.

There are two types of min/max operations for floating-point numbers:
`minf`/`maxf` and `minimumf`/`maximumf`. The code generation for these
operations should differ from that of other vector reduction kinds. This
difference arises because CL and GL operations for floating-point min
and max do not have the same semantics when handling NaNs. Therefore, we
must enforce the desired semantics with additional ops.

~~However, since the code generation for floating-point min/max
operations shares the same functionality as extracting values for the
vector, we have decided to refactor the existing code using the CRTP
pattern.~~ This change does not alter the actual behavior of the code
and is necessary for future fixes to the codegen for floating-point
min/max operations.

---------

Co-authored-by: Daniil Dudkin <unterumarmung@yandex.ru>
2023-11-24 15:24:45 -05:00
Alexander Batashev
a4ee55fe6e [MLIR][NFC] Fix build on recent GCC with C++20 enabled (#73308)
The following pattern fails on recent GCC versions with -std=c++20 flag
passed and succeeds with -std=c++17. Such behavior is not observed on
Clang 16.0.

```
template <typename T>
struct Foo {
    Foo<T>(int a) {}
};
```

This patch removes template parameter from constructor in two occurences
to make the following command complete successfully:
bazel build -c fastbuild --cxxopt=-std=c++20 --host_cxxopt=-std=c++20
@llvm-project//mlir/...

This patch is similar to https://reviews.llvm.org/D154782

Co-authored-by: Alexander Batashev <a.batashev@partner.samsung.com>
2023-11-24 15:13:39 +03:00
Kai Wang
3049c76e43 [mlir][vector][spirv] Lower vector.load and vector.store to SPIR-V (#71674)
Add patterns to lower vector.load to spirv.load and vector.store to
spirv.store.
2023-11-24 10:29:43 +00:00
Oleksandr "Alex" Zinenko
43bc81d748 [mlir] fix LLVM type converter for structs (#73231)
Existing implementation of the LLVM type converter for LLVM structs
containing incompatible types was attempting to change identifiers of
the struct in case of name clash post-conversion (all identified structs
have different names post-conversion since one cannot change the body of
the struct once initialized). Beyond a trivial error of not updating the
counter in renaming, this approach was broken for recursive structs that
can't be made aware of the renaming and would use the pre-existing
struct with clashing name instead.

For example, given

`!llvm.struct<"_Converted.foo", (struct<"_Converted.foo">, f32)>`

the following type

`!llvm.struct<"foo", (struct<"foo", index>)>`

would incorrectly convert to

```
!llvm.struct<"_Converted_1.foo",
             (struct<"_Converted.foo",
	             (struct<"_Converted.foo">, f32)>)>
```

Remove this incorrect renaming and simply refuse to convert types if it
would lead to identifier clashes for structs with different bodies.
Document the expectation that such generated names are reserved and must
not be present in the input IR of the converter. If we ever actually
need to use handle such cases, this can be achieved by temporarily
renaming structs with reserved identifiers to an unreserved name and
back in a pre/post-processing pass that does _not_ use the type
conversion infra.
2023-11-23 22:21:39 +01:00
Benjamin Maxwell
dbb8643333 [mlir][LLVM] Support immargs in LLVM_IntrOpBase intrinsics (#73013)
This extends `LLVM_IntrOpBase` so that it can be passed a list of
`immArgPositions` and a list (of the same length) of `immArgAttrNames`.
`immArgPositions` contains the positions of `immargs` on the LLVM IR
intrinsic, and `immArgAttrNames` maps those to a corresponding MLIR
attribute.

This allows modeling LLVM `immargs` as MLIR attributes, which is the
closest match semantically (and had already been done manually for the
LLVM dialect intrinsics).

This has two upsides:
* It's slightly easier to implement intrinsics with immargs now
(especially if they make use of other features, such as overloads)
* It clearly defines that `immargs` should map to attributes, before
there was no mention of `immargs` in LLVMOpBase.td, so implementing them
was unclear

This works with other features of the `LLVM_IntrOpBase`, so `immargs`
can be marked as overloaded too (which is used in some intrinsics).

As part of this patch (and to test correctness) existing intrinsics have
been updated to use these new parameters.

This also uncovered a few issues with the
`llvm.intr.vector.insert/extract` intrinsics. First, the argument order
for insert did not match the LLVM intrinsic, and secondly, both were
missing a mlirBuilder (so failed to import from LLVM IR). This is
corrected with this patch (and a test case added).
2023-11-23 10:12:12 +00:00