clang-p2996

Author	SHA1	Message	Date
donald chen	889b67c9d3	[mlir] [memref] add more checks to the memref.reinterpret_cast (#112669 ) Operation memref.reinterpret_cast was accept input like: %out = memref.reinterpret_cast %in to offset: [%offset], sizes: [10], strides: [1] : memref<?xf32> to memref<10xf32> A problem arises: while lowering, the true offset of %out is %offset, but its data type indicates an offset of 0. Permitting this inconsistency can result in incorrect outcomes, as certain pass might erroneously extract the offset from the data type of %out. This patch fixes this by enforcing that the return value's data type aligns with the input parameter.	2024-10-26 08:07:51 +08:00
Artem Kroviakov	663e9cec9c	[Func][GPU] Use SymbolUserOpInterface in func::ConstantOp (#107748 ) This PR enables `func::ConstantOp` creation and usage for device functions inside GPU modules. The current main returns error for referencing device functions via `func::ConstantOp`, because during the `ConstantOp` verification it only checks symbols in `ModuleOp` symbol table, which, of course, does not contain device functions that are defined in `GPUModuleOp`. This PR proposes a more general solution. Co-authored-by: Artem Kroviakov <artem.kroviakov@tum.de>	2024-09-09 11:49:16 +02:00
Andrea Faulds	3d01f0a33b	[mlir][gpu] Add 'cluster_stride' attribute to gpu.subgroup_reduce (#107142 ) Follow-up to `7aa22f013e`, adding an additional attribute needed in some applications.	2024-09-05 09:03:22 -04:00
Fabian Mora	016e1eb9c8	[mlir][gpu] Add metadata attributes for storing kernel metadata in GPU objects (#95292 ) This patch adds the `#gpu.kernel_metadata` and `#gpu.kernel_table` attributes. The `#gpu.kernel_metadata` attribute allows storing metadata related to a compiled kernel, for example, the number of scalar registers used by the kernel. The attribute only has 2 required parameters, the name and function type. It also has 2 optional parameters, the arguments attributes and generic dictionary for storing all other metadata. The `#gpu.kernel_table` stores a table of `#gpu.kernel_metadata`, mapping the name of the kernel to the metadata. Finally, the function `ROCDL::getAMDHSAKernelsELFMetadata` was added to collect ELF metadata from a binary, and to test the class methods in both attributes. Example: ```mlir gpu.binary @binary [#gpu.object<#rocdl.target<chip = "gfx900">, kernels = #gpu.kernel_table<[ #gpu.kernel_metadata<"kernel0", (i32) -> (), metadata = {sgpr_count = 255}>, #gpu.kernel_metadata<"kernel1", (i32, f32) -> (), arg_attrs = [{llvm.read_only}, {}]> ]> , bin = "BLOB">] ``` The motivation behind these attributes is to provide useful information for things like tunning. --------- Co-authored-by: Mehdi Amini <joker.eph@gmail.com>	2024-08-27 18:44:50 -04:00
Finlay	552d26e275	[mlir][gpu] Add extra value types for gpu::ShuffleOp (#104605 ) Expand the accepted types for gpu.shuffle to any integer, float or 1d vector of integers or floats. Also updated the gpu-to-llvm-spv pass to support those types.	2024-08-20 19:50:25 +01:00
Andrea Faulds	7aa22f013e	[mlir][gpu] Add 'cluster_size' attribute to gpu.subgroup_reduce (#104851 ) This enables performing several reductions in parallel, each smaller than the size of the subgroup. One potential application is flash attention with subgroup-wide matrix multiplication and reduction combined in one kernel. The multiplication operation requires a 2D matrix to be distributed over the lanes of the subgroup, which then constrains the shape the following reduction can have if we want to keep data in registers.	2024-08-20 13:37:03 -04:00
Andrea Faulds	569698443d	[mlir][gpu] Fix typo in test filename (#104053 ) The word "redule" doesn't appear anywhere else in the MLIR codebase and seems to be a typo of "reduce".	2024-08-15 08:11:31 -05:00
Krzysztof Drewniak	43fd4c49bd	[mlir][GPU] Improve handling of GPU bounds (#95166 ) This change reworks how range information for GPU dispatch IDs (block IDs, thread IDs, and so on) is handled. 1. `known_block_size` and `known_grid_size` become inherent attributes of GPU functions. This makes them less clunky to work with. As a consequence, the `gpu.func` lowering patterns now only look at the inherent attributes when setting target-specific attributes on the `llvm.func` that they lower to. 2. At the same time, `gpu.known_block_size` and `gpu.known_grid_size` are made official dialect-level discardable attributes which can be placed on arbitrary functions. This allows for progressive lowerings (without this, a lowering for `gpu.thread_id` couldn't know about the bounds if it had already been moved from a `gpu.func` to an `llvm.func`) and allows for range information to be provided even when `gpu._{id,dim}` are being used outside of a `gpu.func` context. 3. All of these index operations have gained an optional `upper_bound` attribute, allowing for an alternate mode of operation where the bounds are specified locally and not inherited from the operation's context. These also allow handling of cases where the precise launch sizes aren't known, but can be bounded more precisely than the maximum of what any platform's API allows. (I'd like to thank @benvanik for pointing out that this could be useful.) When inferring bounds (either for range inference or for setting `range` during lowering) these sources of information are consulted in order of specificity (`upper_bound` > inherent attribute > discardable attribute, except that dimension sizes check for `known__bounds` to see if they can be constant-folded before checking their `upper_bound`). This patch also updates the documentation about the bounds and inference behavior to clarify what these attributes do when set and the consequences of setting them up incorrectly. --------- Co-authored-by: Mehdi Amini <joker.eph@gmail.com>	2024-06-17 23:47:38 -05:00
Krzysztof Drewniak	472291111d	[mlir][Arith] Generalize and improve -int-range-optimizations (#94712 ) When the integer range analysis was first develop, a pass that did integer range-based constant folding was developed and used as a test pass. There was an intent to add such a folding to SCCP, but that hasn't happened. Meanwhile, -int-range-optimizations was added to the arith dialect's transformations. The cmpi simplification in that pass is a strict subset of the constant folding that lived in -test-int-range-inference. This commit moves the former test pass into -int-range-optimizaitons, subsuming its previous contents. It also adds an optimization from rocMLIR where `rem{s,u}i` operations that are noops are replaced by their left operands.	2024-06-10 09:56:33 -05:00
Guy David	4bce270157	[mlir][llvm] Implement ConstantLike for ZeroOp, UndefOp, PoisonOp (#93690 ) These act as constants and should be propagated whenever possible. It is safe to do so for mlir.undef and mlir.poison because they remain "dirty" through out their lifetime and can be duplicated, merged, etc. per the LangRef. Signed-off-by: Guy David <guy.david@nextsilicon.com>	2024-05-30 08:21:08 +02:00
tyb0807	8178a3ad1b	[mlir] Replace MLIR_ENABLE_CUDA_CONVERSIONS with LLVM_HAS_NVPTX_TARGET (#93008 ) LLVM_HAS_NVPTX_TARGET is automatically set depending on whether NVPTX was enabled when building LLVM. Use this instead of manually defining MLIR_ENABLE_CUDA_CONVERSIONS (whose name is a bit misleading btw).	2024-05-24 17:31:28 +02:00
klensy	f0b0c02504	[mlir][test] Fix filecheck annotation typos (#92897 ) Moved fixes for mlir from https://github.com/llvm/llvm-project/pull/91854, plus few additional in second commit. --------- Co-authored-by: klensy <nightouser@gmail.com>	2024-05-24 09:24:59 +02:00
Felix Schneider	acd100747f	[mlir][test] Extend `InferIntRangeInterface` test Ops to arbitrary ints (#91850 ) This PR is in preparation to some extensions to the `InferIntRangeInterface` around the `nsw` and `nuw` flags supported in the `arith` dialect and LLVM. We provide some common inference logic for `index` and `arith` in `InferIntRangeCommon.h` but our Test Ops are currently fixed to `Index` Types. As we test the range inference for arith Ops, especially around the overflow behaviour, it's handy to have native support for the typical integer types in the test Ops. This patch 1. Changes the Attributes of `test.with_bounds` ops from `Index` to `APInt` which matches the internal representation in `ConstantIntRanges`. 2. Allows the use of `AnyInteger` in addition to `Index` for the operands and results of the test Ops. This now requires explicit specification of the type in the IR, where before `Index` was implicit. 3. Requires bounds Attrs to be specified in the precision of the SSA value, eliminating any implicit truncation or extension. (Could this lead to problems?)	2024-05-14 20:33:16 +02:00
Mehdi Amini	d566a5cd22	[MLIR] Improve KernelOutlining to avoid introducing an extra block (#90359 ) This fixes a TODO in the code.	2024-04-29 19:30:38 +02:00
Alexander Belyaev	bfcd3fa825	[mlir] Add result name for gpu.block_id and gpu.thread_id ops. (#83393 ) expand-arith-ops.mlir fails on windows, but this is unrelated to this PR	2024-02-29 10:57:09 +01:00
Guray Ozen	d7f59c8fb8	[mlir] Lower math dialect later in gpu-lower-to-nvvm-pipeline (#81489 ) This PR moves lowering of math dialect later in the pipeline. Because math dialect is lowered correctly by createConvertGpuOpsToNVVMOps for GPU target, and it needs to run it first. Reland #78556	2024-02-13 08:31:42 +01:00
Benjamin Kramer	98dbc688de	Revert "[mlir] Lower math dialect later in gpu-lower-to-nvvm-pipeline (#78556 )" This reverts commit `74bf0b1cd9`. The test always fails. \| mlir/test/Dialect/GPU/test-nvvm-pipeline.mlir:23:16: error: CHECK-PTX: expected string not found in input \| // CHECK-PTX: __nv_expf https://lab.llvm.org/buildbot/#/builders/61/builds/53789	2024-01-31 17:41:21 +01:00
Guray Ozen	74bf0b1cd9	[mlir] Lower math dialect later in gpu-lower-to-nvvm-pipeline (#78556 ) This PR moves lowering of math dialect later in the pipeline. Because math dialect is lowered correctly by `createConvertGpuOpsToNVVMOps` for GPU target, and it needs to run it first.	2024-01-31 15:24:32 +01:00
Matthias Springer	ce7cc723b9	[mlir][memref] `memref.subview`: Verify result strides The `memref.subview` verifier currently checks result shape, element type, memory space and offset of the result type. However, the strides of the result type are currently not verified. This commit adds verification of result strides for non-rank reducing ops and fixes invalid IR in test cases. Verification of result strides for ops with rank reductions is more complex (and there could be multiple possible result types). That is left for a separate commit. Also refactor the implementation a bit: * If `computeMemRefRankReductionMask` could not compute the dropped dimensions, there must be something wrong with the op. Return `FailureOr` instead of `std::optional`. * `isRankReducedMemRefType` did much more than just checking whether the op has rank reductions or not. Inline the implementation into the verifier and add better comments. * `produceSubViewErrorMsg` does not have to be templatized. * Fix comment and add additional assert to `ExpandStridedMetadata.cpp`, to make sure that the memref.subview verifier is in sync with the memref.subview -> memref.reinterpret_cast lowering. Note: This change is identical to #79865, but with a fixed comment and an additional assert in `ExpandStridedMetadata.cpp`. (I reverted #79865 in #80116, but the implementation was actually correct, just the comment in `ExpandStridedMetadata.cpp` was confusing.)	2024-01-31 09:28:53 +00:00
Matthias Springer	96c907dbce	Revert "[mlir][memref] `memref.subview`: Verify result strides" (#80116 ) Reverts llvm/llvm-project#79865 I think there is a bug in the stride computation in `SubViewOp::inferResultType`. (Was already there before this change.) Reverting this commit for now and updating the original pull request with a fix and more test cases.	2024-01-31 09:35:13 +01:00
Matthias Springer	db49319264	[mlir][memref] `memref.subview`: Verify result strides (#79865 ) The `memref.subview` verifier currently checks result shape, element type, memory space and offset of the result type. However, the strides of the result type are currently not verified. This commit adds verification of result strides for non-rank reducing ops and fixes invalid IR in test cases. Verification of result strides for ops with rank reductions is more complex (and there could be multiple possible result types). That is left for a separate commit. Also refactor the implementation a bit: * If `computeMemRefRankReductionMask` could not compute the dropped dimensions, there must be something wrong with the op. Return `FailureOr` instead of `std::optional`. * `isRankReducedMemRefType` did much more than just checking whether the op has rank reductions or not. Inline the implementation into the verifier and add better comments. * `produceSubViewErrorMsg` does not have to be templatized.	2024-01-31 09:14:48 +01:00
Matthias Springer	fbb62d449c	[mlir][bufferization] Buffer deallocation: Make op preconditions stricter (#75127 ) The buffer deallocation pass checks the IR ("operation preconditions") to make sure that there is no IR that is unsupported. In such a case, the pass signals a failure. The pass now rejects all ops with unknown memory effects. We do not know whether such an op allocates memory or not. Therefore, the buffer deallocation pass does not know whether a deallocation op should be inserted or not. Memory effects are queried from the `MemoryEffectOpInterface` interface. Ops that do not implement this interface but have the `RecursiveMemoryEffects` trait do not have any side effects (apart from the ones that their nested ops may have). Unregistered ops are now rejected by the pass because they do not implement the `MemoryEffectOpInterface` and neither do we know if they have `RecursiveMemoryEffects` or not. All test cases that currently have unregistered ops are updated to use registered ops.	2024-01-21 11:10:09 +01:00
Fabian Mora	5b4f2b906b	[mlir][gpu] Add an offloading handler attribute to `gpu.module` (#78047 ) This patch adds an optional offloading handler attribute to the`gpu.module` op. This attribute will be used during `gpu-module-to-binary` pass to override the offloading handler used in the `gpu.binary` op.	2024-01-15 16:58:10 -05:00
Fabian Mora	a1eaed7a21	[mlir][gpu] Fix GPU YieldOP format and traits (#78006 ) This patch adds assembly format to `gpu::YieldOp`. It also adds the return like trait, to make it compatible with `RegionBranchOpInterface`.	2024-01-14 21:19:20 -05:00
Guray Ozen	5b33cff397	[mlir][gpu] Add Support for Cluster of Thread Blocks in `gpu.launch` (#76924 )	2024-01-06 11:17:01 +01:00
Andrzej Warzyński	ca5d34ec71	[mlir][TD] Fix the order of return handles (#76929 ) Replace (in tests and docs): %forall, %tiled = transform.structured.tile_using_forall with (updated order of return handles): %tiled, %forall = transform.structured.tile_using_forall Similar change is applied to (in the TD tutorial): transform.structured.fuse_into_containing_op This update makes sure that the tests/documentation are consistent with the Op specifications. Follow-up for #67320 which updated the order of the return handles for `tile_using_forall`.	2024-01-04 12:54:16 +00:00
Jakub Kuderski	c0345b4648	[mlir][gpu] Add subgroup_reduce to shuffle lowering (#76530 ) This supports both the scalar and the vector multi-reduction cases.	2024-01-02 16:14:22 -05:00
Jakub Kuderski	2af186f9bd	[mlir][gpu] Add patterns to break down subgroup reduce (#76271 ) The new patterns break down subgroup reduce ops with vector values into a sequence of subgroup reductions that fit the native shuffle size. The maximum/native shuffle size is parametrized. The overall goal is to be able to perform multi-element reductions with a sequence of `gpu.shuffle` ops.	2023-12-28 14:39:46 -05:00
Jakub Kuderski	72003adf6b	[mlir][gpu] Allow subgroup reductions over 1-d vector types (#76015 ) Each vector element is reduced independently, which is a form of multi-reduction. The plan is to allow for gradual lowering of multi-reduction that results in fewer `gpu.shuffle` ops at the end: 1d `vector.multi_reduction` --> 1d `gpu.subgroup_reduce` --> smaller 1d `gpu.subgroup_reduce` --> packed `gpu.shuffle` over i32 For example we can perform 2 independent f16 reductions with a series of `gpu.shuffles` over i32, reducing the final number of `gpu.shuffles` by 2x.	2023-12-21 11:55:43 -05:00
Matthias Springer	f7096428b4	[mlir][GPU] Add `RecursiveMemoryEffects` to `gpu.launch` (#75315 ) Infer the side effects of `gpu.launch` from its body.	2023-12-20 15:25:25 +09:00
Jakub Kuderski	560564f51c	[mlir][vector][gpu] Align minf/maxf reduction kind names with arith (#75901 ) This is to avoid confusion when dealing with reduction/combining kinds. For example, see a recent PR comment: https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175. Previously, they were picked to mostly mirror the names of the llvm vector reduction intrinsics: https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In isolation, it was not clear if `<maxf>` has `arith.maxnumf` or `arith.maximumf` semantics. The new reduction kind names map 1:1 to arith ops, which makes it easier to tell/look up their semantics. Because both the vector and the gpu dialect depend on the arith dialect, it's more natural to align names with those in arith than with the lowering to llvm intrinsics. Issue: https://github.com/llvm/llvm-project/issues/72354	2023-12-20 00:14:43 -05:00
Fabian Mora	419c45a325	[mlir][gpu] Fix crash in `gpu-module-to-binary` (#75477 ) This patch fixes the error in issue #75434. The crash was being caused by not checking for a lack of target attributes in a GPU module. It's now considered an error to invoke the pass with a GPU module with no target attributes.	2023-12-14 14:03:10 -05:00
Jakub Kuderski	7eccd52842	Reland "[mlir][gpu] Align reduction operations with vector combining kinds (#73423 )" This reverts commit `dd09221a29` and relands https://github.com/llvm/llvm-project/pull/73423. * Updated `gpu.all_reduce` `min`/`max` in CUDA integration tests.	2023-11-27 11:38:18 -05:00
Jakub Kuderski	dd09221a29	Revert "[mlir][gpu] Align reduction operations with vector combining kinds (#73423 )" This reverts commit `e0aac8c88d`. I'm seeing some nvidia integration test failures: https://lab.llvm.org/buildbot/#/builders/61/builds/52334.	2023-11-27 11:29:23 -05:00
Jakub Kuderski	e0aac8c88d	[mlir][gpu] Align reduction operations with vector combining kinds (#73423 ) The motivation for this change is explained in https://github.com/llvm/llvm-project/issues/72354. Before this change, we could not tell between signed/unsigned minimum/maximum and NaN treatment for floating point values. The mapping of old reduction operations to the new ones is as follows: * `min` --> `minsi` for ints, `minf` for floats * `max` --> `maxsi` for ints, `maxf` for floats New reduction kinds not represented in the old enum: `minui`, `maxui`, `minimumf`, `maximumf`. As a next step, I would like to have a common definition of combining kinds used by the `vector` and `gpu` dialects. Separately, the GPU to SPIR-V lowering does not yet properly handle zero and NaN values -- the behavior of floating point min/max group reductions is not specified by the SPIR-V spec, see https://github.com/llvm/llvm-project/issues/73459. Issue: https://github.com/llvm/llvm-project/issues/72354	2023-11-27 11:19:20 -05:00
Guray Ozen	edf5cae739	[mlir][gpu] Support Cluster of Thread Blocks in `gpu.launch_func` (#72871 ) NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA). It is a new level of parallelism, allowing clustering of Cooperative Thread Arrays (CTA) to synchronize and communicate through shared memory while running concurrently. This PR enables support for CGA within the `gpu.launch_func` in the GPU dialect. It extends `gpu.launch_func` to accommodate this functionality. The GPU dialect remains architecture-agnostic, so we've added CGA functionality as optional parameters. We want to leverage mechanisms that we have in the GPU dialects such as outlining and kernel launching, making it a practical and convenient choice. An example of this implementation can be seen below: ``` gpu.launch_func @kernel_module::@kernel clusters in (%1, %0, %0) // <-- Optional blocks in (%0, %0, %0) threads in (%0, %0, %0) ``` The PR also introduces index and dimensions Ops specific to clusters, binding them to NVVM Ops: ``` %cidX = gpu.cluster_id x %cidY = gpu.cluster_id y %cidZ = gpu.cluster_id z %cdimX = gpu.cluster_dim x %cdimY = gpu.cluster_dim y %cdimZ = gpu.cluster_dim z ``` We will introduce cluster support in `gpu.launch` Op in an upcoming PR. See [the documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays) provided by NVIDIA for details.	2023-11-27 11:05:07 +01:00
Guray Ozen	ea84897ba3	[mlir][gpu] Introduce `gpu.dynamic_shared_memory` Op (#71546 ) While the `gpu.launch` Op allows setting the size via the `dynamic_shared_memory_size` argument, accessing the dynamic shared memory is very convoluted. This PR implements the proposed Op, `gpu.dynamic_shared_memory` that aims to simplify the utilization of dynamic shared memory. RFC: https://discourse.llvm.org/t/rfc-simplifying-dynamic-shared-memory-access-in-gpu/ Proposal from RFC This PR `gpu.dynamic.shared.memory` Op to use dynamic shared memory feature efficiently. It is is a powerful feature that enables the allocation of shared memory at runtime with the kernel launch on the host. Afterwards, the memory can be accessed directly from the device. I believe similar story exists for AMDGPU. Current way Using Dynamic Shared Memory with MLIR Let me illustrate the challenges of using dynamic shared memory in MLIR with an example below. The process involves several steps: - memref.global 0-sized array LLVM's NVPTX backend expects - dynamic_shared_memory_size Set the size of dynamic shared memory - memref.get_global Access the global symbol - reinterpret_cast and subview Many OPs for pointer arithmetic ``` // Step 1. Create 0-sized global symbol. Manually set the alignment memref.global "private" @dynamicShmem : memref<0xf16, 3> { alignment = 16 } func.func @main() { // Step 2. Allocate shared memory gpu.launch blocks(...) threads(...) dynamic_shared_memory_size %c10000 { // Step 3. Access the global object %shmem = memref.get_global @dynamicShmem : memref<0xf16, 3> // Step 4. A sequence of `memref.reinterpret_cast` and `memref.subview` operations. %4 = memref.reinterpret_cast %shmem to offset: [0], sizes: [14, 64, 128], strides: [8192,128,1] : memref<0xf16, 3> to memref<14x64x128xf16,3> %5 = memref.subview %4[7, 0, 0][7, 64, 128][1,1,1] : memref<14x64x128xf16,3> to memref<7x64x128xf16, strided<[8192, 128, 1], offset: 57344>, 3> %6 = memref.subview %5[2, 0, 0][1, 64, 128][1,1,1] : memref<7x64x128xf16, strided<[8192, 128, 1], offset: 57344>, 3> to memref<64x128xf16, strided<[128, 1], offset: 73728>, 3> %7 = memref.subview %6[0, 0][64, 64][1,1] : memref<64x128xf16, strided<[128, 1], offset: 73728>, 3> to memref<64x64xf16, strided<[128, 1], offset: 73728>, 3> %8 = memref.subview %6[32, 0][64, 64][1,1] : memref<64x128xf16, strided<[128, 1], offset: 73728>, 3> to memref<64x64xf16, strided<[128, 1], offset: 77824>, 3> // Step.5 Use "test.use.shared.memory"(%7) : (memref<64x64xf16, strided<[128, 1], offset: 73728>, 3>) -> (index) "test.use.shared.memory"(%8) : (memref<64x64xf16, strided<[128, 1], offset: 77824>, 3>) -> (index) gpu.terminator } ``` Let’s write the program above with that: ``` func.func @main() { gpu.launch blocks(...) threads(...) dynamic_shared_memory_size %c10000 { %i = arith.constant 18 : index // Step 1: Obtain shared memory directly %shmem = gpu.dynamic_shared_memory : memref<?xi8, 3> %c147456 = arith.constant 147456 : index %c155648 = arith.constant 155648 : index %7 = memref.view %shmem[%c147456][] : memref<?xi8, 3> to memref<64x64xf16, 3> %8 = memref.view %shmem[%c155648][] : memref<?xi8, 3> to memref<64x64xf16, 3> // Step 2: Utilize the shared memory "test.use.shared.memory"(%7) : (memref<64x64xf16, 3>) -> (index) "test.use.shared.memory"(%8) : (memref<64x64xf16, 3>) -> (index) } } ``` This PR resolves #72513	2023-11-16 14:42:17 +01:00
drazi	9a3d3c7093	generalize pass gpu-kernel-outlining for symbol op (#72074 ) This PR generalize gpu-out-lining pass to take care of ops `SymbolOpInterface` instead of just `func::FuncOp`. Before this change, gpu-out-lining pass will skip `llvm.func`. ```mlir module { llvm.func @main() { %c1 = arith.constant 1 : index gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %c1, %arg7 = %c1, %arg8 = %c1) threads(%arg3, %arg4, %arg5) in (%arg9 = %c1, %arg10 = %c1, %arg11 = %c1) { gpu.terminator } llvm.return } } ``` After this change, gpu-out-lining pass can handle llvm.func as well.	2023-11-12 21:48:49 -08:00
spaceotter	00c3c73189	[mlir][gpu] Separate the barrier elimination code from transform ops (#71762 ) Allows the barrier elimination code to be run from C++ as well. The code from transforms dialect is copied as-is, the pass and populate functions have beed added at the end. Co-authored-by: Eric Eaton <eric@nod-labs.com>	2023-11-10 17:59:09 -08:00
spaceotter	51af040b22	[mlir][gpu] Eliminate redundant gpu.barrier ops (#71575 ) Adds a canonicalizer for gpu.barrier that gets rid of duplicates. Co-authored-by: Eric Eaton <eric@nod-labs.com>	2023-11-09 18:06:20 -05:00
Fabian Mora	42630689e2	[mlir][gpu] Clean GPU `Passes.h` from external SPIRV includes (#71331 ) Removes the `SPIRVAttributes.h` header from `GPU/Transforms/Passes.h`	2023-11-05 17:06:04 -08:00
Sang Ik Lee	2dace04521	[mlir][spirv] Implement gpu::TargetAttrInterface (#69949 ) This commit implements gpu::TargetAttrInterface for SPIR-V target attribute. The plan is to use this to enable GPU compilation pipeline for OpenCL kernels later. The changes do not impact Vulkan shaders using milr-vulkan-runner. New GPU Dialect transform pass spirv-attach-target is implemented for attaching attribute from CLI. gpu-module-to-binary pass now works with GPU module that has SPIR-V module with OpenCL kernel functions inside.	2023-11-05 08:11:53 -08:00
Christian Ulmann	7ed96b1c0d	[MLIR][LLVM] Remove last typed pointer remnants from tests (#71232 ) This commit removes all LLVM dialect typed pointers from the lit tests. Typed pointers have been deprecated for a while now and it's planned to soon remove them from the LLVM dialect. Related PSA: https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502	2023-11-04 14:13:31 +01:00
Oleksandr "Alex" Zinenko	e4384149b5	[mlir] use transform-interpreter in test passes (#70040 ) Update most test passes to use the transform-interpreter pass instead of the test-transform-dialect-interpreter-pass. The new "main" interpreter pass has a named entry point instead of looking up the top-level op with `PossibleTopLevelOpTrait`, which is arguably a more understandable interface. The change is mechanical, rewriting an unnamed sequence into a named one and wrapping the transform IR in to a module when necessary. Add an option to the transform-interpreter pass to target a tagged payload op instead of the root anchor op, which is also useful for repro generation. Only the test in the transform dialect proper and the examples have not been updated yet. These will be updated separately after a more careful consideration of testing coverage of the transform interpreter logic.	2023-10-24 16:12:34 +02:00
Aviad Cohen	7060422265	[mlir][Linalg]: Optimize linalg generic in transform::PromoteOp to avoid unnecessary copies (#68555 ) If the operands are not used in the payload of linalg generic operations, there is no need to copy them before the operation.	2023-10-14 10:40:45 +03:00
Aart Bik	39038177ee	[mlir][sparse][gpu] add CSC and BSR format to cuSparse GPU ops (#67509 ) This adds two cuSparse formats to the GPU dialect support. Together with proper lowering and runtime cuda support. Also fixes a few minor omissions.	2023-09-27 09:32:25 -07:00
Oleksandr "Alex" Zinenko	96ff0255f2	[mlir] cleanup of structured.tile* transform ops (#67320 ) Rename and restructure tiling-related transform ops from the structured extension to be more homogeneous. In particular, all ops now follow a consistent naming scheme: - `transform.structured.tile_using_for`; - `transform.structured.tile_using_forall`; - `transform.structured.tile_reduction_using_for`; - `transform.structured.tile_reduction_using_forall`. This drops the "_op" naming artifact from `tile_to_forall_op` that shouldn't have been included in the first place, consistently specifies the name of the control flow op to be produced for loops (instead of `tile_reduction_using_scf` since `scf.forall` also belongs to `scf`), and opts for the `using` connector to avoid ambiguity. The loops produced by tiling are now systematically placed as trailing results of the transform op. While this required changing 3 out of 4 ops (except for `tile_using_for`), this is the only choice that makes sense when producing multiple `scf.for` ops that can be associated with a variadic number of handles. This choice is also most consistent with other transform ops from the structured extension, in particular with fusion ops, that produce the structured op as the leading result and the loop as the trailing result.	2023-09-26 09:14:29 +02:00
Tobias Gysi	85175edd4e	[mlir][llvm] Replace NullOp by ZeroOp (#67183 ) This revision replaces the LLVM dialect NullOp by the recently introduced ZeroOp. The ZeroOp is more generic in the sense that it represents zero values of any LLVM type rather than null pointers only. This is a follow to https://github.com/llvm/llvm-project/pull/65508	2023-09-25 11:11:52 +02:00
Martin Erhart	522c1d0eea	[mlir][gpu][bufferization] Implement BufferDeallocationOpInterface for gpu.terminator (#66880 ) This is necessary to support deallocation of IR with gpu.launch operations because it does not implement the RegionBranchOpInterface. Implementing the interface would require it to support regions with unstructured control flow and produced arguments/results.	2023-09-20 12:28:28 +02:00
Fabian Mora	5093413a50	[mlir][gpu][NVPTX] Enable NVIDIA GPU JIT compilation path (#66220 ) This patch adds an NVPTX compilation path that enables JIT compilation on NVIDIA targets. The following modifications were performed: 1. Adding a format field to the GPU object attribute, allowing the translation attribute to use the correct runtime function to load the module. Likewise, a dictionary attribute was added to add any possible extra options. 2. Adding the `createObject` method to `GPUTargetAttrInterface`; this method returns a GPU object from a binary string. 3. Adding the function `mgpuModuleLoadJIT`, which is only available for NVIDIA GPUs, as there is no equivalent for AMD. 4. Adding the CMake flag `MLIR_GPU_COMPILATION_TEST_FORMAT` to specify the format to use during testing.	2023-09-14 18:00:27 -04:00

1 2 3 4 5 ...

252 Commits