clang-p2996

Author	SHA1	Message	Date
Guray Ozen	0a600c34c8	[mlir][nvgpu] Make `phaseParity` of `mbarrier.try_wait` `i1` (#81460 ) Currently, `phaseParity` argument of `nvgpu.mbarrier.try_wait.parity` is index. This can cause a problem if it's passed any value different than 0 or 1. Because the PTX instruction only accepts even or odd phase. This PR makes phaseParity argument i1 to avoid misuse. Here is the information from PTX doc: ``` The .parity variant of the instructions test for the completion of the phase indicated by the operand phaseParity, which is the integer parity of either the current phase or the immediately preceding phase of the mbarrier object. An even phase has integer parity 0 and an odd phase has integer parity of 1. So the valid values of phaseParity operand are 0 and 1. ``` See for more information: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-test-wait-mbarrier-try-wait	2024-02-13 09:50:34 +01:00
Guray Ozen	e892f323ea	[mlir][nvgpu] Allow TMA's last dim to be non-128B without swizzling (#81499 ) Allow TMA's last dimension to be non-128B when swizzling mode is not set. Test `tma_load_64x8_8x128_noswizzle.mlir` is failing due to the verifier. This PR will fix that	2024-02-13 09:17:09 +01:00
Guray Ozen	f33a0a4835	[mlir][nvgpu] Improve `tensormap.descriptor` Type Verifier (#77904 ) This PR improves the verifier for the `nvgpu.tensormap.descriptor` type. The descriptor contains information for TMA, and the compile-time check ensures its restrictions, such as the last memory dimension being 128-byte. This prevents runtime crashes. See cuda driver for more explanation: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html#group__CUDA__TENSOR__MEMORY_1ga7c7d2aaac9e49294304e755e6f341d7	2024-02-05 10:32:03 +01:00
Mehdi Amini	a7759fb0f8	Apply clang-tidy fixes for performance-unnecessary-value-param in NVGPUTransformOps.cpp (NFC)	2024-01-25 11:33:05 -08:00
Mehdi Amini	70fdaefe1f	Apply clang-tidy fixes for bugprone-macro-parentheses in NVGPUTransformOps.cpp (NFC)	2024-01-25 11:33:05 -08:00
Matthias Springer	5fcf907b34	[mlir][IR] Rename "update root" to "modify op" in rewriter API (#78260 ) This commit renames 4 pattern rewriter API functions: * `updateRootInPlace` -> `modifyOpInPlace` * `startRootUpdate` -> `startOpModification` * `finalizeRootUpdate` -> `finalizeOpModification` * `cancelRootUpdate` -> `cancelOpModification` The term "root" is a misnomer. The root is the op that a rewrite pattern matches against (https://mlir.llvm.org/docs/PatternRewriter/#root-operation-name-optional). A rewriter must be notified of all in-place op modifications, not just in-place modifications of the root (https://mlir.llvm.org/docs/PatternRewriter/#pattern-rewriter). The old function names were confusing and have contributed to various broken rewrite patterns. Note: The new function names use the term "modify" instead of "update" for consistency with the `RewriterBase::Listener` terminology (`notifyOperationModified`).	2024-01-17 11:08:59 +01:00
Guray Ozen	8dd0d95c7c	[mlir][nvgpu] Add `nvgpu.tma.async.store` (#77811 ) PR adds `nvgpu.tma.async.store` Op for asynchronous stores using the Tensor Memory Access (TMA) unit. It also implements Op lowering to NVVM dialect. The Op currently performs asynchronous stores of a tile memory region from shared to global memory for a single CTA.	2024-01-15 11:44:51 +01:00
Matthias Springer	0a8e3dd432	[mlir][Interfaces] `DestinationStyleOpInterface`: Rename `hasTensor/BufferSemantics` (#77574 ) Rename interface functions as follows: * `hasTensorSemantics` -> `hasPureTensorSemantics` * `hasBufferSemantics` -> `hasPureBufferSemantics` These two functions return "true" if the op has tensor/buffer operands but not buffer/tensor operands. Also drop the "ranked" part from the interface, i.e., do not distinguish between ranked/unranked types. The new function names describe the functions more accurately. They also align their semantics with the notion of "tensor semantics" with the bufferization framework. (An op is supposed to be bufferized if it has tensor operands, and we don't care if it also has memref operands.) This change is in preparation of #75273, which adds `BufferizableOpInterface::hasTensorSemantics`. By renaming the functions in the `DestinationStyleOpInterface`, we can avoid name clashes between the two interfaces.	2024-01-12 10:02:54 +01:00
Guray Ozen	249186701d	[mlir][nvgpu] Improve verifier of `ldmatrix` (#77807 ) PR improves the verifier of `nvgpu.ldmatrix` Op, so `nvgpu-to-nvvm` lowering does not crash.	2024-01-12 08:57:12 +01:00
Guray Ozen	4319e1916d	[mlir][nvgpu] Introduce Multicast Capability to `nvgpu.tma.async.load` (#76935 ) This PR improves the functionality of the `nvgpu.tma.async.load` Op by adding support for multicast. While we already had this capability in the lower-level `nvvm.cp.async.bulk.tensor.shared.cluster.global` NVVM Op, this PR lowers mask information to the NVVM operation.	2024-01-05 10:48:55 +01:00
Guray Ozen	3a03da37a3	[mlir][nvgpu] Add address space attribute converter in nvgpu-to-nvvm pass (#74075 ) GPU dialect has `#gpu.address_space<workgroup>` for shared memory of NVGPU (address space =3). Howeverm when IR combine NVGPU and GPU dialect, `nvgpu-to-nvvm` pass fails due to missing attribute conversion. This PR adds `populateGpuMemorySpaceAttributeConversions` to nvgou-to-nvvm lowering, so we can use `#gpu.address_space<workgroup>` `nvgpu-to-nvvm` pass	2023-12-04 16:48:39 +01:00
Jie Fu	b44399296a	[mlir] Fix -Wsign-compare in NVGPUDialect.cpp (NFC) /llvm-project/mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp:396:31: error: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long long') [-Werror,-Wsign-compare] if (getCoordinates().size() != ~~~~~~~~~~~~~~~~~~~~~~~ ^ 1 error generated.	2023-11-08 20:26:33 +08:00
Guray Ozen	6eb97f0380	[MLIR][NVGPU] Improve and Cleanup verifier of TMA OPs (#70923 ) This PR improves and cleans-up verifiers of TmaCreateDescriptor and TmaAsyncLoad Ops and unifies them. The PR verifiers followings that didn't before: - address space - rank match between descriptor and memref - element type match between descriptor and memref - shape type match between descriptor and memref	2023-11-08 12:03:16 +01:00
Christian Ulmann	97a238e863	[MLIR][LLVM] Remove typed pointer conversion utils (#71169 ) This commit removes the no longer required type pointer helpers from the LLVM dialect conversion utils. Typed pointers have been deprecated for a while now and it's planned to soon remove them from the LLVM dialect. Related PSA: https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502	2023-11-03 13:02:35 +01:00
Guray Ozen	192d3320f0	[mlir][nvgpu] Add predicate argument to NVGPU Ops (#69322 )	2023-10-18 19:41:51 +02:00
Guray Ozen	52db7e2745	[mlir][nvgpu] Improve `WarpgroupAccumulator` type to simplify IR (#68728 ) `WarpgroupAccumulator` (or `!nvgpu.warpgroup.accumulator`) is a type that keeps the accumulator matrix that is used by warp-group level matrix multiplication. It is handy to have a special type for that as the matrix is distributed among the threads of the warp-group. However, current transformations requires to create and use multiple `WarpgroupAccumulator` if the shape of GEMM is larger than the supported shape of `wgmma.mma_async` instruction. This makes IR looks dense. This PR improves the transformation of `WarpgroupAccumulator` type in every nvgpu Op that uses it. Example: Current GEMM in NVGPU-IR ``` // Init %m1, %m2 = nvgpu.warpgroup.mma.init.accumulator -> !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> // GEMM %r1, %r2 = nvgpu.warpgroup.mma %descA, %descB, %m1, %m2 {transposeB}: !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> -> !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> // Epilogue nvgpu.warpgroup.mma.store [%r1, %r2] to %sharedMemoryBuffer : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> into memref<128x128xf32,3> ``` Example: This PR simplifies the IR as below: ``` // Init %m = nvgpu.warpgroup.mma.init.accumulator -> !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> // GEMM %r1 = nvgpu.warpgroup.mma %descA, %descB, %m1 {transposeB}: !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> -> !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> // Epilogue nvgpu.warpgroup.mma.store [%matrixD1, %matrixD2] to %sharedMemoryBuffer : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> into memref<128x128xf32,3> ```	2023-10-17 11:46:47 +02:00
Guray Ozen	315ab3c44b	[MLIR][NVGPU] Introduce `warpgroup.init.accumulator` Op (#67530 ) This Op generates and initilizes the accumulator matrix for `nvgpu.warpgroup.mma` op to perform matrix-multiply-and-accumulate (mma). Its associated transformation generates `!llvm.struct<>` and fill it with the initial values. The size of struct is number of required inout registers for `nvgpu.warpgroup.mma` op.	2023-10-11 08:28:26 -07:00
Guray Ozen	d20fbc9007	[MLIR][NVGPU] Introduce `nvgpu.wargroup.mma.store` Op for Hopper GPUs (#65441 ) This PR introduces a new Op called `warpgroup.mma.store` to the NVGPU dialect of MLIR. The purpose of this operation is to facilitate storing fragmanted result(s) `nvgpu.warpgroup.accumulator` produced by `warpgroup.mma` to the given memref. An example of fragmentated matrix is given here : https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d The `warpgroup.mma.store` does followings: 1) Takes one or more `nvgpu.warpgroup.accumulator` type (fragmented results matrix) 2) Calculates indexes per thread in warp-group and stores the data into give memref. Here's an example usage: ``` // A warpgroup performs GEMM, results in fragmented matrix %result1, %result2 = nvgpu.warpgroup.mma ... // Stores the fragmented result to memref nvgpu.warpgroup.mma.store [%result1, %result2], %matrixD : !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> to memref<128x128xf32,3> ```	2023-10-05 10:54:13 +02:00
Guray Ozen	7eb2b99f16	[mlir] Change the class name of the `GenerateWarpgroupDescriptor` (#68286 )	2023-10-05 10:15:40 +02:00
Guray Ozen	6dc7717bca	[MLIR][NVGPU] Change name `wgmma.descriptor` to `warpgroup.descriptor` (NFC) (#67526 ) NVGPU dialect is gaining large support for warpgroup level operations, and their names always starts with `warpgroup....`. This PR changes name of Op and type from `wgmma.descriptor` to `warpgroup.descriptor` for sake of consistency.	2023-10-05 09:01:48 +02:00
Kazu Hirata	3bca659556	Use llvm::is_contained (NFC)	2023-09-22 17:20:50 -07:00
Guray Ozen	17649a7726	[MLIR][NVGPU] Introduce `nvgpu.mbarrier.group` for multiple mbarrier use (#65951 ) A common practice involves the creation of multiple `mbarrier` objects, see an example below. This is particularly valuable in scenarios like software pipelining for GEMM, where we need to generate multiple barriers dynamically use and wait them in a loop. PR improves `nvgpu.mbarrier.barrier` type into the `nvgpu.mbarrier.group`. All `mbarrier` related Ops now uses this type. Consequently, these Ops are now capable of managing multiple barriers seamlessly. Having `num_barriers = 4` helps us to locate mbarrier object(s) into static shared memory. We could make the value dynamic that requires dynamic shared memory it would complicate the codegen. ``` %barriers = nvgpu.mbarrier.create -> !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c0], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c1], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c2], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c3], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> ... scf.for %i = %c0 to %n step %c1 { nvgpu.mbarrier.try_wait %barriers[ (i % 4) ] ... // ... Do work once mbarrier is ready nvgpu.mbarrier.arrive.expect_tx %barriers[ (i + 3 % 4) ] ... } ``` We will have mbarrier usages like below: ``` expect_tx[0] expect_tx[1] expect_tx[2] Loop: try_wait mbarrier[0], expect_tx[3] try_wait mbarrier[1], expect_tx[0] try_wait mbarrier[2], expect_tx[1] try_wait mbarrier[3], expect_tx[2] ... ```	2023-09-22 17:09:43 +02:00
Guray Ozen	2388222695	[MLIR][NVGPU] Adding `nvgpu.warpgroup.mma` Op for Hopper GPUs (#65440 ) This work introduces a new operation called `warpgroup.mma` to the NVGPU dialect of MLIR. The purpose of this operation is to facilitate warpgroup-level matrix multiply and accumulate (WGMMA) operations on Hopper GPUs with sm_90a architecture. Previously, the `nvvm.wgmma.mma_async` operation was introduced to support warpgroup-level matrix operations in NVVM dialect. This op is used multiple instances of `nvvm.wgmma.mma_async` to achieve the desired shape. The new `nvgpu.warpgroup.mma` operation abstracts this complexity and provides a higher-level interface for performing warpgroup-level matrix operations. The `nvgpu.warpgroup.mma` does followings: 1) Corresponds multiple `wgmma` instructions. 2) Iterates input matrix descriptors to achieve the desired computation shape. 3) Groups and runs `wgmma` instructions asynchronously, and eventually waits them. This are done by `wgmma.fence.aligned`, `wgmma.commit.group.sync.aligned`, and `wgmma.wait.group.sync.aligned` 4) Results fragmented matrices Here's an example usage of the `nvgpu.warpgroup.mma` operation: ``` %wgmmaResult, %wgmmaResult2 = nvgpu.warpgroup.mma %descA, %descB, %acc1, %acc2 {transposeB}: !nvgpu.wgmma.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.wgmma.descriptor<tensor = memref<64x128xf16, 3>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> -> !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> ``` The op will result following PTX: ``` wgmma.fence.sync.aligned; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA, %descB, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+2, %descB+128, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+4, %descB+256, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+8, %descB+348, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+512, %descB, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+514, %descB+128, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+516, %descB+256, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+518, %descB+348, p, 1, 1, 0, 1; wgmma.commit_group.sync.aligned; wgmma.wait_group.sync.aligned 1; ``` The Op keeps - first 64 registers (`{%f1, %f2, 62 more registers}`) -> `%acc1` - second 64 registers (`{%f500,%f501, 62 more registers}`) -> `%acc2`.	2023-09-22 11:46:29 +02:00
Diego Caballero	98f6289a34	[mlir][Vector] Add support for Value indices to vector.extract/insert `vector.extract/insert` ops only support constant indices. This PR is extending them so that arbitrary values can be used instead. This work is part of the RFC: https://discourse.llvm.org/t/rfc-psa-remove-vector-extractelement-and-vector-insertelement-ops-in-favor-of-vector-extract-and-vector-insert-ops Differential Revision: https://reviews.llvm.org/D155034	2023-09-22 00:39:32 +00:00
Christopher Bate	cafb6284d1	[mlir][VectorToGPU] Update memref stride preconditions on `nvgpu.mma.sync` path This change removes the requirement that the row stride be statically known when converting `vector.transfer_read` and `vector.transfer_write` to distributed SIMT operations in the `nvgpu` lowering path. It also adds a check to verify that the last dimension of the source memref is statically known to have stride 1 since this is assumed in the conversion logic. No other change should be required since the generated `vector.load` operations are never created across dimensions other than the last. The routines for checking preconditions on `vector.transfer_read/write` are moved to under nvgpu utilities. The change is NFC with respect to the GPU dialect lowering path. Reviewed By: ThomasRaoux Differential Revision: https://reviews.llvm.org/D155753	2023-09-14 13:51:42 -06:00
Guray Ozen	cce3e8ed89	[MLIR][NVGPU] Introduction of wgmma.generate.descriptor Op This work introduces a new Op, `wgmma.generate.descriptor`, designed to create a wgmma descriptor for inputs of matrix multiply and accumulate operations using `wgmma.mma_async` PTX instruction. The descriptor format specifications can be found in the following link: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-shared-memory-layout-matrix-descriptor It's important to note that this op is in its initial phase, and it does come with certain limitations. It only supports 128b swizzling and does not incorporate interleaving. In the future, different calculations will be addressed in separate works, expanding the capabilities of the op. Reviewed By: qcolombet Differential Revision: https://reviews.llvm.org/D157382	2023-08-22 16:12:25 +02:00
Nicolas Vasilache	99475f5b4a	[mlir][transform] Add NVGPU to NVVM conversion via transform.apply_conversion_patterns Differential Revision: https://reviews.llvm.org/D157501	2023-08-09 14:09:57 +00:00
Matthias Springer	15ea2306a4	[mlir][NVGPU] Support N-D masks in transform.nvgpu.create_async_groups Support IR that is generated by the vector-to-scf lowering of N-D vector transfers with a mask. (Until now only 1-D and 2-D transfers were supported.) Only transfers that were fully unrolled are supported. Differential Revision: https://reviews.llvm.org/D157286	2023-08-08 14:30:03 +02:00
Jie Fu	28d8c0d168	[mlir][nvgpu] Fix -Wunused-variable in NVGPUTransformOps.cpp (NFC) /data/llvm-project/mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp:969:16: error: unused variable 'inMemRefType' [-Werror,-Wunused-variable] MemRefType inMemRefType = inMemRef.getType(); ^ 1 error generated.	2023-08-08 20:21:41 +08:00
Nicolas Vasilache	a3cd2eeb2d	[mlir][nvgpu] Add a nvgpu.rewrite_copy_as_tma transform operation. This revision adds support for direct lowering of a linalg.copy on buffers between global and shared memory to a tma async load + synchronization operations. This uses the recently introduced Hopper NVVM and NVGPU abstraction to connect things end to end. Differential Revision: https://reviews.llvm.org/D157087	2023-08-08 12:07:59 +00:00
Matthias Springer	39d8876da3	[mlir][NVGPU] Support 2D masks in transform.nvgpu.create_async_groups Support IR that is generated by the vector-to-scf lowering of 2D vector transfers with a mask. Only 2D transfers that were fully unrolled are supported at the moment. Differential Revision: https://reviews.llvm.org/D156695	2023-08-07 15:46:54 +02:00
Guray Ozen	4622113820	[mlir][nvgpu] Set useDefaultAttributePrinterParser Differential Revision: https://reviews.llvm.org/D155959	2023-07-21 17:00:39 +02:00
Jie Fu	3fd1790638	[mlir][nvgpu] Ignore -Wunused-function in NVGPUDialect.cpp (NFC) In file included from /Users/jiefu/llvm-project/mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp:363: /Users/jiefu/llvm-project/build-Release/tools/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUAttrDefs.cpp.inc:22:36: error: unused function 'generatedAttributeParser' [-Werror,-Wunused-function] static ::mlir::OptionalParseResult generatedAttributeParser(::mlir::AsmParser &parser, ::llvm::StringRef *mnemonic, ::mlir::Type type, ::mlir::Attribute &value) { ^ /Users/jiefu/llvm-project/build-Release/tools/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUAttrDefs.cpp.inc:46:30: error: unused function 'generatedAttributePrinter' [-Werror,-Wunused-function] static ::mlir::LogicalResult generatedAttributePrinter(::mlir::Attribute def, ::mlir::AsmPrinter &printer) { ^ 2 errors generated.	2023-07-21 20:50:48 +08:00
Guray Ozen	e56d6745f7	[mlir][nvgpu] Add `tma.create.descriptor` to create tensor map descriptor The Op creates a tensor map descriptor object representing tiled memory region. The descriptor is used by Tensor Memory Access (TMA). The `tensor` is the source tensor to be tiled. The `boxDimensions` is the size of the tiled memory region in each dimension. The pattern here lowers `tma.create.descriptor` to a runtime function call that eventually calls calls CUDA Driver's `cuTensorMapEncodeTiled`. For more information see below: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html Depends on D155453 Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D155680	2023-07-21 11:33:04 +02:00
Guray Ozen	70c2e0618a	[mlir][nvgpu] Add nvgpu.tma.async.load and nvgpu.tma.descriptor This work adds `nvgpu.tma.async.load` Op that requests tma load asyncronusly using mbarrier object. It also creates nvgpu.tma.descriptor type. The type is supposed be created by `cuTensorMapEncodeTiled` cuda drivers api. Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D155453	2023-07-21 10:23:25 +02:00
Matthias Springer	db393288ff	[mlir][NVGPU][transform] Add `create_async_groups` transform op This transform looks for suitable vector transfers from global memory to shared memory and converts them to async device copies. Differential Revision: https://reviews.llvm.org/D155569	2023-07-18 14:36:41 +02:00
Alex Zinenko	4a6b31b8d8	[mlir] NFC: untangle SCF Patterns.h and Transforms.h These two headers both contained a strange mix of definitions related to both patterns and non-pattern transforms. Put patterns and "populate" functions into Patterns.h and standalone transforms into Transforms.h. Depends On: D155223 Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D155454	2023-07-18 11:27:36 +00:00
Guray Ozen	960ab5225b	[mlir][nvgpu] Verify invalid copy size (nfc) This work improves verifier for invalid cases. It is NFC. Reviewed By: nicolasvasilache, springerm Differential Revision: https://reviews.llvm.org/D155448	2023-07-17 17:09:33 +02:00
Matthias Springer	b1d2687501	[mlir][IR] Remove duplicate `isLastMemrefDimUnitStride` functions This function is duplicated in various dialects. Differential Revision: https://reviews.llvm.org/D155462	2023-07-17 16:31:04 +02:00
Alex Zinenko	371366ce27	[mlir][nvgpu] add simple pipelining for shared memory copies Add a simple transform operation to the NVGPU extension that performs software pipelining of copies to shared memory. The functionality is extremely minimalistic in this version and only supports copies from global to shared memory inside an `scf.for` loop with either `vector.transfer` or `nvgpu.device_async_copy` operations when pipelining preconditions are already satisfied in the IR. This is the minimally useful version that uses the more general loop pipeliner in an NVGPU-specific way. Further extensions and orthogonalizations will be necessary. This required a change to the loop pipeliner itself to properly propagate errors should the predicate generator fail. This is loosely inspired from the vesion in IREE, but has less unsafe assumptions and more principled way of communicating decisions. Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D155223	2023-07-17 14:29:12 +00:00
Matthias Springer	a4f4d82c35	[mlir][NVGPU][NFC] Clean up code structure * Move passes to `Transforms` directory. * Add `Utils.h` (will be utilized in a subsequent change). Differential Revision: https://reviews.llvm.org/D155427	2023-07-17 14:15:42 +02:00
Guray Ozen	affcfccd3c	[mlir][nvgpu] Add initial support for `mbarrier` `mbarrier` is a barrier created in shared memory that supports different flavors of synchronizing threads other than `__syncthreads`, for more information see below. https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-mbarrier This work adds initial Ops wrt `mbarrier` to nvgpu dialect. First, it introduces to two types: `mbarrier.barrier` that is barrier object in shared memory `mbarrier.barrier.token` that is token It introduces following Ops: `mbarrier.create` creates `mbarrier.barrier` `mbarrier.init` initializes `mbarrier.barrier` `mbarrier.arrive` performs arrive-on `mbarrier.barrier` returns `mbarrier.barrier.token` `mbarrier.arrive.nocomplete` performs arrive-on (non-blocking) `mbarrier.barrier` returns `mbarrier.barrier.token` `mbarrier.test_wait` waits on `mbarrier.barrier` and `mbarrier.barrier.token` Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D154090	2023-07-11 17:35:27 +02:00
Guray Ozen	2c5739675c	[mlir][nvgpu] Implement `nvgpu.device_async_copy` by NVVMToLLVM Pass `nvgpu.device_async_copy` is lowered into `cp.async` PTX instruction. However, NVPTX backend does not support its all mode especially when zero padding is needed. Therefore, current MLIR implementation genereates inline assembly for that. This work simplifies PTX generation for `nvgpu.device_async_copy`, and implements it by `NVVMToLLVM` Pass. Depends on D154060 Reviewed By: nicolasvasilache, manishucsd Differential Revision: https://reviews.llvm.org/D154345	2023-07-11 12:18:28 +02:00
Lorenzo Chelini	2049b2adfe	[MLIR] Fix compiler warnings (NFC) In `TestTensorTransforms.cpp` `replaced` is nullptr I assumed the intent was to emit the error for the `rootOp`. In `TransformInterfaces.cpp` there were some uninitialized variables. In `NVGPUTransformOps.cpp` `matmulOp` was never used. Reviewed By: ftynse Differential Revision: https://reviews.llvm.org/D154439	2023-07-05 09:49:57 +02:00
Nicolas Vasilache	13f4e889c5	Revert "Revert "[mlir][Transform] Add support for mma.sync m16n8k16 f16 rewrite." and "[mlir][Transform] Introduce nvgpu transform extensions"" This reverts commit `6506692fe6`. Differential Revision: https://reviews.llvm.org/D153845	2023-06-28 06:50:05 +00:00
Mehdi Amini	6506692fe6	Revert "[mlir][Transform] Add support for mma.sync m16n8k16 f16 rewrite." and "[mlir][Transform] Introduce nvgpu transform extensions" This reverts commit `40deed40ae`. and commit `1660f2174d`. The buildbot is broken, the two tests aren't passing.	2023-06-27 08:46:18 +02:00
Nicolas Vasilache	1660f2174d	[mlir][Transform] Add support for mma.sync m16n8k16 f16 rewrite. This PR adds support for the m16n8k16 f16 case. At this point, the support is mostly mechanical and could be Tablegen'd to all cases. Until then, this can be populated as needed on a case-by-case basis. Depends on: D153420 Differential Revision: https://reviews.llvm.org/D153428	2023-06-26 16:46:42 +00:00
Nicolas Vasilache	40deed40ae	[mlir][Transform] Introduce nvgpu transform extensions Mapping to NVGPU operations such as mma.sync with mixed precision and ldmatrix with transposes and various data types involves complex matchings from low-level IR. This is akin to raising complex patterns after unnecessarily having lost structural information. To avoid such unnecessary complexity, introduce a direct mapping step from a matmul on memrefs to distributed NVGPU vector abstractions. In this context, mapping to specific mma.sync operations is trivial and consists in simply translating the documentation into indexing expressions. Correctness is demonstrated with an end-to-end integration test. Differential Revision: https://reviews.llvm.org/D153420	2023-06-26 16:21:28 +00:00
Nicolas Vasilache	8c80d01a95	[mlir][NVGPU] NFC - Add a more convenient C++ builder for nvgpu::MmaSyncOp	2023-06-19 13:54:31 +00:00
Matthias Springer	61223c49dd	[mlir][GPU] Rename MLIRGPUOps CMake target to MLIRGPUDialect This is for consistency with other dialects. Differential Revision: https://reviews.llvm.org/D150659	2023-05-16 14:25:08 +02:00

1 2

82 Commits