clang-p2996

Author	SHA1	Message	Date
Matthias Springer	206fad0e21	[mlir][NFC] Mark type converter in `populate...` functions as `const` (#111250 ) This commit marks the type converter in `populate...` functions as `const`. This is useful for debugging. Patterns already take a `const` type converter. However, some `populate...` functions do not only add new patterns, but also add additional type conversion rules. That makes it difficult to find the place where a type conversion was added in the code base. With this change, all `populate...` functions that only populate pattern now have a `const` type converter. Programmers can then conclude from the function signature that these functions do not register any new type conversion rules. Also some minor cleanups around the 1:N dialect conversion infrastructure, which did not always pass the type converter as a `const` object internally.	2024-10-05 21:32:40 +02:00
Youngsuk Kim	123e8c735d	[mlir] Don't call llvm::raw_string_ostream::flush() (NFC) Don't call raw_string_ostream::flush(), which is essentially a no-op. As specified in the docs, raw_string_ostream is always unbuffered. ( `65b13610a5` for further reference )	2024-09-22 15:37:34 -05:00
Observer007	2b23e6c8d6	[mlir][nvgpu] Add `nvgpu.rcp` OP (#100965 ) This PR introduces a new OP for reciprocal calculation for `vector` types using `nvvm.rcp` OPs. Currently, it supports only f32 types --------- Co-authored-by: jingzec <jingzec@nvidia.com>	2024-07-30 09:20:49 +02:00
Christian Sigg	a5757c5b65	Switch member calls to `isa/dyn_cast/cast/...` to free function calls. (#89356 ) This change cleans up call sites. Next step is to mark the member functions deprecated. See https://mlir.llvm.org/deprecation and https://discourse.llvm.org/t/preferred-casting-style-going-forward.	2024-04-19 15:58:27 +02:00
Guray Ozen	0a600c34c8	[mlir][nvgpu] Make `phaseParity` of `mbarrier.try_wait` `i1` (#81460 ) Currently, `phaseParity` argument of `nvgpu.mbarrier.try_wait.parity` is index. This can cause a problem if it's passed any value different than 0 or 1. Because the PTX instruction only accepts even or odd phase. This PR makes phaseParity argument i1 to avoid misuse. Here is the information from PTX doc: ``` The .parity variant of the instructions test for the completion of the phase indicated by the operand phaseParity, which is the integer parity of either the current phase or the immediately preceding phase of the mbarrier object. An even phase has integer parity 0 and an odd phase has integer parity of 1. So the valid values of phaseParity operand are 0 and 1. ``` See for more information: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-test-wait-mbarrier-try-wait	2024-02-13 09:50:34 +01:00
Guray Ozen	fa13c3eea7	[mlir][nvgpu] Fix `transposeB` in `nvgpu.warpgroup.mma` (#79271 ) The #76150 fixed meaning of `transposeB` in NVVM dialect which was initially implemented with opposite meaning. This PR fixes the lowering of `nvgpu.warpgroup.mma` to NVVM dialect. This will fix two integration tests: gemm_f32_f16_f16_128x128x128.mlir gemm_pred_f32_f16_f16_128x128x128.mlir	2024-01-25 09:25:43 +01:00
Guray Ozen	12c241b365	[MLIR][NVVM] Explicit Data Type for Output in `wgmma.mma_async` (#78713 ) The current implementation of `nvvm.wgmma.mma_async` Op deduces the data type of the output matrix from the data type of struct member, which can be non-intuitive, especially in cases where types like `2xf16` are packed into `i32`. This PR addresses this issue by improving the Op to include an explicit data type for the output matrix. The modified Op now includes an explicit data type for Matrix-D (<f16>), and looks as follows: ``` %result = llvm.mlir.undef : !llvm.struct<(struct<(i32, i32, ... nvvm.wgmma.mma_async %descA, %descB, %result, #nvvm.shape<m = 64, n = 32, k = 16>, D [<f16>, #nvvm.wgmma_scale_out<zero>], A [<f16>, #nvvm.wgmma_scale_in<neg>, <col>], B [<f16>, #nvvm.wgmma_scale_in<neg>, <col>] ```	2024-01-22 08:37:20 +01:00
Guray Ozen	21830c9135	[mlir][nvgpu] Fix 'warpgroup.mma.store' index calculation (#78413 ) This PR fixes the 'nvgpu.warpgroup.mma.store' index calculation. When the destionation memref and current accumulator matrix were small, the previous code was reaching out of range.	2024-01-22 08:32:56 +01:00
Guray Ozen	8dd0d95c7c	[mlir][nvgpu] Add `nvgpu.tma.async.store` (#77811 ) PR adds `nvgpu.tma.async.store` Op for asynchronous stores using the Tensor Memory Access (TMA) unit. It also implements Op lowering to NVVM dialect. The Op currently performs asynchronous stores of a tile memory region from shared to global memory for a single CTA.	2024-01-15 11:44:51 +01:00
Guray Ozen	4319e1916d	[mlir][nvgpu] Introduce Multicast Capability to `nvgpu.tma.async.load` (#76935 ) This PR improves the functionality of the `nvgpu.tma.async.load` Op by adding support for multicast. While we already had this capability in the lower-level `nvvm.cp.async.bulk.tensor.shared.cluster.global` NVVM Op, this PR lowers mask information to the NVVM operation.	2024-01-05 10:48:55 +01:00
Guray Ozen	3a03da37a3	[mlir][nvgpu] Add address space attribute converter in nvgpu-to-nvvm pass (#74075 ) GPU dialect has `#gpu.address_space<workgroup>` for shared memory of NVGPU (address space =3). Howeverm when IR combine NVGPU and GPU dialect, `nvgpu-to-nvvm` pass fails due to missing attribute conversion. This PR adds `populateGpuMemorySpaceAttributeConversions` to nvgou-to-nvvm lowering, so we can use `#gpu.address_space<workgroup>` `nvgpu-to-nvvm` pass	2023-12-04 16:48:39 +01:00
Guray Ozen	9ceea08859	[mlir] `im2col` & `l2cache` on cp.async.bulk.tensor.shared.cluster.global` (#72967 ) PR adds support of `im2col` and `l2cache` to `cp.async.bulk.tensor.shared.cluster.global`. The Op is now supports all the traits of the corresponding PTX instruction. The current structure of this operation looks somewhat like below. The PR also simplifies types so we don't need to write obvious types after `:` anymore. ``` nvvm.cp.async.bulk.tensor.shared.cluster.global %dest, %tmaDescriptor, %barrier, box[%crd0,%crd1,%crd2,%crd3,%crd4] im2col[%off0,%off1,%off2] <-- PR introduces multicast_mask = %ctamask l2_cache_hint = %cacheHint <-- PR introduces : !llvm.ptr<3>, !llvm.ptr ```	2023-11-22 16:08:09 +01:00
Guray Ozen	108380da35	[mlir][nvvm] Add `cp.async.bulk.tensor.shared.cluster.global.multicast` (#72429 ) This PR introduce `cp.async.bulk.tensor.shared.cluster.global.multicast` Op in NVVM dialect. It loads data using TMA data from global memory to shared memory of multiple CTAs in the cluster. It resolves #72368	2023-11-16 14:34:56 +01:00
Christian Ulmann	2f17c9f65e	[MLIR][NVGPUToNVVM] Remove typed pointer support (#70867 ) This commit removes the support for lowering NVGPU to NVVM dialect with typed pointers. Typed pointers have been deprecated for a while now and it's planned to soon remove them from the LLVM dialect. Related PSA: https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502	2023-11-02 07:35:21 +01:00
Guray Ozen	192d3320f0	[mlir][nvgpu] Add predicate argument to NVGPU Ops (#69322 )	2023-10-18 19:41:51 +02:00
Guray Ozen	39cdefb5b5	[mlir][nvvm] Add prefetch.tensormap (#67564 ) This PR adds `prefetch.tensormap` Op. It brings the cache line containing the given tma descriptor for subsequent use by the cp.async.bulk.tensor instruction. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-prefetch-prefetchu	2023-10-17 13:03:37 +02:00
Guray Ozen	c4ba84d655	[mlir][nvgpu] Fix packing accumlator matrix (#69316 ) The #68728 significantly simplified the accumulator matrix type, making it easier to work with the nvgpu dialect without worrying about the number of required structs, as this information is abstracted away in the nvgpu-to-nvvm transformation. However, we forgot packing the structs after initialization, causing the accumulator matrix to hold undefined values, which is wrong. This PR addresses that.	2023-10-17 12:46:10 +02:00
Guray Ozen	63389326f5	[mlir][nvvm] Support predicates in `BasicPtxBuilder` (#67102 ) This PR enhances `BasicPtxBuilder` to support predicates in PTX code generation. The `BasicPtxBuilder` interface was initially introduced for generating PTX code automatically for Ops that aren't supported by LLVM core. Predicates, which are typically not supported in LLVM core, are now supported using the same mechanism. In PTX programming, instructions can be guarded by predicates as shown below:. Here `@p` is a predicate register and guard the execution of the instruction. ``` @p ptx.code op1, op2, op3 ``` This PR introduces the `getPredicate` function in the `BasicPtxBuilder` interface to set an optional predicate. When a predicate is provided, the instruction is generated with predicate and guarded, otherwise, predicate is not genearted. Note that the predicate value must always appear as the last argument on the Op definition. Additionally, this PR implements predicate usage for the following ops: - mbarrier.init - mbarrier.init.shared - mbarrier.arrive.expect_tx - mbarrier.arrive.expect_tx.shared - cp.async.bulk.tensor.shared.cluster.global - cp.async.bulk.tensor.global.shared.cta See for more detail in PTX programing model https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-instructions	2023-10-17 12:42:36 +02:00
Guray Ozen	52db7e2745	[mlir][nvgpu] Improve `WarpgroupAccumulator` type to simplify IR (#68728 ) `WarpgroupAccumulator` (or `!nvgpu.warpgroup.accumulator`) is a type that keeps the accumulator matrix that is used by warp-group level matrix multiplication. It is handy to have a special type for that as the matrix is distributed among the threads of the warp-group. However, current transformations requires to create and use multiple `WarpgroupAccumulator` if the shape of GEMM is larger than the supported shape of `wgmma.mma_async` instruction. This makes IR looks dense. This PR improves the transformation of `WarpgroupAccumulator` type in every nvgpu Op that uses it. Example: Current GEMM in NVGPU-IR ``` // Init %m1, %m2 = nvgpu.warpgroup.mma.init.accumulator -> !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> // GEMM %r1, %r2 = nvgpu.warpgroup.mma %descA, %descB, %m1, %m2 {transposeB}: !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> -> !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> // Epilogue nvgpu.warpgroup.mma.store [%r1, %r2] to %sharedMemoryBuffer : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> into memref<128x128xf32,3> ``` Example: This PR simplifies the IR as below: ``` // Init %m = nvgpu.warpgroup.mma.init.accumulator -> !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> // GEMM %r1 = nvgpu.warpgroup.mma %descA, %descB, %m1 {transposeB}: !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> -> !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> // Epilogue nvgpu.warpgroup.mma.store [%matrixD1, %matrixD2] to %sharedMemoryBuffer : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> into memref<128x128xf32,3> ```	2023-10-17 11:46:47 +02:00
Guray Ozen	315ab3c44b	[MLIR][NVGPU] Introduce `warpgroup.init.accumulator` Op (#67530 ) This Op generates and initilizes the accumulator matrix for `nvgpu.warpgroup.mma` op to perform matrix-multiply-and-accumulate (mma). Its associated transformation generates `!llvm.struct<>` and fill it with the initial values. The size of struct is number of required inout registers for `nvgpu.warpgroup.mma` op.	2023-10-11 08:28:26 -07:00
Guray Ozen	d20fbc9007	[MLIR][NVGPU] Introduce `nvgpu.wargroup.mma.store` Op for Hopper GPUs (#65441 ) This PR introduces a new Op called `warpgroup.mma.store` to the NVGPU dialect of MLIR. The purpose of this operation is to facilitate storing fragmanted result(s) `nvgpu.warpgroup.accumulator` produced by `warpgroup.mma` to the given memref. An example of fragmentated matrix is given here : https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d The `warpgroup.mma.store` does followings: 1) Takes one or more `nvgpu.warpgroup.accumulator` type (fragmented results matrix) 2) Calculates indexes per thread in warp-group and stores the data into give memref. Here's an example usage: ``` // A warpgroup performs GEMM, results in fragmented matrix %result1, %result2 = nvgpu.warpgroup.mma ... // Stores the fragmented result to memref nvgpu.warpgroup.mma.store [%result1, %result2], %matrixD : !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> to memref<128x128xf32,3> ```	2023-10-05 10:54:13 +02:00
Guray Ozen	b74cfc139a	[mlir][nvgpu] Improve nvgpu->nvvm transformation of `warpgroup.mma` Op (NFC) (#67325 ) This PR introduces substantial improvements to the readability and maintainability of the `nvgpu.warpgroup.mma` Op transformation from nvgpu->nvvm. This transformation plays a crucial role in GEMM and manages complex operations such as generating multiple wgmma ops and iterating their descriptors. The prior code lacked clarity, but this PR addresses that issue effectively. PR does followings: Introduces a helper class: `WarpgroupGemm` class encapsulates the necessary functionality, making the code cleaner and more understandable. Detailed Documentation: Each function within the helper class is thoroughly documented to provide clear insights into its purpose and functionality.	2023-10-05 10:16:59 +02:00
Guray Ozen	7eb2b99f16	[mlir] Change the class name of the `GenerateWarpgroupDescriptor` (#68286 )	2023-10-05 10:15:40 +02:00
Guray Ozen	6dc7717bca	[MLIR][NVGPU] Change name `wgmma.descriptor` to `warpgroup.descriptor` (NFC) (#67526 ) NVGPU dialect is gaining large support for warpgroup level operations, and their names always starts with `warpgroup....`. This PR changes name of Op and type from `wgmma.descriptor` to `warpgroup.descriptor` for sake of consistency.	2023-10-05 09:01:48 +02:00
Guray Ozen	ee49cda7d4	[mlir][nvgpu] Use ImplicitLocOpBuilder in nvgpu-to-nvvm pass (NFC) (#67993 ) For the sake of better readability, this PR uses `ImplicitLocOpBuilder` instead of rewriter+loc	2023-10-03 10:52:36 +02:00
Guray Ozen	17649a7726	[MLIR][NVGPU] Introduce `nvgpu.mbarrier.group` for multiple mbarrier use (#65951 ) A common practice involves the creation of multiple `mbarrier` objects, see an example below. This is particularly valuable in scenarios like software pipelining for GEMM, where we need to generate multiple barriers dynamically use and wait them in a loop. PR improves `nvgpu.mbarrier.barrier` type into the `nvgpu.mbarrier.group`. All `mbarrier` related Ops now uses this type. Consequently, these Ops are now capable of managing multiple barriers seamlessly. Having `num_barriers = 4` helps us to locate mbarrier object(s) into static shared memory. We could make the value dynamic that requires dynamic shared memory it would complicate the codegen. ``` %barriers = nvgpu.mbarrier.create -> !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c0], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c1], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c2], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c3], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> ... scf.for %i = %c0 to %n step %c1 { nvgpu.mbarrier.try_wait %barriers[ (i % 4) ] ... // ... Do work once mbarrier is ready nvgpu.mbarrier.arrive.expect_tx %barriers[ (i + 3 % 4) ] ... } ``` We will have mbarrier usages like below: ``` expect_tx[0] expect_tx[1] expect_tx[2] Loop: try_wait mbarrier[0], expect_tx[3] try_wait mbarrier[1], expect_tx[0] try_wait mbarrier[2], expect_tx[1] try_wait mbarrier[3], expect_tx[2] ... ```	2023-09-22 17:09:43 +02:00
Michael Liao	8eed7fbc77	[mlir][nvgpu] Fix shared build. NFC	2023-09-22 10:00:29 -04:00
Guray Ozen	2388222695	[MLIR][NVGPU] Adding `nvgpu.warpgroup.mma` Op for Hopper GPUs (#65440 ) This work introduces a new operation called `warpgroup.mma` to the NVGPU dialect of MLIR. The purpose of this operation is to facilitate warpgroup-level matrix multiply and accumulate (WGMMA) operations on Hopper GPUs with sm_90a architecture. Previously, the `nvvm.wgmma.mma_async` operation was introduced to support warpgroup-level matrix operations in NVVM dialect. This op is used multiple instances of `nvvm.wgmma.mma_async` to achieve the desired shape. The new `nvgpu.warpgroup.mma` operation abstracts this complexity and provides a higher-level interface for performing warpgroup-level matrix operations. The `nvgpu.warpgroup.mma` does followings: 1) Corresponds multiple `wgmma` instructions. 2) Iterates input matrix descriptors to achieve the desired computation shape. 3) Groups and runs `wgmma` instructions asynchronously, and eventually waits them. This are done by `wgmma.fence.aligned`, `wgmma.commit.group.sync.aligned`, and `wgmma.wait.group.sync.aligned` 4) Results fragmented matrices Here's an example usage of the `nvgpu.warpgroup.mma` operation: ``` %wgmmaResult, %wgmmaResult2 = nvgpu.warpgroup.mma %descA, %descB, %acc1, %acc2 {transposeB}: !nvgpu.wgmma.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.wgmma.descriptor<tensor = memref<64x128xf16, 3>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> -> !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> ``` The op will result following PTX: ``` wgmma.fence.sync.aligned; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA, %descB, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+2, %descB+128, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+4, %descB+256, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+8, %descB+348, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+512, %descB, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+514, %descB+128, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+516, %descB+256, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+518, %descB+348, p, 1, 1, 0, 1; wgmma.commit_group.sync.aligned; wgmma.wait_group.sync.aligned 1; ``` The Op keeps - first 64 registers (`{%f1, %f2, 62 more registers}`) -> `%acc1` - second 64 registers (`{%f500,%f501, 62 more registers}`) -> `%acc2`.	2023-09-22 11:46:29 +02:00
Guray Ozen	b96d069324	[NVGPU] Add debug in nvgpu (nfc) Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D159343	2023-09-01 16:40:40 +02:00
Guray Ozen	52b93d2fe2	[mlir][nvgpu] remove duplicated pattern (nfc) Differential Revision: https://reviews.llvm.org/D158847	2023-08-25 17:12:58 +02:00
Guray Ozen	50ab427a29	[MLIR][NVGPU] Introduce Warpgroup Matrix Descriptor Type The Warpgroup Matrix Descriptor is a 64-bit integer that holds information about a matrix used by the wgmma instruction. In this work, a new type is introduced for the descriptor. This enhances the readability of the IR and allows for easier verification using MLIR verification tools. The type contains a 'memref' related to the descriptor, which is crucial for preserving and conveying information. Depends on D157382 Reviewed By: qcolombet Differential Revision: https://reviews.llvm.org/D158403	2023-08-22 17:02:37 +02:00
Guray Ozen	cce3e8ed89	[MLIR][NVGPU] Introduction of wgmma.generate.descriptor Op This work introduces a new Op, `wgmma.generate.descriptor`, designed to create a wgmma descriptor for inputs of matrix multiply and accumulate operations using `wgmma.mma_async` PTX instruction. The descriptor format specifications can be found in the following link: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-shared-memory-layout-matrix-descriptor It's important to note that this op is in its initial phase, and it does come with certain limitations. It only supports 128b swizzling and does not incorporate interleaving. In the future, different calculations will be addressed in separate works, expanding the capabilities of the op. Reviewed By: qcolombet Differential Revision: https://reviews.llvm.org/D157382	2023-08-22 16:12:25 +02:00
Matthias Springer	ce254598b7	[mlir][Conversion] Store const type converter in ConversionPattern ConversionPatterns do not (and should not) modify the type converter that they are using. * Make `ConversionPattern::typeConverter` const. * Make member functions of the `LLVMTypeConverter` const. * Conversion patterns take a const type converter. * Various helper functions (that are called from patterns) now also take a const type converter. Differential Revision: https://reviews.llvm.org/D157601	2023-08-14 09:03:11 +02:00
Nicolas Vasilache	99475f5b4a	[mlir][transform] Add NVGPU to NVVM conversion via transform.apply_conversion_patterns Differential Revision: https://reviews.llvm.org/D157501	2023-08-09 14:09:57 +00:00
Guray Ozen	50a76a7d73	[MLIR][NVGPU] Handling Offset in `nvgpu.tma.async.load` When using `nvgpu.tma.async.load` Op to asynchronously load data into shared memory, it fails to account for provided offsets, potentially leading to incorrect memory access. Using offset is common practice especially with the dynamic shared memory. This work addresses the problem by ensuring proper consideration of offsets. Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D157380	2023-08-08 13:25:00 +02:00
Matthias Springer	544f0e9161	[mlir] Fix build after D155680	2023-07-21 13:33:54 +02:00
Guray Ozen	e56d6745f7	[mlir][nvgpu] Add `tma.create.descriptor` to create tensor map descriptor The Op creates a tensor map descriptor object representing tiled memory region. The descriptor is used by Tensor Memory Access (TMA). The `tensor` is the source tensor to be tiled. The `boxDimensions` is the size of the tiled memory region in each dimension. The pattern here lowers `tma.create.descriptor` to a runtime function call that eventually calls calls CUDA Driver's `cuTensorMapEncodeTiled`. For more information see below: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html Depends on D155453 Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D155680	2023-07-21 11:33:04 +02:00
Guray Ozen	9dad32cb90	[mlir][nvgpu] Improve finding module Op to for `mbarrier.create` Current transformation expects module op to be two level higher, however, it is not always the case. This work searches module op in a while loop. Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D155825	2023-07-21 10:36:45 +02:00
Guray Ozen	70c2e0618a	[mlir][nvgpu] Add nvgpu.tma.async.load and nvgpu.tma.descriptor This work adds `nvgpu.tma.async.load` Op that requests tma load asyncronusly using mbarrier object. It also creates nvgpu.tma.descriptor type. The type is supposed be created by `cuTensorMapEncodeTiled` cuda drivers api. Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D155453	2023-07-21 10:23:25 +02:00
Guray Ozen	836dbb8522	[mlir][nvgpu] Add `mbarrier.arrive.expect_tx` and `mbarrier.try_wait.parity` This work adds two Ops: `mbarrier.arrive.expect_tx` performs expect_tx `mbarrier.barrier` returns `mbarrier.barrier.token` `mbarrier.try_wait.parity` waits on `mbarrier.barrier` and `mbarrier.barrier.token` `mbarrier.arrive.expect_tx` is one of the requirement to enable H100 TMA support. Depends on D154074 D154076 D154059 D154060 Reviewed By: qcolombet Differential Revision: https://reviews.llvm.org/D154094	2023-07-20 13:48:30 +02:00
Guray Ozen	affcfccd3c	[mlir][nvgpu] Add initial support for `mbarrier` `mbarrier` is a barrier created in shared memory that supports different flavors of synchronizing threads other than `__syncthreads`, for more information see below. https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-mbarrier This work adds initial Ops wrt `mbarrier` to nvgpu dialect. First, it introduces to two types: `mbarrier.barrier` that is barrier object in shared memory `mbarrier.barrier.token` that is token It introduces following Ops: `mbarrier.create` creates `mbarrier.barrier` `mbarrier.init` initializes `mbarrier.barrier` `mbarrier.arrive` performs arrive-on `mbarrier.barrier` returns `mbarrier.barrier.token` `mbarrier.arrive.nocomplete` performs arrive-on (non-blocking) `mbarrier.barrier` returns `mbarrier.barrier.token` `mbarrier.test_wait` waits on `mbarrier.barrier` and `mbarrier.barrier.token` Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D154090	2023-07-11 17:35:27 +02:00
Guray Ozen	2c5739675c	[mlir][nvgpu] Implement `nvgpu.device_async_copy` by NVVMToLLVM Pass `nvgpu.device_async_copy` is lowered into `cp.async` PTX instruction. However, NVPTX backend does not support its all mode especially when zero padding is needed. Therefore, current MLIR implementation genereates inline assembly for that. This work simplifies PTX generation for `nvgpu.device_async_copy`, and implements it by `NVVMToLLVM` Pass. Depends on D154060 Reviewed By: nicolasvasilache, manishucsd Differential Revision: https://reviews.llvm.org/D154345	2023-07-11 12:18:28 +02:00
Matthias Springer	61223c49dd	[mlir][GPU] Rename MLIRGPUOps CMake target to MLIRGPUDialect This is for consistency with other dialects. Differential Revision: https://reviews.llvm.org/D150659	2023-05-16 14:25:08 +02:00
Tres Popp	5550c82189	[mlir] Move casting calls from methods to function calls The MLIR classes Type/Attribute/Operation/Op/Value support cast/dyn_cast/isa/dyn_cast_or_null functionality through llvm's doCast functionality in addition to defining methods with the same name. This change begins the migration of uses of the method to the corresponding function call as has been decided as more consistent. Note that there still exist classes that only define methods directly, such as AffineExpr, and this does not include work currently to support a functional cast/isa call. Caveats include: - This clang-tidy script probably has more problems. - This only touches C++ code, so nothing that is being generated. Context: - https://mlir.llvm.org/deprecation/ at "Use the free function variants for dyn_cast/cast/isa/…" - Original discussion at https://discourse.llvm.org/t/preferred-casting-style-going-forward/68443 Implementation: This first patch was created with the following steps. The intention is to only do automated changes at first, so I waste less time if it's reverted, and so the first mass change is more clear as an example to other teams that will need to follow similar steps. Steps are described per line, as comments are removed by git: 0. Retrieve the change from the following to build clang-tidy with an additional check: https://github.com/llvm/llvm-project/compare/main...tpopp:llvm-project:tidy-cast-check 1. Build clang-tidy 2. Run clang-tidy over your entire codebase while disabling all checks and enabling the one relevant one. Run on all header files also. 3. Delete .inc files that were also modified, so the next build rebuilds them to a pure state. 4. Some changes have been deleted for the following reasons: - Some files had a variable also named cast - Some files had not included a header file that defines the cast functions - Some files are definitions of the classes that have the casting methods, so the code still refers to the method instead of the function without adding a prefix or removing the method declaration at the same time. ``` ninja -C $BUILD_DIR clang-tidy run-clang-tidy -clang-tidy-binary=$BUILD_DIR/bin/clang-tidy -checks='-,misc-cast-functions'\ -header-filter=mlir/ mlir/ -fix rm -rf $BUILD_DIR/tools/mlir/*/.inc git restore mlir/lib/IR mlir/lib/Dialect/DLTI/DLTI.cpp\ mlir/lib/Dialect/Complex/IR/ComplexDialect.cpp\ mlir/lib/**/IR/\ mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp\ mlir/lib/Dialect/Vector/Transforms/LowerVectorMultiReduction.cpp\ mlir/test/lib/Dialect/Test/TestTypes.cpp\ mlir/test/lib/Dialect/Transform/TestTransformDialectExtension.cpp\ mlir/test/lib/Dialect/Test/TestAttributes.cpp\ mlir/unittests/TableGen/EnumsGenTest.cpp\ mlir/test/python/lib/PythonTestCAPI.cpp\ mlir/include/mlir/IR/ ``` Differential Revision: https://reviews.llvm.org/D150123	2023-05-12 11:21:25 +02:00
Nicolas Vasilache	95cb9862a8	[mlir][NVGPU] Support cache all (.ca) in nvgpu.device_async_copy This patch adds support for cache all (.ca) in conversion from nvgpu-to-nvvm for inline asm `cp.async`. For sizes other than 16 bytes cp.async cache global is not allowed and cache all is required to generate a valid ptx. Differential revision: https://reviews.llvm.org/D148604 Authored-by: Manish Gupta <manigupta@google.com>	2023-04-18 05:00:53 -07:00
Aart Bik	4e4af1338d	[mlir][gpu][nvvm] fixed bug with literal for inline asm for mma instruction The 'mma.sp.sync.aligned' family of instructions expects the sparsity selector as a direct literal (0x0 or 0x1). The current MLIR inline asm passed this as a value in register, which broke the downstream assemblers This is a small step towards supporting 2:4 sparsity on NVidia GPUs in the sparse compiler of MLIR. Reviewed By: ThomasRaoux, guraypp Differential Revision: https://reviews.llvm.org/D146110	2023-03-17 09:22:15 -07:00
Markus Böck	53689fdfb2	[mlir][NVGPUToNVVM] Add option for emitting opaque pointers Part of https://discourse.llvm.org/t/rfc-switching-the-llvm-dialect-and-dialect-lowerings-to-opaque-pointers/68179 The 'use-opaque-pointers' options, when enabled, changes the patterns to emit opaque pointers instead of typed pointers. As part of the migration effort the test have been converted to typed pointers, with an extra file for testing just typed pointers specific code paths. Differential Revision: https://reviews.llvm.org/D144736	2023-02-24 17:43:38 +01:00
Krzysztof Drewniak	499abb243c	Add generic type attribute mapping infrastructure, use it in GpuToX Remapping memory spaces is a function often needed in type conversions, most often when going to LLVM or to/from SPIR-V (a future commit), and it is possible that such remappings may become more common in the future as dialects take advantage of the more generic memory space infrastructure. Currently, memory space remappings are handled by running a special-purpose conversion pass before the main conversion that changes the address space attributes. In this commit, this approach is replaced by adding a notion of type attribute conversions TypeConverter, which is then used to convert memory space attributes. Then, we use this infrastructure throughout the ToLLVM conversions. This has the advantage of loosing the requirements on the inputs to those passes from "all address spaces must be integers" to "all memory spaces must be convertible to integer spaces", a looser requirement that reduces the coupling between portions of MLIR. ON top of that, this change leads to the removal of most of the calls to getMemorySpaceAsInt(), bringing us closer to removing it. (A rework of the SPIR-V conversions to use this new system will be in a folowup commit.) As a note, one long-term motivation for this change is that I would eventually like to add an allocaMemorySpace key to MLIR data layouts and then call getMemRefAddressSpace(allocaMemorySpace) in the relevant ToLLVM in order to ensure all alloca()s, whether incoming or produces during the LLVM lowering, have the correct address space for a given target. I expect that the type attribute conversion system may be useful in other contexts. Reviewed By: ftynse Differential Revision: https://reviews.llvm.org/D142159	2023-02-09 18:00:46 +00:00
Ramkumar Ramachandra	22426110c5	mlir/tblgen: use std::optional in generation This is part of an effort to migrate from llvm::Optional to std::optional. This patch changes the way mlir-tblgen generates .inc files, and modifies tests and documentation appropriately. It is a "no compromises" patch, and doesn't leave the user with an unpleasant mix of llvm::Optional and std::optional. A non-trivial change has been made to ControlFlowInterfaces to split one constructor into two, relating to a build failure on Windows. See also: https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716 Signed-off-by: Ramkumar Ramachandra <r@artagnon.com> Differential Revision: https://reviews.llvm.org/D138934	2022-12-17 11:13:26 +01:00
Kazu Hirata	1a36588ec6	[mlir] Use std::nullopt instead of None (NFC) This patch mechanically replaces None with std::nullopt where the compiler would warn if None were deprecated. The intent is to reduce the amount of manual work required in migrating from Optional to std::optional. This is part of an effort to migrate from llvm::Optional to std::optional: https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716	2022-12-03 18:50:27 -08:00

1 2

68 Commits