clang-p2996

Author	SHA1	Message	Date
Guray Ozen	3a03da37a3	[mlir][nvgpu] Add address space attribute converter in nvgpu-to-nvvm pass (#74075 ) GPU dialect has `#gpu.address_space<workgroup>` for shared memory of NVGPU (address space =3). Howeverm when IR combine NVGPU and GPU dialect, `nvgpu-to-nvvm` pass fails due to missing attribute conversion. This PR adds `populateGpuMemorySpaceAttributeConversions` to nvgou-to-nvvm lowering, so we can use `#gpu.address_space<workgroup>` `nvgpu-to-nvvm` pass	2023-12-04 16:48:39 +01:00
Guray Ozen	9ceea08859	[mlir] `im2col` & `l2cache` on cp.async.bulk.tensor.shared.cluster.global` (#72967 ) PR adds support of `im2col` and `l2cache` to `cp.async.bulk.tensor.shared.cluster.global`. The Op is now supports all the traits of the corresponding PTX instruction. The current structure of this operation looks somewhat like below. The PR also simplifies types so we don't need to write obvious types after `:` anymore. ``` nvvm.cp.async.bulk.tensor.shared.cluster.global %dest, %tmaDescriptor, %barrier, box[%crd0,%crd1,%crd2,%crd3,%crd4] im2col[%off0,%off1,%off2] <-- PR introduces multicast_mask = %ctamask l2_cache_hint = %cacheHint <-- PR introduces : !llvm.ptr<3>, !llvm.ptr ```	2023-11-22 16:08:09 +01:00
Guray Ozen	108380da35	[mlir][nvvm] Add `cp.async.bulk.tensor.shared.cluster.global.multicast` (#72429 ) This PR introduce `cp.async.bulk.tensor.shared.cluster.global.multicast` Op in NVVM dialect. It loads data using TMA data from global memory to shared memory of multiple CTAs in the cluster. It resolves #72368	2023-11-16 14:34:56 +01:00
Christian Ulmann	2f17c9f65e	[MLIR][NVGPUToNVVM] Remove typed pointer support (#70867 ) This commit removes the support for lowering NVGPU to NVVM dialect with typed pointers. Typed pointers have been deprecated for a while now and it's planned to soon remove them from the LLVM dialect. Related PSA: https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502	2023-11-02 07:35:21 +01:00
Guray Ozen	192d3320f0	[mlir][nvgpu] Add predicate argument to NVGPU Ops (#69322 )	2023-10-18 19:41:51 +02:00
Guray Ozen	39cdefb5b5	[mlir][nvvm] Add prefetch.tensormap (#67564 ) This PR adds `prefetch.tensormap` Op. It brings the cache line containing the given tma descriptor for subsequent use by the cp.async.bulk.tensor instruction. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-prefetch-prefetchu	2023-10-17 13:03:37 +02:00
Guray Ozen	c4ba84d655	[mlir][nvgpu] Fix packing accumlator matrix (#69316 ) The #68728 significantly simplified the accumulator matrix type, making it easier to work with the nvgpu dialect without worrying about the number of required structs, as this information is abstracted away in the nvgpu-to-nvvm transformation. However, we forgot packing the structs after initialization, causing the accumulator matrix to hold undefined values, which is wrong. This PR addresses that.	2023-10-17 12:46:10 +02:00
Guray Ozen	63389326f5	[mlir][nvvm] Support predicates in `BasicPtxBuilder` (#67102 ) This PR enhances `BasicPtxBuilder` to support predicates in PTX code generation. The `BasicPtxBuilder` interface was initially introduced for generating PTX code automatically for Ops that aren't supported by LLVM core. Predicates, which are typically not supported in LLVM core, are now supported using the same mechanism. In PTX programming, instructions can be guarded by predicates as shown below:. Here `@p` is a predicate register and guard the execution of the instruction. ``` @p ptx.code op1, op2, op3 ``` This PR introduces the `getPredicate` function in the `BasicPtxBuilder` interface to set an optional predicate. When a predicate is provided, the instruction is generated with predicate and guarded, otherwise, predicate is not genearted. Note that the predicate value must always appear as the last argument on the Op definition. Additionally, this PR implements predicate usage for the following ops: - mbarrier.init - mbarrier.init.shared - mbarrier.arrive.expect_tx - mbarrier.arrive.expect_tx.shared - cp.async.bulk.tensor.shared.cluster.global - cp.async.bulk.tensor.global.shared.cta See for more detail in PTX programing model https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-instructions	2023-10-17 12:42:36 +02:00
Guray Ozen	52db7e2745	[mlir][nvgpu] Improve `WarpgroupAccumulator` type to simplify IR (#68728 ) `WarpgroupAccumulator` (or `!nvgpu.warpgroup.accumulator`) is a type that keeps the accumulator matrix that is used by warp-group level matrix multiplication. It is handy to have a special type for that as the matrix is distributed among the threads of the warp-group. However, current transformations requires to create and use multiple `WarpgroupAccumulator` if the shape of GEMM is larger than the supported shape of `wgmma.mma_async` instruction. This makes IR looks dense. This PR improves the transformation of `WarpgroupAccumulator` type in every nvgpu Op that uses it. Example: Current GEMM in NVGPU-IR ``` // Init %m1, %m2 = nvgpu.warpgroup.mma.init.accumulator -> !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> // GEMM %r1, %r2 = nvgpu.warpgroup.mma %descA, %descB, %m1, %m2 {transposeB}: !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> -> !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> // Epilogue nvgpu.warpgroup.mma.store [%r1, %r2] to %sharedMemoryBuffer : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> into memref<128x128xf32,3> ``` Example: This PR simplifies the IR as below: ``` // Init %m = nvgpu.warpgroup.mma.init.accumulator -> !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> // GEMM %r1 = nvgpu.warpgroup.mma %descA, %descB, %m1 {transposeB}: !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> -> !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> // Epilogue nvgpu.warpgroup.mma.store [%matrixD1, %matrixD2] to %sharedMemoryBuffer : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> into memref<128x128xf32,3> ```	2023-10-17 11:46:47 +02:00
Guray Ozen	315ab3c44b	[MLIR][NVGPU] Introduce `warpgroup.init.accumulator` Op (#67530 ) This Op generates and initilizes the accumulator matrix for `nvgpu.warpgroup.mma` op to perform matrix-multiply-and-accumulate (mma). Its associated transformation generates `!llvm.struct<>` and fill it with the initial values. The size of struct is number of required inout registers for `nvgpu.warpgroup.mma` op.	2023-10-11 08:28:26 -07:00
Guray Ozen	d20fbc9007	[MLIR][NVGPU] Introduce `nvgpu.wargroup.mma.store` Op for Hopper GPUs (#65441 ) This PR introduces a new Op called `warpgroup.mma.store` to the NVGPU dialect of MLIR. The purpose of this operation is to facilitate storing fragmanted result(s) `nvgpu.warpgroup.accumulator` produced by `warpgroup.mma` to the given memref. An example of fragmentated matrix is given here : https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d The `warpgroup.mma.store` does followings: 1) Takes one or more `nvgpu.warpgroup.accumulator` type (fragmented results matrix) 2) Calculates indexes per thread in warp-group and stores the data into give memref. Here's an example usage: ``` // A warpgroup performs GEMM, results in fragmented matrix %result1, %result2 = nvgpu.warpgroup.mma ... // Stores the fragmented result to memref nvgpu.warpgroup.mma.store [%result1, %result2], %matrixD : !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> to memref<128x128xf32,3> ```	2023-10-05 10:54:13 +02:00
Guray Ozen	b74cfc139a	[mlir][nvgpu] Improve nvgpu->nvvm transformation of `warpgroup.mma` Op (NFC) (#67325 ) This PR introduces substantial improvements to the readability and maintainability of the `nvgpu.warpgroup.mma` Op transformation from nvgpu->nvvm. This transformation plays a crucial role in GEMM and manages complex operations such as generating multiple wgmma ops and iterating their descriptors. The prior code lacked clarity, but this PR addresses that issue effectively. PR does followings: Introduces a helper class: `WarpgroupGemm` class encapsulates the necessary functionality, making the code cleaner and more understandable. Detailed Documentation: Each function within the helper class is thoroughly documented to provide clear insights into its purpose and functionality.	2023-10-05 10:16:59 +02:00
Guray Ozen	7eb2b99f16	[mlir] Change the class name of the `GenerateWarpgroupDescriptor` (#68286 )	2023-10-05 10:15:40 +02:00
Guray Ozen	6dc7717bca	[MLIR][NVGPU] Change name `wgmma.descriptor` to `warpgroup.descriptor` (NFC) (#67526 ) NVGPU dialect is gaining large support for warpgroup level operations, and their names always starts with `warpgroup....`. This PR changes name of Op and type from `wgmma.descriptor` to `warpgroup.descriptor` for sake of consistency.	2023-10-05 09:01:48 +02:00
Guray Ozen	ee49cda7d4	[mlir][nvgpu] Use ImplicitLocOpBuilder in nvgpu-to-nvvm pass (NFC) (#67993 ) For the sake of better readability, this PR uses `ImplicitLocOpBuilder` instead of rewriter+loc	2023-10-03 10:52:36 +02:00
Guray Ozen	17649a7726	[MLIR][NVGPU] Introduce `nvgpu.mbarrier.group` for multiple mbarrier use (#65951 ) A common practice involves the creation of multiple `mbarrier` objects, see an example below. This is particularly valuable in scenarios like software pipelining for GEMM, where we need to generate multiple barriers dynamically use and wait them in a loop. PR improves `nvgpu.mbarrier.barrier` type into the `nvgpu.mbarrier.group`. All `mbarrier` related Ops now uses this type. Consequently, these Ops are now capable of managing multiple barriers seamlessly. Having `num_barriers = 4` helps us to locate mbarrier object(s) into static shared memory. We could make the value dynamic that requires dynamic shared memory it would complicate the codegen. ``` %barriers = nvgpu.mbarrier.create -> !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c0], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c1], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c2], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c3], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> ... scf.for %i = %c0 to %n step %c1 { nvgpu.mbarrier.try_wait %barriers[ (i % 4) ] ... // ... Do work once mbarrier is ready nvgpu.mbarrier.arrive.expect_tx %barriers[ (i + 3 % 4) ] ... } ``` We will have mbarrier usages like below: ``` expect_tx[0] expect_tx[1] expect_tx[2] Loop: try_wait mbarrier[0], expect_tx[3] try_wait mbarrier[1], expect_tx[0] try_wait mbarrier[2], expect_tx[1] try_wait mbarrier[3], expect_tx[2] ... ```	2023-09-22 17:09:43 +02:00
Guray Ozen	2388222695	[MLIR][NVGPU] Adding `nvgpu.warpgroup.mma` Op for Hopper GPUs (#65440 ) This work introduces a new operation called `warpgroup.mma` to the NVGPU dialect of MLIR. The purpose of this operation is to facilitate warpgroup-level matrix multiply and accumulate (WGMMA) operations on Hopper GPUs with sm_90a architecture. Previously, the `nvvm.wgmma.mma_async` operation was introduced to support warpgroup-level matrix operations in NVVM dialect. This op is used multiple instances of `nvvm.wgmma.mma_async` to achieve the desired shape. The new `nvgpu.warpgroup.mma` operation abstracts this complexity and provides a higher-level interface for performing warpgroup-level matrix operations. The `nvgpu.warpgroup.mma` does followings: 1) Corresponds multiple `wgmma` instructions. 2) Iterates input matrix descriptors to achieve the desired computation shape. 3) Groups and runs `wgmma` instructions asynchronously, and eventually waits them. This are done by `wgmma.fence.aligned`, `wgmma.commit.group.sync.aligned`, and `wgmma.wait.group.sync.aligned` 4) Results fragmented matrices Here's an example usage of the `nvgpu.warpgroup.mma` operation: ``` %wgmmaResult, %wgmmaResult2 = nvgpu.warpgroup.mma %descA, %descB, %acc1, %acc2 {transposeB}: !nvgpu.wgmma.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.wgmma.descriptor<tensor = memref<64x128xf16, 3>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> -> !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> ``` The op will result following PTX: ``` wgmma.fence.sync.aligned; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA, %descB, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+2, %descB+128, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+4, %descB+256, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+8, %descB+348, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+512, %descB, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+514, %descB+128, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+516, %descB+256, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+518, %descB+348, p, 1, 1, 0, 1; wgmma.commit_group.sync.aligned; wgmma.wait_group.sync.aligned 1; ``` The Op keeps - first 64 registers (`{%f1, %f2, 62 more registers}`) -> `%acc1` - second 64 registers (`{%f500,%f501, 62 more registers}`) -> `%acc2`.	2023-09-22 11:46:29 +02:00
Guray Ozen	b96d069324	[NVGPU] Add debug in nvgpu (nfc) Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D159343	2023-09-01 16:40:40 +02:00
Guray Ozen	52b93d2fe2	[mlir][nvgpu] remove duplicated pattern (nfc) Differential Revision: https://reviews.llvm.org/D158847	2023-08-25 17:12:58 +02:00
Guray Ozen	50ab427a29	[MLIR][NVGPU] Introduce Warpgroup Matrix Descriptor Type The Warpgroup Matrix Descriptor is a 64-bit integer that holds information about a matrix used by the wgmma instruction. In this work, a new type is introduced for the descriptor. This enhances the readability of the IR and allows for easier verification using MLIR verification tools. The type contains a 'memref' related to the descriptor, which is crucial for preserving and conveying information. Depends on D157382 Reviewed By: qcolombet Differential Revision: https://reviews.llvm.org/D158403	2023-08-22 17:02:37 +02:00
Guray Ozen	cce3e8ed89	[MLIR][NVGPU] Introduction of wgmma.generate.descriptor Op This work introduces a new Op, `wgmma.generate.descriptor`, designed to create a wgmma descriptor for inputs of matrix multiply and accumulate operations using `wgmma.mma_async` PTX instruction. The descriptor format specifications can be found in the following link: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-shared-memory-layout-matrix-descriptor It's important to note that this op is in its initial phase, and it does come with certain limitations. It only supports 128b swizzling and does not incorporate interleaving. In the future, different calculations will be addressed in separate works, expanding the capabilities of the op. Reviewed By: qcolombet Differential Revision: https://reviews.llvm.org/D157382	2023-08-22 16:12:25 +02:00
Matthias Springer	ce254598b7	[mlir][Conversion] Store const type converter in ConversionPattern ConversionPatterns do not (and should not) modify the type converter that they are using. * Make `ConversionPattern::typeConverter` const. * Make member functions of the `LLVMTypeConverter` const. * Conversion patterns take a const type converter. * Various helper functions (that are called from patterns) now also take a const type converter. Differential Revision: https://reviews.llvm.org/D157601	2023-08-14 09:03:11 +02:00
Nicolas Vasilache	99475f5b4a	[mlir][transform] Add NVGPU to NVVM conversion via transform.apply_conversion_patterns Differential Revision: https://reviews.llvm.org/D157501	2023-08-09 14:09:57 +00:00
Guray Ozen	50a76a7d73	[MLIR][NVGPU] Handling Offset in `nvgpu.tma.async.load` When using `nvgpu.tma.async.load` Op to asynchronously load data into shared memory, it fails to account for provided offsets, potentially leading to incorrect memory access. Using offset is common practice especially with the dynamic shared memory. This work addresses the problem by ensuring proper consideration of offsets. Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D157380	2023-08-08 13:25:00 +02:00
Guray Ozen	e56d6745f7	[mlir][nvgpu] Add `tma.create.descriptor` to create tensor map descriptor The Op creates a tensor map descriptor object representing tiled memory region. The descriptor is used by Tensor Memory Access (TMA). The `tensor` is the source tensor to be tiled. The `boxDimensions` is the size of the tiled memory region in each dimension. The pattern here lowers `tma.create.descriptor` to a runtime function call that eventually calls calls CUDA Driver's `cuTensorMapEncodeTiled`. For more information see below: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html Depends on D155453 Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D155680	2023-07-21 11:33:04 +02:00
Guray Ozen	9dad32cb90	[mlir][nvgpu] Improve finding module Op to for `mbarrier.create` Current transformation expects module op to be two level higher, however, it is not always the case. This work searches module op in a while loop. Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D155825	2023-07-21 10:36:45 +02:00
Guray Ozen	70c2e0618a	[mlir][nvgpu] Add nvgpu.tma.async.load and nvgpu.tma.descriptor This work adds `nvgpu.tma.async.load` Op that requests tma load asyncronusly using mbarrier object. It also creates nvgpu.tma.descriptor type. The type is supposed be created by `cuTensorMapEncodeTiled` cuda drivers api. Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D155453	2023-07-21 10:23:25 +02:00
Guray Ozen	836dbb8522	[mlir][nvgpu] Add `mbarrier.arrive.expect_tx` and `mbarrier.try_wait.parity` This work adds two Ops: `mbarrier.arrive.expect_tx` performs expect_tx `mbarrier.barrier` returns `mbarrier.barrier.token` `mbarrier.try_wait.parity` waits on `mbarrier.barrier` and `mbarrier.barrier.token` `mbarrier.arrive.expect_tx` is one of the requirement to enable H100 TMA support. Depends on D154074 D154076 D154059 D154060 Reviewed By: qcolombet Differential Revision: https://reviews.llvm.org/D154094	2023-07-20 13:48:30 +02:00
Guray Ozen	affcfccd3c	[mlir][nvgpu] Add initial support for `mbarrier` `mbarrier` is a barrier created in shared memory that supports different flavors of synchronizing threads other than `__syncthreads`, for more information see below. https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-mbarrier This work adds initial Ops wrt `mbarrier` to nvgpu dialect. First, it introduces to two types: `mbarrier.barrier` that is barrier object in shared memory `mbarrier.barrier.token` that is token It introduces following Ops: `mbarrier.create` creates `mbarrier.barrier` `mbarrier.init` initializes `mbarrier.barrier` `mbarrier.arrive` performs arrive-on `mbarrier.barrier` returns `mbarrier.barrier.token` `mbarrier.arrive.nocomplete` performs arrive-on (non-blocking) `mbarrier.barrier` returns `mbarrier.barrier.token` `mbarrier.test_wait` waits on `mbarrier.barrier` and `mbarrier.barrier.token` Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D154090	2023-07-11 17:35:27 +02:00
Guray Ozen	2c5739675c	[mlir][nvgpu] Implement `nvgpu.device_async_copy` by NVVMToLLVM Pass `nvgpu.device_async_copy` is lowered into `cp.async` PTX instruction. However, NVPTX backend does not support its all mode especially when zero padding is needed. Therefore, current MLIR implementation genereates inline assembly for that. This work simplifies PTX generation for `nvgpu.device_async_copy`, and implements it by `NVVMToLLVM` Pass. Depends on D154060 Reviewed By: nicolasvasilache, manishucsd Differential Revision: https://reviews.llvm.org/D154345	2023-07-11 12:18:28 +02:00
Tres Popp	5550c82189	[mlir] Move casting calls from methods to function calls The MLIR classes Type/Attribute/Operation/Op/Value support cast/dyn_cast/isa/dyn_cast_or_null functionality through llvm's doCast functionality in addition to defining methods with the same name. This change begins the migration of uses of the method to the corresponding function call as has been decided as more consistent. Note that there still exist classes that only define methods directly, such as AffineExpr, and this does not include work currently to support a functional cast/isa call. Caveats include: - This clang-tidy script probably has more problems. - This only touches C++ code, so nothing that is being generated. Context: - https://mlir.llvm.org/deprecation/ at "Use the free function variants for dyn_cast/cast/isa/…" - Original discussion at https://discourse.llvm.org/t/preferred-casting-style-going-forward/68443 Implementation: This first patch was created with the following steps. The intention is to only do automated changes at first, so I waste less time if it's reverted, and so the first mass change is more clear as an example to other teams that will need to follow similar steps. Steps are described per line, as comments are removed by git: 0. Retrieve the change from the following to build clang-tidy with an additional check: https://github.com/llvm/llvm-project/compare/main...tpopp:llvm-project:tidy-cast-check 1. Build clang-tidy 2. Run clang-tidy over your entire codebase while disabling all checks and enabling the one relevant one. Run on all header files also. 3. Delete .inc files that were also modified, so the next build rebuilds them to a pure state. 4. Some changes have been deleted for the following reasons: - Some files had a variable also named cast - Some files had not included a header file that defines the cast functions - Some files are definitions of the classes that have the casting methods, so the code still refers to the method instead of the function without adding a prefix or removing the method declaration at the same time. ``` ninja -C $BUILD_DIR clang-tidy run-clang-tidy -clang-tidy-binary=$BUILD_DIR/bin/clang-tidy -checks='-,misc-cast-functions'\ -header-filter=mlir/ mlir/ -fix rm -rf $BUILD_DIR/tools/mlir/*/.inc git restore mlir/lib/IR mlir/lib/Dialect/DLTI/DLTI.cpp\ mlir/lib/Dialect/Complex/IR/ComplexDialect.cpp\ mlir/lib/**/IR/\ mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp\ mlir/lib/Dialect/Vector/Transforms/LowerVectorMultiReduction.cpp\ mlir/test/lib/Dialect/Test/TestTypes.cpp\ mlir/test/lib/Dialect/Transform/TestTransformDialectExtension.cpp\ mlir/test/lib/Dialect/Test/TestAttributes.cpp\ mlir/unittests/TableGen/EnumsGenTest.cpp\ mlir/test/python/lib/PythonTestCAPI.cpp\ mlir/include/mlir/IR/ ``` Differential Revision: https://reviews.llvm.org/D150123	2023-05-12 11:21:25 +02:00
Nicolas Vasilache	95cb9862a8	[mlir][NVGPU] Support cache all (.ca) in nvgpu.device_async_copy This patch adds support for cache all (.ca) in conversion from nvgpu-to-nvvm for inline asm `cp.async`. For sizes other than 16 bytes cp.async cache global is not allowed and cache all is required to generate a valid ptx. Differential revision: https://reviews.llvm.org/D148604 Authored-by: Manish Gupta <manigupta@google.com>	2023-04-18 05:00:53 -07:00
Aart Bik	4e4af1338d	[mlir][gpu][nvvm] fixed bug with literal for inline asm for mma instruction The 'mma.sp.sync.aligned' family of instructions expects the sparsity selector as a direct literal (0x0 or 0x1). The current MLIR inline asm passed this as a value in register, which broke the downstream assemblers This is a small step towards supporting 2:4 sparsity on NVidia GPUs in the sparse compiler of MLIR. Reviewed By: ThomasRaoux, guraypp Differential Revision: https://reviews.llvm.org/D146110	2023-03-17 09:22:15 -07:00
Markus Böck	53689fdfb2	[mlir][NVGPUToNVVM] Add option for emitting opaque pointers Part of https://discourse.llvm.org/t/rfc-switching-the-llvm-dialect-and-dialect-lowerings-to-opaque-pointers/68179 The 'use-opaque-pointers' options, when enabled, changes the patterns to emit opaque pointers instead of typed pointers. As part of the migration effort the test have been converted to typed pointers, with an extra file for testing just typed pointers specific code paths. Differential Revision: https://reviews.llvm.org/D144736	2023-02-24 17:43:38 +01:00
Krzysztof Drewniak	499abb243c	Add generic type attribute mapping infrastructure, use it in GpuToX Remapping memory spaces is a function often needed in type conversions, most often when going to LLVM or to/from SPIR-V (a future commit), and it is possible that such remappings may become more common in the future as dialects take advantage of the more generic memory space infrastructure. Currently, memory space remappings are handled by running a special-purpose conversion pass before the main conversion that changes the address space attributes. In this commit, this approach is replaced by adding a notion of type attribute conversions TypeConverter, which is then used to convert memory space attributes. Then, we use this infrastructure throughout the ToLLVM conversions. This has the advantage of loosing the requirements on the inputs to those passes from "all address spaces must be integers" to "all memory spaces must be convertible to integer spaces", a looser requirement that reduces the coupling between portions of MLIR. ON top of that, this change leads to the removal of most of the calls to getMemorySpaceAsInt(), bringing us closer to removing it. (A rework of the SPIR-V conversions to use this new system will be in a folowup commit.) As a note, one long-term motivation for this change is that I would eventually like to add an allocaMemorySpace key to MLIR data layouts and then call getMemRefAddressSpace(allocaMemorySpace) in the relevant ToLLVM in order to ensure all alloca()s, whether incoming or produces during the LLVM lowering, have the correct address space for a given target. I expect that the type attribute conversion system may be useful in other contexts. Reviewed By: ftynse Differential Revision: https://reviews.llvm.org/D142159	2023-02-09 18:00:46 +00:00
Ramkumar Ramachandra	22426110c5	mlir/tblgen: use std::optional in generation This is part of an effort to migrate from llvm::Optional to std::optional. This patch changes the way mlir-tblgen generates .inc files, and modifies tests and documentation appropriately. It is a "no compromises" patch, and doesn't leave the user with an unpleasant mix of llvm::Optional and std::optional. A non-trivial change has been made to ControlFlowInterfaces to split one constructor into two, relating to a build failure on Windows. See also: https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716 Signed-off-by: Ramkumar Ramachandra <r@artagnon.com> Differential Revision: https://reviews.llvm.org/D138934	2022-12-17 11:13:26 +01:00
Kazu Hirata	1a36588ec6	[mlir] Use std::nullopt instead of None (NFC) This patch mechanically replaces None with std::nullopt where the compiler would warn if None were deprecated. The intent is to reduce the amount of manual work required in migrating from Optional to std::optional. This is part of an effort to migrate from llvm::Optional to std::optional: https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716	2022-12-03 18:50:27 -08:00
rkayaith	b0bbc9b595	[mlir][NVGPU] Fix -Wunsequenced warning llvm-project/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp:441:25: warning: multiple unsequenced modifications to 'asmArgIdx' [-Wunsequenced] ss << "$" << asmArgIdx++ << ",$" << asmArgIdx++ << ";"; ^ ~~	2022-11-08 15:47:12 -05:00
Christopher Bate	708185f03f	[mlir][NVGPU] Add support for structured sparsity MMA variants This change adds a new NVGPU operation that targets the PTX `mma.sp.sync` instruction variants. A lowering to NVVM is provided using inline assembly. Reviewed By: ThomasRaoux, manishucsd Differential Revision: https://reviews.llvm.org/D137202	2022-11-07 09:43:03 -07:00
Manish Gupta	fbf69f95b6	[mlir][NVGPU] Adding Support for cp_async_zfill via Inline Asm `cp_async_zfill` is currently not present in the nvvm backend, this patch adds `cp_async_zfill` support by adding inline asm when lowering from `nvgpu` to `nvvm`. Reviewed By: ThomasRaoux Differential Revision: https://reviews.llvm.org/D132269	2022-09-02 21:29:26 +00:00
Michele Scuttari	67d0d7ac0a	[MLIR] Update pass declarations to new autogenerated files The patch introduces the required changes to update the pass declarations and definitions to use the new autogenerated files and allow dropping the old infrastructure. Reviewed By: mehdi_amini, rriddle Differential Review: https://reviews.llvm.org/D132838	2022-08-31 12:28:45 +02:00
Michele Scuttari	039b969b32	Revert "[MLIR] Update pass declarations to new autogenerated files" This reverts commit `2be8af8f0e`.	2022-08-30 22:21:55 +02:00
Michele Scuttari	2be8af8f0e	[MLIR] Update pass declarations to new autogenerated files The patch introduces the required changes to update the pass declarations and definitions to use the new autogenerated files and allow dropping the old infrastructure. Reviewed By: mehdi_amini, rriddle Differential Review: https://reviews.llvm.org/D132838	2022-08-30 21:56:31 +02:00
Jeff Niu	5c5af910fe	[mlir][LLVMIR] "Modernize" Insert/ExtractValueOp This patch "modernizes" the LLVM `insertvalue` and `extractvalue` operations to use DenseI64ArrayAttr, since they only require an array of indices and previously there was confusion about whether to use i32 or i64 arrays, and to use assembly format. Reviewed By: ftynse Differential Revision: https://reviews.llvm.org/D131537	2022-08-10 12:51:11 -04:00
Manish Gupta	14d79afeae	[mlir][NVGPU] nvgpu.mmasync on F32 through TF32 Adds optional attribute to support tensor cores on F32 datatype by lowering to `mma.sync` with TF32 operands. Since, TF32 is not a native datatype in LLVM we are adding `tf32Enabled` as an attribute to allow the IR to be aware of `MmaSyncOp` datatype. Additionally, this patch adds placeholders for nvgpu-to-nvgpu transformation targeting higher precision tf32x3. For mma.sync on f32 input using tensor cores there are two possibilites: (a) tf32 (1 `mma.sync` per warp-level matrix-multiply-accumulate) (b) tf32x3 (3 `mma.sync` per warp-level matrix-multiply-accumulate) Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits. While f32 has 23 precision bits, tf32 has only 10 precision bits. tf32x3 aims to recover the precision bits by splitting each operand into two tf32 values and issue three `mma.sync` tensor core operations. Reviewed By: ThomasRaoux Differential Revision: https://reviews.llvm.org/D130294	2022-08-01 23:23:27 +00:00
Kazu Hirata	2789c4f51c	[mlir] Use value_or (NFC)	2022-07-25 23:00:56 -07:00
Jacques Pienaar	8df54a6a03	[mlir] Update accessors to prefixed form (NFC) Follow up from flipping dialects to both, flip accessor used to prefixed variant ahead to flipping from _Both to _Prefixed. This just flips to the accessors introduced in the preceding change which are just prefixed forms of the existing accessor changed from. Mechanical change using helper script https://github.com/jpienaar/llvm-project/blob/main/clang-tools-extra/clang-tidy/misc/AddGetterCheck.cpp and clang-format.	2022-06-18 17:53:22 -07:00
Christopher Bate	51b925df94	[mlir][nvgpu] shared memory access optimization pass This change adds a transformation and pass to the NvGPU dialect that attempts to optimize reads/writes from a memref representing GPU shared memory in order to avoid bank conflicts. Given a value representing a shared memory memref, it traverses all reads/writes within the parent op and, subject to suitable conditions, rewrites all last dimension index values such that element locations in the final (col) dimension are given by `newColIdx = col % vecSize + perm[row](col/vecSize,row)` where `perm` is a permutation function indexed by `row` and `vecSize` is the vector access size in elements (currently assumes 128bit vectorized accesses, but this can be made a parameter). This specific transformation can help optimize typical distributed & vectorized accesses common to loading matrix multiplication operands to/from shared memory. Differential Revision: https://reviews.llvm.org/D127457	2022-06-17 09:31:05 -06:00
Mogball	d7ef488bb6	[mlir][gpu] Move GPU headers into IR/ and Transforms/ Depends on D127350 Reviewed By: rriddle Differential Revision: https://reviews.llvm.org/D127352	2022-06-09 22:49:03 +00:00
Christopher Bate	7085cb6011	[mlir][NvGpuToNVVM] Fix byte size calculation in async copy lowering AsyncCopyOp lowering converted "size in elements" to "size in bytes" assuming the element type size is at least one byte. This removes that restriction, allowing for types such as i4 and b1 to be handled correctly. Differential Revision: https://reviews.llvm.org/D125838	2022-05-23 10:53:51 -06:00

1 2

54 Commits