clang-p2996

Author	SHA1	Message	Date
Durgadoss R	13d6233e77	[MLIR][NVGPU] Fix nvgpu_arrive syntax in matmulBuilder.py (#113713 ) This patch updates the syntax for nvgpu_arrive Op in matmulBuilder.py. This fixes the compilation error for this test. For the warp-specialized matmul_kernel implementation, removing the WaitGroupSyncOp (after the mma-main-loop) fixes the hang observed. With these two fixes, the test compiles and executes successfully on an sm90a machine. Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2024-10-26 11:15:50 +05:30
Durgadoss R	a8b5115441	[MLIR][NVGPU] Fix the cga_cluster.mlir test (#112191 ) This patch fixes the sm90 cluster test by: * Fixing a typo in LowerGpuOpsToNVVMOps where one of the ClusterDim Op conversion pattern should actually be for the ClusterDimBlocks Op. This addresses the compilation error for this test. * The grid-size should be (4,4,1) instead of (2,2,1). This passes the scf-if check against the threshold of 3 below and actually generates the required prints from the GPU. Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2024-10-14 19:44:13 +05:30
Durgadoss R	9432f7074c	[MLIR][NVGPU-Tests] Fix a failing sm90 test (#111731 ) The memref.expand_shape explicitly takes an output_shape now. This patch adds it to the Op and fixes the failing test. Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2024-10-10 10:51:59 +05:30
Matthias Springer	8e33ff7d56	[mlir][GPU][NFC] Move `dump-ptx.mlir` test case (#111142 )	2024-10-04 15:13:20 +02:00
Guray Ozen	816134b333	[MLIR] Dump sass (#110227 ) This PR dump sass by using nvdiasm	2024-09-27 13:52:15 +02:00
Christian Sigg	b4ac5c4b7c	[mlir][cuda] NFC: Remove accidentally committed 'asd' file. (#105491 ) Co-authored-by: Christian Sigg <chsigg@users.noreply.github.com>	2024-08-22 10:52:50 +02:00
Guray Ozen	f2251f93ab	[mlir][gpu] Add mlir_c_runner_utils to fix #99035 This fixes the unit test that is broken in #99035.	2024-07-17 09:23:32 +02:00
Guray Ozen	20861f1f2f	[mlir][gpu] Use alloc OP's `host_shared` in cuda runtime (#99035 )	2024-07-17 07:25:11 +02:00
Matthias Springer	7775be4d48	[mlir] Fix GPU integration test (part 2) (#98918 ) Fix tests that were broken by #97903.	2024-07-15 17:39:16 +02:00
Guray Ozen	f8ff909471	[mlir][gpu] Add py binding for AsyncTokenType (#96466 ) The PR adds py binding for `AsyncTokenType`	2024-06-24 11:39:22 +02:00
Pradeep Kumar	bd6568c98a	[MLIR][GPU] Add gpu.cluster_dim_blocks and gpu.cluster_block_id Ops (#95245 ) This commit adds support for `gpu.cluster_dim_blocks` and `gpu.cluster_block_id` Ops to represent number of blocks per cluster and block id inside a cluster respectively. Also, fixed the description of `gpu.cluster_dim` Op and updated the `cga_cluster.mlir` test file to use `gpu.cluster_dim_blocks` Co-authored-by: pradeepku <pradeepku@nvidia.com> Co-authored-by: Guray Ozen <guray.ozen@gmail.com>	2024-06-14 10:35:35 +05:30
Guray Ozen	7c137f7e51	[mlir][nvvm] Remove unused check-ptx (#93147 ) The test used the check generated ptx with `CHECK-PTX`, but does not check that anymore. The PR removes these lines.	2024-05-23 12:09:51 +02:00
Guray Ozen	c82f45f9de	[mlir][nvgpu] Simplify TMA IR generation (#87153 ) This PR add `TmaDescriptorBuilder` - class simplifies TMA generation. - Makes the code ready to support various Tma configurations - removes strings and use the enums from `mlir.nvgpu.ENUMs`. - Example "swizzle = swizzle_128b, l2promo=none, oob=zero, interleave=none" to enums in `mlir.nvgpu` dialects. - Enums have string equivalent that are used during the IR writing and generation (see `TmaDescriptorBuilder::tensormap_descriptor_ty`). - Improves readability and abstracts out TMA descriptor builders in reusable component. --------- Co-authored-by: Manish Gupta <manigupta@google.com>	2024-04-18 09:58:24 +02:00
Guray Ozen	7d55b916a5	[mlir][nvgpu] Support strided memref when creating TMA descriptor (#85652 )	2024-03-18 19:47:39 +01:00
Guray Ozen	d95e6d0274	[mlir] GEMM Hopper Tensor Core Integration Test (#81478 )	2024-03-04 21:03:59 +00:00
Guray Ozen	0a600c34c8	[mlir][nvgpu] Make `phaseParity` of `mbarrier.try_wait` `i1` (#81460 ) Currently, `phaseParity` argument of `nvgpu.mbarrier.try_wait.parity` is index. This can cause a problem if it's passed any value different than 0 or 1. Because the PTX instruction only accepts even or odd phase. This PR makes phaseParity argument i1 to avoid misuse. Here is the information from PTX doc: ``` The .parity variant of the instructions test for the completion of the phase indicated by the operand phaseParity, which is the integer parity of either the current phase or the immediately preceding phase of the mbarrier object. An even phase has integer parity 0 and an odd phase has integer parity of 1. So the valid values of phaseParity operand are 0 and 1. ``` See for more information: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-test-wait-mbarrier-try-wait	2024-02-13 09:50:34 +01:00
Guray Ozen	ace69e6b94	[mlir][gpu] Improve `gpu-lower-to-nvvm-pipeline` Documentation (#77062 ) This PR improves the documentation for the `gpu-lower-to-nvvm-pipeline` (as it was remaning item for #75775) - Changes pipeline `gpu-lower-to-nvvm` -> `gpu-lower-to-nvvm-pipeline` - Adds a section in GPU Dialect in website. It clarifies the pipeline's functionality in lowering primary dialects to NVVM targets.	2024-01-05 12:51:25 +01:00
Guray Ozen	5caae72d1a	[mlir][gpu] Productize `test-lower-to-nvvm` as `gpu-lower-to-nvvm` (#75775 ) The `test-lower-to-nvvm` pipeline serves as the common and proper pipeline for nvvm+host compilation, and it's used across our CUDA integration tests. This PR updates the `test-lower-to-nvvm` pipeline to `gpu-lower-to-nvvm` and moves it within `InitAllPasses.h`. The aim is to call it from Python, also having a standardize compilation process for nvvm.	2023-12-19 08:40:46 +01:00
Jakub Kuderski	7eccd52842	Reland "[mlir][gpu] Align reduction operations with vector combining kinds (#73423 )" This reverts commit `dd09221a29` and relands https://github.com/llvm/llvm-project/pull/73423. * Updated `gpu.all_reduce` `min`/`max` in CUDA integration tests.	2023-11-27 11:38:18 -05:00
Guray Ozen	edf5cae739	[mlir][gpu] Support Cluster of Thread Blocks in `gpu.launch_func` (#72871 ) NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA). It is a new level of parallelism, allowing clustering of Cooperative Thread Arrays (CTA) to synchronize and communicate through shared memory while running concurrently. This PR enables support for CGA within the `gpu.launch_func` in the GPU dialect. It extends `gpu.launch_func` to accommodate this functionality. The GPU dialect remains architecture-agnostic, so we've added CGA functionality as optional parameters. We want to leverage mechanisms that we have in the GPU dialects such as outlining and kernel launching, making it a practical and convenient choice. An example of this implementation can be seen below: ``` gpu.launch_func @kernel_module::@kernel clusters in (%1, %0, %0) // <-- Optional blocks in (%0, %0, %0) threads in (%0, %0, %0) ``` The PR also introduces index and dimensions Ops specific to clusters, binding them to NVVM Ops: ``` %cidX = gpu.cluster_id x %cidY = gpu.cluster_id y %cidZ = gpu.cluster_id z %cdimX = gpu.cluster_dim x %cdimY = gpu.cluster_dim y %cdimZ = gpu.cluster_dim z ``` We will introduce cluster support in `gpu.launch` Op in an upcoming PR. See [the documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays) provided by NVIDIA for details.	2023-11-27 11:05:07 +01:00
Guray Ozen	51916f0c92	[mlir] Add sm_90a GEMM test 128x128x128 (F32 += F16 * F16) (#69913 ) This PR adds a test that performs GEMM 128x128x128 (F32 += F16 * F16). It uses `sm_90a` features in NVGPU dialect. Simplified algorithm is as follows: Prologue ``` mgroup = mbarriers.init x 2 tma.load ... shmem_buffer_lhs<0 x 128 x 64> tma.load ... shmem_buffer_rhs<0 x 64 x 64> tma.load ... shmem_buffer_rhs<0 x 64 x 64> mbarrier.expect_tx 32768 tma.load ... shmem_buffer_lhs<1 x 128 x 64> tma.load ... shmem_buffer_rhs<1 x 64 x 64> tma.load ... shmem_buffer_rhs<1 x 64 x 64> mbarrier.expect_tx 32768 ``` Mainloop ``` matrixD = for(i = 0;...2) { mbarrier.try_wait [i] lhs = shmem_buffer_lhs<pipe x 128 x 64> rhs = shmem_buffer_rhs<pipe x 64 x 128> yield nvgpu.warpgroup.mma (lhs, rhs) // Expanded : nvgpu.warpgroup.mma [128][128]+=[128][64][64][128] // wgmma.m64n128k16(A[0:64][0:16] B[0:16][0:128]) // wgmma.m64n128k16(A[0:64][16:32] * B[16:32][0:128]) // wgmma.m64n128k16(A[0:64][32:48] * B[32:48][0:128]) // wgmma.m64n128k16(A[0:64][48:64] * B[48:64][0:128]) // wgmma.m64n128k16(A[64:128][0:16] * B[0:16][0:128]) // wgmma.m64n128k16(A[64:128][16:32] * B[16:32][0:128]) // wgmma.m64n128k16(A[64:128][32:48] * B[32:48][0:128]) // wgmma.m64n128k16(A[64:128][48:64] * B[48:64][0:128]) ``` Epilogue ``` //reg->shmem warpgroup.mma.store matrixD, shmem //shmem->glbmem parallel-for(i=0;...128) parallel-for(j=0;...128) store shmem, globalmem ```	2023-11-10 16:53:43 +01:00
Guray Ozen	a00caad6bf	[mlir] Add sm_90a GEMM test 128x128x128 (F32 =F16F16) with predicate (#70028 ) PR #69913 added a GEMM test (128x128x128 F32 += F16 F16) with if-statement. This PR adds the same test using predicates in PTX. Predicate support is enabled using _BasicPtxBuilderInterface_ `(nvgpu.opcode ..., predicate = %pred)`. The predicate condition is computed in `Step 2. [GPU] Elect fastest thread in CTA` inspired by cutlass. It is as follows: ``` lane_predicate = nvvm.elect.sync warp_idx = __shfl_sync(0xffffffff, threadIdx.x / 32, 0) warp_idx_in_warp_group = warp_idx % 4 predicate = (lane_predicate & warp_idx_in_warp_group) ``` Depends on #70027 #69934 #69935 #69584	2023-11-10 16:52:00 +01:00
Guray Ozen	f4d59522cf	[mlir] Fix sm90 test for new verifier #70923 improved verifier. The verifier caught that the tensor map type in the tma descriptor in this test isn't correct. The program was working correctly anway since the offset is calculated correctly. This work fixes the test.	2023-11-10 16:50:01 +01:00
Christian Ulmann	52491c99fa	[MLIR][LLVM] Remove typed pointer remnants from integration tests (#71208 ) This commit removes all LLVM dialect typed pointers from the integration tests. Typed pointers have been deprecated for a while now and it's planned to soon remove them from the LLVM dialect. Related PSA: https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502	2023-11-03 21:21:25 +01:00
Christian Ulmann	6cc1c2c6f3	[MLIR][LLVM] Remove last remants of use-opaque-pointers from tests (#71076 ) This commit removes the last remnants of `use-opaque-pointers` from the mlir tests. Two of the tests seem to be disabled, while the CUDA one is an integration test that didn't trigger a buildbot failure.	2023-11-02 21:25:05 +01:00
Guray Ozen	f7dc26cab2	[mlir] Fixed typo in type (128x64 -> 64x128) in TMA load test (#70022 ) The test was meant to check `64x128xf16` as the contiguous dimension exceeds the cache line (128b). TMA requires cache line-aligned loads, so loading 64x128 can be done with two 64x64 loads, as documented in the test. However, there was a typo in the type, which was `memref<128x64xf16>` instead of the correct `memref<64x128xf16>`. This PR corrects the issue and updates the verification.	2023-10-26 11:02:54 +03:00
Guray Ozen	f8058a37ae	[mlir] Fix nvvm integration tests build error (#70113 ) #69934 broke integration tests that rely on the kernel-bare-ptr-calling-convention and host-bare-ptr-calling-convention flags. This PR brings these flags. Also the kernel-index-bitwidth flag is removed, as kernel pointer size depends on the host. Separating host (64-bit) and kernel (32-bit) is not viable.	2023-10-24 22:32:46 +02:00
Oleksandr "Alex" Zinenko	e4384149b5	[mlir] use transform-interpreter in test passes (#70040 ) Update most test passes to use the transform-interpreter pass instead of the test-transform-dialect-interpreter-pass. The new "main" interpreter pass has a named entry point instead of looking up the top-level op with `PossibleTopLevelOpTrait`, which is arguably a more understandable interface. The change is mechanical, rewriting an unnamed sequence into a named one and wrapping the transform IR in to a module when necessary. Add an option to the transform-interpreter pass to target a tagged payload op instead of the root anchor op, which is also useful for repro generation. Only the test in the transform dialect proper and the examples have not been updated yet. These will be updated separately after a more careful consideration of testing coverage of the transform interpreter logic.	2023-10-24 16:12:34 +02:00
Guray Ozen	afe400620f	[MLIR] Use `test-lower-to-nvvm` for sm_90 Integration Tests on GitHub (#68184 ) This PR enables `test-lower-to-nvvm` pass pipeline for the integration tests for NVIDIA sm_90 architecture. This PR adjusts `test-lower-to-nvvm` pass in two ways: 1) Calls `createConvertNVGPUToNVVMPass` before the outlining process. This particular pass is responsible for generating both device and host code. On the host, it calls the CUDA driver to build the TMA descriptor (`cuTensorMap`). 2) Integrates the `createConvertNVVMToLLVMPass` to generate PTXs for NVVM Ops.	2023-10-04 09:50:48 +02:00
Guray Ozen	f9149a34d9	[mlir] adapt sm_90 integration test `mbarrier.group` (#67423 ) #65951 improved mbarrier supports. This PR adapts that usage in the integration test.	2023-09-26 15:50:17 +02:00
Guray Ozen	f4fb03937a	[MLIR] Cleanup Pass Pipeline in sm_90 Integration Tests (#67416 ) MLIR has begun supporting many features of Nvidia's sm_90 architecture, and new tests have been added for it. Although the tests worked well, there were redundancies in the pipeline. This PR cleans up unnecessary passes.	2023-09-26 14:21:22 +02:00
Guray Ozen	9420fc4956	[MLIR] SM_90 integratation test of TMA `128x64xf16` and `64x64xf16` with 128b Swizzling (#65954 ) The #65953 added a test `128x64xf16` that does a single TMA load. This PR adds more complex test that does 2 additional TMA loads with 128B Swizzling: ``` TMA Load: Matrix-A[0:128][0:64] TMA Load: Matrix-B[0:64][0:64] TMA Load: Matrix-B[64:128][0:64] ``` The program tests the loaded data for Matrix-B.	2023-09-15 09:30:15 +02:00
Fabian Mora	5093413a50	[mlir][gpu][NVPTX] Enable NVIDIA GPU JIT compilation path (#66220 ) This patch adds an NVPTX compilation path that enables JIT compilation on NVIDIA targets. The following modifications were performed: 1. Adding a format field to the GPU object attribute, allowing the translation attribute to use the correct runtime function to load the module. Likewise, a dictionary attribute was added to add any possible extra options. 2. Adding the `createObject` method to `GPUTargetAttrInterface`; this method returns a GPU object from a binary string. 3. Adding the function `mgpuModuleLoadJIT`, which is only available for NVIDIA GPUs, as there is no equivalent for AMD. 4. Adding the CMake flag `MLIR_GPU_COMPILATION_TEST_FORMAT` to specify the format to use during testing.	2023-09-14 18:00:27 -04:00
frgossen	1cddbf8cf5	Revert `Add host-supports-nvptx requirement to lit tests` (#66102 and #66129 ) (#66225 )	2023-09-13 12:20:38 -04:00
Guray Ozen	ba81cd10d6	[MLIR] Fix the tma_load test (#66208 ) clang was used for local testing. The PR changes it to `mlir-cpu-runner`	2023-09-13 15:33:42 +02:00
Guray Ozen	38faecc692	[MLIR] SM_90 integration test of TMA with 128b Swizzling (#65953 ) An integration test for the 128b Swizzling TMA. TMA with 128B Swizzle loads data as follows (each numbered cell is 16 bytes). The program tests this pattern for `128x64xf16` type. ``` \|-------------------------------\| \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| \| 1 \| 0 \| 3 \| 2 \| 5 \| 4 \| 7 \| 6 \| \| 2 \| 3 \| 0 \| 1 \| 6 \| 7 \| 4 \| 5 \| \| 3 \| 2 \| 1 \| 0 \| 7 \| 6 \| 5 \| 4 \| \| 4 \| 5 \| 6 \| 7 \| 0 \| 1 \| 2 \| 3 \| \| 5 \| 4 \| 7 \| 6 \| 1 \| 0 \| 3 \| 2 \| \| 6 \| 7 \| 4 \| 5 \| 2 \| 3 \| 0 \| 1 \| \|-------------------------------\| \| ... pattern repeats ... \| \|-------------------------------\| ```	2023-09-13 15:22:48 +02:00
frgossen	a3b894287f	Add host-supports-nvptx requirement to lit tests (#66102 )	2023-09-12 12:21:36 -04:00
Guray Ozen	ad4411230a	[MLIR] Make SM_90 integration tests use `TargetAttr` (#65926 ) The 'TargetAttr' workflow was recently introduced to serialization for 'MLIR->LLVM->PTX'. #65857 removes previous passes (gpu::Serialization* passes) because they are duplicates. This PR removes the use of gpu::Serialization* passes in SM_90 integration tests, and enables the 'TargetAttr' workflow. It also moves the transform dialect specific test to a new folder.	2023-09-11 14:34:03 +02:00
Fabian Mora	119c489cc1	Reland [mlir][test][gpu] Migrate CUDA tests to the TargetAttr compilation workflow (llvm#65768) The revert happened due to a build bot failure that threw 'CUDA_ERROR_UNSUPPORTED_PTX_VERSION'. The failure's root cause was a pass using "+ptx76" for compilation and an old CUDA driver on the bot. This commit relands the patch with "+ptx60". Original Gh PR: #65768 Original commit message: Migrate tests referencing `gpu-to-cubin` to the new compilation workflow using `TargetAttrs`. The `test-lower-to-nvvm` pass pipeline was modified to use the new compilation workflow to simplify the introduction of future tests. The `createLowerGpuOpsToNVVMOpsPass` function was removed, as it didn't allow for passing all options available in the `ConvertGpuOpsToNVVMOp` pass.	2023-09-09 12:45:21 +00:00
Fabian Mora	2c596ea951	Revert "[mlir][test][gpu] Migrate CUDA tests to the TargetAttr compilation workflow (#65768 ) (#65848 ) This reverts commit `d21b67293b`.	2023-09-09 07:14:19 -04:00
Fabian Mora	d21b67293b	[mlir][test][gpu] Migrate CUDA tests to the TargetAttr compilation workflow (#65768 ) Migrate tests referencing `gpu-to-cubin` to the new compilation workflow using `TargetAttrs`. The `test-lower-to-nvvm` pass pipeline was modified to use the new compilation workflow to simplify the introduction of future tests. The `createLowerGpuOpsToNVVMOpsPass` function was removed, as it didn't allow for passing all options available in the `ConvertGpuOpsToNVVMOp` pass.	2023-09-09 07:03:38 -04:00
Guray Ozen	8031a088eb	[MLIR] Run the TMA test for sm_90 TMA was introduced to MLIR, however, it needed `ptxas` compiler. Recent work D154117 introduced that! This work runs the existing integration test. Reviewed By: fmorac Differential Revision: https://reviews.llvm.org/D159347	2023-09-04 18:15:37 +02:00
Benjamin Maxwell	f36e909da0	[mlir][VectorOps] Use SCF for vector.print and allow scalable vectors Reland of the original patch after updating the Python binding tests, a few CUDA/GPU MLIR tests, and ensuring the assembly format is round-trippable. This patch splits the lowering of vector.print into first converting an n-D print into a loop of scalar prints of the elements, then a second pass that converts those scalar prints into the runtime calls. The former is done in VectorToSCF and the latter in VectorToLLVM. The main reason for this is to allow printing scalable vector types, which are not possible to fully unroll at compile time, though this also avoids fully unrolling very large vectors. To allow VectorToSCF to add the necessary punctuation between vectors and elements, a "punctuation" attribute has been added to vector.print. This abstracts calling the runtime functions such as printNewline(), without leaking the LLVM details into the higher abstraction levels. For example: vector.print punctuation <comma> lowers to llvm.call @printComma() : () -> () The output format and runtime functions remain the same, which avoids the need to alter a large number of tests (aside from the pipelines). Reviewed By: awarzynski, c-rhodes, aartbik Differential Revision: https://reviews.llvm.org/D156519	2023-08-11 09:29:54 +00:00
Mehdi Amini	bf57c1fa1e	Revert "[tests] Fix gpu-to-cubin.mlir (NFC)" This reverts commit `9ef6cffbb0`. This was an attempt to fix the bot for a commit I reverted since. It isn't necessary anymore.	2023-08-09 19:55:19 -07:00
Mehdi Amini	1b272d21c8	Revert "[mlir][VectorOps] Use SCF for vector.print and allow scalable vectors" This reverts commit `490dae26cb`. Bot is broken, seems like there is a problem of ambiguity in the parser.	2023-08-09 19:37:01 -07:00
Mehdi Amini	5033ec0a9e	Revert "[tests] Fix gpu-to-cubin.mlir (NFC)" This reverts commit `4434bc5508`. It does not make sense to introduce more passes to fix a parsing issue. More importantly: it didn't fix the test!	2023-08-09 19:37:01 -07:00
Anlun Xu	9ef6cffbb0	[tests] Fix gpu-to-cubin.mlir (NFC) Differential Revision: https://reviews.llvm.org/D157561	2023-08-09 16:25:08 -07:00
Benjamin Maxwell	4434bc5508	[tests] Fix gpu-to-cubin.mlir (NFC)	2023-08-09 12:02:56 +00:00
Benjamin Maxwell	490dae26cb	[mlir][VectorOps] Use SCF for vector.print and allow scalable vectors Reland of the original patch after updating the Python binding tests and a few CUDA/GPU MLIR tests. This patch splits the lowering of vector.print into first converting an n-D print into a loop of scalar prints of the elements, then a second pass that converts those scalar prints into the runtime calls. The former is done in VectorToSCF and the latter in VectorToLLVM. The main reason for this is to allow printing scalable vector types, which are not possible to fully unroll at compile time, though this also avoids fully unrolling very large vectors. To allow VectorToSCF to add the necessary punctuation between vectors and elements, a "punctuation" attribute has been added to vector.print. This abstracts calling the runtime functions such as printNewline(), without leaking the LLVM details into the higher abstraction levels. For example: vector.print <comma> lowers to llvm.call @printComma() : () -> () The output format and runtime functions remain the same, which avoids the need to alter a large number of tests (aside from the pipelines). Reviewed By: awarzynski, c-rhodes, aartbik Differential Revision: https://reviews.llvm.org/D156519	2023-08-09 11:47:18 +00:00
Nicolas Vasilache	a3cd2eeb2d	[mlir][nvgpu] Add a nvgpu.rewrite_copy_as_tma transform operation. This revision adds support for direct lowering of a linalg.copy on buffers between global and shared memory to a tma async load + synchronization operations. This uses the recently introduced Hopper NVVM and NVGPU abstraction to connect things end to end. Differential Revision: https://reviews.llvm.org/D157087	2023-08-08 12:07:59 +00:00

1 2

87 Commits