clang-p2996

Author	SHA1	Message	Date
Ryan Holt	847a6f8f0a	[mlir][MemRef] Add runtime bounds checking (#75817 ) This change adds (runtime) bounds checks for `memref` ops using the existing `RuntimeVerifiableOpInterface`. For `memref.load` and `memref.store`, we check that the indices are in-bounds of the memref's index space. For `memref.reinterpret_cast` and `memref.subview` we check that the resulting address space is in-bounds of the input memref's address space.	2023-12-22 11:49:15 +09:00
Jakub Kuderski	560564f51c	[mlir][vector][gpu] Align minf/maxf reduction kind names with arith (#75901 ) This is to avoid confusion when dealing with reduction/combining kinds. For example, see a recent PR comment: https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175. Previously, they were picked to mostly mirror the names of the llvm vector reduction intrinsics: https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In isolation, it was not clear if `<maxf>` has `arith.maxnumf` or `arith.maximumf` semantics. The new reduction kind names map 1:1 to arith ops, which makes it easier to tell/look up their semantics. Because both the vector and the gpu dialect depend on the arith dialect, it's more natural to align names with those in arith than with the lowering to llvm intrinsics. Issue: https://github.com/llvm/llvm-project/issues/72354	2023-12-20 00:14:43 -05:00
Guray Ozen	5caae72d1a	[mlir][gpu] Productize `test-lower-to-nvvm` as `gpu-lower-to-nvvm` (#75775 ) The `test-lower-to-nvvm` pipeline serves as the common and proper pipeline for nvvm+host compilation, and it's used across our CUDA integration tests. This PR updates the `test-lower-to-nvvm` pipeline to `gpu-lower-to-nvvm` and moves it within `InitAllPasses.h`. The aim is to call it from Python, also having a standardize compilation process for nvvm.	2023-12-19 08:40:46 +01:00
Yinying Li	7bc6c4abe8	[mlir][print]Add functions for printing memref f16/bf16/i16 (#75094 ) 1. Added functions for printMemrefI16/f16/bf16. 2. Added a new integration test for all the printMemref functions.	2023-12-14 13:06:25 -05:00
Benjamin Maxwell	9505cf457f	[mlir][ArmSME][test] Use `only-if-required-by-ops` rather than `enable_arm_streaming_ignore` (NFC) (#75209 ) This moves the fix out of the IR and into the pass description, which seems nicer. It also works as an integration test for the `only-if-required-by-ops` flag :)	2023-12-13 10:29:28 +00:00
Matthias Springer	95d6aa21fb	[mlir][SparseTensor][NFC] Use `tensor.empty` for dense tensors (#74804 ) Use `tensor.empty` + initialization for dense tensors instead of `bufferization.alloc_tensor`.	2023-12-12 08:56:47 +09:00
Aart Bik	21213f39e2	[mlir][sparse] fix uninitialized dense tensor out in conv2d test (#74884 ) Note, tensor.empty may feed into SPARSE output (meaning it truly has no values yet), but for a DENSE output, it should always have an initial value. We ran a verifier over all our tests and this is the only remaining omission.	2023-12-08 12:44:57 -08:00
Aart Bik	ec9e49796d	[mlir][sparse] add sparse convolution with 5x5 kernel (#74793 ) Also unifies some of the test set up parts in other conv tests	2023-12-07 18:11:04 -08:00
Aart Bik	7003e255d3	[mlir][sparse] code formatting (NFC) (#74779 )	2023-12-07 15:46:24 -08:00
Peiming Liu	78e2b74f96	[mlir][sparse] fix bugs when generate sparse conv_3d kernels. (#74561 )	2023-12-06 15:59:10 -08:00
Sang Ik Lee	7fc792cba7	[MLIR] Enable GPU Dialect to SYCL runtime integration (#71430 ) GPU Dialect lowering to SYCL runtime is driven by spirv.target_env attached to gpu.module. As a result of this, spirv.target_env remains as an input to LLVMIR Translation. A SPIRVToLLVMIRTranslation without any actual translation is added to avoid an unregistered error in mlir-cpu-runner. SelectObjectAttr.cpp is updated to 1) Pass binary size argument to getModuleLoadFn 2) Pass parameter count to getKernelLaunchFn This change does not impact CUDA and ROCM usage since both mlir_cuda_runtime and mlir_rocm_runtime are already updated to accept and ignore the extra arguments.	2023-12-05 16:55:24 -05:00
Peiming Liu	8206b75a1e	[mlir][sparse] fix crash when generate rotated convolution kernels. (#74146 )	2023-12-01 14:13:57 -08:00
Andrzej Warzyński	bc802407d1	[mlir][sve][nfc] Merge the integration tests for linalg.matmul (#74059 ) At the moment the logic to tile and vectorize `linalg.matmul` is duplicated in multiple test files: * matmul.mlir * matmul_mixed_ty.mlir Instead, this patch uses `transform.foreach` to apply the same sequence to multiple functions within the same test file (e.g. `matmul_f32` and `matmul_mixed_ty` as defined in the original files). This allows us to merge relevant test files.	2023-12-01 17:39:48 +00:00
Spenser Bauman	0d87e25779	[mlir][tosa] Improve lowering to tosa.fully_connected (#73049 ) The current lowering of tosa.fully_connected produces a linalg.matmul followed by a linalg.generic to add the bias. The IR looks like the following: %init = tensor.empty() %zero = linalg.fill ins(0 : f32) outs(%init) %prod = linalg.matmul ins(%A, %B) outs(%zero) // Add the bias %initB = tensor.empty() %result = linalg.generic ins(%prod, %bias) outs(%initB) { // add bias and product } This has two down sides: 1. The tensor.empty operations typically result in additional allocations after bufferization 2. There is a redundant traversal of the data to add the bias to the matrix product. This extra work can be avoided by leveraging the out-param of linalg.matmul. The new IR sequence is: %init = tensor.empty() %broadcast = linalg.broadcast ins(%bias) outs(%init) %prod = linalg.matmul ins(%A, %B) outs(%broadcast) In my experiments, this eliminates one loop and one allocation (post bufferization) from the generated code.	2023-12-01 15:16:51 +00:00
Andrzej Warzyński	f42ce1621f	[mlir][sve][nfc] Update a test to use transform-interpreter (#73771 ) This is a follow-up of #70040 in which the test updated here was missed. Includes a few additional NFC changes in preparation for extending this test.	2023-12-01 10:08:00 +00:00
Benjamin Maxwell	eaff02f28e	[mlir][ArmSME] Switch to an attribute-based tile allocation scheme (#73253 ) This reworks the ArmSME dialect to use attributes for tile allocation. This has a number of advantages and corrects some issues with the previous approach: * Tile allocation can now be done ASAP (i.e. immediately after `-convert-vector-to-arm-sme`) * SSA form for control flow is now supported (e.g.`scf.for` loops that yield tiles) * ArmSME ops can be converted to intrinsics very late (i.e. after lowering to control flow) * Tests are simplified by removing constants and casts * Avoids correctness issues with representing LLVM `immargs` as MLIR values - The tile ID on the SME intrinsics is an `immarg` (so is required to be a compile-time constant), `immargs` should be mapped to MLIR attributes (this is already the case for intrinsics in the LLVM dialect) - Using MLIR values for `immargs` can lead to invalid LLVM IR being generated (and passes such as -cse making incorrect optimizations) As part of this patch we bid farewell to the following operations: ```mlir arm_sme.get_tile_id : i32 arm_sme.cast_tile_to_vector : i32 to vector<[4]x[4]xi32> arm_sme.cast_vector_to_tile : vector<[4]x[4]xi32> to i32 ``` These are now replaced with: ```mlir // Allocates a new tile with (indeterminate) state: arm_sme.get_tile : vector<[4]x[4]xi32> // A placeholder operation for lowering ArmSME ops to intrinsics: arm_sme.materialize_ssa_tile : vector<[4]x[4]xi32> ``` The new tile allocation works by operations implementing the `ArmSMETileOpInterface`. This interface says that an operation needs to be assigned a tile ID, and may conditionally allocate a new SME tile. Operations allocate a new tile by implementing... ```c++ std::optional<arm_sme::ArmSMETileType> getAllocatedTileType() ``` ...and returning what type of tile the op allocates (ZAB, ZAH, etc). Operations that don't allocate a tile return `std::nullopt` (which is the default behaviour). Currently the following ops are defined as allocating: ```mlir arm_sme.get_tile arm_sme.zero arm_sme.tile_load arm_sme.outerproduct // (if no accumulator is specified) ``` Allocating operations become the roots for the tile allocation pass, which currently just (naively) assigns all transitive uses of a root operation the same tile ID. However, this is enough to handle current use cases. Once tile IDs have been allocated subsequent rewrites can forward the tile IDs to any newly created operations.	2023-11-30 10:22:22 +00:00
Andrzej Warzyński	4b2ba5a61a	[mlir][sve] Add an e2e for linalg.matmul with mixed types (#73773 ) Apart from the test itself, this patch also updates a few patterns to fix how new VectorType(s) are created. Namely, it makes sure that "scalability" is correctly propagated. Regression tests will be updated seperately while auditing Vector dialect tests in the context of scalable vectors: * https://github.com/orgs/llvm/projects/23	2023-11-29 21:21:10 +00:00
Aart Bik	1944c4f76b	[mlir][sparse] rename DimLevelType to LevelType (#73561 ) The "Dim" prefix is a legacy left-over that no longer makes sense, since we have a very strict "Dimension" vs. "Level" definition for sparse tensor types and their storage.	2023-11-27 14:27:52 -08:00
Jakub Kuderski	7eccd52842	Reland "[mlir][gpu] Align reduction operations with vector combining kinds (#73423 )" This reverts commit `dd09221a29` and relands https://github.com/llvm/llvm-project/pull/73423. * Updated `gpu.all_reduce` `min`/`max` in CUDA integration tests.	2023-11-27 11:38:18 -05:00
Guray Ozen	edf5cae739	[mlir][gpu] Support Cluster of Thread Blocks in `gpu.launch_func` (#72871 ) NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA). It is a new level of parallelism, allowing clustering of Cooperative Thread Arrays (CTA) to synchronize and communicate through shared memory while running concurrently. This PR enables support for CGA within the `gpu.launch_func` in the GPU dialect. It extends `gpu.launch_func` to accommodate this functionality. The GPU dialect remains architecture-agnostic, so we've added CGA functionality as optional parameters. We want to leverage mechanisms that we have in the GPU dialects such as outlining and kernel launching, making it a practical and convenient choice. An example of this implementation can be seen below: ``` gpu.launch_func @kernel_module::@kernel clusters in (%1, %0, %0) // <-- Optional blocks in (%0, %0, %0) threads in (%0, %0, %0) ``` The PR also introduces index and dimensions Ops specific to clusters, binding them to NVVM Ops: ``` %cidX = gpu.cluster_id x %cidY = gpu.cluster_id y %cidZ = gpu.cluster_id z %cdimX = gpu.cluster_dim x %cdimY = gpu.cluster_dim y %cdimZ = gpu.cluster_dim z ``` We will introduce cluster support in `gpu.launch` Op in an upcoming PR. See [the documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays) provided by NVIDIA for details.	2023-11-27 11:05:07 +01:00
Cullen Rhodes	fae3964cbc	[mlir][linalg] Add an e2e test for linalg.matmul to ArmSME (#72144 ) This patch adds an integration test lowering a linalg.matmul to SME via vector.outerproduct. It's similar to the linalg.matmul_transpose_a e2e test added recently in as well as vector transpose canonicalizations, to lower the following sequence (taken from the inner loop): ``` %subview = memref.subview %arg0[%arg3, %arg5] [%2, 1] [1, 1] : memref<?x?xf32, strided<[?, ?], offset: ?>> to memref<?x1xf32, strided<[?, ?], offset: ?>> %mask = vector.create_mask %2, %c1 : vector<[4]x1xi1> %0 = vector.transfer_read %subview[%c0, %c0], %pad, %mask {in_bounds = [true, true]} : memref<?x1xf32, strided<[?, ?], offset: ?>>, vector<[4]x1xf32> %1 = vector.transpose %0, [1, 0] : vector<[4]x1xf32> to vector<1x[4]xf32> %2 = vector.extract %1[0] : vector<[4]xf32> from vector<1x[4]xf32> ``` Rank-2 vectors with leading scalable dim can't be type converted to an array. TransferReadDropUnitDimsPattern drops the unit dim on the vector.transfer_read so it can be lowered via the generic path (to SVE). The transpose canonicalizations lower the transpose to a shape_cast which folds away.	2023-11-23 08:53:43 +00:00
Benjamin Maxwell	dff97c1e4c	[mlir][ArmSME] Move ArmSME -> intrinsics lowerings to `convert-arm-sme-to-llvm` pass (#72890 ) This gives more flexibility with when these lowerings are performed, without also lowering unrelated vector ops. This is a NFC (other than adding a new `-convert-arm-sme-to-llvm` pass)	2023-11-22 13:36:36 +00:00
Aart Bik	c97e4273e2	[mlir][sparse] test on read/convert permuted 3d sparse tensors (#72925 ) 3! = 6	2023-11-21 09:26:04 -08:00
Peiming Liu	b52eb7c2fe	[mlir][sparse] add a csr x bsr matmul test case (#73012 )	2023-11-21 09:14:45 -08:00
Aart Bik	6352a07ba6	[mlir][sparse] test four row/col major versions of BSR (#72898 ) Note, this is a redo of https://github.com/llvm/llvm-project/pull/72712 which was reverted due to time outs in the bot. I have timed the tests on various settings, and it does not even hit the top 20 of integration tests. To be safe, I removed the SIMD version of the tests, just keeping libgen/direcIR paths (which are the most important to test for us). I will also keep an eye on https://lab.llvm.org/buildbot/#/builders/264/builds after submitting to make sure there is no repeat.	2023-11-20 12:28:16 -08:00
Mehdi Amini	2b71f91b06	Revert "[mlir][sparse] stress test BSR" (#72735 ) Reverts llvm/llvm-project#72712 This causes timeouts on the bots.	2023-11-17 19:06:49 -08:00
Aart Bik	813aaf39f9	[mlir][sparse] stress test BSR (#72712 ) I always enjoy a good stress test. This end-to-end integration test ensures the major ordering of both the block and within the block are correctly handled (giving row-row, row-col, col-row and col-row as options).	2023-11-17 15:47:38 -08:00
Aart Bik	6b56dd6a93	[mlir][sparse] enable 2:4 test for both directIR/libgen path (#72593 )	2023-11-17 09:40:32 -08:00
Aart Bik	83cf0dc982	[mlir][sparse] implement direct IR alloc/empty/new for non-permutations (#72585 ) This change implements the correct level sizes set up for the direct IR codegen fields in the sparse storage scheme. This brings libgen and codegen together again. This is step 3 out of 3 to make sparse_tensor.new work for BSR	2023-11-16 17:17:41 -08:00
Aart Bik	5535e48be2	[mlir][sparse] Capitalize class comment (#72436 )	2023-11-15 13:04:27 -08:00
Aart Bik	58090617c6	[mlir][sparse] fix broken test (merge conflict marker was left) (#72438 )	2023-11-15 13:01:43 -08:00
Tim Harvey	dce7a7cf69	Changed all code and comments that used the phrase "sparse compiler" to instead use "sparsifier" (#71875 ) The changes in this p.r. mostly center around the tests that use the flag sparse_compiler (also: sparse-compiler).	2023-11-15 20:12:35 +00:00
Aart Bik	a89c15aa2e	[mlir][sparse] enable Python BSR test (#72325 )	2023-11-14 15:35:03 -08:00
Aart Bik	a40900211a	[mlir][sparse] set rwx permissions to consistent values (#72311 ) some files had "x" permission set, others were missing "r"	2023-11-14 13:32:55 -08:00
Aart Bik	5f32bcfbae	[mlir][sparse][gpu] re-enable all GPU libgen tests (#72185 ) Previous change no longer properly used the GPU libgen pass (even though most tests still passed falling back to CPU). This revision puts the proper pass order into place. Also bit of a cleanup of CPU codegen vs. libgen setup.	2023-11-14 09:06:15 -08:00
Benjamin Maxwell	783ac3b6fb	[mlir][ArmSME] Make use of backend function attributes for enabling ZA storage (#71044 ) Previously, we were inserting za.enable/disable intrinsics for functions with the "arm_za" attribute (at the MLIR level), rather than using the backend attributes. This was done to avoid a dependency on the SME ABI functions from compiler-rt (which have only recently been implemented). Doing things this way did have correctness issues, for example, calling a streaming-mode function from another streaming-mode function (both with ZA enabled) would lead to ZA being disabled after returning to the caller (where it should still be enabled). Fixing issues like this would require re-doing the ABI work already done in the backend within MLIR. Instead, this patch switches to use the "arm_new_za" (backend) attribute for enabling ZA for an MLIR function. For the integration tests, this requires some way of linking the SME ABI functions. This is done via the `%arm_sme_abi_shlib` lit substitution. By default, this expands to a stub implementation of the SME ABI functions, but this can be overridden by providing the `ARM_SME_ABI_ROUTINES_SHLIB` CMake cache variable (pointing it at an alternative implementation). For now, the ArmSME integration tests pass with just stubs, as we don't make use of nested ZA-enabled calls. A future patch may add an option to compiler-rt to build the SME builtins into a standalone shared library to allow easily building/testing with the actual implementation.	2023-11-14 12:50:38 +00:00
Peiming Liu	269685545e	[mlir][sparse] remove filter-loop based algorithm support to handle a… (#71840 ) …ffine subscript expressions.	2023-11-13 11:36:49 -08:00
Aart Bik	af8428c0d9	[mlir][sparse] unify support of (dis)assemble between direct IR/lib path (#71880 ) Note that the (dis)assemble operations still make some simplfying assumptions (e.g. trailing 2-D COO in AoS format) but now at least both the direct IR and support library path behave exactly the same. Generalizing the ops is still TBD.	2023-11-13 10:05:00 -08:00
Peiming Liu	bfe08c094d	[mlir][sparse] support sparsifying 2:4 block sparsity (#71749 )	2023-11-10 12:25:53 -08:00
Guray Ozen	51916f0c92	[mlir] Add sm_90a GEMM test 128x128x128 (F32 += F16 * F16) (#69913 ) This PR adds a test that performs GEMM 128x128x128 (F32 += F16 * F16). It uses `sm_90a` features in NVGPU dialect. Simplified algorithm is as follows: Prologue ``` mgroup = mbarriers.init x 2 tma.load ... shmem_buffer_lhs<0 x 128 x 64> tma.load ... shmem_buffer_rhs<0 x 64 x 64> tma.load ... shmem_buffer_rhs<0 x 64 x 64> mbarrier.expect_tx 32768 tma.load ... shmem_buffer_lhs<1 x 128 x 64> tma.load ... shmem_buffer_rhs<1 x 64 x 64> tma.load ... shmem_buffer_rhs<1 x 64 x 64> mbarrier.expect_tx 32768 ``` Mainloop ``` matrixD = for(i = 0;...2) { mbarrier.try_wait [i] lhs = shmem_buffer_lhs<pipe x 128 x 64> rhs = shmem_buffer_rhs<pipe x 64 x 128> yield nvgpu.warpgroup.mma (lhs, rhs) // Expanded : nvgpu.warpgroup.mma [128][128]+=[128][64][64][128] // wgmma.m64n128k16(A[0:64][0:16] B[0:16][0:128]) // wgmma.m64n128k16(A[0:64][16:32] * B[16:32][0:128]) // wgmma.m64n128k16(A[0:64][32:48] * B[32:48][0:128]) // wgmma.m64n128k16(A[0:64][48:64] * B[48:64][0:128]) // wgmma.m64n128k16(A[64:128][0:16] * B[0:16][0:128]) // wgmma.m64n128k16(A[64:128][16:32] * B[16:32][0:128]) // wgmma.m64n128k16(A[64:128][32:48] * B[32:48][0:128]) // wgmma.m64n128k16(A[64:128][48:64] * B[48:64][0:128]) ``` Epilogue ``` //reg->shmem warpgroup.mma.store matrixD, shmem //shmem->glbmem parallel-for(i=0;...128) parallel-for(j=0;...128) store shmem, globalmem ```	2023-11-10 16:53:43 +01:00
Guray Ozen	a00caad6bf	[mlir] Add sm_90a GEMM test 128x128x128 (F32 =F16F16) with predicate (#70028 ) PR #69913 added a GEMM test (128x128x128 F32 += F16 F16) with if-statement. This PR adds the same test using predicates in PTX. Predicate support is enabled using _BasicPtxBuilderInterface_ `(nvgpu.opcode ..., predicate = %pred)`. The predicate condition is computed in `Step 2. [GPU] Elect fastest thread in CTA` inspired by cutlass. It is as follows: ``` lane_predicate = nvvm.elect.sync warp_idx = __shfl_sync(0xffffffff, threadIdx.x / 32, 0) warp_idx_in_warp_group = warp_idx % 4 predicate = (lane_predicate & warp_idx_in_warp_group) ``` Depends on #70027 #69934 #69935 #69584	2023-11-10 16:52:00 +01:00
Guray Ozen	f4d59522cf	[mlir] Fix sm90 test for new verifier #70923 improved verifier. The verifier caught that the tensor map type in the tma descriptor in this test isn't correct. The program was working correctly anway since the offset is calculated correctly. This work fixes the test.	2023-11-10 16:50:01 +01:00
Cullen Rhodes	fe8c649d01	[mlir][linalg] Add an e2e test for linalg.matmul_transpose_a to ArmSME (#71644 ) This patch adds an integration test demonstrating the first e2e example lowering a linalg.matmul to SME via vector.outerproduct. The test uses a 'linalg.matmul_transpose_a' rather than 'linalg.matmul' since the latter emits a 'vector.transfer_read' with a vector type of 'vector<[4]x1xf32>' that can't be currently lowered via generic (SVE) path, since it has leading scalable dim.	2023-11-10 07:52:39 +00:00
Cullen Rhodes	4240b1790f	[mlir][ArmSME] Lower transfer_write + transpose to vertical store (#71181 ) This patch extends the lowering of vector.transfer_write in VectorToArmSME to support in-flight transpose via SME vertical store.	2023-11-10 07:51:06 +00:00
Peiming Liu	5a6ffc5503	[mlir][sparse] temporarily disable BSR GPU libgen tests. (#71870 )	2023-11-09 13:54:02 -08:00
Peiming Liu	a2d9d2e1d9	[mlir][sparse] re-enable aarch64 test. (#71855 ) Should have been fixed by initializing output tensor to zeros in https://github.com/llvm/llvm-project/pull/71845	2023-11-09 11:46:52 -08:00
Peiming Liu	30e4b09d49	[mlir][sparse] try fix flanky test. (#71845 )	2023-11-09 11:10:59 -08:00
Peiming Liu	4eb01f7d5e	[mlir][sparse] disable aarch64 test to fix buildbot error. (#71818 ) To fix https://github.com/llvm/llvm-project/pull/71448	2023-11-09 10:50:58 -08:00
Peiming Liu	c99951d491	[mlir][sparse] end-to-end matmul between Dense and BSR tensors (#71448 )	2023-11-08 11:28:00 -08:00
Aart Bik	5ef446790f	[mlir][sparse][gpu] cleanup GPUDataTransferStrategy (#71615 ) The flag seems to be doing practically the same thing for zero cost and pinned dma. In addition, the register host is not truly the right zero cost mechanism according to Thomas. So we are simplifying the setup for now, until we have a better definition for what to implement and test. https://github.com/llvm/llvm-project/issues/64316	2023-11-08 09:45:11 -08:00

1 2 3 4 5 ...

764 Commits