This commit introduces a shared library for the MLIR execution engine.
This library is only built when `LLVM_BUILD_LLVM_DYLIB` is set. Having
such a library allows downstream users to depend on the execution engine
without giving up dynamic linkage. This is especially important for CPU
runner-style tools, as they link against large parts of MLIR and LLVM.
It is alternatively possible to modify the `MLIRExecutionEngine` target
when `LLVM_BUILD_LLVM_DYLIB` is set, to avoid duplicated libraries.
Note that even though the sparse runtime support lib always uses SoA
storage for COO storage (and provides correct codegen by means of views
into this storage), in some rare cases we need the true physical SoA
storage as a coordinate buffer. This PR provides that functionality by
means of a (costly) coordinate buffer call.
Since this is currently only used for testing/debugging by means of the
sparse_tensor.print method, this solution is acceptable. If we ever want
a performing version of this, we should truly support AoS storage of COO
in addition to the SoA used right now.
The platform running on Apple Silicon does not seem to support the
negative nan. It causes the test failure where we explicitly specify the
negative nan bit pattern and check the output printed by the CRunnerUtil
function.
We can make the print function in the utility platform agnostic by using
the standard library functions (i.e. `std::isnan` and `std::signbit`) so
that we can run the test across platforms that do not support the
negative bit pattern.
I have added two test cases that would fail in the Apple Silicon
platform without print function changes.
```
$ uname -a
Darwin Kernel Version 23.3.0: Wed Dec 20 21:30:44 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6000 arm64
```
See:
https://discourse.llvm.org/t/test-failure-of-sparse-sign-test-in-apple-silicon/77876/3
The base class llvm::ThreadPoolInterface will be renamed
llvm::ThreadPool in a subsequent commit.
This is a breaking change: clients who use to create a ThreadPool must
now create a DefaultThreadPool instead.
This adds a new `mlir_arm_runner_utils` library that contains utils
specific to Arm/AArch64. This is for use in MLIR integration tests.
This initial patch adds `setArmVLBits()` and `setArmSVLBits()`. This
allows changing vector length or streaming vector length at runtime (or
setting it to a known minimum, i.e. 128-bits).
this is causing test failures on AArch64 linux, hitting the
following assert:
# | mlir-cpu-runner: /home/culrho01/llvm-project/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:519: void llvm::RuntimeDyldELF::resolveAArch64Relocation(const SectionEntry &, uint64_t, uint64_t, uint32_t, int64_t): Assertion `isInt<33>(Result) && "overflow check failed for relocation"' failed.
Seeing the same in buildbot as well, e.g.
https://lab.llvm.org/buildbot/#/builders/179/builds/9094/steps/12/logs/FAIL__MLIR__sparse_codegen_dim_mlir
Reverts llvm/llvm-project#78070
1. Fix a bug in verifyMemref to pass in `data` instead of `baseptr`,
which didn't verify data correctly.
2. Add `==` for f16 and bf16.
3. Add a comprehensive test of verifyMemref for all supported types.
This removes the temporary DENSE24 attribute and replaces it with proper
recognition of dense to 24 conversion. The compressionh will be
performed on the device prior to performing the matrix mult. Note that
we no longer need to start with the linalg version, we can lift this to
the proper named linalg op. Also renames some files into more consistent
names.
There were two issues with the previous computation:
* it never looked at dimensions past the second one
* the definition was recursive, making each dimension have an extra
`elementSize` power
…ation
The previous code was technically incorrect in that the type indicated
that the memref only has 1 dimension, while the code below was happily
dereferencing the size array out of bounds. Now, if the compiler doesn't
get too smart about optimizations, this code *might even work*. But, if
the compiler realizes that the array has 1 element it might starrt doing
silly things. This generates a specialization per each supported rank,
making sure we don't do any UB.
This fixes a few issues present in the current version:
1) The macro doesn't enforce the default visibility on exported
functions, causing compilation to fail when using
`-fvisibility=hidden`
2) Not all functions are exported
3) Sometimes the macro ended up weirdly interleaved with `extern "C"`
declarations
The "Dim" prefix is a legacy left-over that no longer makes sense, since
we have a very strict "Dimension" vs. "Level" definition for sparse
tensor types and their storage.
NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA).
It is a new level of parallelism, allowing clustering of Cooperative
Thread Arrays (CTA) to synchronize and communicate through shared memory
while running concurrently.
This PR enables support for CGA within the `gpu.launch_func` in the GPU
dialect. It extends `gpu.launch_func` to accommodate this functionality.
The GPU dialect remains architecture-agnostic, so we've added CGA
functionality as optional parameters. We want to leverage mechanisms
that we have in the GPU dialects such as outlining and kernel launching,
making it a practical and convenient choice.
An example of this implementation can be seen below:
```
gpu.launch_func @kernel_module::@kernel
clusters in (%1, %0, %0) // <-- Optional
blocks in (%0, %0, %0)
threads in (%0, %0, %0)
```
The PR also introduces index and dimensions Ops specific to clusters,
binding them to NVVM Ops:
```
%cidX = gpu.cluster_id x
%cidY = gpu.cluster_id y
%cidZ = gpu.cluster_id z
%cdimX = gpu.cluster_dim x
%cdimY = gpu.cluster_dim y
%cdimZ = gpu.cluster_dim z
```
We will introduce cluster support in `gpu.launch` Op in an upcoming PR.
See [the
documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays)
provided by NVIDIA for details.
Using mlir cmake in downstream project fails with error
```
CMake Error at D:/projs/llvm/llvm-install/lib/cmake/mlir/MLIRTargets.cmake:2537 (message):
The imported target "mlir_arm_sme_abi_stubs" references the file
"D:/projs/llvm/llvm-install/lib/mlir_arm_sme_abi_stubs.lib"
but this file does not exist. Possible reasons include:
* The file was deleted, renamed, or moved to another location.
* An install or uninstall procedure did not complete successfully.
* The installation package was faulty and contained
"D:/projs/llvm/llvm-install/lib/cmake/mlir/MLIRTargets.cmake"
but not all the files it references.
Call Stack (most recent call first):
D:/projs/llvm/llvm-install/lib/cmake/mlir/MLIRConfig.cmake:37 (include)
mlir/CMakeLists.txt:5 (find_package)
```
Windows cmake needs export libaries but it seems they are only being
generated if you have at least one exported symbol.
Add export attributes to symbols.
Not sure what the best approach to fix this (probably we should just
disable this lib on windows entirely), but it fixed things for me
locally.
Previously, we were inserting za.enable/disable intrinsics for functions
with the "arm_za" attribute (at the MLIR level), rather than using the
backend attributes. This was done to avoid a dependency on the SME ABI
functions from compiler-rt (which have only recently been implemented).
Doing things this way did have correctness issues, for example, calling
a streaming-mode function from another streaming-mode function (both
with ZA enabled) would lead to ZA being disabled after returning to the
caller (where it should still be enabled). Fixing issues like this would
require re-doing the ABI work already done in the backend within MLIR.
Instead, this patch switches to use the "arm_new_za" (backend) attribute
for enabling ZA for an MLIR function. For the integration tests, this
requires some way of linking the SME ABI functions. This is done via the
`%arm_sme_abi_shlib` lit substitution. By default, this expands to a
stub implementation of the SME ABI functions, but this can be overridden
by providing the `ARM_SME_ABI_ROUTINES_SHLIB` CMake cache variable
(pointing it at an alternative implementation). For now, the ArmSME
integration tests pass with just stubs, as we don't make use of nested
ZA-enabled calls.
A future patch may add an option to compiler-rt to build the SME
builtins into a standalone shared library to allow easily
building/testing with the actual implementation.
This PR guards the driver call with if-statement as the driver calls are
more expensive.
As a future todo, the if statement could be generated by the compiler
and thus optimized in some cases.
Printing strings within integration tests is currently quite annoyingly
verbose, and can't be tucked into shared helpers as the types depend on
the length of the string:
```
llvm.mlir.global internal constant @hello_world("Hello, World!\0")
func.func @entry() {
%0 = llvm.mlir.addressof @hello_world : !llvm.ptr<array<14 x i8>>
%1 = llvm.mlir.constant(0 : index) : i64
%2 = llvm.getelementptr %0[%1, %1]
: (!llvm.ptr<array<14 x i8>>, i64, i64) -> !llvm.ptr<i8>
llvm.call @printCString(%2) : (!llvm.ptr<i8>) -> ()
return
}
```
So this patch adds a simple extension to `vector.print` to simplify
this:
```
func.func @entry() {
// Print a vector of characters ;)
vector.print str "Hello, World!"
return
}
```
Most of the logic for this is now shared with `cf.assert` which already
does something similar.
Depends on #68694
This last revision completed the migration to non-permutation support in
the SparseTensor library. All mappings are now controlled by the MapRef
(forward and backward). Unused code has been removed, which simplifies
subsequent testing of block sparsity.
This cleans up all external entry points that will have to deal with
non-permutations, making any subsequent refactoring much more local to
the lib files.
Making the materialize-from-reader method part of the Swiss army knife
suite again removes a lot of redundant boiler plate code and unifies the
parameter setup into a single centralized utility. Furthermore, we now
have minimized the number of entry points into the library that need a
non-permutation map setup, simplifying what comes next