Load/unload GPU modules in global ctors/dtors instead of each time when launching a kernel. Loading GPU modules is a heavy-weight operation and synchronizes the GPU context. Now that the modules are loaded ahead of time, asynchronously launched kernels can run concurrently, see https://discourse.llvm.org/t/how-to-lower-the-combination-of-async-gpu-ops-in-gpu-dialect. The implementations of `embedBinary()` and `launchKernel()` use slightly different mechanics at the moment but I prefer to not change the latter more than necessary as part of this PR. I will prepare a follow-up NFC for `launchKernel()` to align them again.
45 KiB
45 KiB