clang-p2996

Author	SHA1	Message	Date
Joseph Huber	a8f9f85ab0	[Libomptarget][NFC] Fix unused variable warnings Summary: This patch fixes a few warnings that would show up while building.	2024-04-10 10:01:15 -05:00
dhruvachak	cc8c6b037c	[OpenMP] [amdgpu] Added a synchronous version of data exchange. (#87032 ) Similar to H2D and D2H, use synchronous mode for large data transfers beyond a certain size for D2D as well. As with H2D and D2H, this size is controlled by an env-var.	2024-03-29 13:33:43 -07:00
Joseph Huber	4dc3225248	[Libomptarget] Replace global `PluginTy::get` interface with references (#86595 ) Summary: We have a plugin singleton that implements the Plugin interface. This then spawns separate device and kernels. Previously when these needed to reach into the global singleton they would use the `PluginTy::get` routine to get access to it. In the future we will move away from this as the lifetime of the plugin will be handled by `libomptarget` directly. This patch removes uses of this inside of the plugin implementaion themselves by simply keeping a reference to the plugin inside of the device. The external `__tgt_rtl` functions still use the global method, but will be removed later.	2024-03-26 07:13:59 -05:00
Joseph Huber	2cad43c1ba	[Libomptarget] Factor functions out of 'Plugin' interface (#86528 ) Summary: This patch factors common functions out of the `Plugin` interface prior to its removal in a future patch. This simply temporarily renames it to `PluginTy` so that we could re-use `Plugin::check` internally as this needs to be defined statically per plugin now. We can refactor this later. The future patch will delete `PluginTy` and `PluginTy::get` entirely. This simply tries to minimize a few changes to make it easier to land.	2024-03-25 15:24:39 -05:00
Joseph Huber	dcbddc2525	[Libomptarget] Unify and simplify plugin CMake (#86191 ) Summary: This patch reworks the CMake handling for building plugins. All this does is pull a lot of shared and common logic into a single helper function. This also simplifies the OMPT libraries from being built separately instead of just added.	2024-03-22 16:13:58 -05:00
Gheorghe-Teodor Bercea	c25e77436e	Revert "[libomptarget][nextgen-plugin] Use SCRELEASE/SCACQUIRE in packet headers" (#85950 ) Reverts llvm/llvm-project#85678	2024-03-20 11:40:12 -04:00
Gheorghe-Teodor Bercea	927308a52b	[libomptarget][nextgen-plugin] Use SCRELEASE/SCACQUIRE in packet headers (#85678 ) This patch updates the construction of packet headers to replace the usage of ACQUIRE/RELEASE with SCACQUIRE/SCRELEASE which is now recommended. The patch also ensures consistency across kernel dispatches.	2024-03-20 11:22:01 -04:00
Gheorghe-Teodor Bercea	5c752df1e1	[libomptarget][nextgen-plugin][NFC] Clean-up InputSignal checks (#83458 ) Clean-up InputSignal checks.	2024-03-07 12:01:42 -05:00
Ye Luo	0fa04b6e2c	[libomptarget] Fix libomptarget.rtl.amdgpu.so installation If AMD GPUs don't exist when building libomptarget, the early return causes skipping the plugin installation.	2024-03-05 21:32:48 -06:00
Joseph Huber	2fb764d2da	[libomptarget] Fix 'libomptarget' libraries being installed twice (#83624 ) Summary: We use `add_llvm_library` as a shorthand for setting up all the dependencies and libraries we need for the OpenMP offloading runtime as they depend on a lot of the LLVM utilities. However, we always explicitly installed these manually. Behind the scenes the function would then install it again. This was unnoticed because until now the destinations matched. Now that we want it to optionally go into the other directory it is duplicating them. Fix this by stating that this is a build tree only library so we can handle it ourselves.	2024-03-01 15:52:20 -06:00
Joseph Huber	621bafd5c1	[Libomptarget] Move target table handling out of the plugins (#77150 ) Summary: This patch removes the bulk of the handling of the `__tgt_offload_entries` out of the plugins itself. The reason for this is because the plugins themselves should not be handling this implementation detail of the OpenMP runtime. Instead, we expose two new plugin API functions to get the points to a device pointer for a global as well as a kernel type. This required introducing a new type to represent a binary image that has been loaded on a device. We can then use this to load the addresses as needed. The creation of the mapping table is then handled just in `libomptarget` where we simply look up each address individually. This should allow us to expose these operations more generically when we provide a separate API.	2024-01-22 11:06:47 -06:00
carlobertolli	ae99966a27	[OpenMP] Enable automatic unified shared memory on MI300A. (#77512 ) This patch enables applications that did not request OpenMP unified_shared_memory to run with the same zero-copy behavior, where mapped memory does not result in extra memory allocations and memory copies, but CPU-allocated memory is accessed from the device. The name for this behavior is "automatic zero-copy" and it relies on detecting: that the runtime is running on a MI300A, that the user did not select unified_shared_memory in their program, and that XNACK (unified memory support) is enabled in the current GPU configuration. If all these conditions are met, then automatic zero-copy is triggered. This patch also introduces an environment variable OMPX_APU_MAPS that, if set, triggers automatic zero-copy also on non APU GPUs (e.g., on discrete GPUs). This patch is still missing support for global variables, which will be provided in a subsequent patch. Co-authored-by: Thorsten Blass <thorsten.blass@amd.com>	2024-01-22 10:30:22 -06:00
Joseph Huber	89cdd48a22	[Libomptarget] Remove temporary files in AMDGPU JIT impl (#77980 ) Summary: This patch cleans up some of the JIT handling for AMDGPU as well as removing its temporary files. Previously these would be left in the temporary directory after the program was run. This costs some extra time, but the correct solution to avoid that is to create a sufficient entrypoint into `ld.lld` that we can simply pass a memory buffer into.	2024-01-15 19:03:19 -06:00
Joseph Huber	37c1a5e3f5	[Libomptarget] Fix GPU Dtors referencing possibly deallocated image (#77828 ) Summary: The constructors and destructors look up a symbol in the ELF quickly to determine if they need to be run on the GPU. This allows us to avoid the very slow actions required to do the slower lookup using the vendor API. One problem occurs with how we handle the lifetime of these images. Right now there is no invariant to specify the lifetime of the underlying binary image that is loaded. In the typical case, this comes from the binary itself in the `.llvm.offloading` section, meaning that the lifetime of the binary should match the executable itself. This would work fine, if it weren't for the fact that the plugin is loaded via `dlopen` and can have a teardown order out of sync with the main executable. This was likely what was occuring when this failed on some systems but not others. A potential solution would be to simply copy images into memory so the runtime does not rely on external references. Another would be to manually zero these out after initialization as to prevent this mistake from happening accidentally. The former has the benefit of making some checks easier, and allowing for constant initialization be done on the ELF itself (normally we can't do this because writing to a constant section, e.g. .llvm.offloading is a segfault.). The downside would be the extra time required to copy the image in bulk (Although we are likely doing this in the vendor runtimes as well). This patch went with a quick solution to simply set a boolean value at initialization time if we need to call destructors. Fixes: https://github.com/llvm/llvm-project/issues/77798	2024-01-11 15:00:53 -06:00
Joseph Huber	d03b8c3a04	[Libomptarget][NFC] Format in-line comments consistently (#77530 ) Summary: The LLVM style uses /Foo=/ when indicating the name of a constant. See https://llvm.org/docs/CodingStandards.html#comment-formatting. This is useful for consistency, as well as because `clang-format` understands this syntax and formats it more cleanly. Do a bulk update of this syntax.	2024-01-10 10:10:08 -06:00
carlobertolli	ce4144406c	Revert "[OpenMP][libomptarget] Enable automatic unified shared memory executi…" (#77371 ) Reverts llvm/llvm-project#75999 lit test is failing.	2024-01-08 14:38:29 -06:00
carlobertolli	22a73e7c46	[OpenMP][libomptarget] Enable automatic unified shared memory executi… (#75999 ) …on (zero-copy) on MI300A. This patch enables applications that did not request OpenMP unified_shared_memory to run with the same zero-copy behavior, where mapped memory does not result in extra memory allocations and memory copies, but CPU-allocated memory is accessed from the device. The name for this behavior is "automatic zero-copy" and it relies on detecting: that the runtime is running on a MI300A, that the user did not select unified_shared_memory in their program, and that XNACK (unified memory support) is enabled in the current GPU configuration. If all these conditions are met, then automatic zero-copy is triggered. This patch is still missing support for global variables, which will be provided in a subsequent patch. Co-authored-by: Thorsten Blass <thorsten.blass@amd.com>	2024-01-08 14:17:28 -06:00
Joseph Huber	e7655ad605	[Libomptarget] Remove unnecessary CMake definition of endiannness (#77205 ) Summary: This is needed for some definition in `hsa.h` that requires this to be set for some architectures when it fails at autodetection. We only really build `libomptarget` with `gcc` and `clang` which already provide their own way of detecting this. Remove the unnecessary define and move it into the source.	2024-01-08 13:23:38 -06:00
Chaitanya	1637c07925	[openmp][amdgpu] Add DynamicLdsSize to AMDGPUImplicitArgsTy (#65325 ) #65273 "hidden_dynamic_lds_size" argument will be added in the reserved section at offset 120 of the implicit argument layout Add DynamicLdsSize to AMDGPUImplicitArgsTy struct at offset 120 and fill the dynamic LDS size before kernel launch.	2024-01-06 09:34:48 +05:30
Joseph Huber	fb32977ac7	[Libomptarget] Fix RPC-based malloc on NVPTX (#72440 ) Summary: The device allocator on NVPTX architectures is enqueued to a stream that the kernel is potentially executing on. This can lead to deadlocks as the kernel will not proceed until the allocation is complete and the allocation will not proceed until the kernel is complete. CUDA 11.2 introduced async allocations that we can manually place on separate streams to combat this. This patch makes a new allocation type that's guaranteed to be non-blocking so it will actually make progress, only Nvidia needs to care about this as the others are not blocking in this way by default. I had originally tried to make the `alloc` and `free` methods take a `__tgt_async_info`. However, I observed that with the large volume of streams being created by a parallel test it quickly locked up the system as presumably too many streams were being created. This implementation not just creates a new stream and immediately destroys it. This obviously isn't very fast, but it at least gets the cases to stop deadlocking for now.	2024-01-02 16:53:53 -06:00
Joseph Huber	ba192debb4	[Libomptarget][Obvious] Fix typo in attribute lookup Summary: These are keys into the AMDGPU target metadata. One of them had a typo which prevented it from being extracted.	2023-12-20 19:03:35 -06:00
Joseph Huber	e4f4022b70	[Libomptarget][NFC] Fix linting warnings in the plugins Summary: Fix some linting warnings present in the plugins.	2023-12-20 10:07:34 -06:00
Joseph Huber	ac029e02a9	[Libomptarget] Remove __tgt_image_info and use the ELF directly (#75720 ) Summary: This patch reorganizes a lot of the code used to check for compatibility with the current environment. The main bulk of this patch involves moving from using a separate `__tgt_image_info` struct (which just contains a string for the architecture) to instead simply checking this information from the ELF directly. Checking information in the ELF is very inexpensive as creating an ELF file is simply writing a base pointer. The main desire to do this was to reorganize everything into the ELF image. We can then do the majority of these checks without first initializing the plugin. A future patch will move the first ELF checks to happen without initializing the plugin so we no longer need to initialize and plugins that don't have needed images. This patch also adds a lot more sanity checks for whether or not the ELF is actually compatible. Such as if the images have a valid ABI, 64-bit width, executable, etc.	2023-12-19 20:01:31 -06:00
Joseph Huber	6f3bd3a2f6	[Libomptarget] Add a utility function for checking existence of symbols (#74550 ) Summary: There are now a few cases that check if a symbol is present before continuing, effectively making them optional features if present in the image. This was done in at least three locations and required an ugly operation to consume the error. This patch makes a utility function to handle that instead.	2023-12-06 07:41:27 -06:00
Jon Chesterfield	f184147706	[amdgpu] Default to 1.0, instead of unspecified, for dynamic hsa (#74098 ) The plugin checks the values of HSA_AMD_INTERFACE_VERSION_* so we now set them to something safe in the header.	2023-12-01 16:37:49 +00:00
Johannes Doerfert	e2299e8d9d	[OpenMP][NFC] Move OMPT headers into OpenMP/OMPT (#73718 )	2023-11-29 08:29:41 -08:00
Johannes Doerfert	db96a9c3b7	[OpenMP][NFC] Flatten plugin-nextgen/common folder sturcture (#73725 ) For historic reasons we had it setup that there was ` plugin-nextgen/common/PluginInterface/<sources + headers>` which is not what we do anywhere else. Now it looks like the rest: ``` plugin-nextgen/common/include/<headers> plugin-nextgen/common/src/<sources> ``` As part of this, `dlwrap.h` was moved into common/include (as `DLWrap.h`) since it is exclusively used by the plugins.	2023-11-29 07:57:01 -08:00
Jan Patrick Lehr	3930a0b57a	[OpenMP][libomptarget] Use two SDMA engines (#73633 ) Limit the use to two SDMA engines which are optimized for such transfers.	2023-11-29 14:21:44 +01:00
Johannes Doerfert	7233e42dff	[OpenMP][NFC] Move Environment.h and SourceInfo.h into "Shared" folder (#73703 )	2023-11-28 15:10:06 -08:00
Johannes Doerfert	8327f4a851	[OpenMP][NFC] Move Utils.h and Debug.h into a "Shared" include folder (#73701 ) Headers used throughout the different runtimes are different from the internal headers. This is a first step to bring structure in into the include folder.	2023-11-28 13:44:57 -08:00
Johannes Doerfert	0783bf1cb3	[OpenMP][NFC] Merge MemoryManager into PluginInterface (#73678 ) Similar to #73677, there is no benefit from keeping MemoryManager seperate; it's tied into the current design. Except the move I also replaced the getenv call with our Env handling.	2023-11-28 10:17:51 -08:00
Johannes Doerfert	4667dd62ee	[OpenMP][NFC] Merge elf_common into PluginInterface (#73677 ) The overhead of a library and 4 files seems high without benefit. This simply tries to consolidate our structure.	2023-11-28 10:03:25 -08:00
Johannes Doerfert	6663df30c0	[OpenMP][NFC] Remove std::move to silence warnings	2023-11-20 17:15:33 -08:00
Joseph Huber	47a3ad5be1	[Libomptarget] Handle dynamic stack sizes for AMD COV5 (#72606 ) Summary: One of the changes in the AMD code-object version five was that kernels that use an unknown amount of private stack memory now no longer default to 16 KBs. Instead it emits a flag that indicates the runtime must provide a value. This patch checks if we must provide such a stack, and uses the existing handling of the stack environment variable to configure it.	2023-11-20 12:48:42 -06:00
Jan Patrick Lehr	5c22b907dc	Reland [OpenMP][libomptarget] Enable parallel copies via multiple SDM… (#72307 ) …A engines (#71801) This enables the AMDGPU plugin to use a new ROCm 5.7 interface to dispatch asynchronous data transfers across SDMA engines. The default functionality stays unchanged, meaning that all data transfers are enqueued into a H2D queue or an D2H queue, depending on transfer direction, via the HSA interface used previously. The new interface can be enabled via the environment variable `LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES=true` when libomptarget is built against a recent ROCm version (5.7 and later). As of now, requests are distributed in a round-robin fashion across available SDMA engines.	2023-11-14 21:30:04 +01:00
Joseph Huber	cc9e19ee59	Revert "[OpenMP][libomptarget] Enable parallel copies via multiple SDMA engines (#71801 )" This causes the tests to fail because the bots were not updated in time. Revert until we update the bots to a valid version. This reverts commit `e876250b63`.	2023-11-14 12:34:27 -06:00
Jan Patrick Lehr	e876250b63	[OpenMP][libomptarget] Enable parallel copies via multiple SDMA engines (#71801 ) This enables the AMDGPU plugin to use a new ROCm 5.7 interface to dispatch asynchronous data transfers across SDMA engines. The default functionality stays unchanged, meaning that all data transfers are enqueued into a H2D queue or an D2H queue, depending on transfer direction, via the HSA interface used previously. The new interface can be enabled via the environment variable `LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES=true` when libomptarget is built against a recent ROCm version (5.7 and later). As of now, requests are distributed in a round-robin fashion across available SDMA engines.	2023-11-14 19:16:39 +01:00
Joseph Huber	237adfca4e	[OpenMP] Rework handling of global ctor/dtors in OpenMP (#71739 ) Summary: This patch reworks how we handle global constructors in OpenMP. Previously, we emitted individual kernels that were all registered and called individually. In order to provide more generic support, this patch moves all handling of this to the target backend and the runtime plugin. This has the benefit of supporting the GNU extensions for constructors an destructors, removing a class of failures related to shared library destruction order, and allows targets other than OpenMP to use the same support without needing to change the frontend. This is primarily done by calling kernels that the backend emits to iterate a list of ctor / dtor functions. For x64, this is automatic and we get it for free with the standard `dlopen` handling. For AMDGPU, we emit `amdgcn.device.init` and `amdgcn.device.fini` functions which handle everything atuomatically and simply need to be called. For NVPTX, a patch https://github.com/llvm/llvm-project/pull/71549 provides the kernels to call, but the runtime needs to set up the array manually by pulling out all the known constructor / destructor functions. One concession that this patch requires is the change that for GPU targets in OpenMP offloading we will use `llvm.global_dtors` instead of using `atexit`. This is because `atexit` is a separate runtime function that does not mesh well with the handling we're trying to do here. This should be equivalent in all cases except for cases where we would need to destruct manually such as: ``` struct S { ~S() { foo(); } }; void foo() { static S s; } ``` However this is broken in many other ways on the GPU, so it is not regressing any support, simply increasing the scope of what we can handle. This changes the handling of ctors / dtors. This patch now outputs a information message regarding the deprecation if the old format is used. This will be completely removed in a later release. Depends on: https://github.com/llvm/llvm-project/pull/71549	2023-11-10 14:53:53 -06:00
Saiyedul Islam	21861991e7	[OpenMP] Cleanup and fixes for ABI agnostic DeviceRTL (#71234 ) Fixes the DeviceRTL compilation to ensure it is ABI agnostic. Uses already available global variable "oclc_ABI_version" instead of "llvm.amdgcn.abi.verion". It also adds some minor fields in ImplicitArg structure.	2023-11-09 10:34:35 +05:30
Jon Chesterfield	f0e100a05a	[amdgpu][openmp] Treat missing TIMESTAMP_FREQUENCY as non-fatal (#70987 ) If you build with dynamic_hsa, the symbol is known and compilation succeeds. If you then run with a slightly older libhsa, this argument is not recognised and an error returned. I'd rather the program runs with a misleading omp wtime than refuses to run at all.	2023-11-01 22:43:34 +00:00
Jon Chesterfield	896749aa0d	[amdgpu][openmp] Avoiding writing to packet header twice (#70695 ) I think it follows from the HSA spec that a write to the first byte is deemed significant to the GPU in which case writing to the second short and reading back from it later would be safe. However, the examples for this all involve an atomic write to the first 32 bits and it seems a credible risk that the occasional CI errors abound invalid packets have as their root cause that the firmware notices the early write to packet->setup and treats that as a sign that the packet is ready to go. That was overly-paranoid, however in passing noticed the code in libc is genuinely invalid. The memset writes a zero to the header byte, changing it from type_invalid (1) to type_vendor (0), at which point the GPU is free to read the 64 byte packet and interpret it as a vendor packet, which is probably why libc CI periodically errors about invalid packets. Also a drive by change to do the atomic store on a uint32_t consistently. I'm not sure offhand what __atomic_store_n on a uint16_t* and an int resolves to, seems better to be unambiguous there.	2023-10-30 18:35:52 +00:00
Konstantinos Parasyris	d6a3d6b96d	[openmp] Fixed Support for VA for record-replay. (#70396 ) The commit was discussed in phabricator (https://reviews.llvm.org/D157186). Record replay currently fails on AMD as it conflicts with the heap memory allocator introduced in #69806. The workaround is setting `LIBOMPTARGET_HEAP_SIZE=0` during both record and replay run.	2023-10-29 12:27:19 -07:00
Johannes Doerfert	d346c82435	[OpenMP] Associate the KernelEnvironment with the GenericKernelTy (#70383 ) By associating the kernel environment with the generic kernel we can access middle-end information easily, including the launch bounds ranges that are acceptable. By constraining the number of threads accordingly, we now obey the user-provided bounds that were passed via attributes.	2023-10-29 11:35:34 -07:00
Jon Chesterfield	840d0b7e03	[amdgpu] D2D memcpy via streams and HSA (#69977 ) hsa_amd_memory_async_copy can handle device to device copies if passed the corresponding parameters. No functional change - currently D2D copy goes through a fallback in libomptarget that stages through a host malloc, after this it goes directly through HSA. Works under exactly the situations that HSA works. Verified locally on a performance benchmark. Hoping to attract further testing from internal developers after it lands.	2023-10-24 00:05:04 +01:00
Johannes Doerfert	d3921e4670	[OpenMP] Basic BumpAllocator for (AMD)GPUs (#69806 ) The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to run more tests. The allocator implements `malloc`, both internally and externally, while we continue to default to the NVIDIA `malloc` when we target NVIDIA GPUs. Once we have smarter or customizable allocators we should consider this choice, for now, this allocator is better than none. It traps if it is out of memory, making it easy to debug. Heap size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB. It allows to track allocation statistics via `LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with `-fopenmp-target-debug=8`). Two tests were added, and one was enabled. This is the next step towards fixing https://github.com/llvm/llvm-project/issues/66708	2023-10-21 14:49:30 -07:00
Joseph Huber	34a3fb9f62	[Libomptarget][NFC] Remove use of VLA in the AMDGPU plugin (#69761 ) Summary: We should not rely on a VLA in C++ for the handling of this string. The size is a true runtime value so we cannot rely on constexpr handling. We simply use a small vector, whose default size is most likely large enough to handle whatever size gets output within the stack, but is safe in cases where it is not.	2023-10-20 16:02:51 -04:00
Ye Luo	8c2da6bb7f	[libomptarget] document ActionFunctions in the amdgpu plugin. (#66397 )	2023-09-14 12:18:49 -05:00
Ye Luo	6c8248e38b	[libomptarget] Rename AMDGPUSignalTy member Signal to HSASignal.	2023-09-11 22:42:34 -05:00
Ye Luo	08352b99a4	[libomptarget][NFC] update comments.	2023-09-11 22:30:25 -05:00
Joseph Huber	460840c09d	[OpenMP] Support 'omp_get_num_procs' on the device (#65501 ) Summary: The `omp_get_num_procs()` function should return the amount of parallelism availible. On the GPU, this was not defined. We have elected to define this function as the maximum amount of wavefronts / warps that can be simultaneously resident on the device. For AMDGPU this is the number of CUs multiplied byth CU's per wave. For NVPTX this is the maximum threads per SM divided by the warp size and multiplied by the number of SMs.	2023-09-06 13:45:05 -05:00

1 2 3

108 Commits