Commit Graph

88 Commits

Author SHA1 Message Date
Joseph Huber
ba192debb4 [Libomptarget][Obvious] Fix typo in attribute lookup
Summary:
These are keys into the AMDGPU target metadata. One of them had a typo
which prevented it from being extracted.
2023-12-20 19:03:35 -06:00
Joseph Huber
e4f4022b70 [Libomptarget][NFC] Fix linting warnings in the plugins
Summary:
Fix some linting warnings present in the plugins.
2023-12-20 10:07:34 -06:00
Joseph Huber
ac029e02a9 [Libomptarget] Remove __tgt_image_info and use the ELF directly (#75720)
Summary:
This patch reorganizes a lot of the code used to check for compatibility
with the current environment. The main bulk of this patch involves
moving from using a separate `__tgt_image_info` struct (which just
contains a string for the architecture) to instead simply checking this
information from the ELF directly. Checking information in the ELF is
very inexpensive as creating an ELF file is simply writing a base
pointer.

The main desire to do this was to reorganize everything into the ELF
image. We can then do the majority of these checks without first
initializing the plugin. A future patch will move the first ELF checks
to happen without initializing the plugin so we no longer need to
initialize and plugins that don't have needed images.

This patch also adds a lot more sanity checks for whether or not the ELF
is actually compatible. Such as if the images have a valid ABI, 64-bit
width, executable, etc.
2023-12-19 20:01:31 -06:00
Joseph Huber
6f3bd3a2f6 [Libomptarget] Add a utility function for checking existence of symbols (#74550)
Summary:
There are now a few cases that check if a symbol is present before
continuing, effectively making them optional features if present in the
image. This was done in at least three locations and required an ugly
operation to consume the error. This patch makes a utility function to
handle that instead.
2023-12-06 07:41:27 -06:00
Jon Chesterfield
f184147706 [amdgpu] Default to 1.0, instead of unspecified, for dynamic hsa (#74098)
The plugin checks the values of HSA_AMD_INTERFACE_VERSION_* so we now
set them to something safe in the header.
2023-12-01 16:37:49 +00:00
Johannes Doerfert
e2299e8d9d [OpenMP][NFC] Move OMPT headers into OpenMP/OMPT (#73718) 2023-11-29 08:29:41 -08:00
Johannes Doerfert
db96a9c3b7 [OpenMP][NFC] Flatten plugin-nextgen/common folder sturcture (#73725)
For historic reasons we had it setup that there was
`  plugin-nextgen/common/PluginInterface/<sources + headers>`
which is not what we do anywhere else.
Now it looks like the rest:
```
  plugin-nextgen/common/include/<headers>
  plugin-nextgen/common/src/<sources>
```
As part of this, `dlwrap.h` was moved into common/include (as
`DLWrap.h`)
since it is exclusively used by the plugins.
2023-11-29 07:57:01 -08:00
Jan Patrick Lehr
3930a0b57a [OpenMP][libomptarget] Use two SDMA engines (#73633)
Limit the use to two SDMA engines which are optimized for such transfers.
2023-11-29 14:21:44 +01:00
Johannes Doerfert
7233e42dff [OpenMP][NFC] Move Environment.h and SourceInfo.h into "Shared" folder (#73703) 2023-11-28 15:10:06 -08:00
Johannes Doerfert
8327f4a851 [OpenMP][NFC] Move Utils.h and Debug.h into a "Shared" include folder (#73701)
Headers used throughout the different runtimes are different from the
internal headers. This is a first step to bring structure in into the
include folder.
2023-11-28 13:44:57 -08:00
Johannes Doerfert
0783bf1cb3 [OpenMP][NFC] Merge MemoryManager into PluginInterface (#73678)
Similar to #73677, there is no benefit from keeping MemoryManager
seperate; it's tied into the current design. Except the move I also
replaced the getenv call with our Env handling.
2023-11-28 10:17:51 -08:00
Johannes Doerfert
4667dd62ee [OpenMP][NFC] Merge elf_common into PluginInterface (#73677)
The overhead of a library and 4 files seems high without benefit. This
simply tries to consolidate our structure.
2023-11-28 10:03:25 -08:00
Johannes Doerfert
6663df30c0 [OpenMP][NFC] Remove std::move to silence warnings 2023-11-20 17:15:33 -08:00
Joseph Huber
47a3ad5be1 [Libomptarget] Handle dynamic stack sizes for AMD COV5 (#72606)
Summary:
One of the changes in the AMD code-object version five was that kernels
that use an unknown amount of private stack memory now no longer default
to 16 KBs. Instead it emits a flag that indicates the runtime must
provide a value. This patch checks if we must provide such a stack, and
uses the existing handling of the stack environment variable to
configure it.
2023-11-20 12:48:42 -06:00
Jan Patrick Lehr
5c22b907dc Reland [OpenMP][libomptarget] Enable parallel copies via multiple SDM… (#72307)
…A engines (#71801)

This enables the AMDGPU plugin to use a new ROCm 5.7 interface to
dispatch asynchronous data transfers across SDMA engines.

The default functionality stays unchanged, meaning that all data
transfers are enqueued into a H2D queue or an D2H queue, depending on
transfer direction, via the HSA interface used previously.

The new interface can be enabled via the environment variable
`LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES=true` when libomptarget
is built against a recent ROCm version (5.7 and later). As of now,
requests are distributed in a round-robin fashion across available SDMA
engines.
2023-11-14 21:30:04 +01:00
Joseph Huber
cc9e19ee59 Revert "[OpenMP][libomptarget] Enable parallel copies via multiple SDMA engines (#71801)"
This causes the tests to fail because the bots were not updated in time.
Revert until we update the bots to a valid version.

This reverts commit e876250b63.
2023-11-14 12:34:27 -06:00
Jan Patrick Lehr
e876250b63 [OpenMP][libomptarget] Enable parallel copies via multiple SDMA engines (#71801)
This enables the AMDGPU plugin to use a new ROCm 5.7 interface to
dispatch asynchronous data transfers across SDMA engines.

The default functionality stays unchanged, meaning that all data
transfers are enqueued into a H2D queue or an D2H queue, depending on
transfer direction, via the HSA interface used previously.

The new interface can be enabled via the environment variable
`LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES=true` when libomptarget
is built against a recent ROCm version (5.7 and later).
As of now, requests are distributed in a round-robin fashion across
available SDMA engines.
2023-11-14 19:16:39 +01:00
Joseph Huber
237adfca4e [OpenMP] Rework handling of global ctor/dtors in OpenMP (#71739)
Summary:
This patch reworks how we handle global constructors in OpenMP.
Previously, we emitted individual kernels that were all registered and
called individually. In order to provide more generic support, this
patch moves all handling of this to the target backend and the runtime
plugin. This has the benefit of supporting the GNU extensions for
constructors an destructors, removing a class of failures related to
shared library destruction order, and allows targets other than OpenMP
to use the same support without needing to change the frontend.

This is primarily done by calling kernels that the backend emits to
iterate a list of ctor / dtor functions. For x64, this is automatic and
we get it for free with the standard `dlopen` handling. For AMDGPU, we
emit `amdgcn.device.init` and `amdgcn.device.fini` functions which
handle everything atuomatically and simply need to be called. For NVPTX,
a patch https://github.com/llvm/llvm-project/pull/71549 provides the
kernels to call, but the runtime needs to set up the array manually by
pulling out all the known constructor / destructor functions.

One concession that this patch requires is the change that for GPU
targets in OpenMP offloading we will use `llvm.global_dtors` instead of
using `atexit`. This is because `atexit` is a separate runtime function
that does not mesh well with the handling we're trying to do here. This
should be equivalent in all cases except for cases where we would need
to destruct manually such as:

```
struct S { ~S() { foo(); } };
void foo() {
  static S s;
}
```

However this is broken in many other ways on the GPU, so it is not
regressing any support, simply increasing the scope of what we can
handle.

This changes the handling of ctors / dtors. This patch now outputs a
information message regarding the deprecation if the old format is used.
This will be completely removed in a later release.

Depends on: https://github.com/llvm/llvm-project/pull/71549
2023-11-10 14:53:53 -06:00
Saiyedul Islam
21861991e7 [OpenMP] Cleanup and fixes for ABI agnostic DeviceRTL (#71234)
Fixes the DeviceRTL compilation to ensure it is ABI agnostic. Uses
already available global variable "oclc_ABI_version" instead of
"llvm.amdgcn.abi.verion".

It also adds some minor fields in ImplicitArg structure.
2023-11-09 10:34:35 +05:30
Jon Chesterfield
f0e100a05a [amdgpu][openmp] Treat missing TIMESTAMP_FREQUENCY as non-fatal (#70987)
If you build with dynamic_hsa, the symbol is known and compilation
succeeds. If you then run with a slightly older libhsa, this argument is
not recognised and an error returned. I'd rather the program runs with a
misleading omp wtime than refuses to run at all.
2023-11-01 22:43:34 +00:00
Jon Chesterfield
896749aa0d [amdgpu][openmp] Avoiding writing to packet header twice (#70695)
I think it follows from the HSA spec that a write to the first byte is
deemed significant to the GPU in which case writing to the second short
and reading back from it later would be safe. However, the examples for
this all involve an atomic write to the first 32 bits and it seems a
credible risk that the occasional CI errors abound invalid packets have
as their root cause that the firmware notices the early write to
packet->setup and treats that as a sign that the packet is ready to go.

That was overly-paranoid, however in passing noticed the code in libc is
genuinely invalid. The memset writes a zero to the header byte, changing
it from type_invalid (1) to type_vendor (0), at which point the GPU is
free to read the 64 byte packet and interpret it as a vendor packet,
which is probably why libc CI periodically errors about invalid packets.

Also a drive by change to do the atomic store on a uint32_t
consistently. I'm not sure offhand what __atomic_store_n on a uint16_t*
and an int resolves to, seems better to be unambiguous there.
2023-10-30 18:35:52 +00:00
Konstantinos Parasyris
d6a3d6b96d [openmp] Fixed Support for VA for record-replay. (#70396)
The commit was discussed in phabricator
(https://reviews.llvm.org/D157186).

Record replay currently fails on AMD as it conflicts with the heap
memory allocator introduced in #69806. The workaround is setting
`LIBOMPTARGET_HEAP_SIZE=0` during both record and replay run.
2023-10-29 12:27:19 -07:00
Johannes Doerfert
d346c82435 [OpenMP] Associate the KernelEnvironment with the GenericKernelTy (#70383)
By associating the kernel environment with the generic kernel we can
access middle-end information easily, including the launch bounds ranges
that are acceptable. By constraining the number of threads accordingly,
we now obey the user-provided bounds that were passed via attributes.
2023-10-29 11:35:34 -07:00
Jon Chesterfield
840d0b7e03 [amdgpu] D2D memcpy via streams and HSA (#69977)
hsa_amd_memory_async_copy can handle device to device copies if passed
the corresponding parameters.

No functional change - currently D2D copy goes through a fallback in
libomptarget that stages through a host malloc, after this it goes
directly through HSA.

Works under exactly the situations that HSA works. Verified locally on a
performance benchmark. Hoping to attract further testing from internal
developers after it lands.
2023-10-24 00:05:04 +01:00
Johannes Doerfert
d3921e4670 [OpenMP] Basic BumpAllocator for (AMD)GPUs (#69806)
The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to
run more tests. The allocator implements `malloc`, both internally and
externally, while we continue to default to the NVIDIA `malloc` when we
target NVIDIA GPUs. Once we have smarter or customizable allocators we
should consider this choice, for now, this allocator is better than
none. It traps if it is out of memory, making it easy to debug. Heap
size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB.
It allows to track allocation statistics via
`LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with
`-fopenmp-target-debug=8`). Two tests were added, and one was enabled.

This is the next step towards fixing
 https://github.com/llvm/llvm-project/issues/66708
2023-10-21 14:49:30 -07:00
Joseph Huber
34a3fb9f62 [Libomptarget][NFC] Remove use of VLA in the AMDGPU plugin (#69761)
Summary:
We should not rely on a VLA in C++ for the handling of this string. The
size is a true runtime value so we cannot rely on constexpr handling. We
simply use a small vector, whose default size is most likely large
enough to handle whatever size gets output within the stack, but is safe
in cases where it is not.
2023-10-20 16:02:51 -04:00
Ye Luo
8c2da6bb7f [libomptarget] document ActionFunctions in the amdgpu plugin. (#66397) 2023-09-14 12:18:49 -05:00
Ye Luo
6c8248e38b [libomptarget] Rename AMDGPUSignalTy member Signal to HSASignal. 2023-09-11 22:42:34 -05:00
Ye Luo
08352b99a4 [libomptarget][NFC] update comments. 2023-09-11 22:30:25 -05:00
Joseph Huber
460840c09d [OpenMP] Support 'omp_get_num_procs' on the device (#65501)
Summary:
The `omp_get_num_procs()` function should return the amount of
parallelism availible. On the GPU, this was not defined. We have elected
to define this function as the maximum amount of wavefronts / warps that
can be simultaneously resident on the device. For AMDGPU this is the
number of CUs multiplied byth CU's per wave. For NVPTX this is the
maximum threads per SM divided by the warp size and multiplied by the
number of SMs.
2023-09-06 13:45:05 -05:00
Saiyedul Islam
f616c3eeb4 [OpenMP][DeviceRTL][AMDGPU] Support code object version 5
Update DeviceRTL and the AMDGPU plugin to support code
object version 5. Default is code object version 4.

CodeGen for __builtin_amdgpu_workgroup_size generates code
for cov4 as well as cov5 if -mcode-object-version=none
is specified. DeviceRTL compilation passes this argument
via Xclang option to generate abi-agnostic code.

Generated code for the above builtin uses a clang
control constant "llvm.amdgcn.abi.version" to branch on
the abi version, which is available during linking of
user's OpenMP code. Load of this constant gets eliminated
during linking.

AMDGPU plugin queries the ELF for code object version
and then prepares various implicitargs accordingly.

Differential Revision: https://reviews.llvm.org/D139730

Reviewed By: jhuber6, yaxunl
2023-08-29 06:35:44 -05:00
Joseph Huber
06adac8c4e [Libomptarget] Configure the RPC port count from the plugin
This patch allows us to configure the port count to what the specific
card would desire for parallelism. For AMDGPU we need to use the maximum
number of hardware parallelism to avoid deadlocks. For NVPTX we don't
have this problem due to the friendlier scheduler, so we use the number
of warps active on an SM times the number of SMs as a good guess.

Note that the max ports currently is going to be smaller than these
numbers. That will be improved in the future.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D155903
2023-08-11 12:54:47 -05:00
Michael Halkenhaeuser
7eba3e58d5 [OpenMP][AMDGPU] Add Envar for controlling HSA busy queue tracking
If the Envar is set to true (default), busy HSA queues will be
actively avoided when assigning a queue to a Stream.

Otherwise, we will initialize a new HSA queue for each requested
Stream, then default to round robin once the set maximum has been
reached.

Reviewed By: jdoerfert, kevinsala

Differential Revision: https://reviews.llvm.org/D156996
2023-08-07 10:48:02 -04:00
Kevin Sala
b8e297d1af [OpenMP][libomptarget] Improve kernel initialization in plugins
This patch modifies the plugins so that the initialization of KernelTy objects
is done in the init method. Part of the initialization was done in the
constructKernelEntry method. Now this method is called constructKernel
and only allocates and constructs a KernelTy object.

This patch prepares the kernel class for the new implementation of device
reductions.

Differential Revision: https://reviews.llvm.org/D156917
2023-08-06 11:53:58 +02:00
Kevin Sala
4f46a48aaf [OpenMP][libomptarget] Remove unused virtual functions in GenericKernelTy
The virtual functions getDefaultNumBlocks and getDefaultNumThreads from the kernels are
only forwarding the call to the generic device's ones. This patch removes those two
functions from the kernels (and their derived ones). Now calls are made to the device's
functions directly.

Differential Revision: https://reviews.llvm.org/D156905
2023-08-02 17:18:50 +02:00
Michael Halkenhaeuser
5b19f42b63 [OpenMP][AMDGPU] Single eager resource init + HSA queue utilization tracking
This patch lazily initializes queues/streams/events since their initialization
might come at a cost even if we do not use them.

To further benefit from this, AMDGPU/HSA queue management is moved into the
AMDGPUStreamManager of an AMDGPUDevice. Streams may now use different HSA queues
during their lifetime and identify busy queues.

When a Stream is requested from the resource manager, it will search for and
try to assign an idle queue. During the search for an idle queue the manager
may initialize more queues, up to the set maximum (default: 4).
When no idle queue could be found: resort to round robin selection.

With contributions from Johannes Doerfert <johannes@jdoerfert.de>

Depends on D156245

Reviewed By: kevinsala

Differential Revision: https://reviews.llvm.org/D154523
2023-08-02 08:22:26 -04:00
Kevin Sala
523ac0fcdf [OpenMP][libomptarget] Retrieve multiple resources from resource managers
This patch extends the plugin resource managers to return more than one resource
per call. The return function is not extended since we do not return more than
one resource anywhere.

Differential Revision: https://reviews.llvm.org/D155629
2023-07-28 00:37:08 +02:00
Kevin Sala
53e4c7c309 [OpenMP][libomptarget] Improving plugin resource managers
This patch improves the resource managers in the plugins by properly handling
the errors. Until now, errors when creating and destroying resources were not
propagated and were directly handled inside the resource managers. Now, all
errors are propagated as in the rest of the plugin infrastructure.

The code is now ready to implement the request/return of multiple resources in
a single getResource/returnResource call.

Differential Revision: https://reviews.llvm.org/D155621
2023-07-28 00:37:08 +02:00
Shilei Tian
10068cd654 [OpenMP] Introduce kernel environment
This patch introduces per kernel environment. Previously, flags such as execution mode are set through global variables with name like `__kernel_name_exec_mode`. They are accessible on the host by reading the corresponding global variable, but not from the device. Besides, some assumptions, such as no nested parallelism, are not per kernel basis, preventing us applying per kernel optimization in the device runtime.

This is a combination and refinement of patch series D116908, D116909, and D116910.

Depend on D155886.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D142569
2023-07-26 13:35:14 -04:00
Shilei Tian
6bd74fd65f Revert commits for kernel environment
This reverts commits for kernel environments as they causes issues in AMD BB.
2023-07-23 23:32:31 -04:00
Shilei Tian
c5c8040390 [OpenMP] Introduce kernel environment
This patch introduces per kernel environment. Previously, flags such as execution mode are set through global variables with name like `__kernel_name_exec_mode`. They are accessible on the host by reading the corresponding global variable, but not from the device. Besides, some assumptions, such as no nested parallelism, are not per kernel basis, preventing us applying per kernel optimization in the device runtime.

This is a combination and refinement of patch series D116908, D116909, and D116910.

Depend on D155886.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D142569
2023-07-23 18:36:01 -04:00
Shilei Tian
763fdb1ffa [OpenMP][Plugin] Update the global address calculation
Current global address caculation doesn't work for AMDGPU in some cases (https://reviews.llvm.org/D142569#4506212).
The root cause is the `sh_addr` is not substracted when caculating the address.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D155886
2023-07-23 17:41:06 -04:00
Joseph Huber
8a0763f19c [Libomptarget] Remove RPCHandleTy indirection
The 'RPCHandleTy' was intended to capture the intention that a specific
device owns its slot in the RPC server. However, this required creating
a temporary store to hold these pointers. This was causing really weird
spurious failure due to undefined behaviour in the order of library
teardown. For example, the x64 plugin would be torn down, set this to
some invalid memory, and then the CUDA plugin would crash. Rather than
spend the time to fully diagnose this problem I found it pertinent to
simply remove the failure mode.

This patch removes this indirection so now the usage of the RPC server
must always be done with the intended device. This just requires some
extra handling for the AMDGPU indirection where we need to store a
reference to the device.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D154971
2023-07-11 10:54:40 -05:00
Michael Halkenhaeuser
142faf56f5 [OpenMP] [OMPT] [amdgpu] [5/8] Implemented device init/fini/load callbacks
Added support in the generic plugin to invoke registered callbacks.

Depends on D124070

Patch from John Mellor-Crummey <johnmc@rice.edu>
(With contributions from Dhruva Chakrabarti <Dhruva.Chakrabarti@amd.com>)

Differential Revision: https://reviews.llvm.org/D124652
2023-07-11 07:13:22 -04:00
Joseph Huber
691dc2d10d [Libomptarget] Begin implementing support for RPC services
This patch adds the intial support for running an RPC server in
libomptarget to handle host services. We interface with the library
provided by the `libc` project to stand up a basic server. We introduce
a new type that is controlled by the plugin and has each device
intialize its interface. We then run a basic server to check the RPC
buffer.

This patch does not fully implement the interface. In the future each
plugin will want to define special handlers via the interface to support
things like malloc or H2D copies coming from RPC. We will also want to
allow the plugin to specify t he number of ports. This is currently
capped in the implementation but will be adjusted soon.

Right now running the server is handled by whatever thread ends up doing
the waiting. This is probably not a completely sound solution but I am
not overly familiar with the behaviour of OpenMP tasks and what would be
required here. This works okay with synchrnous regions, and somewhat
fine with `nowait` regions, but I've observed some weird behavior when
one of those regions calls `exit`.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D154312
2023-07-07 12:36:46 -05:00
Joseph Huber
071c8a41cc [Libomptarget] Fix tests after deleting the next-gen plugins
The next-gen plugins didn't correctly configure tests and were never
actually being run. Since deleting the old plugin we stopped getting
`libomptarget` tests. This patch fixes the issue and allows the targets
to be built

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D154619
2023-07-06 10:44:50 -05:00
Joseph Huber
e90ab9148b [OpenMP] Delete old plugins
It's time to remove the old plugins as the next-gen has already been set
to default in LLVM 16.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D142820
2023-07-05 17:39:47 -05:00
Joseph Huber
6764301a6b [Libomptarget] Correctly implement getWTime on AMDGPU
AMDGPU provides a fixed frequency clock since some generations back.
However, the frequency is variable by card and must be looked up at
runtime. This patch adds a new device environment line for the clock
frequency so that we can use it in the same way as NVPTX. This is the
correct implementation and the version in ASO should be replaced.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D154456
2023-07-04 21:50:43 -05:00
Johannes Doerfert
2e3c6c3f80 [OpenMP][NFC] Eliminate warning 2023-05-18 13:27:43 -07:00
Joseph Huber
b09953a4a3 [Libomptarget] Fix AMDGPU Note handling after D150022
Summary:
The changes in https://reviews.llvm.org/D150022 changed the API for this
function that we query. Simply pass in the alignment from the associated
header to fix.
2023-05-10 14:12:39 -05:00