Summary:
This patch removes most of the special handling from the
`PluginAdaptorTy` in preparation for changing this to be the
`GenericPluginTy`. Doing this requires that the OpenMP specific handling
of stuff like device offsets be contained within the OpenMP plugin
manager. Generally this was uninvasive expect for the change to tracking
the offset and size of the used devices. The eaiest way I could think to
do this was to use some maps, which double as indicators for which
plugins have devices active. This should not affect the logic.
This adds an API call ompx_dump_mapping_tables.
This allows users to debug the mapping tables and can be especially
useful for unified shared memory applications to check if the code
behaves in the way it should. The implementation reuses code already
present to dump mapping tables (in a debug setting).
---------
Co-authored-by: Joseph Huber <huberjn@outlook.com>
Since there's no `ompt_target_data_transfer_tofrom_device` (within
ompt_target_data_op_t enum) or something other that conveys the meaning
of inter-device data exchange we decided to indicate a Device-to-Device
transfer by using: optype == ompt_target_data_transfer_from_device (=3)
Hence, a device transfer may be identified e.g. by checking for: (optype
== 3) &&
(src_device_num < omp_get_num_devices()) &&
(dest_device_num < omp_get_num_devices())
Fixes: #66478
Summary:
Ever since the introduction of the new plugins we haven't exercised the
concept of "optional" plugin functions. This is done in perparation for
making the plugins use a static interface as it will greatly simplify
the implementation if we assert that every function has the entrypoints.
Currently some unsupported functions will just return failure or some
other default value, so this shouldn't change anything.
Summary:
Currently we rely on global constructors to initialize and shut down the
OpenMP runtime library and plugin manager. This causes some issues
because we do not have a defined lifetime that we can rely on to release
and allocate resources. This patch instead adds some simple reference
counted initialization and deinitialization function.
A future patch will use the `deinit` interface to more intelligently
handle plugin deinitilization. Right now we do nothing and rely on
`atexit` inside of the plugins to tear them down. This isn't great
because it limits our ability to control these things.
Note that I made the `__tgt_register_lib` functions do the
initialization instead of adding calls to the new runtime functions in
the linker wrapper. The reason for this is because in the past it's been
easier to not introduce a new function call, since sometimes the user's
compiler will link against an older `libomptarget`. Maybe if we change
the name with offloading in the future we can simplify this.
Depends on https://github.com/llvm/llvm-project/pull/80460
Summary:
Currently, OpenMP handles the `omp requires` clause by emitting a global
constructor into the runtime for every translation unit that requires
it. However, this is not a great solution because it prevents us from
having a defined order in which the runtime is accessed and used.
This patch changes the approach to no longer use global constructors,
but to instead group the flag with the other offloading entires that we
already handle. This has the effect of still registering each flag per
requires TU, but now we have a single constructor that handles
everything.
This function removes support for the old `__tgt_register_requires` and
replaces it with a warning message. We just had a recent release, and
the OpenMP policy for the past four releases since we switched to LLVM
is that we do not provide strict backwards compatibility between major
LLVM releases now that the library is versioned. This means that a user
will need to recompile if they have an old binary that relied on
`register_requires` having the old behavior. It is important that we
actively deprecate this, as otherwise it would not solve the problem of
having no defined init and shutdown order for `libomptarget`. The
problem of `libomptarget` not having a define init and shutdown order
cascades into a lot of other issues so I have a strong incentive to be
rid of it.
It is worth noting that the current `__tgt_offload_entry` only has space
for a 32-bit integer here. I am planning to overhaul these at some point
as well.
Currently, __builtin_return_address is used to generate the return
address when the callback invoker is created. However, this may result
in the return address pointing to an internal runtime function. This is
not what a tool would typically want. A tool would want to know the
corresponding user code from where the runtime entry point is invoked.
This change adds a thread local variable that is assigned the return
address at the OpenMP runtime entry points. An RAII is used to manage
the modifications to the thread local variable. Whenever the return
address is required for OMPT events, it is read from the thread local
variable.
Summary:
This function is always false in the current implementation and is not
even considered required. Just remove it and if someone needs it in the
future they can add it back in. This is done to simplify the interface
prior to other changes
Summary:
The current logic tries to map target mapping tables to the current
device. Right now it assumes that data is only mapped a single time per
device. This is only true if we have a single instance of the runtime
running on a single program. However, in the case of dynamic library
loads or shared libraries, this may happen multiple times.
Given a case of a simple dynamic library load which has its own target
kernel instruction, the current logic had only the first call to
`__tgt_target_kernel` to the data mapping for that device. Then, when
the next dynamic library load got called, it would see that the global
were already mapped for that device and skip registering its own
entires, even though they were distinct. This resulted in none of the
mappings being done and hitting an assertion.
This patch simply gets rid of this per-device check. The check should
instead be on the host offloading entries. We already have logic that
calls `continue` if we already have entries for that pointer, so we can
simply rely on that instead.
Summary:
A previous patch removed creating these entries in clang in favor of the
backend emitting a callable kernel and having the runtime call that if
present. The support for the old style was kept around in LLVM 18.0 but
now that we have forked to 19.0 we should remove the support.
The effect of this would be that an application linking against a newer
libomptarget that still had the old constructors will no longer be
called. In that case, they can either recompile or use the
`libomptarget.so.18` that comes with the previous release.
Summary:
This patch removes the bulk of the handling of the
`__tgt_offload_entries` out of the plugins itself. The reason for this
is because the plugins themselves should not be handling this
implementation detail of the OpenMP runtime. Instead, we expose two new
plugin API functions to get the points to a device pointer for a global
as well as a kernel type.
This required introducing a new type to represent a binary image that
has been loaded on a device. We can then use this to load the addresses
as needed. The creation of the mapping table is then handled just in
`libomptarget` where we simply look up each address individually. This
should allow us to expose these operations more generically when we
provide a separate API.
This patch enables applications that did not request OpenMP
unified_shared_memory to run with the same zero-copy behavior, where
mapped memory does not result in extra memory allocations and memory
copies, but CPU-allocated memory is accessed from the device. The name
for this behavior is "automatic zero-copy" and it relies on detecting:
that the runtime is running on a MI300A, that the user did not select
unified_shared_memory in their program, and that XNACK (unified memory
support) is enabled in the current GPU configuration. If all these
conditions are met, then automatic zero-copy is triggered.
This patch also introduces an environment variable OMPX_APU_MAPS that,
if set, triggers automatic zero-copy also on non APU GPUs (e.g., on
discrete GPUs).
This patch is still missing support for global variables, which will be
provided in a subsequent patch.
Co-authored-by: Thorsten Blass <thorsten.blass@amd.com>
Summary:
The LLVM style uses /*Foo=*/ when indicating the name of a constant. See
https://llvm.org/docs/CodingStandards.html#comment-formatting. This is
useful for consistency, as well as because `clang-format` understands
this syntax and formats it more cleanly. Do a bulk update of this
syntax.
Summary:
The offloading entries right now are assumed to be baked into the binary
itself, and thus always valid whenever the library is executing. This
means that we don't need to copy them to additional storage and can
instead simply pass around references to it.
This is not likely to change in the expected operation of the OpenMP
library. Additionally, the indirection for the offload entry struct is
simply two pointers, so moving it by value is trivial.
…on (zero-copy) on MI300A.
This patch enables applications that did not request OpenMP
unified_shared_memory to run with the same zero-copy behavior, where
mapped memory does not result in extra memory allocations and memory
copies, but CPU-allocated memory is accessed from the device. The name
for this behavior is "automatic zero-copy" and it relies on detecting:
that the runtime is running on a MI300A, that the user did not select
unified_shared_memory in their program, and that XNACK (unified memory
support) is enabled in the current GPU configuration. If all these
conditions are met, then automatic zero-copy is triggered.
This patch is still missing support for global variables, which will be
provided in a subsequent patch.
Co-authored-by: Thorsten Blass <thorsten.blass@amd.com>
Summary:
The device allocator on NVPTX architectures is enqueued to a stream that
the kernel is potentially executing on. This can lead to deadlocks as
the kernel will not proceed until the allocation is complete and the
allocation will not proceed until the kernel is complete. CUDA 11.2
introduced async allocations that we can manually place on separate
streams to combat this. This patch makes a new allocation type that's
guaranteed to be non-blocking so it will actually make progress, only
Nvidia needs to care about this as the others are not blocking in this
way by default.
I had originally tried to make the `alloc` and `free` methods take a
`__tgt_async_info`. However, I observed that with the large volume of
streams being created by a parallel test it quickly locked up the system
as presumably too many streams were being created. This implementation
not just creates a new stream and immediately destroys it. This
obviously isn't very fast, but it at least gets the cases to stop
deadlocking for now.
Summary:
Adding information to the LIBOMPTARGET profiler runtime kernel and API
calls.
Key changes:
* Adding information to runtime calls for better understanding of how
the application
is executing. For example teams requested by the user, size of memory
transfers.
* Profile timer was changed from 'us' to 'ns', since 'us' was too
coarse-grain
to register some important details like key kernel duration
* Removed non API or Runtime calls, to reduce complexity of profile for
application
developers.
---------
Co-authored-by: Felipe Cabarcas <cabarcas@leia.crpl.cis.udel.edu>
Co-authored-by: fel-cab <fel-cab@github.com>
Summary:
This patch reorganizes a lot of the code used to check for compatibility
with the current environment. The main bulk of this patch involves
moving from using a separate `__tgt_image_info` struct (which just
contains a string for the architecture) to instead simply checking this
information from the ELF directly. Checking information in the ELF is
very inexpensive as creating an ELF file is simply writing a base
pointer.
The main desire to do this was to reorganize everything into the ELF
image. We can then do the majority of these checks without first
initializing the plugin. A future patch will move the first ELF checks
to happen without initializing the plugin so we no longer need to
initialize and plugins that don't have needed images.
This patch also adds a lot more sanity checks for whether or not the ELF
is actually compatible. Such as if the images have a valid ABI, 64-bit
width, executable, etc.
DeviceTy provides an abstraction for "middle-level" operations that can
be done with a offload device. Mapping was tied into it but is not
strictly necessary. Other languages do not track mapping, and even
OpenMP can be used completely without mapping. This simply moves the
relevant code into the OpenMP/Mapping.cpp as part of a new class
MappingInfoTy. Each device still has one, but it does not clutter the
device.cpp anymore.
Reverts llvm/llvm-project#74360
As I wrote in the analysis of #74360:
Since
bc4e0c048a
we will not add PluginAdaptors into the container of all plugin adaptors
before the plugin is not ready. The error is thereby gone. When and old
HSA loads other libraries they can call register_image but that will
simply not register the image with the plugin we are currently
initializing. That seems like reasonable behavior, thought it is good to
keep in mind if we ever want a kernel library (@jhuber6 @mjklemm). We
can still have a standalone kernel library though or load it late after
all plugins are setup (which seems reasonable).
I did not expect one our tests actually doing exactly what this will not
allow anymore, at least when you use rocm <5.5.0. Need to figure out if
we want this behavior (for rocm <5.5.0).
This introduces checked errors into the creation and initialization of
`PluginAdaptorTy`. We also allow the adaptor to "hide" devices from the
user if the initialization failed. The new organization avoids the
"initOnce" stuff but we still do not eagerly initialize the plugin
devices (I think we should merge `PluginAdaptorTy::initDevices` into
`PluginAdaptorTy::init`)
This fixes two bugs and adds a test for them:
- A shared library with declare target functions but without kernels
should not error out due to missing globals.
- Enabling LIBOMPTARGET_INFO=32 should not deadlock in the presence of
indirect declare targets.
We accessed the `Devices` container most of the time while holding the
RTLsMtx, but not always. Sometimes we used the mutex for the size query,
but then accessed Devices again unguarded. From now we properly
encapsulate the container in a ProtectedObj which ensures exclusive
accesses. We also hide the "isReady" part in the `getDevice` accessor
and use an `llvm::Expected` to allow to return errors.
This moves the offload entry logic into classes and provides convenient
accessors. No functional change intended but we can now print all
offload entries (and later look them up), tested via
`OMPTARGET_DUMP_OFFLOAD_ENTRIES=<device_no>`.
This basically moves code around again, but this time to provide cleaner
interfaces and remove duplication. PluginAdaptorManagerTy is almost all
gone after this.
For historic reasons we had it setup that there was
` plugin-nextgen/common/PluginInterface/<sources + headers>`
which is not what we do anywhere else.
Now it looks like the rest:
```
plugin-nextgen/common/include/<headers>
plugin-nextgen/common/src/<sources>
```
As part of this, `dlwrap.h` was moved into common/include (as
`DLWrap.h`)
since it is exclusively used by the plugins.