Summary:
This patch removes the bulk of the handling of the
`__tgt_offload_entries` out of the plugins itself. The reason for this
is because the plugins themselves should not be handling this
implementation detail of the OpenMP runtime. Instead, we expose two new
plugin API functions to get the points to a device pointer for a global
as well as a kernel type.
This required introducing a new type to represent a binary image that
has been loaded on a device. We can then use this to load the addresses
as needed. The creation of the mapping table is then handled just in
`libomptarget` where we simply look up each address individually. This
should allow us to expose these operations more generically when we
provide a separate API.
This patch enables applications that did not request OpenMP
unified_shared_memory to run with the same zero-copy behavior, where
mapped memory does not result in extra memory allocations and memory
copies, but CPU-allocated memory is accessed from the device. The name
for this behavior is "automatic zero-copy" and it relies on detecting:
that the runtime is running on a MI300A, that the user did not select
unified_shared_memory in their program, and that XNACK (unified memory
support) is enabled in the current GPU configuration. If all these
conditions are met, then automatic zero-copy is triggered.
This patch also introduces an environment variable OMPX_APU_MAPS that,
if set, triggers automatic zero-copy also on non APU GPUs (e.g., on
discrete GPUs).
This patch is still missing support for global variables, which will be
provided in a subsequent patch.
Co-authored-by: Thorsten Blass <thorsten.blass@amd.com>
Summary:
The LLVM style uses /*Foo=*/ when indicating the name of a constant. See
https://llvm.org/docs/CodingStandards.html#comment-formatting. This is
useful for consistency, as well as because `clang-format` understands
this syntax and formats it more cleanly. Do a bulk update of this
syntax.
Summary:
The offloading entries right now are assumed to be baked into the binary
itself, and thus always valid whenever the library is executing. This
means that we don't need to copy them to additional storage and can
instead simply pass around references to it.
This is not likely to change in the expected operation of the OpenMP
library. Additionally, the indirection for the offload entry struct is
simply two pointers, so moving it by value is trivial.
…on (zero-copy) on MI300A.
This patch enables applications that did not request OpenMP
unified_shared_memory to run with the same zero-copy behavior, where
mapped memory does not result in extra memory allocations and memory
copies, but CPU-allocated memory is accessed from the device. The name
for this behavior is "automatic zero-copy" and it relies on detecting:
that the runtime is running on a MI300A, that the user did not select
unified_shared_memory in their program, and that XNACK (unified memory
support) is enabled in the current GPU configuration. If all these
conditions are met, then automatic zero-copy is triggered.
This patch is still missing support for global variables, which will be
provided in a subsequent patch.
Co-authored-by: Thorsten Blass <thorsten.blass@amd.com>
Summary:
The device allocator on NVPTX architectures is enqueued to a stream that
the kernel is potentially executing on. This can lead to deadlocks as
the kernel will not proceed until the allocation is complete and the
allocation will not proceed until the kernel is complete. CUDA 11.2
introduced async allocations that we can manually place on separate
streams to combat this. This patch makes a new allocation type that's
guaranteed to be non-blocking so it will actually make progress, only
Nvidia needs to care about this as the others are not blocking in this
way by default.
I had originally tried to make the `alloc` and `free` methods take a
`__tgt_async_info`. However, I observed that with the large volume of
streams being created by a parallel test it quickly locked up the system
as presumably too many streams were being created. This implementation
not just creates a new stream and immediately destroys it. This
obviously isn't very fast, but it at least gets the cases to stop
deadlocking for now.
Summary:
Adding information to the LIBOMPTARGET profiler runtime kernel and API
calls.
Key changes:
* Adding information to runtime calls for better understanding of how
the application
is executing. For example teams requested by the user, size of memory
transfers.
* Profile timer was changed from 'us' to 'ns', since 'us' was too
coarse-grain
to register some important details like key kernel duration
* Removed non API or Runtime calls, to reduce complexity of profile for
application
developers.
---------
Co-authored-by: Felipe Cabarcas <cabarcas@leia.crpl.cis.udel.edu>
Co-authored-by: fel-cab <fel-cab@github.com>
Summary:
This patch reorganizes a lot of the code used to check for compatibility
with the current environment. The main bulk of this patch involves
moving from using a separate `__tgt_image_info` struct (which just
contains a string for the architecture) to instead simply checking this
information from the ELF directly. Checking information in the ELF is
very inexpensive as creating an ELF file is simply writing a base
pointer.
The main desire to do this was to reorganize everything into the ELF
image. We can then do the majority of these checks without first
initializing the plugin. A future patch will move the first ELF checks
to happen without initializing the plugin so we no longer need to
initialize and plugins that don't have needed images.
This patch also adds a lot more sanity checks for whether or not the ELF
is actually compatible. Such as if the images have a valid ABI, 64-bit
width, executable, etc.
DeviceTy provides an abstraction for "middle-level" operations that can
be done with a offload device. Mapping was tied into it but is not
strictly necessary. Other languages do not track mapping, and even
OpenMP can be used completely without mapping. This simply moves the
relevant code into the OpenMP/Mapping.cpp as part of a new class
MappingInfoTy. Each device still has one, but it does not clutter the
device.cpp anymore.
Reverts llvm/llvm-project#74360
As I wrote in the analysis of #74360:
Since
bc4e0c048a
we will not add PluginAdaptors into the container of all plugin adaptors
before the plugin is not ready. The error is thereby gone. When and old
HSA loads other libraries they can call register_image but that will
simply not register the image with the plugin we are currently
initializing. That seems like reasonable behavior, thought it is good to
keep in mind if we ever want a kernel library (@jhuber6 @mjklemm). We
can still have a standalone kernel library though or load it late after
all plugins are setup (which seems reasonable).
I did not expect one our tests actually doing exactly what this will not
allow anymore, at least when you use rocm <5.5.0. Need to figure out if
we want this behavior (for rocm <5.5.0).
This introduces checked errors into the creation and initialization of
`PluginAdaptorTy`. We also allow the adaptor to "hide" devices from the
user if the initialization failed. The new organization avoids the
"initOnce" stuff but we still do not eagerly initialize the plugin
devices (I think we should merge `PluginAdaptorTy::initDevices` into
`PluginAdaptorTy::init`)
This fixes two bugs and adds a test for them:
- A shared library with declare target functions but without kernels
should not error out due to missing globals.
- Enabling LIBOMPTARGET_INFO=32 should not deadlock in the presence of
indirect declare targets.
We accessed the `Devices` container most of the time while holding the
RTLsMtx, but not always. Sometimes we used the mutex for the size query,
but then accessed Devices again unguarded. From now we properly
encapsulate the container in a ProtectedObj which ensures exclusive
accesses. We also hide the "isReady" part in the `getDevice` accessor
and use an `llvm::Expected` to allow to return errors.
This moves the offload entry logic into classes and provides convenient
accessors. No functional change intended but we can now print all
offload entries (and later look them up), tested via
`OMPTARGET_DUMP_OFFLOAD_ENTRIES=<device_no>`.
This basically moves code around again, but this time to provide cleaner
interfaces and remove duplication. PluginAdaptorManagerTy is almost all
gone after this.
For historic reasons we had it setup that there was
` plugin-nextgen/common/PluginInterface/<sources + headers>`
which is not what we do anywhere else.
Now it looks like the rest:
```
plugin-nextgen/common/include/<headers>
plugin-nextgen/common/src/<sources>
```
As part of this, `dlwrap.h` was moved into common/include (as
`DLWrap.h`)
since it is exclusively used by the plugins.
Not everything in libomptarget (include) is "OpenMP", but some things
most certainly are. This commit moves some code around to start making
this distinction without the intention to change functionality.
Headers used throughout the different runtimes are different from the
internal headers. This is a first step to bring structure in into the
include folder.
There is no point in keeping GenericDeviceTy objects alive longer than
the associated GenericPluginTy. Instead of the old API we now tear them
down with the plugin, avoiding ordering issues.
The order in which we deinit things, especially when shared libraries
are involved, is complicated. To simplify our lives the nextgen plugin
deinitializes the GenericPluginTy and subclasses automatically. The old
__tgt_rtl_deinit_plugin is not needed anymore.
When we record and replay kernels we should not error out early if there
is a chance the program might still run fine. This patch will:
1) Fallback to the allocation heuristic if the VAMap doesn't work.
2) Adjust the memory start to match the required address if possible.
3) Adjust the (guessed) pointer arguments if the memory start adjustment
is impossible. This will allow kernels without indirect accesses to
work while indirect accesses will most likely fail.
Before we tracked the size of the teams reduction buffer in order to
allocate it at runtime per kernel launch. This patch splits the number
into two parts, the size of the reduction data (=all reduction
variables) and the (maximal) length of the buffer. This will allow us to
allocate less if we need less, e.g., if we have less teams than the
maximal length. It also allows us to move code from clangs codegen into
the runtime as we now know how large the reduction data is.
We used to perform team reduction on global memory allocated in the
runtime and by clang. This was racy as multiple instances of a kernel,
or different kernels with team reductions, would use the same locations.
Since we now have the kernel launch environment, we can allocate dynamic
memory per-launch, allowing us to move all the state into a non-racy
place.
Fixes: https://github.com/llvm/llvm-project/issues/70249
The KernelEnvironment is for compile time information about a kernel. It
allows the compiler to feed information to the runtime. The
KernelLaunchEnvironment is for dynamic information *per* kernel launch.
It allows the rutime to feed information to the kernel that is not
shared with other invocations of the kernel. The first use case is to
replace the globals that synchronize teams reductions with per-launch
versions. This allows concurrent teams reductions. More uses cases will
follow, e.g., per launch memory pools.
Fixes: https://github.com/llvm/llvm-project/issues/70249