Summary:
Adding information to the LIBOMPTARGET profiler runtime kernel and API
calls.
Key changes:
* Adding information to runtime calls for better understanding of how
the application
is executing. For example teams requested by the user, size of memory
transfers.
* Profile timer was changed from 'us' to 'ns', since 'us' was too
coarse-grain
to register some important details like key kernel duration
* Removed non API or Runtime calls, to reduce complexity of profile for
application
developers.
---------
Co-authored-by: Felipe Cabarcas <cabarcas@leia.crpl.cis.udel.edu>
Co-authored-by: fel-cab <fel-cab@github.com>
Summary:
This patch reorganizes a lot of the code used to check for compatibility
with the current environment. The main bulk of this patch involves
moving from using a separate `__tgt_image_info` struct (which just
contains a string for the architecture) to instead simply checking this
information from the ELF directly. Checking information in the ELF is
very inexpensive as creating an ELF file is simply writing a base
pointer.
The main desire to do this was to reorganize everything into the ELF
image. We can then do the majority of these checks without first
initializing the plugin. A future patch will move the first ELF checks
to happen without initializing the plugin so we no longer need to
initialize and plugins that don't have needed images.
This patch also adds a lot more sanity checks for whether or not the ELF
is actually compatible. Such as if the images have a valid ABI, 64-bit
width, executable, etc.
DeviceTy provides an abstraction for "middle-level" operations that can
be done with a offload device. Mapping was tied into it but is not
strictly necessary. Other languages do not track mapping, and even
OpenMP can be used completely without mapping. This simply moves the
relevant code into the OpenMP/Mapping.cpp as part of a new class
MappingInfoTy. Each device still has one, but it does not clutter the
device.cpp anymore.
Reverts llvm/llvm-project#74360
As I wrote in the analysis of #74360:
Since
bc4e0c048a
we will not add PluginAdaptors into the container of all plugin adaptors
before the plugin is not ready. The error is thereby gone. When and old
HSA loads other libraries they can call register_image but that will
simply not register the image with the plugin we are currently
initializing. That seems like reasonable behavior, thought it is good to
keep in mind if we ever want a kernel library (@jhuber6 @mjklemm). We
can still have a standalone kernel library though or load it late after
all plugins are setup (which seems reasonable).
I did not expect one our tests actually doing exactly what this will not
allow anymore, at least when you use rocm <5.5.0. Need to figure out if
we want this behavior (for rocm <5.5.0).
This introduces checked errors into the creation and initialization of
`PluginAdaptorTy`. We also allow the adaptor to "hide" devices from the
user if the initialization failed. The new organization avoids the
"initOnce" stuff but we still do not eagerly initialize the plugin
devices (I think we should merge `PluginAdaptorTy::initDevices` into
`PluginAdaptorTy::init`)
This fixes two bugs and adds a test for them:
- A shared library with declare target functions but without kernels
should not error out due to missing globals.
- Enabling LIBOMPTARGET_INFO=32 should not deadlock in the presence of
indirect declare targets.
We accessed the `Devices` container most of the time while holding the
RTLsMtx, but not always. Sometimes we used the mutex for the size query,
but then accessed Devices again unguarded. From now we properly
encapsulate the container in a ProtectedObj which ensures exclusive
accesses. We also hide the "isReady" part in the `getDevice` accessor
and use an `llvm::Expected` to allow to return errors.
This moves the offload entry logic into classes and provides convenient
accessors. No functional change intended but we can now print all
offload entries (and later look them up), tested via
`OMPTARGET_DUMP_OFFLOAD_ENTRIES=<device_no>`.
This basically moves code around again, but this time to provide cleaner
interfaces and remove duplication. PluginAdaptorManagerTy is almost all
gone after this.
For historic reasons we had it setup that there was
` plugin-nextgen/common/PluginInterface/<sources + headers>`
which is not what we do anywhere else.
Now it looks like the rest:
```
plugin-nextgen/common/include/<headers>
plugin-nextgen/common/src/<sources>
```
As part of this, `dlwrap.h` was moved into common/include (as
`DLWrap.h`)
since it is exclusively used by the plugins.
Not everything in libomptarget (include) is "OpenMP", but some things
most certainly are. This commit moves some code around to start making
this distinction without the intention to change functionality.
Headers used throughout the different runtimes are different from the
internal headers. This is a first step to bring structure in into the
include folder.
There is no point in keeping GenericDeviceTy objects alive longer than
the associated GenericPluginTy. Instead of the old API we now tear them
down with the plugin, avoiding ordering issues.
The order in which we deinit things, especially when shared libraries
are involved, is complicated. To simplify our lives the nextgen plugin
deinitializes the GenericPluginTy and subclasses automatically. The old
__tgt_rtl_deinit_plugin is not needed anymore.
When we record and replay kernels we should not error out early if there
is a chance the program might still run fine. This patch will:
1) Fallback to the allocation heuristic if the VAMap doesn't work.
2) Adjust the memory start to match the required address if possible.
3) Adjust the (guessed) pointer arguments if the memory start adjustment
is impossible. This will allow kernels without indirect accesses to
work while indirect accesses will most likely fail.
Before we tracked the size of the teams reduction buffer in order to
allocate it at runtime per kernel launch. This patch splits the number
into two parts, the size of the reduction data (=all reduction
variables) and the (maximal) length of the buffer. This will allow us to
allocate less if we need less, e.g., if we have less teams than the
maximal length. It also allows us to move code from clangs codegen into
the runtime as we now know how large the reduction data is.
We used to perform team reduction on global memory allocated in the
runtime and by clang. This was racy as multiple instances of a kernel,
or different kernels with team reductions, would use the same locations.
Since we now have the kernel launch environment, we can allocate dynamic
memory per-launch, allowing us to move all the state into a non-racy
place.
Fixes: https://github.com/llvm/llvm-project/issues/70249
The KernelEnvironment is for compile time information about a kernel. It
allows the compiler to feed information to the runtime. The
KernelLaunchEnvironment is for dynamic information *per* kernel launch.
It allows the rutime to feed information to the kernel that is not
shared with other invocations of the kernel. The first use case is to
replace the globals that synchronize teams reductions with per-launch
versions. This allows concurrent teams reductions. More uses cases will
follow, e.g., per launch memory pools.
Fixes: https://github.com/llvm/llvm-project/issues/70249
The commit was discussed in phabricator
(https://reviews.llvm.org/D157186).
Record replay currently fails on AMD as it conflicts with the heap
memory allocator introduced in #69806. The workaround is setting
`LIBOMPTARGET_HEAP_SIZE=0` during both record and replay run.
The runtime needs to know about the acceptable launch bounds, especially
if the compiler (middle- or backend) assumed those bounds. While this
patch does not yet inform the runtime, it stores the bounds in a place
that can/will be accessed and is associated with the kernel.
The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to
run more tests. The allocator implements `malloc`, both internally and
externally, while we continue to default to the NVIDIA `malloc` when we
target NVIDIA GPUs. Once we have smarter or customizable allocators we
should consider this choice, for now, this allocator is better than
none. It traps if it is out of memory, making it easy to debug. Heap
size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB.
It allows to track allocation statistics via
`LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with
`-fopenmp-target-debug=8`). Two tests were added, and one was enabled.
This is the next step towards fixing
https://github.com/llvm/llvm-project/issues/66708
Implement a slow-path version of omp_target_memset*()
There is a TODO to implement a fast path that uses an on-device
kernel instead of the host-based memory fill operation. This may
require some additional plumbing to have kernels in libomptarget.so