A couple of liboffload entry points were missed out from the tests, and
unsurprisingly a crash in one of them made it in. Add the tests and fix
the unchecked error in `olDestroyEvent`.
Adds a `check-offload-unit` target for running the liboffload unit test
suite. This unit test binary runs the tests for every available device.
This can optionally filtered to devices from a single platform, but the
check target runs on everything.
The target is not part of `check-offload` and does not get propagated to
the top level build. I'm not sure if either of these things are
desirable, but I'm happy to look into it if we want.
Also remove the `offload/unittests/Plugins` test as it's dead code and
doesn't build.
Summary:
This was accidentally kept in the old location when we moved to the
new `lib/<triple>/` location for the DeviceRTL. Move this to reduce the
delta with https://github.com/llvm/llvm-project/pull/136729.
Summary:
We treated the missing kernel environment as a unique mode, but it was
kind of this random bool that was doing the same thing and it explicitly
expects the kernel environment to be zero. It broke after the previous
change since it used to default to SPMD and didn't handle zero in any of
the other cases despite being used. This fixes that and queries for it
without needing to consume an error.
Implement the complete initial version of the Offload API, to the extent
that is usable for simple offloading programs. Tested with a basic SYCL
program.
As far as possible, these are simple wrappers over existing
functionality in the plugins.
* Allocating and freeing memory (host, device, shared).
* Creating a program
* Creating a queue (wrapper over asynchronous stream resource)
* Enqueuing memcpy operations
* Enqueuing kernel executions
* Waiting on (optional) output events from the enqueue operations
* Waiting on a queue to finish
Objects created with the API have reference counting semantics to handle
their lifetime. They are created with an initial reference count of 1,
which can be incremented and decremented with retain and release
functions. They are freed when their reference count reaches 0. Platform
and device objects are not reference counted, as they are expected to
persist as long as the library is in use, and it's not meaningful for
users to create or destroy them.
Tests have been added to `offload.unittests`, including device code for
testing program and kernel related functionality.
The API should still be considered unstable and it's very likely we will
need to change the existing entry points.
Summary:
Currently we depend on a single LLVM include directory. This is actually
only required to define one enum, which is highly unlikely to change.
THis patch makes the `Environment.h` include directory more hermetic so
we no long depend on other libraries. In exchange, we get a simpler
dependency list for the price of hard-coding `1` somewhere. I think it's
a valid trade considering that this flag is highly unlikely to change at
this point.
@ronlieb AMD version
https://gist.github.com/jhuber6/3313e6f957be14dc79fe85e5126d2cb3
Unset `-march` when invoking the compiler and linker to build the GPU
libraries. These libraries use GPU targets rather than the CPU targets,
and an incidental `-march=native` causes Clang to be able to determine
the GPU used — which causes the build to fail when there is no GPU
available. Resetting `-march=` should suffice to revert to building
generic code for the time being.
See the discussion in:
https://github.com/llvm/llvm-project/pull/126143#issuecomment-2816718492
Summary:
Currently, we build a single `libomptarget.devicertl.a` which is a
fatbinary. It is a host object file that contains the embedded archive
files for both the NVIDIA and AMDGPU targets. This was done primarily as
a convenience due to naming conflicts. Now that the clang driver for the
GPU targets can appropriate link via the per-target runtime-dir, we can
just make two separate static libraries and remove the indirection.
This patch creates two new static libraries that get installed into
```
/lib/amdgcn-amd-amdhsa/libomp.a
/lib/nvptx64-nvidia-cuda/libomp.a
```
for AMDGPU and NVPTX respectively. The link job created by the linker
wrapper now simply needs to do `-lomp` and it will search those
directories and link those static libraries. This requires far less
special handling.
This patch is a precursor to changing the build system entirely to be a
runtimes based one. Soon this target will be a standard `add_library`
and done through the GPU runtime targets.
NOTE that this actually does remove an additional optimization step.
Previously we merged all of the files into a single bitcode object and
forcibly internalized some definitions. This, instead, just treats them
like a normal static library. This may possibly affect performance for
some files, but I think it's better overall to use static library
semantics because it allows us to have an 'include-what-you-use'
relationship with the library.
Performance testing will be required. If we really need the merged blob
then we can simply pack that into a new static library.
Currently we don't check for the presence of descriptor/BoxTypes before
emitting stores which lower to memcpys, the issue with this is that
users can have optional arguments, where they don't provide an input,
making the argument effectively null. This can still be mapped and this
causes issues at the moment as we'll emit a memcpy for function
arguments to store to a local variable for certain edge cases, when we
perform this memcpy on a null input, we cause a segfault at runtime.
The fix to this is to simply create a branch around the store that
checks if the data we're copying from is actually present. If it is, we
proceed with the store, if it isn't we skip it.
Summary:
Previously, we removed the special handling for the code object version
global. I erroneously thought that this meant we cold get rid of this
weird `-Xclang` option. However, this also emits an LLVM IR module flag,
which will then cause linking issues.
This reverts commit 3483740289 because #128619 doesn't handle the case when we have an empty frame from `getInliningInfoForAddress` because line num is 0 which makes it non-differentiable from missing debug info. So, we end up using the base filename from symtab again. Reverting for now until that issus is solved.
PR #134713, which landed as 79cb6f05da, causes this on my test
systems:
```
-- Building AMDGPU plugin for dlopened libhsa
-- Not generating AMDGPU tests, no supported devices detected. Use 'LIBOMPTARGET_FORCE_AMDGPU_TESTS' to override.
-- Building CUDA plugin for dlopened libcuda
-- Not generating NVIDIA tests, no supported devices detected. Use 'LIBOMPTARGET_FORCE_NVIDIA_TESTS' to override.
```
The problem is it cannot locate amdgpu-arch and nvptx-arch. This patch
enables it to.
I suspect there is more cleanup to do here. amdgpu-arch and nvptx-arch
do not appear to exist as cmake targets anymore, but there is still
cmake code here that looks for those targets.
This patch removes the addition of 1 to the number of iterations when
calling the following DeviceRTL functions:
- `__kmpc_distribute_for_static_loop*`
- `__kmpc_distribute_static_loop*`
- `__kmpc_for_static_loop*`
Calls to these functions are currently only produced by the OMPIRBuilder
from flang, which already passes the correct number of iterations to
these functions. By adding 1 to the received `num_iters` variable,
worksharing can produce incorrect results. This impacts flang OpenMP
offloading of `do`, `distribute` and `distribute parallel do`
constructs.
Expecting the application to pass `tripcount - 1` as the argument seems
unexpected as well, so rather than updating flang I think it makes more
sense to update the runtime.
If user specifies offload is disabled (e.g.,
OMP_TARGET_OFFLOAD=disable), disable library almost completely. This
reduces resources spent to a minimum and ensures all APIs behave as if
the only available device is the host device.
Currently some of the APIs behave as if there were devices avaible for
offload even when under OMP_TARGET_OFFLOAD=disable.
---------
Co-authored-by: Joseph Huber <huberjn@outlook.com>
Summary:
When we were first porting to COV5, this lead to some ABI issues due to
a change in how we looked up the work group size. Bitcode libraries
relied on the builtins to emit code, but this was changed between
versions. This prevented the bitcode libraries, like OpenMP or libc,
from being used for both COV4 and COV5. The solution was to have this
'none' functionality which effectively emitted code that branched off of
a global to resolve to either version.
This isn't a great solution because it forced every TU to have this
variable in it. The patch in
https://github.com/llvm/llvm-project/pull/131033 removed support for
COV4 from OpenMP, which was the only consumer of this functionality.
Other users like HIP and OpenCL did not use this because they linked the
ROCm Device Library directly which has its own handling (The name was
borrowed from it after all).
So, now that we don't need to worry about backward compatibility with
COV4, we can remove this special handling. Users can still emit COV4
code, this simply removes the special handling used to make the OpenMP
device runtime bitcode version agnostic.
When building with asserts enabled, this can actually cause strange
miscompilations because an incorrect llvm.assume is generated at the
point of the assertion.
Summary:
We conditionally allocate the implicit arguments, so they possibly are
null. The flang compiler seems to hit this case, even though it
shouldn't when it's supposed to conform to the HSA code object. For now
guard this to fix the regression and cover a case in the future where
someone rolls a fully custom implementatation.
Fixes: https://github.com/llvm/llvm-project/issues/132982
This pull request is the third part of an ongoing effort to extends PGO
instrumentation to GPU device code and depends on
https://github.com/llvm/llvm-project/pull/93365. This PR makes the
following changes:
- Allows PGO flags to be supplied to GPU targets
- Pulls version global from device
- Modifies `__llvm_write_custom_profile` and `lprofWriteDataImpl` to
allow the PGO version to be overridden
Spirv doesn't have implicit conversions between address spaces (at least
at present, we might need to change that) and address space qualified
*this pointers are not handled well by clang. This commit changes the
single instance of the smartstack to be explicitly a singleton, for
fractionally simpler IR generation (no this pointer) and to sidestep the
work in progress spirv64-- openmp target not being able to compile the
original version.
Summary:
This patch moves the RPC server handling to be a header only utility
stored in the `shared/` directory. This is intended to be shared within
LLVM for the loaders and `offload/` handling.
Generally, this makes it easier to share code without weird
cross-project binaries being plucked out of the build system. It also
allows us to soon move the loader interface out of the `libc` project so
that we don't need to bootstrap those and can build them in LLVM.
Enable the LLVM_ENABLE_RUNTIMES=flang-rt build of the Fortran runtime
for the amdgpu-offload-* buildbots. This pre-population cmake cache
files is referred to by the llvm-zorg annotated builder factory
[script](872f477610/zorg/buildbot/builders/annotated/amdgpu-offload-cmake.py (L26)).
The corresponding change in llvm-zorg is
llvm/llvm-zorg#402
This reverts commit e296fb8ff6.
The worker of amdgpu-offload-rhel-8-cmake-build-only has been updated
with a newer version of Ninja that supports Fortran.
Improve the check for whether a type can be passed by copy. Currently,
passing by copy is done via the OMP_MAP_LITERAL mapping, which can only
transfer as much data as can be contained in a pointer representation.
The HAS_DEVICE_ADDR indicates that the object(s) listed exists at an
address that is a valid device address. Specifically,
`has_device_addr(x)` means that (in C/C++ terms) `&x` is a device
address.
When entering a target region, `x` does not need to be allocated on the
device, or have its contents copied over (in the absence of additional
mapping clauses). Passing its address verbatim to the region for use is
sufficient, and is the intended goal of the clause.
Some Fortran objects use descriptors in their in-memory representation.
If `x` had a descriptor, both the descriptor and the contents of `x`
would be located in the device memory. However, the descriptors are
managed by the compiler, and can be regenerated at various points as
needed. The address of the effective descriptor may change, hence it's
not safe to pass the address of the descriptor to the target region.
Instead, the descriptor itself is always copied, but for objects like
`x`, no further mapping takes place (as this keeps the storage pointer
in the descriptor unchanged).
---------
Co-authored-by: Sergio Afonso <safonsof@amd.com>
This PR adds an initial implementation for the map modifiers close,
present and ompx_hold, primarily just required adding the appropriate
map type flags to the map type bits. In the case of ompx_hold it
required adding the map type to the OpenMP dialect. Close has a bit of a
problem when utilised with the ALWAYS map type on descriptors, so it is
likely we'll have to make sure close and always are not applied to the
descriptor simultaneously in the future when we apply always to the
descriptors to facilitate movement of descriptor information to device
for consistency, however, we may find an alternative to this with
further investigation. For the moment, it is a TODO/Note to keep track
of it.
The module currently stores the target triple as a string. This means
that any code that wants to actually use the triple first has to
instantiate a Triple, which is somewhat expensive. The change in #121652
caused a moderate compile-time regression due to this. While it would be
easy enough to work around, I think that architecturally, it makes more
sense to store the parsed Triple in the module, so that it can always be
directly queried.
For this change, I've opted not to add any magic conversions between
std::string and Triple for backwards-compatibilty purses, and instead
write out needed Triple()s or str()s explicitly. This is because I think
a decent number of them should be changed to work on Triple as well, to
avoid unnecessary conversions back and forth.
The only interesting part in this patch is that the default triple is
Triple("") instead of Triple() to preserve existing behavior. The former
defaults to using the ELF object format instead of unknown object
format. We should fix that as well.
The ffi_cif structure defined in the wrapper header is smaller than the
actual structure in libffi which results in other structures being
overwritten when libffi is called, and finally in a segfault.
The patch updates the structure to the correct layout as specified in
ffi.h
This patch adds OpenMPToLLVMIRTranslation support for the OpenMP Declare
Mapper directive.
Since both MLIR and Clang now support custom mappers, I've changed the
respective function params to no longer be optional as well.
Depends on #121005