Commit Graph

70 Commits

Author SHA1 Message Date
Ross Brunton
4785832144 [Offload] Fix cmake warning (#145488)
Cmake was unhappy that there was no space between arguments, now it
is.
2025-06-24 13:42:03 +01:00
Johannes Doerfert
57a90edacd [OpenMP][GPU][FIX] Enable generic barriers in single threaded contexts (#140786)
The generic GPU barrier implementation checked if it was the main thread
in generic mode to identify single threaded regions. This doesn't work
since inside of a non-active (=sequential) parallel, that thread becomes
the main thread of a team, and is not the main thread in generic mode.
At least that is the implementation of the APIs today.

To identify single threaded regions we now check the team size
explicitly.

This exposed three other issues; one is, for now, expected and not a
bug, the second one is a bug and has a FIXME in the
single_threaded_for_barrier_hang_1.c file, and the final one is also
benign as described in the end.

The non-bug issue comes up if we ever initialize a thread state.
Afterwards we will never run any region in parallel. This is a little
conservative, but I guess thread states are really bad for performance
anyway.

The bug comes up if we optimize single_threaded_for_barrier_hang_1 and
execute it in Generic-SPMD mode. For some reason we loose all the
updates to b. This looks very much like a compiler bug, but could also
be another logic issue in the runtime. Needs to be investigated.

Issue number 3 comes up if we have nested parallels inside of a target
region. The clang SPMD-check logic gets confused, determines SPMD (which
is fine) but picks an unreasonable thread count. This is all benign, I
think, just weird:

```
  #pragma omp target teams
  #pragma omp parallel num_threads(64)
  #pragma omp parallel num_threads(10)
  {}
```
Was launched with 10 threads, not 64.
2025-05-20 19:33:54 -07:00
Joseph Huber
dbe070eb3e [Offload] Fix PowerPC builds that pass -mcpu (#138327)
Summary:
Another hacky fix done until
https://github.com/llvm/llvm-project/pull/136729 lands. This time for
`-mcpu`.
2025-05-06 14:14:16 -05:00
Joseph Huber
dfcb8cb2a9 [OpenMP] Add pre sm_70 load hack back in (#138589)
Summary:
Different ordering modes aren't supported for an atomic load, so we just
do an add of zero as the same thing. It's less efficient, but it works.

Fixes https://github.com/llvm/llvm-project/issues/138560
2025-05-05 16:33:41 -05:00
Ye Luo
dcb43307ce [Offload] Fix dependency issue #126143 in CMake 2025-05-05 00:38:48 -05:00
Joseph Huber
346792aafb [Offload] Override linker for device build (#137246)
Summary:
Override the default linker in case the user is passing it separately.
This requires `lld` but it always did. This will be fixed *properly*
when https://github.com/llvm/llvm-project/pull/136729 lands.

Fixes https://github.com/llvm/llvm-project/issues/136822
2025-04-25 17:22:07 +02:00
Joseph Huber
6d0d50f0ac [OpenMP] Update the bitcode library install and search path (#136754)
Summary:
This was accidentally kept in the old location when we moved to the
new `lib/<triple>/` location for the DeviceRTL. Move this to reduce the
delta with https://github.com/llvm/llvm-project/pull/136729.
2025-04-23 08:20:15 -05:00
Joseph Huber
92bba68634 [Offload] Fix handling of 'bare' mode when environment missing (#136794)
Summary:
We treated the missing kernel environment as a unique mode, but it was
kind of this random bool that was doing the same thing and it explicitly
expects the kernel environment to be zero. It broke after the previous
change since it used to default to SPMD and didn't handle zero in any of
the other cases despite being used. This fixes that and queries for it
without needing to consume an error.
2025-04-23 08:16:39 -05:00
Joseph Huber
56bf0e7202 [OpenMP] Remove dependency on LLVM include directory from DeviceRTL (#136359)
Summary:
Currently we depend on a single LLVM include directory. This is actually
only required to define one enum, which is highly unlikely to change.
THis patch makes the `Environment.h` include directory more hermetic so
we no long depend on other libraries. In exchange, we get a simpler
dependency list for the price of hard-coding `1` somewhere. I think it's
a valid trade considering that this flag is highly unlikely to change at
this point.

@ronlieb AMD version
https://gist.github.com/jhuber6/3313e6f957be14dc79fe85e5126d2cb3
2025-04-21 15:21:47 -05:00
Michał Górny
ac8fc09688 [offload] Unset -march when building GPU libraries (#136442)
Unset `-march` when invoking the compiler and linker to build the GPU
libraries. These libraries use GPU targets rather than the CPU targets,
and an incidental `-march=native` causes Clang to be able to determine
the GPU used — which causes the build to fail when there is no GPU
available. Resetting `-march=` should suffice to revert to building
generic code for the time being.

See the discussion in:
https://github.com/llvm/llvm-project/pull/126143#issuecomment-2816718492
2025-04-20 04:16:19 +00:00
Joseph Huber
db0f754c5a [OpenMP] Remove 'libomptarget.devicertl.a' fatbinary and use static library (#126143)
Summary:
Currently, we build a single `libomptarget.devicertl.a` which is a
fatbinary. It is a host object file that contains the embedded archive
files for both the NVIDIA and AMDGPU targets. This was done primarily as
a convenience due to naming conflicts. Now that the clang driver for the
GPU targets can appropriate link via the per-target runtime-dir, we can
just make two separate static libraries and remove the indirection.

This patch creates two new static libraries that get installed into
```
/lib/amdgcn-amd-amdhsa/libomp.a
/lib/nvptx64-nvidia-cuda/libomp.a
```
for AMDGPU and NVPTX respectively. The link job created by the linker
wrapper now simply needs to do `-lomp` and it will search those
directories and link those static libraries. This requires far less
special handling.

This patch is a precursor to changing the build system entirely to be a
runtimes based one. Soon this target will be a standard `add_library`
and done through the GPU runtime targets.

NOTE that this actually does remove an additional optimization step.
Previously we merged all of the files into a single bitcode object and
forcibly internalized some definitions. This, instead, just treats them
like a normal static library. This may possibly affect performance for
some files, but I think it's better overall to use static library
semantics because it allows us to have an 'include-what-you-use'
relationship with the library.

Performance testing will be required. If we really need the merged blob
then we can simply pack that into a new static library.
2025-04-18 07:43:31 -05:00
Joseph Huber
2f41fa387d [AMDGPU] Fix code object version not being set to 'none' (#135036)
Summary:
Previously, we removed the special handling for the code object version
global. I erroneously thought that this meant we cold get rid of this
weird `-Xclang` option. However, this also emits an LLVM IR module flag,
which will then cause linking issues.
2025-04-10 11:31:21 -05:00
Sergio Afonso
66fca0674d [OpenMP] Fix num_iters in __kmpc_*_loop DeviceRTL functions (#133435)
This patch removes the addition of 1 to the number of iterations when
calling the following DeviceRTL functions:
- `__kmpc_distribute_for_static_loop*`
- `__kmpc_distribute_static_loop*`
- `__kmpc_for_static_loop*`

Calls to these functions are currently only produced by the OMPIRBuilder
from flang, which already passes the correct number of iterations to
these functions. By adding 1 to the received `num_iters` variable,
worksharing can produce incorrect results. This impacts flang OpenMP
offloading of `do`, `distribute` and `distribute parallel do`
constructs.

Expecting the application to pass `tripcount - 1` as the argument seems
unexpected as well, so rather than updating flang I think it makes more
sense to update the runtime.
2025-04-01 10:29:08 +01:00
Joseph Huber
772173f548 [Clang][AMDGPU] Remove special handling for COV4 libraries (#132870)
Summary:
When we were first porting to COV5, this lead to some ABI issues due to
a change in how we looked up the work group size. Bitcode libraries
relied on the builtins to emit code, but this was changed between
versions. This prevented the bitcode libraries, like OpenMP or libc,
from being used for both COV4 and COV5. The solution was to have this
'none' functionality which effectively emitted code that branched off of
a global to resolve to either version.

This isn't a great solution because it forced every TU to have this
variable in it. The patch in
https://github.com/llvm/llvm-project/pull/131033 removed support for
COV4 from OpenMP, which was the only consumer of this functionality.
Other users like HIP and OpenCL did not use this because they linked the
ROCm Device Library directly which has its own handling (The name was
borrowed from it after all).

So, now that we don't need to worry about backward compatibility with
COV4, we can remove this special handling. Users can still emit COV4
code, this simply removes the special handling used to make the OpenMP
device runtime bitcode version agnostic.
2025-03-28 07:35:16 -05:00
macurtis-amd
21a8c63cdc [offload] Remove bad assert in StaticLoopChunker::Distribute (#132705)
When building with asserts enabled, this can actually cause strange
miscompilations because an incorrect llvm.assume is generated at the
point of the assertion.
2025-03-28 04:53:00 -05:00
Joseph Huber
cb493d2bab [OpenMP] Replace utilities with 'gpuintrin.h' definitions (#131644)
Summary:
Port more instructions. AMD version is at
https://gist.github.com/jhuber6/235d7ee95f747c75f9a3cfd8eedac6aa
2025-03-19 10:47:21 -05:00
Jon Chesterfield
deb0f3c09b [openmp][nfc] Use builtin align in the devicertl (#131918)
Noticed while extracting the smartstack as a test case
2025-03-18 21:31:49 +00:00
Jon Chesterfield
395bdebebd Revert "[openmp][nfc] Refactor shared/lds smartstack for spirv (#131905)"
This reverts commit c02b935a9b.
Failed a check-offload test under CI
2025-03-18 20:43:05 +00:00
Joseph Huber
206f78dfec [OpenMP] Use 'gpuintrin.h' definitions for simple block identifiers (#131631)
Summary:
This patch ports the runtime to use `gpuintrin.h` instead of calling the
builtins for most things. The `lanemask_gt` stuff was left for now with
a fallback.

AMD version for Ron
https://gist.github.com/jhuber6/42014d635b9a8158727640876bf47226.
2025-03-18 15:38:46 -05:00
Jon Chesterfield
c02b935a9b [openmp][nfc] Refactor shared/lds smartstack for spirv (#131905)
Spirv doesn't have implicit conversions between address spaces (at least
at present, we might need to change that) and address space qualified
*this pointers are not handled well by clang. This commit changes the
single instance of the smartstack to be explicitly a singleton, for
fractionally simpler IR generation (no this pointer) and to sidestep the
work in progress spirv64-- openmp target not being able to compile the
original version.
2025-03-18 20:33:24 +00:00
Joseph Huber
ed9107f2d7 [OpenMP] Replace use of target address space with <gpuintrin.h> local (#126119)
Summary:
This definition is more portable since it defines the correct value for
the target. I got rid of the helper mostly because I think it's easy
enough to use now that it's a type and being explicit about what's
`undef` or `poison` is good.
2025-02-09 10:25:25 -06:00
Joseph Huber
bb7ab2557c [OpenMP] Port the OpenMP device runtime to direct C++ compilation (#123673)
Summary:
This removes the use of OpenMP offloading to build the device runtime.
The main benefit here is that we no longer need to rely on offloading
semantics to build a device only runtime. Things like variants are now
no longer needed and can just be simple if-defs. In the future, I will
remove most of the special handling here and fold it into calls to the
`<gpuintrin.h>` functions instead. Additionally I will rework the
compilation to make this a separate runtime.

The current plan is to have this, but make including OpenMP and
offloading either automatically add it, or print a warning if it's
missing. This will allow us to use a normal CMake workflow and delete
all the weird 'lets pull the clang binary out of the build' business.
```
-DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=offload
-DLLVM_RUNTIME_TARGETS=amdgcn-amd-amdhsa
```

After that, linking the OpenMP device runtime will be `-Xoffload-linker
-lomp`. I.e. no more fat binary business.

Only look at the most recent commit since this includes the two
dependencies
(fix to AMDGPUEmitPrintfBinding and the PointerToMember bug).
2025-02-05 08:18:52 -06:00
Christian Clauss
1f56bb3137 [Offload][NFC] Fix typos discovered by codespell (#125119)
https://github.com/codespell-project/codespell

% `codespell
--ignore-words-list=archtype,hsa,identty,inout,iself,nd,te,ths,vertexes
--write-changes`
2025-01-31 09:35:29 -06:00
Joseph Huber
760a786d15 [Clang] Prevent mlink-builtin-bitcode from internalizing the RPC client (#118661)
Summary:
Currently, we only use `-mlink-builtin-bitcode` for non-LTO NVIDIA
compiliations. This has the problem that it will internalize the RPC
client symbol which needs to be visible to the host. To counteract that,
I put `retain` on it, but this also prevents optimizations on the global
itself, so the passes we have that remove the symbol don't work on
OpenMP anymore. This patch does the dumbest solution, adding a special
string check for it in clang. Not the best solution, the runner up would
be to have a clang attribute for `externally_initialized` because those
can't be internalized, but that might have some unfortunate
side-effects. Alternatively we could make NVIDIA compilations do LTO all
the time, but that would affect some users and it's harder than I
thought.
2025-01-27 19:30:59 -06:00
Joseph Huber
f233a54ae8 [OpenMP] Remove usage of pointer-to-member in lookup (#123671)
Summary:
This is buggy and is currently being tracked in
https://github.com/llvm/llvm-project/issues/123241. For now, replace it
with a macro so that we can use address spaces directly.
2025-01-21 07:50:40 -06:00
Joseph Huber
3274bf6b42 [OpenMP] Make each atomic helper take an atomic scope argument (#122786)
Summary:
Right now we just default to device for each type, and mix an ad-hoc
scope with the one used by the compiler's builtins. Unify this can make
each version take the scope optionally.

For @ronlieb, this will remove the need for `add_system` in the fork as
well as the extra `cas` with system scope, just pass `system`.
2025-01-20 21:58:27 -06:00
Joseph Huber
2d9f406943 [OpenMP] Adjust 'printf' handling in the OpenMP runtime (#123670)
Summary:
We used to avoid a lot of this stuff because we didn't properly handle
variadics in device code. That's been solved for now, so we can just
make an internal printf handler that forwards to the external `vprintf`
function. This is either provided by NVIDIA's SDK or by the GPU libc
implementation.

The main reason for doing this is because it prevents the stupid AMDGPU
printf pass from mangling our beautiful printfs!
2025-01-20 21:56:46 -06:00
Joseph Huber
723a3e746a [OpenMP] Fix mispelled attribute and warning
Summary:
This is spelled `ompx_aligned_barrier` when used directly, but wasn't
included in the list of known assumptions. Fix that so now th test
works.
2025-01-20 08:40:19 -06:00
Joseph Huber
58af82b462 [OpenMP] Remove 'omp assumes' scopes now that we have no inline ASM (#123611)
Summary:
We used this globally scoped `ext_no_call_asm` as a sort of hack around
the compiler that allowed the attributor to optimize out inline assembly
calls to PTX instructions. Quite some time ago I got rid of every inline
assembly call and replaced it with a builitin, so this can just be
deleted.

Furthermore, I use the `[[omp::assume]]` attribute directly for the
aligned barrier usage. This prints an unknown assumption warning (even
though it isn't) so I'm just silencing that for now until I fix it
later.

---------

Co-authored-by: Michael Kruse <github@meinersbur.de>
2025-01-20 08:11:06 -06:00
Joseph Huber
1c00d0d776 [OpenMP] Remove hack around missing atomic load (#122781)
Summary:
We used to do a fetch add of zero to approximate a load. This is because
the NVPTX backend didn't handle this properly. It's not an issue anymore
so simply use the proper atomic builtin.
2025-01-16 15:17:15 -06:00
Joseph Huber
74d5373f49 [OpenMP] Fix missing type getter for SFINAE helper
Summary:
This didn't get the type, which made using this always return false.
2025-01-10 19:35:41 -06:00
Joseph Huber
f53cb84df6 [OpenMP] Use __builtin_bit_cast instead of UB type punning (#122325)
Summary:
Use a normal bitcast, remove from the shared utils since it's not
available in
GCC 7.4
2025-01-09 13:59:21 -06:00
Joseph Huber
b57c0bac81 [OpenMP] Update atomic helpers to just use headers (#122185)
Summary:
Previously we had some indirection here, this patch updates these
utilities to just be normal template functions. We use SFINAE to manage
the special case handling for floats. Also this strips address spaces so
it can be used more generally.
2025-01-09 13:57:39 -06:00
Joseph Huber
34f8573a51 [OpenMP] Use generic IR for the OpenMP DeviceRTL (#119091)
Summary:
We previously built this for every single architecture to deal with
incompatibility. This patch updates it to use the 'generic' IR that
`libc` and other projects use. Who knows if this will have any
side-effects, probably worth testing more but it passes the tests I
expect to pass on my side.
2024-12-24 18:05:28 -06:00
Joseph Huber
b0fbddde38 [OpenMP] Only put retain for NVPTX so it can be optimized out for AMD
Summary:
This is a hack that only NVPTX needs.
2024-12-17 15:16:51 -06:00
Joseph Huber
f4ee5a673f [OpenMP] Replace AMDGPU fences with generic scoped fences (#119619)
Summary:
This is simpler and more common. I would've replaced the CUDA uses and
made this the same but currently it doesn't codegen these fences fully
and just emits a full system wide barrier as a fallback.
2024-12-12 07:54:51 -06:00
hidekisaito
f2bceb2311 [Offload][AMDGPU] accept generic target (#118919)
Enables generic ISA, e.g., "--offload-arch=gfx11-generic" device code to
run on gfx11-generic ISA capable device.

Executable may contain one ELF that has specific target ISA and another
ELF that has compatible generic ISA.
Under that circumstance, this code should say both ELFs are compatible,
leaving the rest to PluginManager to handle.
Suggestions on how best to address that is welcome.
2024-12-09 19:11:38 -05:00
Michał Górny
69227a11fe [offload] Support LIBOMPTARGET_DEVICE_ARCHITECTURES={amdgpu|nvptx} (#119070)
Add two more special values for LIBOMPTARGET_DEVICE_ARCHITECTURES:
`amdgpu` and `nvptx`, to support building for all AMDGPU and NVPTX
targets respectively. This can be used in place of `all` when offload is
built with one of the GPU plugins only.
2024-12-07 15:37:28 +00:00
Michał Górny
b54ba5361e [offload] Add gfx1012 (Navi 14) to AMDGPU models list (#118857)
Fixes #118824
2024-12-06 03:24:55 +00:00
Jan Patrick Lehr
c7babfa6a3 [Offload] Find libc relative to DeviceRTL path (#118497)
This was discussed as a potential solution in
https://github.com/llvm/llvm-project/pull/118173
2024-12-03 16:37:57 +01:00
Joseph Huber
91f5f974cb [OpenMP] Unconditionally provide an RPC client interface for OpenMP (#117933)
Summary:
This patch adds an RPC interface that lives directly in the OpenMP
device runtime. This allows OpenMP to implement custom opcodes.
Currently this is only providing the host call interface, which is the
raw version of reverse offloading. Previously this lived in `libc/` as
an extension which is not the correct place.

The interface here uses a weak symbol for the RPC client by the same
name that the `libc` interface uses. This means that it will defer to
the libc one if both are present so we don't need to set up multiple
instances.

The presense of this symbol is what controls whether or not we set up
the RPC server. Because this is an external symbol it normally won't be
optimized out, so there's a special pass in OpenMPOpt that deletes this
symbol if it is unused during linking. That means at `O0` the RPC server
will always be present now, but will be removed trivially if it's not
used at O1 and higher.
2024-12-02 14:31:51 -06:00
Joseph Huber
506ca19dc9 [OpenMP] Remove use of '__AMDGCN_WAVEFRONT_SIZE' (#113156)
Summary:
This is going to be deprecated in
https://github.com/llvm/llvm-project/pull/112849. This patch ports it to
use the builtin instead. This isn't a compile constant, so it could
slightly negatively affect codegen. There really should be an IR pass to
turn it into a constant if the function has known attributes.

Using the builtin is correct when we just do it for knowing the size
like we do here. Obviously guarding w32/w64 code with this check would
be broken.
2024-11-25 07:38:28 -06:00
Matt Arsenault
a6fc489bb7 AMDGPU: Add gfx950 subtarget definitions (#116307)
Mostly a stub, but adds some baseline tests and
tests for removed instructions.
2024-11-18 10:41:14 -08:00
Carl Ritson
076aac59ac [AMDGPU] Add a new target for gfx1153 (#113138) 2024-10-23 12:56:58 +09:00
Joseph Huber
e8d2057ca4 [OpenMP] Add critical region lock for NVPTX targets (#110148)
Summary:
We define this on AMDGCN but not NVPTX, which leads to some failures
dependong on the target.
2024-09-26 11:33:52 -07:00
Joseph Huber
c3ac3fe825 [OpenMP] Fix redefining stdint.h types (#108607)
Summary:
We can include `stdint.h` just fine as long as we don't allow it to find
system headers, passing `-nostdlibinc` and `-nogpuinc` suppresses these
extra paths so we will just use the clang resource headers for
`stdint.h` and `stddef.h`.
2024-09-13 13:22:44 -05:00
Johannes Doerfert
08533a3ee8 [Offload][NFC] Reorganize utils:: and make Device/Host/Shared clearer (#100280)
We had three `utils::` namespaces, all with different "meaning" (host,
device, hsa_utils). We should, when we can, keep "include/Shared"
accessible from host and device, thus RefCountTy has been moved to a
separate header. `hsa_utils` was introduced to make `utils::` less
overloaded. And common functionality was de-duplicated, e.g.,
`utils::advance` and `utils::advanceVoidPtr` -> `utils:advancePtr`. Type
punning now checks for the size of the result to make sure it matches
the source type.

No functional change was intended.
2024-09-05 13:36:26 -07:00
WÁNG Xuěruì
9adf81182e [Offload] Fix stray libomptarget message helper calls (#106837)
In #92581 the `LibomptargetUitls.cmake` helpers have been removed, but
only uses of `libomptarget_say` were migrated. Migrate the remaining few
warning and error messages so the `check-offload` target would not fail
due to missing `libomptarget_warning_say`.

While at it, update the `check-offload` unavailability message to say
`check-offload` instead of `check-libomptarget`.

Fixes #92581
2024-08-31 07:06:41 -05:00
Ethan Luis McDonough
fde2d23ee2 [PGO][OpenMP] Instrumentation for GPU devices (Revision of #76587) (#102691)
This pull request is a revised version of #76587. This pull request
fixes some build issues that were present in the previous version of
this change.

> This pull request is the first part of an ongoing effort to extends
PGO instrumentation to GPU device code. This PR makes the following
changes:
>
> - Adds blank registration functions to device RTL
> - Gives PGO globals protected visibility when targeting a supported
GPU
> - Handles any addrspace casts for PGO calls
> - Implements PGO global extraction in GPU plugins (currently only
dumps info)
>
> These changes can be tested by supplying `-fprofile-instrument=clang`
while targeting a GPU.
2024-08-22 01:10:54 -05:00
Joseph Huber
74d23f15b6 [OpenMP] Implement 'omp_alloc' on the device (#102526)
Summary:
The 'omp_alloc' function should be callable from a target region. This
patch implemets it by simply calling `malloc` for every non-default
trait value allocator. All the special access modifiers are
unimplemented and return null. The null allocator returns null as the
spec states it should not be usable from the target.
2024-08-14 13:38:55 -05:00