Commit Graph

99 Commits

Author SHA1 Message Date
Johannes Doerfert
7233e42dff [OpenMP][NFC] Move Environment.h and SourceInfo.h into "Shared" folder (#73703) 2023-11-28 15:10:06 -08:00
Johannes Doerfert
3de645efe3 [OpenMP][NFC] Split the reduction buffer size into two components
Before we tracked the size of the teams reduction buffer in order to
allocate it at runtime per kernel launch. This patch splits the number
into two parts, the size of the reduction data (=all reduction
variables) and the (maximal) length of the buffer. This will allow us to
allocate less if we need less, e.g., if we have less teams than the
maximal length. It also allows us to move code from clangs codegen into
the runtime as we now know how large the reduction data is.
2023-11-06 11:50:41 -08:00
Jan Patrick Lehr
07f5cf1992 [OpenMP][libomptarget] Fixes possible no-return warning (#70808)
The UNREACHABLE macro resolves to message + trap, which may still warn, so we add call to __builtin_unreachable.
2023-11-06 16:45:03 +01:00
Johannes Doerfert
d3e7a48cbd [OpenMP][NFC] Remove a no-op function 2023-11-03 10:28:36 -07:00
Johannes Doerfert
f9a89e6b9c [OpenMP][FIX] Allocate per launch memory for GPU team reductions (#70752)
We used to perform team reduction on global memory allocated in the
runtime and by clang. This was racy as multiple instances of a kernel,
or different kernels with team reductions, would use the same locations.
Since we now have the kernel launch environment, we can allocate dynamic
memory per-launch, allowing us to move all the state into a non-racy
place.

Fixes: https://github.com/llvm/llvm-project/issues/70249
2023-11-01 11:11:48 -07:00
Johannes Doerfert
b8cbc5c02c [OpenMP] Introduce the KernelLaunchEnvironment as implicit argument (#70401)
The KernelEnvironment is for compile time information about a kernel. It
allows the compiler to feed information to the runtime. The
KernelLaunchEnvironment is for dynamic information *per* kernel launch.
It allows the rutime to feed information to the kernel that is not
shared with other invocations of the kernel. The first use case is to
replace the globals that synchronize teams reductions with per-launch
versions. This allows concurrent teams reductions. More uses cases will
follow, e.g., per launch memory pools.

Fixes: https://github.com/llvm/llvm-project/issues/70249
2023-10-31 19:38:43 -07:00
Johannes Doerfert
d3921e4670 [OpenMP] Basic BumpAllocator for (AMD)GPUs (#69806)
The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to
run more tests. The allocator implements `malloc`, both internally and
externally, while we continue to default to the NVIDIA `malloc` when we
target NVIDIA GPUs. Once we have smarter or customizable allocators we
should consider this choice, for now, this allocator is better than
none. It traps if it is out of memory, making it easy to debug. Heap
size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB.
It allows to track allocation statistics via
`LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with
`-fopenmp-target-debug=8`). Two tests were added, and one was enabled.

This is the next step towards fixing
 https://github.com/llvm/llvm-project/issues/66708
2023-10-21 14:49:30 -07:00
Johannes Doerfert
d571af7f62 [OpenMP][FIX] Ensure thread states do not crash on the GPU
The nested parallelism causes thread states which still do not properly
work but at least don't crash anymore.
2023-10-21 14:43:09 -07:00
Johannes Doerfert
1cea309b7e [OpenMP][NFC] Move DebugKind to make it reusable from the host 2023-10-20 19:28:09 -07:00
Joseph Huber
b69081e324 Attributes (#69358)
- [Libomptarget] Make the references to 'malloc' and 'free' weak.
- [Libomptarget][NFC] Use C++ style attributes instead
2023-10-18 12:52:43 -04:00
Joseph Huber
460840c09d [OpenMP] Support 'omp_get_num_procs' on the device (#65501)
Summary:
The `omp_get_num_procs()` function should return the amount of
parallelism availible. On the GPU, this was not defined. We have elected
to define this function as the maximum amount of wavefronts / warps that
can be simultaneously resident on the device. For AMDGPU this is the
number of CUs multiplied byth CU's per wave. For NVPTX this is the
maximum threads per SM divided by the warp size and multiplied by the
number of SMs.
2023-09-06 13:45:05 -05:00
Joseph Huber
aa78e94b0b [Libomptarget] Support mapping indirect host calls to device functions
The changes in D157738 allowed for us to emit stub globals on the device
in the offloading entry section. These globals contain the addresses of
device functions and allow us to map host functions to their
corresponding device equivalent. This patch provides the initial support
required to build a table on the device to lookup the associated value.
This is done by finding these entries and creating a global table on the
device that can be searched with a simple binary search.

This requires an allocation, which supposedly should be automatically
freed at plugin shutdown. This includes a basic test which looks up device
pointers via a host pointer using the added function. This will need to be built
upon to provide full support for these calls in the runtime.

To support reverse offloading it would also be useful to provide a reverse table
that allows us to get host functions from device stubs.

Depends on D157738

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D157918
2023-08-25 18:51:56 -05:00
Johannes Doerfert
ed16143593 [OpenMP][FIX] Ensure __assert_fail is compatible with the host
Fixes: https://github.com/llvm/llvm-project/issues/64360
2023-08-04 11:36:58 -07:00
Joseph Huber
46642cc83d [Libomptarget] Remove debug RAII from libomptarget
This feature was supposed to allow you to trace execution inside of
Libomptarget. However, this never really worked properly. The printing
was always reoganized, only worked for single  threads, and pretty much
only told you a handful of things about a runtime library that's an
implementation detail to all users. Despite this, it contributed about
40% of the total filesize of the deviceRTL. This patch simply removes
this functionalit which I think was past due.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D157001
2023-08-03 09:37:47 -05:00
Johannes Doerfert
1f3a28d4e5 [OpenMP][NFC] Reorganize the ompx::mapping layer in the GPU runtime
This change makes the naming more consistent, I hope.
2023-07-31 13:44:51 -07:00
Shilei Tian
10068cd654 [OpenMP] Introduce kernel environment
This patch introduces per kernel environment. Previously, flags such as execution mode are set through global variables with name like `__kernel_name_exec_mode`. They are accessible on the host by reading the corresponding global variable, but not from the device. Besides, some assumptions, such as no nested parallelism, are not per kernel basis, preventing us applying per kernel optimization in the device runtime.

This is a combination and refinement of patch series D116908, D116909, and D116910.

Depend on D155886.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D142569
2023-07-26 13:35:14 -04:00
Shilei Tian
6bd74fd65f Revert commits for kernel environment
This reverts commits for kernel environments as they causes issues in AMD BB.
2023-07-23 23:32:31 -04:00
Shilei Tian
c5c8040390 [OpenMP] Introduce kernel environment
This patch introduces per kernel environment. Previously, flags such as execution mode are set through global variables with name like `__kernel_name_exec_mode`. They are accessible on the host by reading the corresponding global variable, but not from the device. Besides, some assumptions, such as no nested parallelism, are not per kernel basis, preventing us applying per kernel optimization in the device runtime.

This is a combination and refinement of patch series D116908, D116909, and D116910.

Depend on D155886.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D142569
2023-07-23 18:36:01 -04:00
Johannes Doerfert
f914208c43 [OpenMP][NFCI] Avoid storing non-constant values in ICV
If we store a constant in an ICV it is easier for the optimizer to
propagate it. Since we often use the full block for the thread limit and
the parallel team size, we can instead replace that dynamic value with a
constant that otherwise cannot occur, here 0.
2023-07-18 16:50:50 -07:00
Johannes Doerfert
88a68de14c [OpenMP][NFCI] Split assertion message from assertion expression
We ended up with `llvm.assume(icmp ne ptr as(4) null, as(4) @str)`
because the string in address space 4 was not known to be non-null.
There is no need to create these assumes.
2023-07-18 16:50:50 -07:00
Joseph Huber
6764301a6b [Libomptarget] Correctly implement getWTime on AMDGPU
AMDGPU provides a fixed frequency clock since some generations back.
However, the frequency is variable by card and must be looked up at
runtime. This patch adds a new device environment line for the clock
frequency so that we can use it in the same way as NVPTX. This is the
correct implementation and the version in ASO should be replaced.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D154456
2023-07-04 21:50:43 -05:00
Dhruva Chakrabarti
6a1d1f7eef [OpenMP] Added memory scope to atomic::inc API and used the device scope in reduction.
With https://reviews.llvm.org/D137524, memory scope and ordering
attributes are being used to generate the required instructions for
atomic inc/dec on AMDGPU. This patch adds the memory scope attribute to
the atomic::inc API and uses the device scope in reduction. Without
the device scope in atomic_inc, the default system scope leads to
unnecessary L2 write-backs/invalidates.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D154172
2023-06-30 15:05:01 -04:00
Doru Bercea
04609b09e9 Enable up to 64 arguments for outlined regions in OpenMP device code.
Co-Author: Fabio Luporini <fabio@devitocodes.com>

Review: https://reviews.llvm.org/D150134
2023-05-24 10:31:39 -04:00
Joseph Huber
47800a12dc [OpenMP][NFC] clang-format the OpenMP device runtime
These files aren't fully formatted. I'm guessing this was a holdover
from when `clang-format` was totally broken for OpenMP offloading.
Format the files to be more consistent.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D151226
2023-05-23 11:19:09 -05:00
Shilei Tian
d4ecd1241c Revert "[OpenMP] Introduce kernel environment"
This reverts commit 35cfadfbe2.

It makes a couple of buildbots unhappy because of the following test failures:
- `Transforms/OpenMP/add_attributes.ll'`
- `mapping/declare_mapper_target_data.cpp` on AMDGPU
2023-04-22 20:56:35 -04:00
Shilei Tian
35cfadfbe2 [OpenMP] Introduce kernel environment
This patch introduces per kernel environment. Previously, flags such as execution mode are set through global variables with name like `__kernel_name_exec_mode`. They are accessible on the host by reading the corresponding global variable, but not from the device. Besides, some assumptions, such as no nested parallelism, are not per kernel basis, preventing us applying per kernel optimization in the device runtime.

This is a combination and refinement of patch series D116908, D116909, and D116910.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D142569
2023-04-22 20:46:38 -04:00
Johannes Doerfert
67fed132f3 [OpenMP] Ensure memory fences are created with barriers for AMDGPUs
It turns out that the __builtin_amdgcn_s_barrier() alone does not emit
a fence. We somehow got away with this and assumed it would work as it
(hopefully) is correct on the NVIDIA path where we just emit a
__syncthreads. After talking to @arsenm we now (mostly) align with the
OpenCL barrier implementation [1] and emit explicit fences for AMDGPUs.

It seems this was the underlying cause for #59759, but I am not 100%
certain. There is a chance this simply hides the problem.

Fixes: https://github.com/llvm/llvm-project/issues/59759

[1] 07b347366e/opencl/src/workgroup/wgbarrier.cl (L21)
2023-04-17 15:27:17 -07:00
Rafael A. Herrera Guaitero
64549f0903 [OpenMP][5.1] Fix parallel masked is ignored #59939
Code generation support for 'parallel masked' directive.

The `EmitOMPParallelMaskedDirective` was implemented.
In addition, the appropiate device functions were added.

Fix #59939.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D143527
2023-04-03 20:33:55 +00:00
Ye Luo
ead2d86ee9 Revert "[OpenMP] Ensure memory fences are created with barriers for AMDGPUs"
This reverts commit 36d6217c4e.
2023-03-24 21:10:03 -05:00
Ye Luo
36d6217c4e [OpenMP] Ensure memory fences are created with barriers for AMDGPUs
It turns out that the `__builtin_amdgcn_s_barrier()` alone does not emit
a fence. We somehow got away with this and assumed it would work as it
(hopefully) is correct on the NVIDIA path where we just emit a
`__syncthreads`. After talking to @arsenm we now (mostly) align with the
OpenCL barrier implementation [1] and emit explicit fences for AMDGPUs.

It seems this was the underlying cause for #59759, but I am not 100%
certain. There is a chance this simply hides the problem.

Fixes: https://github.com/llvm/llvm-project/issues/59759

[1] 07b347366e/opencl/src/workgroup/wgbarrier.cl (L21)

Reviewed By: ye-luo

Differential Revision: https://reviews.llvm.org/D145290
2023-03-24 20:36:51 -05:00
Joseph Huber
cfd18167c8 Revert "[Libomptarget] Use freestanding stdint.h header for DeviceRTL"
This patch breaks the handling of `printf` in the OpenMP library. Usiing
`-ffreestanding` prevents clang from emitting LLVM builtins, which we
use for OpenMP printing support. Shelve this until we have functioning
`printf` in the GPU `libc` and we can remove that code.

This reverts commit a92eaa3ebe.
2023-03-13 14:17:05 -05:00
Joseph Huber
a92eaa3ebe [Libomptarget] Use freestanding stdint.h header for DeviceRTL
The `stdint.h` header provides the standard types. Previously we used
`-nostdinc` and defined these ourselves. This patch switches to a
freestanding version which should work properly. Without
`-ffreestanding` the `stdint.h` header will include other libraries. But
in a freestanding environment it should work given the primitives.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D145963
2023-03-13 12:32:58 -05:00
Shilei Tian
693358d787 [OpenMP][DeviceRTL][NFC] Use OMPTgtExecModeFlags from llvm/include/llvm/Frontend/OpenMP/OMPDeviceConstants.h
This patch makes preparation for a series that will enable per-kernel information
used in both host and device runtime. Some variables/enums, such as `OMPTgtExecModeFlags`,
have to be shared by both of them. A new header `OMPDeviceConstants.h` is added,
containing code that will be shared by them. We will introduce more variables soon.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D142320
2023-01-22 19:10:54 -05:00
Shilei Tian
18959be84d [OpenMP][DeviceRTL] Fix the support for tasking on the device
This patch fixes the support for tasking on the device.

Note: AMDGPU doesn't support it yet because of no support for `malloc` and `free`.

Fix #59946.

```
➜  ./test_parallel_master_device
[OMPVV_RESULT: test_parallel_master_device.c] Test passed on the device.
```

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D141562
2023-01-11 23:50:35 -05:00
Johannes Doerfert
2b5a99b3d9 [OpenMP] Rename the _OMP namespace in the device runtime to ompx
Differential Revision: https://reviews.llvm.org/D140334
2022-12-19 14:43:59 -08:00
Johannes Doerfert
90609fb68f [OpenMP][NFCI] Remove effectively dead code in clang and the runtime
Differential Revision: https://reviews.llvm.org/D136903
2022-12-13 18:44:19 -08:00
Johannes Doerfert
f9c29878b0 Revert "[OpenMP][NFCI] Remove effectively dead code in clang and the runtime"
This reverts commit c1c8cbbf5f. One of the
tests seems to be flaky/non-deterministic.
2022-12-12 22:08:28 -08:00
Johannes Doerfert
c1c8cbbf5f [OpenMP][NFCI] Remove effectively dead code in clang and the runtime 2022-12-12 20:55:36 -08:00
Joseph Huber
9223315903 [DeviceRTL] Allow IsSPMDMode to be optimized out in LTO mode
A previous patch merged the static and bitcode versions of the
deviceRTL. We previously used the static library's separate compilation
to set a special flag that prevented `IsSPMDMode` from being put in the
used list and preventing it from being optimized out. When they were
merged we could no longer do this separate compilation that allowed
users of LTO to get more optimal code.

This patch rearranges the code. The `IsSPMDMode` global is now
transitively used by its inclusion in the changed `__keep_alive`
function. This allows us to then manually delete the `__keep_alive`
function from the module when building the static library via
`llvm-extract`. The result is that the bitcode library correctly will
maintain the needed shared state, while the static library will be able
to internalize it and optimize it out.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D135280
2022-10-05 14:40:01 -05:00
Johannes Doerfert
f8ee045c6d [OpenMP] Eliminate the ThreadStates array in favor of indirection
If we have thread states, the program is going to be rather slow. If we
don't, we want to avoid wasting shared memory. This patch introduces a
slight penalty (malloc + indirection) for the slow path and reduces
resource usage for the fast path.

Differential Revision: https://reviews.llvm.org/D135037
2022-10-04 20:27:34 -07:00
Johannes Doerfert
b113965073 [OpenMP] Introduce more atomic operations into the runtime
We should use OpenMP atomics but they don't take variable orderings.
Maybe we should expose all of this in the header but that solves only
part of the problem anyway.

Differential Revision: https://reviews.llvm.org/D135036
2022-10-04 20:20:55 -07:00
Johannes Doerfert
f85c1f3b7c [OpenMP] Replace __ATOMIC_XYZ with atomic::xyz for style
Also fixes one ordering argument not used.

Differential Revision: https://reviews.llvm.org/D135035
2022-10-04 19:43:30 -07:00
Johannes Doerfert
abbc3fa17b [OpenMP] Replace pointer comparison with isSharedMemPtr check
The pointer comparison was causing confusion for capture tracking, let's
avoid confusion.

Differential Revision: https://reviews.llvm.org/D135160
2022-10-04 19:24:22 -07:00
Dhruva Chakrabarti
839ac62c50 Revert "[OpenMP] Codegen aggregate for outlined function captures"
This reverts commit 7539e9cf81.
2022-09-15 03:08:46 +00:00
Giorgis Georgakoudis
7539e9cf81 [OpenMP] Codegen aggregate for outlined function captures
Parallel regions are outlined as functions with capture variables explicitly generated as distinct parameters in the function's argument list. That complicates the fork_call interface in the OpenMP runtime: (1) the fork_call is variadic since there is a variable number of arguments to forward to the outlined function, (2) wrapping/unwrapping arguments happens in the OpenMP runtime, which is sub-optimal, has been a source of ABI bugs, and has a hardcoded limit (16) in the number of arguments, (3)  forwarded arguments must cast to pointer types, which complicates debugging. This patch avoids those issues by aggregating captured arguments in a struct to pass to the fork_call.

Reviewed By: jdoerfert, jhuber6, ABataev

Differential Revision: https://reviews.llvm.org/D102107
2022-09-15 00:54:05 +00:00
Joseph Huber
2b8f722e63 [OpenMP] Add option to assert no nested OpenMP parallelism on the GPU
The OpenMP device runtime needs to support the OpenMP standard. However
constructs like nested parallelism are very uncommon in real application
yet lead to complexity in the runtime that is sometimes difficult to
optimize out. As a stop-gap for performance we should supply an argument
that selectively disables this feature. This patch adds the
`-fopenmp-assume-no-nested-parallelism` argument which explicitly
disables the usee of nested parallelism in OpenMP.

Reviewed By: carlo.bertolli

Differential Revision: https://reviews.llvm.org/D132074
2022-08-23 14:09:51 -05:00
Shilei Tian
db5a2afa62 [OpenMP][DeviceRTL] Implement libc function memcmp
We will add some simple implementation of libc functions starting from
this patch, and the first one is `memcmp`, which is reported in #56929. Note that
`malloc` and `free` are not included in this patch because of the use of
`declare variant`. In the near future we will implement the two functions w/o
using any vendor provided function.

This fixes #56929.

Reviewed By: jhuber6

Differential Revision: https://reviews.llvm.org/D131182
2022-08-04 14:37:54 -04:00
Joseph Huber
b08369f7f2 Revert "[OpenMP] Remove noinline attributes in the device runtime"
The behaviour of this patch is not great, but it has some side-effects
that are required for OpenMPOpt to work. The problem is that when we use
`-mlink-builtin-bitcode` we only import used symbols from the runtime.
Then OpenMPOpt will insert calls to symbols that were not previously
included. This patch removed this implicit behaviour as these functions
were kept alive by the `noinline` simply because it kept calls to them
in the module. This caused regression in some tests that relied on some
OpenMPOpt passes without using LTO. Reverting for the LLVM15 release but
will try to fix it more correctly on main.

This reverts commit d61d72dae6.

Fixes #56752
2022-07-27 11:09:18 -04:00
Joseph Huber
d61d72dae6 [OpenMP] Remove noinline attributes in the device runtime
We previously used the `noinline` attributes to specify some defintions
which should be kept alive in the runtime. These were then stripped
immediately in the OpenMPOpt module pass. However, Since the changes in
D130298, we not explicitly state which functions will have external
visiblity in the bitcode library. Additionally the OpenMPOpt module pass
should run before the inliner pass, so this shouldn't make a difference
in whether or not the functions will be alive for the initial pass of
OpenMPOpt. This should simplify the interface, and additionally save
time spend on scanning funciton names for noinline.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D130368
2022-07-25 15:44:50 -04:00
Johannes Doerfert
d150152615 [OpenMP] Introduce more fine-grained control over the thread state use
We can help optimizations by making sure we use the team state whenever
it is clear there is no thread state. To this end we introduce a new
state flag (`state::HasThreadState`) and explicit control for the
`state::ValueRAII` helpers, including a dedicated "assert equal".

Differential Revision: https://reviews.llvm.org/D130113
2022-07-21 12:30:38 -05:00