Improve the check for whether a type can be passed by copy. Currently,
passing by copy is done via the OMP_MAP_LITERAL mapping, which can only
transfer as much data as can be contained in a pointer representation.
The HAS_DEVICE_ADDR indicates that the object(s) listed exists at an
address that is a valid device address. Specifically,
`has_device_addr(x)` means that (in C/C++ terms) `&x` is a device
address.
When entering a target region, `x` does not need to be allocated on the
device, or have its contents copied over (in the absence of additional
mapping clauses). Passing its address verbatim to the region for use is
sufficient, and is the intended goal of the clause.
Some Fortran objects use descriptors in their in-memory representation.
If `x` had a descriptor, both the descriptor and the contents of `x`
would be located in the device memory. However, the descriptors are
managed by the compiler, and can be regenerated at various points as
needed. The address of the effective descriptor may change, hence it's
not safe to pass the address of the descriptor to the target region.
Instead, the descriptor itself is always copied, but for objects like
`x`, no further mapping takes place (as this keeps the storage pointer
in the descriptor unchanged).
---------
Co-authored-by: Sergio Afonso <safonsof@amd.com>
This PR adds an initial implementation for the map modifiers close,
present and ompx_hold, primarily just required adding the appropriate
map type flags to the map type bits. In the case of ompx_hold it
required adding the map type to the OpenMP dialect. Close has a bit of a
problem when utilised with the ALWAYS map type on descriptors, so it is
likely we'll have to make sure close and always are not applied to the
descriptor simultaneously in the future when we apply always to the
descriptors to facilitate movement of descriptor information to device
for consistency, however, we may find an alternative to this with
further investigation. For the moment, it is a TODO/Note to keep track
of it.
This patch adds OpenMPToLLVMIRTranslation support for the OpenMP Declare
Mapper directive.
Since both MLIR and Clang now support custom mappers, I've changed the
respective function params to no longer be optional as well.
Depends on #121005
Summary:
The test failed because it no longer passed Rpass by default without
LTO. I think that's desirable as it matches the standard behavior.
This reverts commit 6fd99de318.
This pull request is the second part of an ongoing effort to extends PGO
instrumentation to GPU device code and depends on #76587. This PR makes
the following changes:
- Introduces `__llvm_write_custom_profile` to PGO compiler-rt library.
This is an external function that can be used to write profiles with
custom data to target-specific files.
- Adds `__llvm_write_custom_profile` as weak symbol to libomptarget so
that it can write the collected data to a profraw file.
- Adds `PGODump` debug flag and only displays dump when the
aforementioned flag is set
Currently if we generate code for the below target data map that uses an
optional mapping:
!$omp target data if(present(a)) map(alloc:a)
do i = 1, 10
a(i) = i
end do
!$omp end target data
We yield an LLVM-IR error as the branch for the else path is not
generated. This occurs because we enter the NoDupPriv path of the call
back function when generating the else branch, however, the emitBranch
function needs to be set to a block for it to functionally generate and
link in a follow up branch. The NoDupPriv path currently doesn't do
this, while it's not supposed to generate anything (as far as I am
aware) we still need to at least set the builders placement back so that
it emits the appropriate follow up branch. This avoids the missing
terminator LLVM-IR verification error by correctly generating the follow
up branch.
This PR aims to fix a mapping error when trying to map nullary elements
of a record type (primary example is allocatables/pointer types in
Fortran at the moment). This should be legal to map, just not write to
without pointing to anything within the target region. A common Fortran
OpenMP idiom/example where this is useful can be found in the added
Fortran offload example.
The runtime error arises when we try to map the pointer member utilising
a prescribed constant size that we receive from the lowered type,
resulting in mapping of data that will be non-existent when there is no
allocated data. The fix in this case is to emit a runtime check to see
if the data has been allocated, if it hasn't been we select a size of 0,
if it has we emit the usual type size.
Summary:
The previous offloading entry type did not fit the current use-cases
very well. This widens it and adds a version to prevent further
annoyances. It also includes the kind to better sort who's using it.
The first 64-bytes are reserved as zero so the OpenMP runtime can detect
the old format for binary compatibilitry.
Exposed by -Warray-bounds:
In file included from
../../../../../../../llvm/offload/plugins-nextgen/common/src/GlobalHandler.cpp:252:
../../../../../../../llvm/llvm/include/llvm/ProfileData/InstrProfData.inc:109:1:
error: array index 4 is past the end of the array (that has type 'const
std::remove_const<const uint16_t>::type[4]' (aka 'const unsigned
short[4]')) [-Werror,-Warray-bounds]
109 | INSTR_PROF_DATA(const uint16_t, Int16ArrayTy,
NumValueSites[IPVK_Last+1], \
| ^ ~~~~~~~~~~~
../../../../../../../llvm/offload/plugins-nextgen/common/src/GlobalHandler.cpp:250:15:
note: expanded from macro 'INSTR_PROF_DATA'
250 | outs() << ProfData.Name << " "; \
| ^ ~~~~
../../../../../../../llvm/llvm/include/llvm/ProfileData/InstrProfData.inc:109:1:
note: array 'NumValueSites' declared here
109 | INSTR_PROF_DATA(const uint16_t, Int16ArrayTy,
NumValueSites[IPVK_Last+1], \
| ^
../../../../../../../llvm/offload/plugins-nextgen/common/include/GlobalHandler.h:62:3:
note: expanded from macro 'INSTR_PROF_DATA'
62 | std::remove_const<Type>::type Name;
Avoid accessing out-of-bound data, but skip printing array data for now.
As there is no simple way to do this without hardcoding the
NumValueSites field.
---------
Co-authored-by: Ethan Luis McDonough <ethanluismcdonough@gmail.com>
Currently the compiler will ICE in programs like the following on the
device lowering pass:
```
program main
implicit none
type i1_t
integer :: val(1000)
end type i1_t
integer :: i
type(i1_t), pointer :: newi1
type(i1_t), pointer :: tab=>null()
integer, dimension(:), pointer :: tabval
!$omp THREADPRIVATE(tab)
allocate(newi1)
tab=>newi1
tab%val(:)=1
tabval=>tab%val
!$omp target teams distribute parallel do
do i = 1, 1000
tabval(i) = i
end do
!$omp end target teams distribute parallel do
end program main
```
This is due to the fact that THREADPRIVATE returns a result operation,
and this operation can actually be used by other LLVM dialect (or other
dialect) operations. However, we currently skip the lowering of
threadprivate, so we effectively never generate and bind an LLVM-IR
result to the threadprivate operation result. So when we later go on to
lower dependent LLVM dialect operations, we are missing the required
LLVM-IR result, try to access and use it and then ICE. The fix in this
particular PR is to allow compilation of threadprivate for device as
well as host, and simply treat the device compilation as a no-op,
binding the LLVM-IR result of threadprivate with no alterations and
binding it, which will allow the rest of the compilation to proceed,
where we'll eventually discard the host segment in any case.
The other possible solution to this I can think of, is doing something
similar to Flang's passes that occur prior to CodeGen to the LLVM
dialect, where they erase/no-op certain unrequired operations or
transform them to lower level series of operations. And we would
erase/no-op threadprivate on device as we'd never have these in target
regions.
The main issues I can see with this are that we currently do not
specialise this stage based on wether we're compiling for device or
host, so it's setting a precedent and adding another point of having to
understand the separation between target and host compilation. I am also
not sure we'd necessarily want to enforce this at a dialect level incase
someone else wishes to add a different lowering flow or translation
flow. Another possible issue is that a target operation we have/utilise
would depend on the result of threadprivate, meaning we'd not be allowed
to entirely erase/no-op it, I am not sure of any situations where this
may be an issue currently though.
Attempt to address the following example from causing an assert or ICE:
```
subroutine test(a)
implicit none
integer :: i
real(kind=real64), dimension(:) :: a
real(kind=real64), dimension(size(a, 1)) :: b
!$omp target map(tofrom: b)
do i = 1, 10
b(i) = i
end do
!$omp end target
end subroutine
```
Where we utilise a Fortran intrinsic (size) to calculate the size of
allocatable arrays and then map it to device.
This re-applies #117867 with a small fix that hopefully prevents build
bot failures. The fix is avoiding `dyn_cast` for the result of
`getOperation()`. Instead we can assign the result to `mlir::ModuleOp`
directly since the type of the operation is known statically (`OpT` in
`OperationPass`).
This is a starting PR to implicitly map allocatable record fields.
This PR contains the following changes:
1. Re-purposes some of the utils used in `Lower/OpenMP.cpp` so that
these utils work on the `mlir::Value` level rather than the
`semantics::Symbol` level. This takes one step towards to enabling
MLIR passes to more easily do some lowering themselves (e.g. creating
`omp.map.bounds` ops for implicitely caputured data like this PR
does).
2. Adds support for implicitely capturing and mapping allocatable fields
in record types.
There is quite some distant to still cover to have full support for
this. I added a number of todos to guide further development.
Co-authored-by: Andrew Gozillon <andrew.gozillon@amd.com>
Co-authored-by: Andrew Gozillon <andrew.gozillon@amd.com>
Summary:
This is going to be deprecated in
https://github.com/llvm/llvm-project/pull/112849. This patch ports it to
use the builtin instead. This isn't a compile constant, so it could
slightly negatively affect codegen. There really should be an IR pass to
turn it into a constant if the function has known attributes.
Using the builtin is correct when we just do it for knowing the size
like we do here. Obviously guarding w32/w64 code with this check would
be broken.
This PR is one in a series of 3 that aim to add support for explicit
member mapping of allocatable components in derived types within
OpenMP+Fortran for Flang.
This PR provides all of the runtime tests that are currently
upstreamable, unfortunately some of the other tests would require
linking of the fortran runtime for offload which we currently do not do.
But regardless, this is plenty to ensure that the mapping is working in
most cases.
This patch primarily updates the MapInfoFinalization pass to utilise the
BlockArgument interface. It also shuffles newly added arguments the
MapInfoFinalization passes to the end of the BlockArg/Relevant MapInfo
lists, instead of one prior to the owning descriptor type.
During this it was noted that the use_device_ptr/addr handling of target
data was a little bit too order dependent so I've attempted to make it
less so, as we cannot depend on argument ordering to be the same as
Fortran for any future frontends.
This patch fixes the mapping and lowering of arrays with dynamic extents
and adds a new test for the same. The fix discards the incomplete the
dynamic extent information and replacing it with just the base type.
When lowering to llvm later, the bounds information is used instead.
It appears that the RUNTIMES build prefers the x86-64-unknown-linux-gnu
triple notation for the host. This fixes runtime / test breakages when
compiler-rt is used as the CLANG_DEFAULT_RTLIB.
This PR aims to unify the map argument generation behavior across both
the implicit capture (captured in a target region) and the explicit
capture (process map), currently the varPtr field of the MapInfo for the
same variable will be different depending on how it's captured. This PR
tries to align that across the generations of MapInfoOp in the OpenMP
lowering.
Currently, I have opted to utilise the rawInput (input memref to a HLFIR
DeclareInfoOp) as opposed to the addr field which includes more
information. The side affect of this is that we have to deal with
BoxTypes less often, which will result in simpler maps in these cases.
The negative side affect of this is that we don't have access to the
bounds information through the resulting value, however, I believe the
bounds information we require in our case is still appropriately stored
in the map bounds, and this seems to be the case from testing so far.
The other fix is for cases where we end up with a BoxType argument into
a function (certain assumed shape and sizes cases do this) that has no
fir.ref wrapping it. As we need the Box to be a reference type to
actually utilise the operation to access the base address stored inside
and create the correct mappings we currently generate an intermediate
allocation in these cases, and then store into it, and utilise this as
the map argument, as opposed to the original.
However, as we were not sharing the same intermediate allocation across
all of the maps for a variable, this resulted in errors in certain cases
when detatching/attatching the data e.g. via enter and exit. This PR
adjusts this for cases
Currently we only maintain tracking of all intermediate allocations for
the current function scope, as opposed to module. Primarily as the only
case I am aware of that this is required is in cases where we pass
certain types of arguments to functions (so I opted to minimize the
overhead of the pass for now). It could likely be extended to module
scope if required if we find other cases where it's applicable and
causing issues.
This pull request is a revised version of #76587. This pull request
fixes some build issues that were present in the previous version of
this change.
> This pull request is the first part of an ongoing effort to extends
PGO instrumentation to GPU device code. This PR makes the following
changes:
>
> - Adds blank registration functions to device RTL
> - Gives PGO globals protected visibility when targeting a supported
GPU
> - Handles any addrspace casts for PGO calls
> - Implements PGO global extraction in GPU plugins (currently only
dumps info)
>
> These changes can be tested by supplying `-fprofile-instrument=clang`
while targeting a GPU.
When we use the device, e.g., with an API that interacts with it, we
need to ensure the image is loaded and the constructors are executed.
Two tests are included to verify we 1) load images and run constructors
when needed, and 2) we do so lazily only if the device is actually used.
---------
Co-authored-by: Joseph Huber <huberjn@outlook.com>
Summary;
Now that we use the linker to do LTO / device linking, we need to inform
the `clang` invocation to use `-flto` so it forwards arguments like
`-On` correctly.
Through the new `-foffload-via-llvm` flag, CUDA kernels can now be
lowered to the LLVM/Offload API. On the Clang side, this is simply done
by using the OpenMP offload toolchain and emitting calls to `llvm*`
functions to orchestrate the kernel launch rather than `cuda*`
functions. These `llvm*` functions are implemented on top of the
existing LLVM/Offload API.
As we are about to redefine the Offload API, this wil help us in the
design process as a second offload language.
We do not support any CUDA APIs yet, however, we could:
https://www.osti.gov/servlets/purl/1892137
For proper host execution we need to resurrect/rebase
https://tianshilei.me/wp-content/uploads/2021/12/llpp-2021.pdf
(which was designed for debugging).
```
❯❯❯ cat test.cu
extern "C" {
void *llvm_omp_target_alloc_shared(size_t Size, int DeviceNum);
void llvm_omp_target_free_shared(void *DevicePtr, int DeviceNum);
}
__global__ void square(int *A) { *A = 42; }
int main(int argc, char **argv) {
int DevNo = 0;
int *Ptr = reinterpret_cast<int *>(llvm_omp_target_alloc_shared(4, DevNo));
*Ptr = 7;
printf("Ptr %p, *Ptr %i\n", Ptr, *Ptr);
square<<<1, 1>>>(Ptr);
printf("Ptr %p, *Ptr %i\n", Ptr, *Ptr);
llvm_omp_target_free_shared(Ptr, DevNo);
}
❯❯❯ clang++ test.cu -O3 -o test123 -foffload-via-llvm --offload-arch=native
❯❯❯ llvm-objdump --offloading test123
test123: file format elf64-x86-64
OFFLOADING IMAGE [0]:
kind elf
arch gfx90a
triple amdgcn-amd-amdhsa
producer openmp
❯❯❯ LIBOMPTARGET_INFO=16 ./test123
Ptr 0x155448ac8000, *Ptr 7
Ptr 0x155448ac8000, *Ptr 42
```
In debug mode there is a wrapper (the kernel) around the function in
which we generate the kernel code. We worked around this before to get
the correct kernel name, but now we really distinguish both to attach
the launch bounds to the kernel, not the inner function.
This patch handles dependencies specified by the `depend` clause on an
OpenMP target construct. It does this much the same way clang does it by
materializing an OpenMP `task` that is tagged with the dependencies.
The following functions are relevant to this patch -
1) `createTarget` - This function itself is largely unchanged except
that it now accepts a vector of `DependData` objects that it simply
forwards to `emitTargetCall`
2) `emitTargetCall` - This function has changed now to check if an outer
target-task needs to be materialized (i.e if `target` construct has
`nowait` or has `depend` clause). If yes, it calls `emitTargetTask` to
do all the heavy lifting for creating and dispatching the task.
3) `emitTargetTask` - Bulk of the change is here. See the large comment
explaining what it does at the beginning of this function
Many tests in the `offload` project have requirements defined by which
targets are not supported rather than which platforms are supported.
This patch aims to streamline the requirement definitions by adding four
new feature tags: `host`, `gpu`, `amdgpu`, and `nvidiagpu`.
This pull request is the first part of an ongoing effort to extends PGO
instrumentation to GPU device code. This PR makes the following changes:
- Adds blank registration functions to device RTL
- Gives PGO globals protected visibility when targeting a supported GPU
- Handles any addrspace casts for PGO calls
- Implements PGO global extraction in GPU plugins (currently only dumps
info)
These changes can be tested by supplying `-fprofile-instrument=clang`
while targeting a GPU.
This PR attempts to fix common block mapping for regular mapping of
these types as well as when they have been marked as "declare target
link". This PR should allow correct mapping of both the members of a
common block and the full common block via its block symbol.
The main changes were some adjustments to the Fortran OpenMP lowering to
HLFIR/FIR, the lowering of the LLVM+OpenMP dialect to LLVM-IR and
adjustments to the way the we handle target kernel map argument
rebinding inside of the OMPIRBuilder.
For the Fortran OpenMP lowering were two changes, one to prevent the
implicit capture of common block members when the common block symbol
itself has been marked and the other creates intermediate member access
inside of the target region to be used in-place of those external to the
target region, this prevents external usages breaking the
IsolatedFromAbove pact.
In the latter case, there was an adjustment to the size calculation for
types to better handle cases where we pass an array as the type of a map
(as opposed to the bounds and the type of the element), which occurs in
the case of common blocks. There is also some adjustment to how
handleDeclareTargetMapVar handles renaming of declare target symbols in
the module to the reference pointer, now it will only apply to those
within the kernel that is currently being generated and we also perform
a modification to replace constants with instructions as necessary as we
cannot replace these with our reference pointer (non-constant and
constants do not mix nicely).
In the case of the OpenMPIRBuilder some changes were made to defer
global symbol rebinding to kernel arguments until all other arguments
have been rebound. This makes sure we do not replace uses that may refer
to the global (e.g. a GEP) but are themselves actually a separate
argument that needs bound.
Currently "declare target to" still needs some work, but this may be the
case for all types in conjunction with "declare target to" at the
moment.
Sometimes it might be beneficial to spawn more thread blocks instead of
reusing existing for multiple loop iterations.
**Alternatives considered:**
Make `DefaultNumBlocks` settable via an environment variable.
---------
Co-authored-by: Joseph Huber <huberjn@outlook.com>
This PR seeks to expand/replace the Constant -> Instruction conversion
that needs to occur inside of the OpenMP Target kernel generation to
allow kernel argument replacement of uses within the kernel (cannot
replace constant uses within constant expressions with non-constants).
It does so by making use of the new-ish utility
convertUsersOfConstantsToInstructions which is a much more expansive
version of what the smaller "version" of the function I wrote does,
effectively expanding uses of the input argument that are constant
expressions into instructions so that we can replace with the
appropriate kernel argument.
Also alters convertUsersOfConstantsToInstructions to optionally
restrict the replacement to a function and optionally leave
dead constants alone, the latter is necessary when lowering from
MLIR as we cannot be sure we can remove the constants at this
stage, even if rewritten to instructions the ModuleTranslation may
maintain links to the original constants and utilise them in further
lowering steps (as when we're lowering the kernel, the module is
still in the process of being lowered). This can result in unusual
ICEs later. These dead constants can be tidied up later (and
appear to be in subsequent lowering from checking with
emit-llvm).
The current test is not really correct because the mask is set to
0xffffffff
even if it is on an AMDGPU whose wavefront size is 64. Besides,
`__AMDGCN_WAVEFRONT_SIZE` is not set on host compilation so the
verification
happens to work.