This patch separates the lowering dispatch for host and target devices.
For the target device, if the current operation is not a top-level
operation (e.g. omp.target) or is inside a target device code region it
will be ignored, since it belongs to the host code.
This is an alternative approach to #84611, the new test in this PR was
taken from there.
Currently, by-ref reductions will allocate the per-thread reduction
variable in the initialization region. Adding a cleanup region allows
that allocation to be undone. This will allow flang to support reduction
of arrays stored on the heap.
This conflation of allocation and initialization in the initialization
should be fixed in the future to better match the OpenMP standard, but
that is beyond the scope of this patch.
The argument to the initialization region of reduction declarations was
never mapped. This meant that if this argument was accessed inside the
initialization region, that mlir operation would be translated to an
llvm operation with a null argument (failing verification).
Adding the mapping ensures that the right LLVM value can be found when
inlining and converting the initialization region.
We have to separately establish and clean up these mappings for each use
of the reduction declaration because repeated usage of the same
declaration will inline it using a different concrete value for the
block argument.
This argument was never used previously because for most cases the
initialized value depends only upon the type of the reduction, not on
the original variable. It is needed now so that we can read the array
extents for the local copy from the mold.
Flang support for reductions on assumed shape arrays patch 2/3
This PR refactors bounds offsetting by combining the two differing
implementations (one applying to initial derived type member map
implementation for descriptors and the other for regular arrays,
effectively allocatable array vs regular array in fortran) now that it's
a little simpler to do.
The PR also moves the utilization of createAlteredByCaptureMap into
genMapInfoOp, where it will be correctly applied to all MapInfoData,
appropriately offsetting and altering Pointer data set in the kernel
argument structure on the host. This primarily means bounds offsets will
now correctly apply to enter/exit/update map clauses as opposed to just
the Target directive that is currently the case. A few fortran runtime
tests have been added to verify this new behavior.
This PR depends on: https://github.com/llvm/llvm-project/pull/84328 and
is an extraction of the larger derived type member map PR stack (so a
requirement for it to land).
Previously reduction variables were always passed by value into and out
of the initialization and combiner regions of the OpenMP reduction
declare operation.
This worked well for reductions of primitive types (and might perform
better than passing by reference). But passing by reference will be
useful for array and derived type reductions (e.g. to move allocation
inside of the init region).
Passing reductions by reference requires different LLVM-IR generation
when lowering from MLIR because some of the loads/stores/allocations
will now be moved inside of the init and combiner regions. This
alternate code generation is requested using a new attribute to
omp.wsloop and omp.parallel.
Existing lowerings from mlir are unaffected (these will continue to use
the by-value argument passing.
Flang will continue to pass by-value argument passing for trivial types
unless a (hidden) command line argument is supplied. Non-trivial types
will always use the by-ref lowering.
Array reductions are not ready yet (but are coming very soon). In the
meantime, this is tested by forcing existing reductions to use by-ref.
Commit series for by-ref OpenMP reductions 3/3
---------
Co-authored-by: Mats Petersson <mats.petersson@arm.com>
Use the new copyprivate list from omp.single to emit calls to
__kmpc_copyprivate, during the creation of the single operation
in OMPIRBuilder.
This is patch 4 of 4, to add support for COPYPRIVATE in Flang.
Original PR: https://github.com/llvm/llvm-project/pull/73128
Adds basic support for materializing delayed privatization. So far, the
restrictions on the implementation are:
- Only `private` clauses are supported (`firstprivate` support will be
added in a later PR).
GPU kernels generated via typical MLIR mechanisms make the assumption
that all workgroups are of uniform size, and so, as in OpenMP, it is
appropriate to set the "uniform-work-group-size"="true" attribute on
these functions by default. This commit makes that choice.
In the event it is needed, this commit adds
`rocdl.uniform_work_group_size` as an attribute to be set on LLVM
functions that can be used to override the default.
In addition, add proper failure messages to translation
Summary:
Currently, OpenMP handles the `omp requires` clause by emitting a global
constructor into the runtime for every translation unit that requires
it. However, this is not a great solution because it prevents us from
having a defined order in which the runtime is accessed and used.
This patch changes the approach to no longer use global constructors,
but to instead group the flag with the other offloading entires that we
already handle. This has the effect of still registering each flag per
requires TU, but now we have a single constructor that handles
everything.
This function removes support for the old `__tgt_register_requires` and
replaces it with a warning message. We just had a recent release, and
the OpenMP policy for the past four releases since we switched to LLVM
is that we do not provide strict backwards compatibility between major
LLVM releases now that the library is versioned. This means that a user
will need to recompile if they have an old binary that relied on
`register_requires` having the old behavior. It is important that we
actively deprecate this, as otherwise it would not solve the problem of
having no defined init and shutdown order for `libomptarget`. The
problem of `libomptarget` not having a define init and shutdown order
cascades into a lot of other issues so I have a strong incentive to be
rid of it.
It is worth noting that the current `__tgt_offload_entry` only has space
for a 32-bit integer here. I am planning to overhaul these at some point
as well.
This is a new ODS feature that allows dialects to define a list of
key/value pair representing an attribute type and a name.
This will generate helper classes on the dialect to be able to
manage discardable attributes on operations in a type safe way.
For example the `test` dialect can define:
```
let discardableAttrs = (ins
"mlir::IntegerAttr":$discardable_attr_key,
);
```
And the following will be generated in the TestDialect class:
```
/// Helper to manage the discardable attribute `discardable_attr_key`.
class DiscardableAttrKeyAttrHelper {
::mlir::StringAttr name;
public:
static constexpr ::llvm::StringLiteral getNameStr() {
return "test.discardable_attr_key";
}
constexpr ::mlir::StringAttr getName() {
return name;
}
DiscardableAttrKeyAttrHelper(::mlir::MLIRContext *ctx)
: name(::mlir::StringAttr::get(ctx, getNameStr())) {}
mlir::IntegerAttr getAttr(::mlir::Operation *op) {
return op->getAttrOfType<mlir::IntegerAttr>(name);
}
void setAttr(::mlir::Operation *op, mlir::IntegerAttr val) {
op->setAttr(name, val);
}
bool isAttrPresent(::mlir::Operation *op) {
return op->hasAttrOfType<mlir::IntegerAttr>(name);
}
void removeAttr(::mlir::Operation *op) {
assert(op->hasAttrOfType<mlir::IntegerAttr>(name));
op->removeAttr(name);
}
};
DiscardableAttrKeyAttrHelper getDiscardableAttrKeyAttrHelper() {
return discardableAttrKeyAttrName;
}
```
User code having an instance of the TestDialect can then manipulate this
attribute on operation using:
```
auto helper = testDialect.getDiscardableAttrKeyAttrHelper();
helper.setAttr(op, value);
helper.isAttrPresent(op);
...
```
This patch reworks the way that wsloop reduction operations function to
better match the expected semantics from the OpenMP specification,
following the rework of parallel reductions.
The new semantics create a private reduction variable as a block
argument which should be used normally for all operations on that
variable in the region; this private variable is then combined with the
others into the shared variable. This way no special omp.reduction
operations are needed inside the region. These block arguments follow
the loop control block arguments.
---------
Co-authored-by: Kiran Chandramohan <kiran.chandramohan@arm.com>
Add support for attribute nvvm.grid_constant on LLVM function arguments.
The attribute can be attached only to arguments of type llvm.ptr that
have llvm.byval attribute.
Generate LLVM metadata for functions with nvvm.grid_constant arguments.
The metadata node is a list of integers, where each integer n denotes
that the nth parameter has the
grid_constant annotation (numbering from 1). The generated metadata node
will be handled by NVVM compiler. See
https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#supported-properties
for documentation on grid_constant property.
This patch also adds convertParameterAttr to
LLVMTranslationDialectInterface for supporting the translation of
derived dialect attributes on function parameters
This patch reworks the way that parallel reduction operations function
to better match the expected semantics from the OpenMP specification.
Previously specific omp.reduction operations were used inside the
region, meaning that the reduction only applied when the correct
operation was used, whereas the specification states that any change to
the variable inside the region should be taken into account for the
reduction.
The new semantics create a private reduction variable as a block
argument which should be used normally for all operations on that
variable in the region; this private variable is then combined with the
others into the shared variable. This way no special omp.reduction
operations are needed inside the region.
This patch only makes the change for the `parallel` operation, the
change for the `wsloop` operation will be in a separate patch.
---------
Co-authored-by: Kiran Chandramohan <kiran.chandramohan@arm.com>
This patch seeks to add an initial lowering for pointers and allocatable variables
captured by implicit and explicit map in Flang OpenMP for Target operations that
take map clauses e.g. Target, Target Update. Target Exit/Enter etc.
Currently this is done by treating the type that lowers to a descriptor
(allocatable/pointer/assumed shape) as a map of a record type (e.g. a structure) as that's
effectively what descriptor types lower to in LLVM-IR and what they're represented as
in the Fortran runtime (written in C/C++). The descriptor effectively lowers to a structure
containing scalar and array elements that represent various aspects of the underlying
data being mapped (lower bound, upper bound, extent being the main ones of interest
in most cases) and a pointer to the allocated data. In this current iteration of the mapping
we map the structure in it's entirety and then attach the underlying data pointer and map
the data to the device, this allows most of the required data to be resident on the device
for use. Currently we do not support the addendum (another block of pointer data), but
it shouldn't be too difficult to extend this to support it.
The MapInfoOp generation for descriptor types is primarily handled in an optimization
pass, where it expands BoxType (descriptor types) map captures into two maps, one for
the structure (scalar elements) and the other for the pointer data (base address) and
links them in a Parent <-> Child relationship. The later lowering processes will then treat
them as a conjoined structure with a pointer member map.
This patch removes the omp.target module attribute, since the
information it held on the target CPU and features is available through
the fir.target_cpu and fir.target_features module attributes. Target
outlining during the MLIR to LLVM IR translation stage is updated, so
that these attributes, at that point available as llvm.func attributes,
are passed along to the newly created function.
Provides some context for failing to generate LLVM IR for `target
enter|exit|update` directives when `nowait` is provided. This is
directly helpful for flang users since they would get this error message
if they tried to use `nowait`. Before that we had a very generic
message.
This is a follow-up to https://github.com/llvm/llvm-project/pull/78269,
please only review the latest commit (the one with the same commit
message as the PR title).
After the removal of the OpenMP early outlining MLIR pass in #67319, the
`EarlyOutliningInterface` stopped doing any useful work. It used to be
necessary to tie the name of the function from which a target region was
outlined to that new function, so it would be used when translating to
LLVM IR in place of the outlined function's name.
This is not necessary anymore, so this patch removes all references to
this interface and uses of the `omp.outline_parent_name` discardable
attribute in tests.
If -nogpulib option is passed by the user, then the OpenMP device
runtime is not used and we should not emit globals to configure
debugging at compile-time for the device runtime.
Link to -nogpulib flag implementation for Clang:
https://reviews.llvm.org/D125314
Extend the `amendOperation` mechanism for translating dialect attributes
attached to operations from another dialect when translating MLIR to
LLVM IR. Previously, this mechanism would have no knowledge of the LLVM
IR instructions created for the given operation, making it impossible
for it to perform local modifications such as attaching operation-level
metadata. Collect instructions inserted by the LLVM IR builder and pass
them to `amendOperation`.
GPU Dialect lowering to SYCL runtime is driven by spirv.target_env
attached to gpu.module. As a result of this, spirv.target_env remains as
an input to LLVMIR Translation.
A SPIRVToLLVMIRTranslation without any actual translation is added to
avoid an unregistered error in mlir-cpu-runner.
SelectObjectAttr.cpp is updated to
1) Pass binary size argument to getModuleLoadFn
2) Pass parameter count to getKernelLaunchFn
This change does not impact CUDA and ROCM usage since both
mlir_cuda_runtime and mlir_rocm_runtime are already updated to accept
and ignore the extra arguments.
NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA).
It is a new level of parallelism, allowing clustering of Cooperative
Thread Arrays (CTA) to synchronize and communicate through shared memory
while running concurrently.
This PR enables support for CGA within the `gpu.launch_func` in the GPU
dialect. It extends `gpu.launch_func` to accommodate this functionality.
The GPU dialect remains architecture-agnostic, so we've added CGA
functionality as optional parameters. We want to leverage mechanisms
that we have in the GPU dialects such as outlining and kernel launching,
making it a practical and convenient choice.
An example of this implementation can be seen below:
```
gpu.launch_func @kernel_module::@kernel
clusters in (%1, %0, %0) // <-- Optional
blocks in (%0, %0, %0)
threads in (%0, %0, %0)
```
The PR also introduces index and dimensions Ops specific to clusters,
binding them to NVVM Ops:
```
%cidX = gpu.cluster_id x
%cidY = gpu.cluster_id y
%cidZ = gpu.cluster_id z
%cdimX = gpu.cluster_dim x
%cdimY = gpu.cluster_dim y
%cdimZ = gpu.cluster_dim z
```
We will introduce cluster support in `gpu.launch` Op in an upcoming PR.
See [the
documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays)
provided by NVIDIA for details.
This patch removes the val field from the `MapInfoOp`.
Previously when lowering `TargetOp`, the bounds information for the
`BoxValues` were also being mapped. Instead these ops are now cloned
inside the target region to prevent mapping of non reference typed
values.
This block of code was here to create pseudo handling of implicit
captures in target regions to prevent gfortran test regressions and
allow certain pieces of code to function, however, with the introduction
of the IFA patch which adds proper handling of implicits by adding them
to the map operands list alongside explicit mappings at the initial
Fortran -> MLIR generation phase this should no longer be required and
may cause some adverse affects at worse in the future.
Currently, when deleting the device functions in the second stage of filtering during MLIR to LLVM translation we can end up with invalid calls to these functions. This is because of the removal of the EarlyOutliningPass which would have otherwise gotten rid of any such calls.
This patch aims to alter the function filtering pass in the following way:
- Any host function is completely removed.
- Call to the host function are also removed and their uses replaced with Undef values.
- Any host function with target region code is marked to be removed during the the second stage.
- Calls to such functions are still removed and their uses replaced with Undef values.
Co-authored-by: Sergio Afonso <sergio.afonsofumero@amd.com>
This patch adds a `llvm.linker.options` operation taking a list of
strings to pass to the linker when the resulting object file is linked.
This is particularly useful on Windows to specify the CRT version to use
for this object file.
Despite the fact that the LLVM dialect’s `FuncOp` already supports
calling conventions, there was yet no support for them in the ops that
actually perform function calls, which led to incorrect LLVM IR being
generated if one actually tried setting a `FuncOp`’s calling convention
to anything other than `ccc`.
This commit adds support for calling conventions to `LLVM::CallOp` and
`LLVM::InvokeOp` and makes sure that calling conventions are parsed,
printed, and lowered appropriately.
This patch removes the OMPEarlyOutliningPass as it is no longer required. The implicit map operand capture has now been moved to the PFT lowering stage.
Depends on #67318.
This patch adds the MLIR translation changes required for add the IsolatedFromAbove and OutlineableOpenMPOpInterface traits to omp.target. It links the newly added block arguments to their corresponding llvm values.
Depends on #67164.
This was a regression introduced by myself in:
6a62707c04
where I too hastily removed the basic handling of implicit captures
we have currently. This will be superseded by all implicit captures
being added to target operations map_info entries in a soon landing
series of patches, however, that is currently not the case so we must
continue to do some basic handling of these captures for the time
being. This patch re-adds that behaviour to avoid regressions.
Unfortunately this means some test changes as well as
getUsedValuesDefinedAbove grabs constants used outside
of the target region which aren't handled particularly
well currently.
This patch seeks to add initial lowering of OpenMP array sections within
target region map clauses from MLIR to LLVM IR.
This patch seeks to support fixed sized contiguous (don't think OpenMP
supports anything other than contiguous sections from my reading but i
could be wrong) arrays initially, before looking toward assumed size and
shaped arrays. The patch also currently does not include stride, it's
left for future work.
Although, assumed size works in some fashion (dummy arguments) with some
minor alterations to the OMPEarlyOutliner, so it is possible changes
made in the IsolatedFromAbove series may allow this to work with no
further required patches.
It utilises the generated omp.bounds to calculate the size of the mapped
OpenMP array (both for sectioned and un-sectioned arrays) as well as the
offset to be passed to the kernel argument structure.
Alongside these changes some refactoring of how map data is handled is
attempted, using a new MapData structure to keep track of information
utilised in the lowering of mapped values.
The initial addition of a more complex createDeviceArgumentAccessor that
utilises capture kinds similarly to (and loosely based on) Clang to
generate different kernel argument accesses is also added.
A similar function for altering how the kernel argument is passed to the
kernel argument structure on the host is also utilised
(createAlteredByCaptureMap), which allows modification of the
pointer/basePointer based on their capture (and bounds information).
It's of note ByRef, is the default for explicit mappings and ByCopy will
be the default for implicit captures, so the former is currently tested
in this patch and the latter is not for the moment.
Remove usage of getElementType in OpenMPTranslation to pave way for
switching to opaque pointers in MLIR and Flang. The approach chosen
stores the elementType in a new field in MapInfo called varType. A
similar approach was chosen for AtomicReadOp in
81767f52f4