This PR attempts to fix common block mapping for regular mapping of
these types as well as when they have been marked as "declare target
link". This PR should allow correct mapping of both the members of a
common block and the full common block via its block symbol.
The main changes were some adjustments to the Fortran OpenMP lowering to
HLFIR/FIR, the lowering of the LLVM+OpenMP dialect to LLVM-IR and
adjustments to the way the we handle target kernel map argument
rebinding inside of the OMPIRBuilder.
For the Fortran OpenMP lowering were two changes, one to prevent the
implicit capture of common block members when the common block symbol
itself has been marked and the other creates intermediate member access
inside of the target region to be used in-place of those external to the
target region, this prevents external usages breaking the
IsolatedFromAbove pact.
In the latter case, there was an adjustment to the size calculation for
types to better handle cases where we pass an array as the type of a map
(as opposed to the bounds and the type of the element), which occurs in
the case of common blocks. There is also some adjustment to how
handleDeclareTargetMapVar handles renaming of declare target symbols in
the module to the reference pointer, now it will only apply to those
within the kernel that is currently being generated and we also perform
a modification to replace constants with instructions as necessary as we
cannot replace these with our reference pointer (non-constant and
constants do not mix nicely).
In the case of the OpenMPIRBuilder some changes were made to defer
global symbol rebinding to kernel arguments until all other arguments
have been rebound. This makes sure we do not replace uses that may refer
to the global (e.g. a GEP) but are themselves actually a separate
argument that needs bound.
Currently "declare target to" still needs some work, but this may be the
case for all types in conjunction with "declare target to" at the
moment.
Uses the new InsertPosition class (added in #94226) to simplify some of
the IRBuilder interface, and removes the need to pass a BasicBlock
alongside a BasicBlock::iterator, using the fact that we can now get the
parent basic block from the iterator even if it points to the sentinel.
This patch removes the BasicBlock argument from each constructor or call
to setInsertPoint.
This has no functional effect, but later on as we look to remove the
`Instruction *InsertBefore` argument from instruction-creation
(discussed
[here](https://discourse.llvm.org/t/psa-instruction-constructors-changing-to-iterator-only-insertion/77845)),
this will simplify the process by allowing us to deprecate the
InsertPosition constructor directly and catch all the cases where we use
instructions rather than iterators.
Fix for Fujitsu test suite test: 0275_0032.f90. The MLIR to LLVM
translation logic assumed that reduction arguments to an `omp.parallel`
op are always the last set of arguments to the op. However, this is a
wrong assumption since private args come afterward.
Fixes a crash uncovered by test 0007_0019.f90 in the Fujitsu test suite.
Previously, in the `PrivCB`, we cloned the `omp.private` op without
inserting it in the parent module of the original op. This causes issues
whenever there is an op that needs to lookup the parent module (e.g. for
symbol lookup). This PR fixes the issue by cloning in the parent module
instead of creating an orphaned op.
It was incorrect to set the insertion point to the init block after
inlining the initialization region because the code generated in the
init block depends upon the value yielded from the init region. When
there were multiple reduction initialization regions each with multiple
blocks, this could lead to the initilization region being inlined after
the init block which depends upon it.
Moving the insertion point to before inlining the initialization block
turned up further issues around the handling of the terminator for the
initialization block, which are also fixed here.
This fixes a bug in #92430 (but the affected code couldn't compile
before #92430 anyway).
Some of the OpenMP code can change the instruction pointed at by the
insertion point. This leads to an assert in the compiler about
BB->getParent() and IP->getParent() not matching.
The fix is to rebuild the insertionpoint from the block, rather than use
builder.restoreIP.
Also, move some of the alloca generation, rather than skipping back and
forth between insert points (and ensure all the allocas are done before
their users are created).
A simple test, mainly to ensure the minimal reproducer doesn't fail to
compile in the future is also added.
This PR attempts to consolidate the different topological sort utilities
into one place. It adds them to the analysis folder because the
`SliceAnalysis` uses some of these.
There are now two different sorting strategies:
1. Sort only according to SSA use-def chains
2. Sort while taking regions into account. This requires a much more
elaborate traversal and cannot be applied on graph regions that easily.
This additionally reimplements the region aware topological sorting
because the previous implementation had an exponential space complexity.
I'm open to suggestions on how to combine this further or how to fuse
the test passes.
This commit renames the name of the block sorting utility function to
`getBlocksSortedByDominance`. A topological order is not defined on a
general directed graph, so the previous name did not make sense.
Fixes#88935
Toggling reduction by-ref broke when multiple reduction clauses were
used. Decisions made for the by-ref status for later clauses could then
invalidate decisions for earlier clauses. For example,
```
reduction(+:scalar,scalar2) reduction(+:array)
```
The first clause would choose by value reduction and generate by-value
reduction regions, but then after this the second clause would force
by-ref to support the array argument. But by the time the second clause
is processed, the first clause has already had the wrong kind of
reduction regions generated.
This is solved by toggling whether a variable should be reduced by
reference per variable. In the above example, this allows only `array`
to be reduced by ref.
If there is only one non-terminator operation in the update region then
the update operation can be found and we can try to generate an
atomicrmw instruction. Otherwise use the cmpxchg loop.
Fixes#91929
The current comparator doesn't work correctly when two identical entries
with -1 are compared. The comparator returns `first` is case when
`aIndex == -1 && bIndex == -1`, but it should `continue` as those
indexes are the same.
llvm::sort requires the comparator to return `false` for equal elements,
otherwise it triggers `Your comparator is not a valid strict-weak
ordering` assert.
This patch seeks to refactor slightly and extend the current record type map
support that was put in place for Fortran's descriptor types to handle explicit
member mapping for record types at a single level of depth (the case of explicit
mapping of nested record types is currently unsupported).
This patch seeks to support this by extending the OpenMPToLLVMIRTranslation phase
to more generally support record types, building on the prior groundwork in the
Fortran allocatables/pointers patch. It now supports different kinds of record type
mapping, in this case full record type mapping and then explicit member mapping
in which there is a special case for certain types when mapped individually to not
require any parent map link in the kernel argument structure. To facilitate this
required:
* The movement of the setting of the map flag type "ptr_and_obj" to respective
frontends, now supporting it as a possible flag that can be read and printed
in mlir form. Some minor changes to declare target map type setting was
neccesary for this.
* The addition of a member index array operand, which tracks the position
of the member in the parent, required for caclulating the appropriate size
to offload to the target, alongside the parents offload pointer (always the
first member currently being mapped).
* A partial mapping attribute operand, to indicate if the entire record type is
being mapped or just member components, aiding the ability to lower
record types in the different manners that are possible.
* Refactoring bounds calculation for record types and general arrays to one
location (as well as load/store generation prior to assigning to the kernel
argument structure), as a side affect enter/exit/update/data mapping
should now be more correct and fully support bounds mapping, previously
this would have only worked for target.
Pull Request: https://github.com/llvm/llvm-project/pull/82852
Extends current support for delayed privatization during translation to
LLVM IR. This adds support for materlizaing the `dealloc` region in
`omp.private` ops when this region contains clean-up/deallocation logic
that needs to be executed at the end of the parallel region.
This changes the `OMPIRBuilder` slightly to execute the finalization
callback **after** the privatization callback. This allows us to collect
information about privatized variables on the MLIR and LLVM sides so
that we can properly emit deallocation logic.
This patch introduces minimal changes to the MLIR to LLVM IR translation
of `omp.wsloop` to support the loop wrapper approach.
There is `omp.loop_nest` related translation code that should be
extracted and shared among all loop operations (e.g. `omp.simd`). This
would possibly also help in the addition of support for compound
constructs later on. This first approach is only intended to keep things
running after the transition to loop wrappers and not to add support for
other use cases enabled by that transition.
This PR on its own will not pass premerge tests. All patches in the
stack are needed before it can be compiled and passes tests.
This patch updates the definition of `omp.simdloop` to enforce the
restrictions of a wrapper operation. It has been renamed to `omp.simd`,
to better reflect the naming used in the spec. All uses of "simdloop" in
function names have been updated accordingly.
Some changes to Flang lowering and OpenMP to LLVM IR translation are
introduced to prevent the introduction of compilation/test failures. The
eventual long term solution might be different.
I missed this before. For by-ref reductions, the private reduction
variable is a pointer to the pointer to the variable. So an extra load
is required to get the right value.
See the "red.private.value.n" loads in the reduction combiner region for
reference.
This patch separates the lowering dispatch for host and target devices.
For the target device, if the current operation is not a top-level
operation (e.g. omp.target) or is inside a target device code region it
will be ignored, since it belongs to the host code.
This is an alternative approach to #84611, the new test in this PR was
taken from there.
Currently, by-ref reductions will allocate the per-thread reduction
variable in the initialization region. Adding a cleanup region allows
that allocation to be undone. This will allow flang to support reduction
of arrays stored on the heap.
This conflation of allocation and initialization in the initialization
should be fixed in the future to better match the OpenMP standard, but
that is beyond the scope of this patch.
The argument to the initialization region of reduction declarations was
never mapped. This meant that if this argument was accessed inside the
initialization region, that mlir operation would be translated to an
llvm operation with a null argument (failing verification).
Adding the mapping ensures that the right LLVM value can be found when
inlining and converting the initialization region.
We have to separately establish and clean up these mappings for each use
of the reduction declaration because repeated usage of the same
declaration will inline it using a different concrete value for the
block argument.
This argument was never used previously because for most cases the
initialized value depends only upon the type of the reduction, not on
the original variable. It is needed now so that we can read the array
extents for the local copy from the mold.
Flang support for reductions on assumed shape arrays patch 2/3
This PR refactors bounds offsetting by combining the two differing
implementations (one applying to initial derived type member map
implementation for descriptors and the other for regular arrays,
effectively allocatable array vs regular array in fortran) now that it's
a little simpler to do.
The PR also moves the utilization of createAlteredByCaptureMap into
genMapInfoOp, where it will be correctly applied to all MapInfoData,
appropriately offsetting and altering Pointer data set in the kernel
argument structure on the host. This primarily means bounds offsets will
now correctly apply to enter/exit/update map clauses as opposed to just
the Target directive that is currently the case. A few fortran runtime
tests have been added to verify this new behavior.
This PR depends on: https://github.com/llvm/llvm-project/pull/84328 and
is an extraction of the larger derived type member map PR stack (so a
requirement for it to land).
Previously reduction variables were always passed by value into and out
of the initialization and combiner regions of the OpenMP reduction
declare operation.
This worked well for reductions of primitive types (and might perform
better than passing by reference). But passing by reference will be
useful for array and derived type reductions (e.g. to move allocation
inside of the init region).
Passing reductions by reference requires different LLVM-IR generation
when lowering from MLIR because some of the loads/stores/allocations
will now be moved inside of the init and combiner regions. This
alternate code generation is requested using a new attribute to
omp.wsloop and omp.parallel.
Existing lowerings from mlir are unaffected (these will continue to use
the by-value argument passing.
Flang will continue to pass by-value argument passing for trivial types
unless a (hidden) command line argument is supplied. Non-trivial types
will always use the by-ref lowering.
Array reductions are not ready yet (but are coming very soon). In the
meantime, this is tested by forcing existing reductions to use by-ref.
Commit series for by-ref OpenMP reductions 3/3
---------
Co-authored-by: Mats Petersson <mats.petersson@arm.com>
Use the new copyprivate list from omp.single to emit calls to
__kmpc_copyprivate, during the creation of the single operation
in OMPIRBuilder.
This is patch 4 of 4, to add support for COPYPRIVATE in Flang.
Original PR: https://github.com/llvm/llvm-project/pull/73128
Adds basic support for materializing delayed privatization. So far, the
restrictions on the implementation are:
- Only `private` clauses are supported (`firstprivate` support will be
added in a later PR).
Summary:
Currently, OpenMP handles the `omp requires` clause by emitting a global
constructor into the runtime for every translation unit that requires
it. However, this is not a great solution because it prevents us from
having a defined order in which the runtime is accessed and used.
This patch changes the approach to no longer use global constructors,
but to instead group the flag with the other offloading entires that we
already handle. This has the effect of still registering each flag per
requires TU, but now we have a single constructor that handles
everything.
This function removes support for the old `__tgt_register_requires` and
replaces it with a warning message. We just had a recent release, and
the OpenMP policy for the past four releases since we switched to LLVM
is that we do not provide strict backwards compatibility between major
LLVM releases now that the library is versioned. This means that a user
will need to recompile if they have an old binary that relied on
`register_requires` having the old behavior. It is important that we
actively deprecate this, as otherwise it would not solve the problem of
having no defined init and shutdown order for `libomptarget`. The
problem of `libomptarget` not having a define init and shutdown order
cascades into a lot of other issues so I have a strong incentive to be
rid of it.
It is worth noting that the current `__tgt_offload_entry` only has space
for a 32-bit integer here. I am planning to overhaul these at some point
as well.
This patch reworks the way that wsloop reduction operations function to
better match the expected semantics from the OpenMP specification,
following the rework of parallel reductions.
The new semantics create a private reduction variable as a block
argument which should be used normally for all operations on that
variable in the region; this private variable is then combined with the
others into the shared variable. This way no special omp.reduction
operations are needed inside the region. These block arguments follow
the loop control block arguments.
---------
Co-authored-by: Kiran Chandramohan <kiran.chandramohan@arm.com>
This patch reworks the way that parallel reduction operations function
to better match the expected semantics from the OpenMP specification.
Previously specific omp.reduction operations were used inside the
region, meaning that the reduction only applied when the correct
operation was used, whereas the specification states that any change to
the variable inside the region should be taken into account for the
reduction.
The new semantics create a private reduction variable as a block
argument which should be used normally for all operations on that
variable in the region; this private variable is then combined with the
others into the shared variable. This way no special omp.reduction
operations are needed inside the region.
This patch only makes the change for the `parallel` operation, the
change for the `wsloop` operation will be in a separate patch.
---------
Co-authored-by: Kiran Chandramohan <kiran.chandramohan@arm.com>
This patch seeks to add an initial lowering for pointers and allocatable variables
captured by implicit and explicit map in Flang OpenMP for Target operations that
take map clauses e.g. Target, Target Update. Target Exit/Enter etc.
Currently this is done by treating the type that lowers to a descriptor
(allocatable/pointer/assumed shape) as a map of a record type (e.g. a structure) as that's
effectively what descriptor types lower to in LLVM-IR and what they're represented as
in the Fortran runtime (written in C/C++). The descriptor effectively lowers to a structure
containing scalar and array elements that represent various aspects of the underlying
data being mapped (lower bound, upper bound, extent being the main ones of interest
in most cases) and a pointer to the allocated data. In this current iteration of the mapping
we map the structure in it's entirety and then attach the underlying data pointer and map
the data to the device, this allows most of the required data to be resident on the device
for use. Currently we do not support the addendum (another block of pointer data), but
it shouldn't be too difficult to extend this to support it.
The MapInfoOp generation for descriptor types is primarily handled in an optimization
pass, where it expands BoxType (descriptor types) map captures into two maps, one for
the structure (scalar elements) and the other for the pointer data (base address) and
links them in a Parent <-> Child relationship. The later lowering processes will then treat
them as a conjoined structure with a pointer member map.
This patch removes the omp.target module attribute, since the
information it held on the target CPU and features is available through
the fir.target_cpu and fir.target_features module attributes. Target
outlining during the MLIR to LLVM IR translation stage is updated, so
that these attributes, at that point available as llvm.func attributes,
are passed along to the newly created function.
Provides some context for failing to generate LLVM IR for `target
enter|exit|update` directives when `nowait` is provided. This is
directly helpful for flang users since they would get this error message
if they tried to use `nowait`. Before that we had a very generic
message.
This is a follow-up to https://github.com/llvm/llvm-project/pull/78269,
please only review the latest commit (the one with the same commit
message as the PR title).
After the removal of the OpenMP early outlining MLIR pass in #67319, the
`EarlyOutliningInterface` stopped doing any useful work. It used to be
necessary to tie the name of the function from which a target region was
outlined to that new function, so it would be used when translating to
LLVM IR in place of the outlined function's name.
This is not necessary anymore, so this patch removes all references to
this interface and uses of the `omp.outline_parent_name` discardable
attribute in tests.
If -nogpulib option is passed by the user, then the OpenMP device
runtime is not used and we should not emit globals to configure
debugging at compile-time for the device runtime.
Link to -nogpulib flag implementation for Clang:
https://reviews.llvm.org/D125314
Extend the `amendOperation` mechanism for translating dialect attributes
attached to operations from another dialect when translating MLIR to
LLVM IR. Previously, this mechanism would have no knowledge of the LLVM
IR instructions created for the given operation, making it impossible
for it to perform local modifications such as attaching operation-level
metadata. Collect instructions inserted by the LLVM IR builder and pass
them to `amendOperation`.
This patch removes the val field from the `MapInfoOp`.
Previously when lowering `TargetOp`, the bounds information for the
`BoxValues` were also being mapped. Instead these ops are now cloned
inside the target region to prevent mapping of non reference typed
values.
This block of code was here to create pseudo handling of implicit
captures in target regions to prevent gfortran test regressions and
allow certain pieces of code to function, however, with the introduction
of the IFA patch which adds proper handling of implicits by adding them
to the map operands list alongside explicit mappings at the initial
Fortran -> MLIR generation phase this should no longer be required and
may cause some adverse affects at worse in the future.
Currently, when deleting the device functions in the second stage of filtering during MLIR to LLVM translation we can end up with invalid calls to these functions. This is because of the removal of the EarlyOutliningPass which would have otherwise gotten rid of any such calls.
This patch aims to alter the function filtering pass in the following way:
- Any host function is completely removed.
- Call to the host function are also removed and their uses replaced with Undef values.
- Any host function with target region code is marked to be removed during the the second stage.
- Calls to such functions are still removed and their uses replaced with Undef values.
Co-authored-by: Sergio Afonso <sergio.afonsofumero@amd.com>