Commit Graph

584 Commits

Author SHA1 Message Date
Joseph Huber
470aefb240 [Offload][NFC] Remove omp_ prefix from offloading entries (#88071)
Summary:
These entires are generic for offloading with the new driver now. Having
the `omp` prefix was a historical artifact and is confusing when used
for CUDA. This patch just renames them for now, future patches will
rework the binary format to make it more common.
2024-04-09 15:50:15 -05:00
Billy Zhu
6f6336858e [MLIR][LLVM] Add DebugNameTableKind to DICompileUnit (#87974)
Add the DebugNameTableKind field to DICompileUnit, along with its
importer & exporter.
2024-04-09 06:18:07 -07:00
Matthias Braun
4a812b5912 Verify threadlocal_address constraints (#87841)
Check invariants for `llvm.threadlocal.address` intrinsic in IR
Verifier.
2024-04-08 17:47:57 -07:00
Billy Zhu
81a7b6454e [MLIR][LLVM] Recursion importer handle repeated self-references (#87295)
Followup to this discussion:
https://github.com/llvm/llvm-project/pull/80251#discussion_r1535599920.

The previous debug importer was correct but inefficient. For cases with
mutual recursion that contain more than one back-edge, each back-edge
would result in a new translated instance. This is because the previous
implementation never caches any translated result with unbounded
self-references. This means all translation inside a recursive context
is performed from scratch, which will incur repeated run-time cost as
well as repeated attribute sub-trees in the translated IR (differing
only in their `recId`s).

This PR refactors the importer to handle caching inside a recursive
context.
- In the presence of unbound self-refs, the translation result is cached
in a separate cache that keeps track of the set of dependent unbound
self-refs.
- A dependent cache entry is valid only when all the unbound self-refs
are in scope. Whenever a cached entry goes out of scope, it will be
removed the next time it is looked up.
2024-04-08 01:09:54 -07:00
Fabian Mora
a2c4b7c8e2 [mlir] Add convertInstruction and getSupportedInstructions to LLVMImportInterface (#86799)
This patch adds the `convertInstruction` and `getSupportedInstructions`
to `LLVMImportInterface`, allowing any non-LLVM dialect to specify how
to import LLVM IR instructions and overriding the default import of LLVM instructions.
2024-04-07 08:46:21 +02:00
Jan Leyonberg
9708d09003 [MLIR][OpenMP] Skip host omp ops when compiling for the target device (#85239)
This patch separates the lowering dispatch for host and target devices.
For the target device, if the current operation is not a top-level
operation (e.g. omp.target) or is inside a target device code region it
will be ignored, since it belongs to the host code.


This is an alternative approach to #84611, the new test in this PR was
taken from there.
2024-04-05 09:25:28 -04:00
Tom Eccles
cc34ad91f0 [MLIR][OpenMP] Add cleanup region to omp.declare_reduction (#87377)
Currently, by-ref reductions will allocate the per-thread reduction
variable in the initialization region. Adding a cleanup region allows
that allocation to be undone. This will allow flang to support reduction
of arrays stored on the heap.

This conflation of allocation and initialization in the initialization
should be fixed in the future to better match the OpenMP standard, but
that is beyond the scope of this patch.
2024-04-04 11:19:42 +01:00
Tom Eccles
099ecdf1ec [mlir][OpenMP] map argument to reduction initialization region (#86979)
The argument to the initialization region of reduction declarations was
never mapped. This meant that if this argument was accessed inside the
initialization region, that mlir operation would be translated to an
llvm operation with a null argument (failing verification).

Adding the mapping ensures that the right LLVM value can be found when
inlining and converting the initialization region.

We have to separately establish and clean up these mappings for each use
of the reduction declaration because repeated usage of the same
declaration will inline it using a different concrete value for the
block argument.

This argument was never used previously because for most cases the
initialized value depends only upon the type of the reduction, not on
the original variable. It is needed now so that we can read the array
extents for the local copy from the mold.

Flang support for reductions on assumed shape arrays patch 2/3
2024-04-04 10:55:42 +01:00
Alex Voicu
ab7dba233a [CodeGen][LLVM] Make the va_list related intrinsics generic. (#85460)
Currently, the builtins used for implementing `va_list` handling
unconditionally take their arguments as unqualified `ptr`s i.e. pointers
to AS 0. This does not work for targets where the default AS is not 0 or
AS 0 is not a viable AS (for example, a target might choose 0 to
represent the constant address space). This patch changes the builtins'
signature to take generic `anyptr` args, which corrects this issue. It
is noisy due to the number of tests affected. A test for an upstream
target which does not use 0 as its default AS (SPIRV for HIP device
compilations) is added as well.
2024-03-27 11:41:34 +00:00
Victor Perez
77cbc9bf60 [MLIR][LLVM] Add llvm.experimental.constrained.fptrunc operation (#86260)
Add operation mapping to the LLVM
`llvm.experimental.constrained.fptrunc.*` intrinsic.

The new operation implements the new
`LLVM::FPExceptionBehaviorOpInterface` and
`LLVM::RoundingModeOpInterface` interfaces.

---------

Signed-off-by: Victor Perez <victor.perez@codeplay.com>
2024-03-26 11:02:50 +01:00
Tobias Gysi
aeeb7d566c [MLIR][LLVM] Make subprogram flags optional (#86433)
This revision makes the subprogramFlags field in the DISubprogrammAttr
optional. This is necessary since the DISubprogram attached to a
declaration may have none of the subprogram flags set.
2024-03-25 12:19:22 +01:00
agozillon
8612fa0d84 [MLIR][OpenMP] Refactor bounds offsetting and fix to apply to all directives (#84349)
This PR refactors bounds offsetting by combining the two differing
implementations (one applying to initial derived type member map
implementation for descriptors and the other for regular arrays,
effectively allocatable array vs regular array in fortran) now that it's
a little simpler to do.

The PR also moves the utilization of createAlteredByCaptureMap into
genMapInfoOp, where it will be correctly applied to all MapInfoData,
appropriately offsetting and altering Pointer data set in the kernel
argument structure on the host. This primarily means bounds offsets will
now correctly apply to enter/exit/update map clauses as opposed to just
the Target directive that is currently the case. A few fortran runtime
tests have been added to verify this new behavior.

This PR depends on: https://github.com/llvm/llvm-project/pull/84328 and
is an extraction of the larger derived type member map PR stack (so a
requirement for it to land).
2024-03-22 15:32:39 +01:00
Tobias Gysi
adda597388 [MLIR] Add index bitwidth to the DataLayout (#85927)
When importing from LLVM IR the data layout of all pointer types
contains an index bitwidth that should be used for index computations.
This revision adds a getter to the DataLayout that provides access to
the already stored bitwidth. The function returns an optional since only
pointer-like types have an index bitwidth. Querying the bitwidth of a
non-pointer type returns std::nullopt.

The new function works for the built-in Index type and, using a type
interface, for the LLVMPointerType.
2024-03-21 09:07:57 +01:00
Christian Ulmann
4095a326c0 [MLIR][LLVM] Add extraData field to the DIDerivedType attribute (#85935)
This commit extends the DIDerivedTypeAttr with the `extraData` field.
For now, the type of it is limited to be a `DINodeAttr`, as extending
the debug metadata handling to support arbitrary metadata nodes does not
seem to be necessary so far.
2024-03-20 16:08:38 +01:00
Sergio Afonso
d84252e064 [MLIR][OpenMP] NFC: Uniformize OpenMP ops names (#85393)
This patch proposes the renaming of certain OpenMP dialect operations with the
goal of improving readability and following a uniform naming convention for
MLIR operations and associated classes. In particular, the following operations
are renamed:

- `omp.map_info` -> `omp.map.info`
- `omp.target_update_data` -> `omp.target_update`
- `omp.ordered_region` -> `omp.ordered.region`
- `omp.cancellationpoint` -> `omp.cancellation_point`
- `omp.bounds` -> `omp.map.bounds`
- `omp.reduction.declare` -> `omp.declare_reduction`

Also, the following MLIR operation classes have been renamed:

- `omp::TaskLoopOp` -> `omp::TaskloopOp`
- `omp::TaskGroupOp` -> `omp::TaskgroupOp`
- `omp::DataBoundsOp` -> `omp::MapBoundsOp`
- `omp::DataOp` -> `omp::TargetDataOp`
- `omp::EnterDataOp` -> `omp::TargetEnterDataOp`
- `omp::ExitDataOp` -> `omp::TargetExitDataOp`
- `omp::UpdateDataOp` -> `omp::TargetUpdateOp`
- `omp::ReductionDeclareOp` -> `omp::DeclareReductionOp`
- `omp::WsLoopOp` -> `omp::WsloopOp`
2024-03-20 11:19:38 +00:00
Tom Eccles
27534d69e2 [mlir][LLVM] erase call mappings in forgetMapping() (#84955)
It looks like the mappings for call instructions were forgotten here.
This fixes a bug in OpenMP when in-lining a region containing call
operations multiple times.

OpenMP array reductions 4/6
Previous PR: https://github.com/llvm/llvm-project/pull/84954
Next PR: https://github.com/llvm/llvm-project/pull/84957
2024-03-20 10:03:26 +00:00
Billy Zhu
c242302492 [MLIR][LLVM] DI Recursive Type fix for recursion via scope of composites (#85850)
Fixes this bug for the previous recursive DI type PR:
https://github.com/llvm/llvm-project/pull/80251#issuecomment-2007254788.

Drawing inspiration from how clang uses DIBuilder to build forward
decls, this PR changes how placeholders are created & updated. Instead
of requiring each recursive DIType to do in-place mutation, we simply
ask for a temporary node as the placeholder, and run RAUW at the end
when the concrete node is translated.

This has the side effect of simplifying what's needed to add recursion
support for a type. Now only one additional method needs to be created
for exporting. Concretely, for this PR, `translateImpl` for
DICompositeType is back to the state it was before the previous PR, and
the only net addition for DICompositeType is `translateTemporaryImpl`.

---------

Co-authored-by: Tobias Gysi <tobias.gysi@nextsilicon.com>
2024-03-19 22:49:50 -07:00
Billy Zhu
1e8dad3bef [MLIR][LLVM] Support Recursive DITypes (#80251)
Following the discussion from [this
thread](https://discourse.llvm.org/t/handling-cyclic-dependencies-in-debug-info/67526/11),
this PR adds support for recursive DITypes.

This PR adds:
1. DIRecursiveTypeAttrInterface: An interface that DITypeAttrs can
implement to indicate that it supports recursion. See full description
in code.
2. Importer & exporter support (The only DITypeAttr that implements the
interface is DICompositeTypeAttr, so the exporter is only implemented
for composites too. There will be two methods that each llvm DI type
that supports mutation needs to implement since there's nothing
general).

---------

Co-authored-by: Tobias Gysi <tobias.gysi@nextsilicon.com>
2024-03-15 09:58:25 -07:00
Zahi Moudallal
8481fb1698 [MLIR][ROCDL] Fix BallotOp LLVM translation and add doc (#85116)
This modifies the return type of the intrinsic call to handle 32 and 64
bits
properly and document the MLIR operation.
2024-03-14 08:43:48 -07:00
Tom Eccles
f46f5a01f4 [flang][OpenMP][OMPIRBuilder][mlir] Optionally pass reduction vars by ref (#84304)
Previously reduction variables were always passed by value into and out
of the initialization and combiner regions of the OpenMP reduction
declare operation.

This worked well for reductions of primitive types (and might perform
better than passing by reference). But passing by reference will be
useful for array and derived type reductions (e.g. to move allocation
inside of the init region).

Passing reductions by reference requires different LLVM-IR generation
when lowering from MLIR because some of the loads/stores/allocations
will now be moved inside of the init and combiner regions. This
alternate code generation is requested using a new attribute to
omp.wsloop and omp.parallel.

Existing lowerings from mlir are unaffected (these will continue to use
the by-value argument passing.

Flang will continue to pass by-value argument passing for trivial types
unless a (hidden) command line argument is supplied. Non-trivial types
will always use the by-ref lowering.

Array reductions are not ready yet (but are coming very soon). In the
meantime, this is tested by forcing existing reductions to use by-ref.

Commit series for by-ref OpenMP reductions 3/3

---------

Co-authored-by: Mats Petersson <mats.petersson@arm.com>
2024-03-13 14:51:09 +00:00
Zahi Moudallal
accfbf4e49 [MLIR][ROCDL] Add BallotOp and lit test (#84856) 2024-03-12 17:07:16 -07:00
Krzysztof Drewniak
b05c15259b [mlir][AMDGPU] Improve amdgpu.lds_barrier, add warnings (#77942)
On some architectures (currently gfx90a, gfx94*, and gfx10**), we can
implement an LDS barrier using compiler intrinsics instead of inline
assembly, improving optimization possibilities and decreasing the
fragility of the underlying code.

Other AMDGPU chipsets continue to require inline assembly to implement
this barrier, as, by the default, the LLVM backend will insert waits on
global memory (s_waintcnt vmcnt(0)) before barriers in order to ensure
memory watchpoints set by debuggers work correctly.

Use of amdgpu.lds_barrier, on these architectures, imposes a tradeoff
between debugability and performance. The documentation, as well as the
generated inline assembly, have been updated to explicitly call
attention to this fact.

For chipsets that did not require the inline assembly hack, we move to
the s.waitcnt and s.barrier intrinsics, which have been added to the
ROCDL dialect. The magic constants used as an argument to the waitcnt
intrinsic can be derived from
llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
2024-03-11 10:06:49 -05:00
Kareem Ergawy
5c54f72901 [MLIR][OpenMP] Extend omp.private materialization support: firstprivate (#82164)
Extends current support for delayed privatization during translation to
LLVM IR. This adds support for one-block `firstprivate` `omp.private`
ops.
2024-03-04 12:28:30 +01:00
Leandro Lupori
64422cf826 [llvm][mlir][OMPIRBuilder] Translate omp.single's copyprivate (#80488)
Use the new copyprivate list from omp.single to emit calls to
__kmpc_copyprivate, during the creation of the single operation
in OMPIRBuilder.

This is patch 4 of 4, to add support for COPYPRIVATE in Flang.
Original PR: https://github.com/llvm/llvm-project/pull/73128
2024-02-28 13:33:42 -03:00
Tobias Gysi
e39e30e952 [mlir][llvm] Fix access group translation (#83257)
This commit fixes the translation of access group metadata to LLVM IR.
Previously, it did not use a temporary metadata node to model the
placeholder of the self-referencing access group nodes. This is
dangerous since, the translation may produce a metadata list with a null
entry that is later on changed changed with a self reference. At the
same time, for example the debug info translation may create the same
uniqued node, which after setting the self-reference the suddenly
references the access group metadata. The commit avoids such breakages.
2024-02-28 14:45:18 +01:00
Kareem Ergawy
9d56be010c [MLIR][OpenMP] Support basic materialization for omp.private ops (#81715)
Adds basic support for materializing delayed privatization. So far, the
restrictions on the implementation are:
- Only `private` clauses are supported (`firstprivate` support will be
  added in a later PR).
2024-02-28 05:00:07 +01:00
Xiang Li
70a7b1e8df Remove test since no test on --debug output. (#83189) 2024-02-27 17:00:01 -05:00
Krzysztof Drewniak
563f414e04 [mlir][AMDGPU] Set uniform-work-group-size=true by default (#79077)
GPU kernels generated via typical MLIR mechanisms make the assumption
that all workgroups are of uniform size, and so, as in OpenMP, it is
appropriate to set the "uniform-work-group-size"="true" attribute on
these functions by default. This commit makes that choice.

In the event it is needed, this commit adds
`rocdl.uniform_work_group_size` as an attribute to be set on LLVM
functions that can be used to override the default.

In addition, add proper failure messages to translation
2024-02-27 12:35:48 -06:00
Xiang Li
f4fad827ca [NFC] Add REQUIRES: asserts to limit the test to debug only. (#83145) 2024-02-27 12:59:56 -05:00
Xiang Li
c11627c2f4 [MLIR][LLVM] Fix memory explosion when converting global variable bodies in ModuleTranslation (#82708)
There is memory explosion when converting the body or initializer region
of a large global variable, e.g. a constant array.

For example, when translating a constant array of 100000 strings:

llvm.mlir.global internal constant @cats_strings() {addr_space = 0 :
i32, alignment = 16 : i64} : !llvm.array<100000 x ptr<i8>> {
    %0 = llvm.mlir.undef : !llvm.array<100000 x ptr<i8>>
    %1 = llvm.mlir.addressof @om_1 : !llvm.ptr<array<1 x i8>>
%2 = llvm.getelementptr %1[0, 0] : (!llvm.ptr<array<1 x i8>>) ->
!llvm.ptr<i8>
    %3 = llvm.insertvalue %2, %0[0] : !llvm.array<100000 x ptr<i8>>
    %4 = llvm.mlir.addressof @om_2 : !llvm.ptr<array<1 x i8>>
%5 = llvm.getelementptr %4[0, 0] : (!llvm.ptr<array<1 x i8>>) ->
!llvm.ptr<i8>
    %6 = llvm.insertvalue %5, %3[1] : !llvm.array<100000 x ptr<i8>>
    %7 = llvm.mlir.addressof @om_3 : !llvm.ptr<array<1 x i8>>
%8 = llvm.getelementptr %7[0, 0] : (!llvm.ptr<array<1 x i8>>) ->
!llvm.ptr<i8>
    %9 = llvm.insertvalue %8, %6[2] : !llvm.array<100000 x ptr<i8>>
    %10 = llvm.mlir.addressof @om_4 : !llvm.ptr<array<1 x i8>>
%11 = llvm.getelementptr %10[0, 0] : (!llvm.ptr<array<1 x i8>>) ->
!llvm.ptr<i8>
    %12 = llvm.insertvalue %11, %9[3] : !llvm.array<100000 x ptr<i8>>

    ... (ignore the remaining part)
}

where @om_1, @om_2, ... are string global constants.

Each time an operation is converted to LLVM, a new constant is created.
When it comes to llvm.insertvalue, a new constant array of 100000
elements is created and the old constant array (input) is not destroyed.
This causes memory explosion. We observed that, on a system with 128 GB
memory, the translation of 100000 elements got killed due to using up
all the memory. On a system with 64 GB, 65536 elements was enough to
cause the translation killed.

There is a previous patch (https://reviews.llvm.org/D148487) which fix
this issue but was reverted for
https://github.com/llvm/llvm-project/issues/62802

The old patch checks generated constants and destroyed them if there is
no use. But the check of use for the constant is too early, which cause
the constant be removed before use.


This new patch added a map was added a map to save expected use count
for a constant. Then decrease when reach each use.
And only erase the constant when the use count reach to zero

With new patch, the repro in
https://github.com/llvm/llvm-project/issues/62802 finished correctly.
2024-02-26 21:15:12 -05:00
agozillon
dcf4ca558c [OpenMP][MLIR][OMPIRBuilder] Add a small optional constant alloca raise function pass to finalize, utilised in convertTarget (#78818)
This patch seeks to add a mechanism to raise constant (not ConstantExpr
or runtime/dynamic) sized allocations into the entry block for select
functions that have been inserted into a list for processing. This
processing occurs during the finalize call, after OutlinedInfo regions
have completed. This currently has only been utilised for
createOutlinedFunction, which is triggered for TargetOp generation in
the OpenMP MLIR dialect lowering to LLVM-IR.

This currently is required for Target kernels generated by
createOutlinedFunction to avoid subsequent optimization passes doing
some unintentional malformed optimizations for AMD kernels (unsure if it
occurs for other vendors). If the allocas are generated inside of the
kernel and are not in the entry block and are subsequently passed to a
function this can lead to required instructions being erased or
manipulated in a way that causes the kernel to run into a HSA access
error.

This fix is related to a series of problems found in:
https://github.com/llvm/llvm-project/issues/74603

This problem primarily presents itself for Flang's HLFIR AssignOp
currently, when utilised with a scalar temporary constant on the RHS and
a descriptor type on the LHS. It will generate a call to a runtime
function, wrap the RHS temporary in a newly allocated descriptor (an
llvm struct), and pass both the LHS and RHS descriptor into the runtime
function call. This will currently be
embedded into the middle of the target region in the user entry block,
which means the allocas are also embedded in the middle, which seems to
pose
issues when later passes are executed. This issue may present itself in
other HLFIR operations or unrelated operations that generate allocas as
a by product, but for the moment, this one test case is the only
scenario I've found this problem.

Perhaps this is not the appropriate fix, I am very open to other
suggestions, I've tried a few others (at varying levels of the
flang/mlir compiler flow), but this one is the smallest and least
intrusive change set. The other two, that come to mind (but I've not
fully looked into, the former I tried a little with blocks but it had a
few issues I'd need to think through):

- Having a proper alloca only block (or region) generated for TargetOps
that we could merge into the entry block that's generated by
convertTarget's createOutlinedFunction.
- Or diverging a little from Clang's current target generation and using
the CodeExtractor to generate the user code as an outlined function
region invoked from the kernel we make, with our kernel arguments passed
into it. Similar to the current parallel generation. I am not sure how
well this would intermingle with the existing parallel generation though
that's layered in.

Both of these methods seem like quite a divergence from the current
status quo, which I am not entirely sure is merited for the small test
this change aims to fix.
2024-02-23 22:59:41 +01:00
Tobias Gysi
335d34d9ea [MLIR][LLVM] Fix debug intrinsic import (#82637)
This revision handles the case that the translation of a scope fails due
to cyclic metadata. This mainly affects the import of debug intrinsics
that indirectly take such a scope as metadata argument (e.g. via local
variable or label metadata). This commit ensures we drop intrinsics with
such a dependency on cyclic metadata.
2024-02-23 10:30:19 +01:00
Joseph Huber
cc374d8056 [OpenMP] Remove register_requires global constructor (#80460)
Summary:
Currently, OpenMP handles the `omp requires` clause by emitting a global
constructor into the runtime for every translation unit that requires
it. However, this is not a great solution because it prevents us from
having a defined order in which the runtime is accessed and used.

This patch changes the approach to no longer use global constructors,
but to instead group the flag with the other offloading entires that we
already handle. This has the effect of still registering each flag per
requires TU, but now we have a single constructor that handles
everything.

This function removes support for the old `__tgt_register_requires` and
replaces it with a warning message. We just had a recent release, and
the OpenMP policy for the past four releases since we switched to LLVM
is that we do not provide strict backwards compatibility between major
LLVM releases now that the library is versioned. This means that a user
will need to recompile if they have an old binary that relied on
`register_requires` having the old behavior. It is important that we
actively deprecate this, as otherwise it would not solve the problem of
having no defined init and shutdown order for `libomptarget`. The
problem of `libomptarget` not having a define init and shutdown order
cascades into a lot of other issues so I have a strong incentive to be
rid of it.

It is worth noting that the current `__tgt_offload_entry` only has space
for a 32-bit integer here. I am planning to overhaul these at some point
as well.
2024-02-21 11:33:32 -06:00
Guray Ozen
b5d694ba14 [mlir][nvvm] Introduce nvvm.barrier OP (#81487)
This PR that introduces the `nvvm.barrier` OP to the NVVM dialect.
Currently, NVVM only supports the `nvvm.barrier0`, which synchronizes
all threads using barrier resource 0.

The new `nvvm.barrier` has two essential arguments: the barrier resource
and the number of threads. This added flexibility allows for selective
synchronization of threads within a CTA, aligning with the capabilities
provided by LLVM intrinsics or the PTX model.

I think we can deprecate `nvvm.barrier0` in favor of the more generic
`nvvm.barrier`.

```
// Equivalent to nvvm.barrier0 (or __syncthreads() in CUDA)
nvvm.barrier

// Synchronize all threads using the 3rd barrier resource.
nvvm.barrier id = 3

// Synchronize %numberOfThreads threads using the 3rd barrier resource.
nvvm.barrier id = 3 number_of_threads = %numberOfThreads
```
2024-02-14 08:28:45 +01:00
David Truby
be9f8ffd81 [mlir][flang][openmp] Rework wsloop reduction operations (#80019)
This patch reworks the way that wsloop reduction operations function to
better match the expected semantics from the OpenMP specification,
following the rework of parallel reductions.

The new semantics create a private reduction variable as a block
argument which should be used normally for all operations on that
variable in the region; this private variable is then combined with the
others into the shared variable. This way no special omp.reduction
operations are needed inside the region. These block arguments follow
the loop control block arguments.

---------

Co-authored-by: Kiran Chandramohan <kiran.chandramohan@arm.com>
2024-02-13 19:13:54 +00:00
Giuseppe Rossini
16140ff219 [mlir][ROCDL] Add synchronization primitives (#80888)
This PR adds two LLVM intrinsics to MLIR:
- llvm.amdgcn.s.setprio which sets the priority of a wave for the GPU
scheduler
- llvm.amdgcn.sched.barrier which sets a software barrier so that the
scheduler cannot move instructions around
2024-02-13 12:29:49 -06:00
Rishi Surendran
fa6850a998 [mlir][nvvm]Add support for grid_constant attribute on LLVM function arguments (#78228)
Add support for attribute nvvm.grid_constant on LLVM function arguments.
The attribute can be attached only to arguments of type llvm.ptr that
have llvm.byval attribute.
Generate LLVM metadata for functions with nvvm.grid_constant arguments.
The metadata node is a list of integers, where each integer n denotes
that the nth parameter has the
grid_constant annotation (numbering from 1). The generated metadata node
will be handled by NVVM compiler. See
https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#supported-properties
for documentation on grid_constant property.

This patch also adds convertParameterAttr to
LLVMTranslationDialectInterface for supporting the translation of
derived dialect attributes on function parameters 
2024-02-12 13:16:59 -08:00
David Truby
9ecf4d20bb [mlir][flang][openmp] Rework parallel reduction operations (#79308)
This patch reworks the way that parallel reduction operations function
to better match the expected semantics from the OpenMP specification.
Previously specific omp.reduction operations were used inside the
region, meaning that the reduction only applied when the correct
operation was used, whereas the specification states that any change to
the variable inside the region should be taken into account for the
reduction.

The new semantics create a private reduction variable as a block
argument which should be used normally for all operations on that
variable in the region; this private variable is then combined with the
others into the shared variable. This way no special omp.reduction
operations are needed inside the region.

This patch only makes the change for the `parallel` operation, the
change for the `wsloop` operation will be in a separate patch.

---------

Co-authored-by: Kiran Chandramohan <kiran.chandramohan@arm.com>
2024-02-12 17:19:49 +00:00
Benjamin Maxwell
413e82a087 [mlir][ArmSVE] Add intrinsics for the SME2 multi-vector zips (#80985)
These are added to the ArmSVE dialect for consistency with LLVM, which
registers SME2 intrinsics that don't require ZA under SVE.
2024-02-09 13:33:09 +00:00
Kolya Panchenko
9f6c00565a [MLIR][VCIX] Support VCIX intrinsics in LLVMIR dialect (#75875)
The changeset extends LLVMIR intrinsics with VCIX intrinsics.
The VCIX intrinsics allow MLIR users to interact with RISC-V
co-processors that are compatible with `XSfvcp` extension

Source:
https://www.sifive.com/document-file/sifive-vector-coprocessor-interface-vcix-software
2024-02-07 15:23:28 -05:00
Kojo Acquah
16d890ced6 [mlir][ArmNeon] Adds Arm Neon SMMLA, UMMLA, and USMMLA Intrinsics (#80511)
This adds the SMMLA, UMMLA, and  USMMLA intrinsics to Neon dialect bringing it in line with the SVE dialect. 

These ops enable matrix multiply-accumulate instructions with two e 2x8 matrix inputs of respective signage into a 2x2 32-bit integer accumulator. This is equivalent to performing an 8-way dot product per destination element.

Op details: 
https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]&q=mmla
2024-02-06 14:12:40 -08:00
agozillon
95fe47ca7e [Flang][OpenMP] Initial mapping of Fortran pointers and allocatables for target devices (#71766)
This patch seeks to add an initial lowering for pointers and allocatable variables 
captured by implicit and explicit map in Flang OpenMP for Target operations that 
take map clauses e.g. Target, Target Update. Target Exit/Enter etc.

Currently this is done by treating the type that lowers to a descriptor 
(allocatable/pointer/assumed shape) as a map of a record type (e.g. a structure) as that's
effectively what descriptor types lower to in LLVM-IR and what they're represented as
in the Fortran runtime (written in C/C++). The descriptor effectively lowers to a structure
containing scalar and array elements that represent various aspects of the underlying
data being mapped (lower bound, upper bound, extent being the main ones of interest
in most cases) and a pointer to the allocated data. In this current iteration of the mapping
we map the structure in it's entirety and then attach the underlying data pointer and map
the data to the device, this allows most of the required data to be resident on the device
for use. Currently we do not support the addendum (another block of pointer data), but
it shouldn't be too difficult to extend this to support it.

The MapInfoOp generation for descriptor types is primarily handled in an optimization
pass, where it expands BoxType (descriptor types) map captures into two maps, one for
the structure (scalar elements) and the other for the pointer data (base address) and
links them in a Parent <-> Child relationship. The later lowering processes will then treat
them as a conjoined structure with a pointer member map.
2024-02-05 18:45:07 +01:00
Sergio Afonso
bc82d1a6b7 [OpenMPIRBuilder][MLIR] Pass target-cpu and target-features to outlined functions (#80283)
This patch adds support for forwarding the target-cpu and
target-features attributes to functions outlined in the OpenMPIRBuilder.
This, in turn, results in the addition of these attributes for functions
created during the translation of the `omp.parallel`, `omp.task` and
`omp.teams` operations, and for the `omp.wsloop` operation when doing
codegen for an OpenMP target device.
2024-02-05 12:22:56 +00:00
Sergio Afonso
92bbf615f5 [Flang][MLIR][OpenMP] Use function-attached target attributes for OpenMP lowering (#78291)
This patch removes the omp.target module attribute, since the
information it held on the target CPU and features is available through
the fir.target_cpu and fir.target_features module attributes. Target
outlining during the MLIR to LLVM IR translation stage is updated, so
that these attributes, at that point available as llvm.func attributes,
are passed along to the newly created function.
2024-02-02 13:16:36 +00:00
Kai Sasaki
65ac8c16e0 [mlir] Skip invalid test on big endian platform (s390x) (#80246)
The buildbot test running on s390x platform keeps failing since [this
time](https://lab.llvm.org/buildbot/#/builders/199/builds/31136). This
is because of the dependency on the endianness of the platform. It
expects the format invalid in the big endian platform (s390x). We can
simply skip it.

See: https://discourse.llvm.org/t/mlir-s390x-linux-failure/76695
2024-02-02 00:07:44 -08:00
Sander de Smalen
d313614b60 [AArch64] Replace LLVM IR function attributes for PSTATE.ZA. (#79166)
Since https://github.com/ARM-software/acle/pull/276 the ACLE
defines attributes to better describe the use of a given SME state.

Previously the attributes merely described the possibility of it being
'shared' or 'preserved', whereas the new attributes have more semantics
and also describe how the data flows through the program.

For ZT0 we already had to add new LLVM IR attributes:
* aarch64_new_zt0
* aarch64_in_zt0
* aarch64_out_zt0
* aarch64_inout_zt0
* aarch64_preserves_zt0

We have now done the same for ZA, such that we add:
* aarch64_new_za       (previously `aarch64_pstate_za_new`)
* aarch64_in_za (more specific variation of `aarch64_pstate_za_shared`)
* aarch64_out_za (more specific variation of `aarch64_pstate_za_shared`)
* aarch64_inout_za (more specific variation of
`aarch64_pstate_za_shared`)
* aarch64_preserves_za (previously `aarch64_pstate_za_shared,
aarch64_pstate_za_preserved`)

This explicitly removes 'pstate' from the name, because with SME2 and
the new ACLE attributes there is a difference between "sharing ZA"
(sharing
the ZA matrix register with the caller) and "sharing PSTATE.ZA" (sharing
either the ZA or ZT0 register, both part of PSTATE.ZA with the caller).
2024-02-01 13:37:37 +00:00
Dominik Adamski
b4370140b4 [OpenMPIRBuilder] Do not call host runtime for GPU teams codegen (#79984)
Patch ensures that host runtime functions are not called for handling
OpenMP teams clause on the device.

GPU code for pragma `omp target teams distribute parallel do` will
require only one call to OpenMP loop-worksharing GPU runtime. Support
for it will be added later.

This patch does not include changes required for handling `omp target
teams` for the host side.
2024-01-31 12:16:35 +01:00
Cullen Rhodes
95ef8e3868 [mlir][ArmSME] Support 2-way widening outer products (#78975)
This patch introduces support for 2-way widening outer products. This
enables the fusion of 2 'arm_sme.outerproduct' operations that are
chained via the accumulator into a 2-way widening outer product
operation.

Changes:

- Add 'llvm.aarch64.sme.[us]mop[as].za32' intrinsics for 2-way variants.
  These map to instruction variants added in SME2 and use different
  intrinsics. Intrinsics are already implemented for widening variants
  from SME1.
- Adds the following operations:
  - fmopa_2way, fmops_2way
  - smopa_2way, smops_2way
  - umopa_2way, umops_2way
- Implements conversions for the above ops to intrinsics in
ArmSMEToLLVM.
- Adds a pass 'arm-sme-outer-product-fusion'  that fuses
  'arm_sme.outerproduct' operations.

For a detailed description of these operations see the
'arm_sme.fmopa_2way' description.

The reason for introducing many operations rather than one is the
signed/unsigned variants can't be distinguished with types (e.g., ui16,
si16) since 'arith.extui' and 'arith.extsi' only support signless
integers. A single operation would require this information and an
attribute (for example) for the sign doesn't feel right if
floating-point types are also supported where this wouldn't apply.
Furthermore, the SME FP8 extensions (FEAT_SME_F8F16, FEAT_SME_F8F32)
introduce FMOPA 2-way (FP8 to FP16) and 4-way (FP8 to FP32) variants but
no subtract variant. Whilst these are not supported in this patch, it
felt simpler to have separate ops for add/subtract given this.
2024-01-31 09:13:18 +00:00
Alex Bradbury
748c295908 [MLIR][LLVM] Add fast-math related function attribute support (#79812)
Adds unsafe-fp-math, no-infs-fp-math, no-nans-fp-math,
approx-func-fp-math, and no-signed-zeros-fp-math function attributes.

This allows code generators using the LLVMIR dialect to match the
codegen of Clang.
2024-01-30 14:03:51 +00:00
Mirko Brkušanin
7fdf608cef [AMDGPU] Add GFX12 WMMA and SWMMAC instructions (#77795)
Co-authored-by: Petar Avramovic <Petar.Avramovic@amd.com>
Co-authored-by: Piotr Sobczak <piotr.sobczak@amd.com>
2024-01-24 13:43:07 +01:00