Re-land PR after being reverted because of buildbot failures.
This patch adds representation for `device_type` clause information on
compute construct (parallel, kernels, serial).
The `device_type` clause on compute construct impacts clauses that
appear after it. The values impacted by `device_type` are now tied with
an attribute array that represent the device_type associated with them.
`DeviceType::None` is used to represent the value produced by a clause
before any `device_type`. The operands and the attribute information are
parser/printed together.
This is an example with `vector_length` clause. The first value (64) is
not impacted by `device_type` so it will be represented with
DeviceType::None. None is not printed. The second value (128) is tied
with the `device_type(multicore)` clause.
```
!$acc parallel vector_length(64) device_type(multicore) vector_length(256)
```
```
acc.parallel vector_length(%c64 : i32, %c128 : i32 [#acc.device_type<multicore>]) {
}
```
When multiple values can be produced for a single clause like
`num_gangs` and `wait`, an extra attribute describe the number of values
belonging to each `device_type`. Values and attributes are
parsed/printed together.
```
acc.parallel num_gangs({%c2 : i32, %c4 : i32}, {%c4 : i32} [#acc.device_type<nvidia>])
```
While preparing this patch I noticed that the wait devnum is not part of
the operations and is not lowered. It will be added in a follow up
patch.
This patch adds representation for `device_type` clause information on
compute construct (parallel, kernels, serial).
The `device_type` clause on compute construct impacts clauses that
appear after it. The values impacted by `device_type` are now tied with
an attribute array that represent the device_type associated with them.
`DeviceType::None` is used to represent the value produced by a clause
before any `device_type`. The operands and the attribute information are
parser/printed together.
This is an example with `vector_length` clause. The first value (64) is
not impacted by `device_type` so it will be represented with
DeviceType::None. None is not printed. The second value (128) is tied
with the `device_type(multicore)` clause.
```
!$acc parallel vector_length(64) device_type(multicore) vector_length(256)
```
```
acc.parallel vector_length(%c64 : i32, %c128 : i32 [#acc.device_type<multicore>]) {
}
```
When multiple values can be produced for a single clause like
`num_gangs` and `wait`, an extra attribute describe the number of values
belonging to each `device_type`. Values and attributes are
parsed/printed together.
```
acc.parallel num_gangs({%c2 : i32, %c4 : i32}, {%c4 : i32} [#acc.device_type<nvidia>])
```
While preparing this patch I noticed that the wait devnum is not part of
the operations and is not lowered. It will be added in a follow up
patch.
The `acc` dialect operations now implement MemoryEffects interfaces in
the following ways:
- Data entry operations which may read host memory via `varPtr` are now
marked as so. The majority of them do NOT actually read the host memory.
For example, `acc.present` works on the basis of presence of pointer and
not necessarily what the data points to - so they are not marked as
reading the host memory. They still use `varPtr` though but this
dependency is reflected through ssa.
- Data clause operations which may mutate the data pointed to by
`accPtr` are marked as doing so.
- Data clause operations which update required structured or dynamic
runtime counters are marked as reading and writing the newly defined
`RuntimeCounters` resource. Some operations, like `acc.getdeviceptr` do
not actually use the runtime counters - but are marked as reading them
since the address obtained depends on the mapping operations which do
update the runtime counters. Namely, `acc.getdeviceptr` cannot be moved
across other mapping operations.
- Constructs are marked as writing to the `ConstructResource`. This may
be too strict but is needed for the following reasons: 1) Structured
constructs may not use `accPtr` and instead use `varPtr` - when this is
the case, data actions may be removed even when used. 2) Unstructured
constructs are currently used to aggregate multiple data actions. We do
not want such constructs removed or moved for now.
- Terminators are marked as `Pure` as in other dialects.
The current approach has the following limitations which may require
further improvements:
- Subsequent `acc.copyin` operations on same data do not actually read
host memory pointed to by `varPtr` but are still marked as so.
- Two `acc.delete` operations on same data may not mutate `accPtr` until
the runtime counters are zero (but are still marked as mutating).
- The `varPtrPtr` argument, when present, points to the address of
location of `varPtr`. When mapping to target device, an `accPtrPtr`
needs computed and this memory is mutated. This effect is not captured
since the current operations do not produce `accPtrPtr`.
- Runtime counter effects are imprecise since two operations with
differing `varPtr` increment/decrement different counters. Additionally,
operations with `varPtrPtr` mutate attachment counters.
- The `ConstructResource` is too strict and likely can be relaxed with
better modeling.
This commit removes the support for typed pointers from the LLVM
dialect. Typed pointers have been deprecated for a while and thus this
removal was announced in a PSA:
https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502
This change includes:
- Changing the ` LLVMPointerType`
- Removing remaining usages of the builders and the now removed element
type
- Fixing assembly formats that require fully qualified pointer types
- Updating ODS pointer constraints
The compute and data constructs implement getNumDataOperands and
getDataOperand. The acc.loop operation similarly has multiple data
operands - thus it makes sense to expose them the same way.
For loop, only private and reduction operands are exposed this way.
Technically, acc.loop also holds cache operands - but these are hints
not a data attribute.
After PR#69417, lowering for combined constructs was updated to adhere
to OpenACC 3.3, section 2.11: `A private or reduction clause on a
combined construct is treated as if it appeared on the loop construct.`
However, the second part of that paragraph notes `In addition, a
reduction clause on a combined construct implies a copy clause`. Since
the acc dialect decomposes combined constructs, it is important to
distinguish between the case where an explicit data clause is required
(as noted in section 2.6.2) and the case where an implicit data action
must be generated by compiler.
Add lowering support for array with dynamic extents in the firstprivate
recipe. Generalize the lowering so static shaped arrays and array with
dynamic extents use the same path.
Some cleaning code is taken from #68836 that is not landed yet.
Add support for assumed shape arrays in lowering of the copy region of
the firstprivate recipe. Information is passed in block arguments as it
is done for the reduction recipe.
Conversion of `hlfir.assign` operations inside OpenACC recipe operations
may result in `fir.alloca` insertion. FIRBuilder can only handle
alloca insertion inside FuncOp's and outlineable OpenMP operations.
I added a simple interface for OpenACC recipe operations that have
executable code inside all their regions, and alloca may be inserted
into the entry blocks of those regions always.
With our current approach the OptimizedBufferization pass is supposed
to lower these `hlfir.assign` operations into loops, because there
should not be conflicts between lhs/rhs. The pass is currently
only working on FuncOp, and this is why it does not optimize
`hlfir.assign` inside the recipes. I will fix it in a separate commit.
Since we run OptimizedBufferization only at >O0, these changes
should still be useful.
Note that the OpenACC codegen that applies the recipes should be aware
of potential alloca operations and produce appropriate stack clean-ups.
The `cache` directive may appear at the top of (inside of) a loop. It
specifies array elements or subarrays that should be fetched into the
highest level of the cache for the body of the loop.
The `cache` directive is modeled as a data entry operands attached to
the acc.loop operation.
The OpenACC standard specifies an `atomic` construct in section 2.12 (of
3.3 spec), used to ensure that a specific location is accessed or
updated atomically. Four different clauses are allowed: `read`, `write`,
`update`, or `capture`. If no clause appears, it is as if `update` is
used.
The OpenMP specification defines the same clauses for `omp atomic`. The
types of expression and the clauses in the OpenACC spec match the OpenMP
spec exactly. The main difference is that the OpenMP specification is a
superset - it includes clauses for `hint` and `memory order`. It also
allows conditional expression statements. But otherwise, the expression
definition matches.
Thus, for OpenACC, we refactor and reuse the OpenMP implementation as
follows:
* The atomic operations are duplicated in OpenACC dialect. This is
preferable so that each language's semantics are precisely represented
even if specs have divergence.
* However, since semantics overlap, a common interface between the
atomic operations is being added. The semantics for the interfaces are
not generic enough to be used outside of OpenACC and OpenMP, and thus
new folders were added to hold common pieces of the two dialects.
* The atomic interfaces define common accessors (such as getting `x` or
`v`) which match the OpenMP and OpenACC specs. It also adds common
verifiers intended to be called by each dialect's operation verifier.
* The OpenMP write operation was updated to use `x` and `expr` to be
consistent with its other operations (that use naming based on spec).
The frontend lowering necessary to generate the dialect can also be
reused. This will be done in a follow up change.
The declare attribute has been updated to allow implicit flag. This is
useful for variables that can be declare'd implicitly - like global
constants. The verifier has been updated to ensure that an implicit
declare'd variable has an implicit data action. The builder doesn't
require for this flag to be set so any code creating this attribute
will continue to work as-is.
Reviewed By: vzakhari
Differential Revision: https://reviews.llvm.org/D159124
The standard suggests that the value for the `device_type` clause on the
`set` directive is a list but this does not makes sense. Restrict the number
of value to one so it matches the runtime function.
Reviewed By: razvanlupusoru
Differential Revision: https://reviews.llvm.org/D158644
Introduce the acc.set operation that models the
acc set directive. Based on acc.init and acc.shutdown
Reviewed By: razvanlupusoru
Differential Revision: https://reviews.llvm.org/D158554
The acc.declare operation represent the implicit
region of variable in the declare directive in the function
(and subroutine in fortran).
Reviewed By: razvanlupusoru
Differential Revision: https://reviews.llvm.org/D158314
The global ctor for acc declare when the variable is a descriptor
is treated differently. The descriptor is implicity copied in.
An additional registering function will be generated to deal with
the data pointer when the data is actually allocated. This will come in
a follow up patch.
The descriptor is not a user visible detail but an implementation detail.
The intent for declare is that the lifetime is implicitly managed - and the
data must be on device. Since descriptor holds pointer to the data,
it makes sense to also make this available on device at same time.
Copyin is used because it contains relevant details about the data such
as bounds.
Reviewed By: razvanlupusoru
Differential Revision: https://reviews.llvm.org/D157338
Adds representation for `acc routine` under new operation named
`acc.routine`. This operation is associated with a function symbol.
It also gets its own compiler generated synthetic symbol name so
that it can be referenced from the associated function. The clauses
associated with the `acc routine` directive are captured in the
`acc.routine` op.
The linking between the `func.func` and its `acc.routine` declaration
is done through the `acc.routine_info` attribute. In practice, a
single `acc routine` is associated with a function. But the spec does
not specifically restrict this - thus the 1:N relationship between
`func.func` and `acc.routine` allowed in the dialect. Additionally, it
makes sense that multiple acc routines could be used for a single
function depending on loop context - to allow flexible parallelization.
Most acc routine clauses are supported including `gang`, `gang(dim:)`,
`vector`, `worker`, `seq`, `nohost`, and `bind`. The only one not
supported is `device_type`. This is because most other clauses also
miss this and the effort to add support for it needs to be coordinated
and consistent.
Reviewed By: clementval, vzakhari
Differential Revision: https://reviews.llvm.org/D156281
The attribute on operations in ops.mlir were not DeclareAttr but
DataClauseAttr with the acc.declare attribute name. Update the test
and the verifier to work correctly with the expected DeclareAttr.
Reviewed By: razvanlupusoru
Differential Revision: https://reviews.llvm.org/D156262
Allow the init and combiner regions to have more
arguments to pass information.
Reviewed By: razvanlupusoru
Differential Revision: https://reviews.llvm.org/D155656
For variables in declare clauses, their producing operation should be
marked with the data clause for ease of lookup and consistency
verification. Thus add an attribute that can be used for this purpose
plus verification that declare data operation matches the declare
data clause on variable.
Reviewed By: clementval
Differential Revision: https://reviews.llvm.org/D155640
A declare directive is used to specify the creation of a visible device
copy of a variable for the duration of the implicit data region as it
relates to the scope in which the variable is declared.
In order to support this, the following new operations were added:
1) `acc.global_ctor` and `acc.global_dtor`. These are used whenever the
declare directive applies to a global.
2) `acc.declare_enter` and `acc.declare_exit`. These operations are
modeled similarly to `acc.enter_data` and `acc.exit_data`. The reason
they are not modeled like `acc.data` is so that these operations can be
used both for globals and regions like functions.
3) `acc.declare_device_resident` and `acc.declare_link`. These
operations are modeled in a manner consistent with previously defined
data entry operation model.
The `acc.getdeviceptr` was generalized so that it can be used with
acc.declare_exit.
Reviewed By: clementval, vzakhari
Differential Revision: https://reviews.llvm.org/D155322
OpenACC 3.2 allowed the wait clause to the data construct. This patch
adds a unit attribute and a variadic operand to the data operation to model
the wait clause in a similar way it was added to other data operation.
The attribute models the presence of the clause without any argument. When
arguments are provided they are placed in the wait operand.
Depends on D154111
Reviewed By: razvanlupusoru
Differential Revision: https://reviews.llvm.org/D154131
OpenACC 3.2 allowed the async clause to the data construct. This patch
adds a unit attribute and an optional operand to the data operation to model
the data clause in a similar way it was added to other data operation.
The attribute models the presence of the clause without any argument. When
an argument is provided it is placed in the async operand.
Reviewed By: razvanlupusoru
Differential Revision: https://reviews.llvm.org/D154111
In the latest spec, the `num_gangs` clause accepts up to three
arguments. Update the dialect to swicth `numGangs` operands from
optional single operand to a variadic operand. The verifier limits
the number of operands to three as specified in the spec.
Reviewed By: razvanlupusoru
Differential Revision: https://reviews.llvm.org/D153796
acc.reduction operation is used as data entry operation for the reduction
operands.
Reviewed By: jeanPerier
Differential Revision: https://reviews.llvm.org/D153367
Lower 1d array reduction for add and mul operator. Multi-dimensional arrays and
other operator will follow.
Reviewed By: razvanlupusoru
Differential Revision: https://reviews.llvm.org/D153448
For all other compute and data constructs, the data operands list
is named `dataClauseOperands`. Update `acc.host_data` to be
consistent with this naming.
Reviewed By: clementval
Differential Revision: https://reviews.llvm.org/D153425
acc.firstprivate operation will be used as data entry operation
for the firstprivate operands.
Depends on D152970
Reviewed By: razvanlupusoru
Differential Revision: https://reviews.llvm.org/D152972
acc.private operation will be used as data entry operation
for the private operands.
Reviewed By: razvanlupusoru
Differential Revision: https://reviews.llvm.org/D152970
OpenACC 3.3 introduces a dim argument on the gang clause. This patch
adds a new operand for it on the acc.loop and update the custom
gang clause parser/printer for it.
Depends on D151970
Reviewed By: razvanlupusoru, jeanPerier
Differential Revision: https://reviews.llvm.org/D151971
The custom parser for the gang values was not implemented correctly.
This patch fixes the noted issue and allows the num/static values
to appear in any order.
Reviewed By: razvanlupusoru, jeanPerier
Differential Revision: https://reviews.llvm.org/D151970
Parallel and serial constructs support reduction clause. Extend
recent D151564 loop reduction clause support to also include these
compute constructs.
Reviewed By: clementval, vzakhari
Differential Revision: https://reviews.llvm.org/D151955
Add initial support to lower reduction clause to its representation in MLIR.
This patch adds support for addition of integer and real scalar types. Other
operators and types will be added with follow up patches.
Reviewed By: razvanlupusoru
Differential Revision: https://reviews.llvm.org/D151564
The MLIR classes Type/Attribute/Operation/Op/Value support
cast/dyn_cast/isa/dyn_cast_or_null functionality through llvm's doCast
functionality in addition to defining methods with the same name.
This change begins the migration of uses of the method to the
corresponding function call as has been decided as more consistent.
Note that there still exist classes that only define methods directly,
such as AffineExpr, and this does not include work currently to support
a functional cast/isa call.
Context:
- https://mlir.llvm.org/deprecation/ at "Use the free function variants
for dyn_cast/cast/isa/…"
- Original discussion at https://discourse.llvm.org/t/preferred-casting-style-going-forward/68443
Implementation:
This patch updates all remaining uses of the deprecated functionality in
mlir/. This was done with clang-tidy as described below and further
modifications to GPUBase.td and OpenMPOpsInterfaces.td.
Steps are described per line, as comments are removed by git:
0. Retrieve the change from the following to build clang-tidy with an
additional check:
main...tpopp:llvm-project:tidy-cast-check
1. Build clang-tidy
2. Run clang-tidy over your entire codebase while disabling all checks
and enabling the one relevant one. Run on all header files also.
3. Delete .inc files that were also modified, so the next build rebuilds
them to a pure state.
```
ninja -C $BUILD_DIR clang-tidy
run-clang-tidy -clang-tidy-binary=$BUILD_DIR/bin/clang-tidy -checks='-*,misc-cast-functions'\
-header-filter=mlir/ mlir/* -fix
rm -rf $BUILD_DIR/tools/mlir/**/*.inc
```
Differential Revision: https://reviews.llvm.org/D151542
Use the new reduction design in acc.loop operation.
Depends on D151146
Reviewed By: razvanlupusoru, jeanPerier
Differential Revision: https://reviews.llvm.org/D151164
Add the missing check on private list information. The
check is the same than the one done for acc.parallel.
Depends on D151146
Reviewed By: razvanlupusoru, jeanPerier
Differential Revision: https://reviews.llvm.org/D151149
Add the missing check on private list information. The
check is the same than the one done for acc.parallel.
Depends on D151146
Reviewed By: razvanlupusoru, jeanPerier
Differential Revision: https://reviews.llvm.org/D151149
After D150818 the reduction clause is represented
with a acc.reduction.recipe operation and an operand.
This patch updates the acc.parallel op for the new design.
Reviewed By: razvanlupusoru, jeanPerier
Differential Revision: https://reviews.llvm.org/D151146
The destroy region is optional but the verifier was enforcing it.
Update the verifier and make it clear in the definition.
Reviewed By: razvanlupusoru
Differential Revision: https://reviews.llvm.org/D151239
Update acc.loop private operands list to use the new design
introduced in D150622.
Depends on D150975
Reviewed By: razvanlupusoru
Differential Revision: https://reviews.llvm.org/D150984