This patch implements support for the UNROLL directive to control how
many times a loop should be unrolled.
It must be placed immediately before a `DO LOOP` and applies only to the
loop that follows. N is an integer that specifying the unrolling factor.
This is done by adding an attribute to the branch into the loop in LLVM
to indicate that the loop should unrolled.
The code pushed to support the directive `VECTOR ALWAYS` has been
modified to take account of the fact that several directives can be used
before a `DO LOOP`.
This patch adds a call to the CUFInit function just after `ProgramStart`
when CUDA Fortran is enabled to initialize the CUDA context. This allows
us to set up some context information like the stack limit that can be
defined by an environment variable `ACC_OFFLOAD_STACKSIZE=<value>`.
Fixes https://github.com/llvm/llvm-project/issues/113191
Issue: [flang][OpenMP] Runtime segfault when an allocatable variable is
used with copyin
Rootcause: The value of the threadprivate variable is not being copied
from the primary thread to the other threads within a parallel region.
As a result it tries to access a null pointer inside a parallel region
which causes segfault.
Fix: When allocatables used with copyin clause need to ensure that, on
entry to any parallel region each thread’s copy of a variable will
acquire the allocation status of the primary thread, before copying the
value of a threadprivate variable of the primary thread to the
threadprivate variable of each other member of the team.
This fixes a bug when the same variable is used in `firstprivate` and
`lastprivate` clauses on the same construct. The issue boils down to the
fact that `copyHostAssociateVar` was deciding the direction of the copy
assignment (i.e. the `lhs` and `rhs`) based on whether the
`copyAssignIP`
parameter is set. This is not the best way to do it since it is not
related to whether we doing a copy from host to localized copy or the
other way around. When we set the insertion for `firstprivate` in
delayed privatization, this resulted in switching the direction of the
copy assignment. Instead, this PR adds a new paramter to explicitely
tell
the function the direction of the assignment.
This is a follow up PR for
https://github.com/llvm/llvm-project/pull/122471, only the latest commit
is relevant.
Descriptors of allocatable components of firstprivate derived type
copies need to be set-up. Otherwise the program later die when
manipulating them inside OpenMP region.
This commit fixes some but not all memory leaks in Flang. There are
still 91 tests that fail with ASAN.
- Use `mlir::OwningOpRef` instead of `std::unique_ptr`. The latter does
not free allocations of nested blocks.
- Pass `ModuleOp` as value instead of reference.
- Add few missing deallocations in test cases and other places.
Allocatable members of privatized derived types must be allocated,
with the same bounds as the original object, whenever that member
is also allocated in it, but Flang was not performing such
initialization.
The `Initialize` runtime function can't perform this task unless
its signature is changed to receive an additional parameter, the
original object, that is needed to find out which allocatable
members, with their bounds, must also be allocated in the clone.
As `Initialize` is used not only for privatization, sometimes this
other object won't even exist, so this new parameter would need
to be optional.
Because of this, it seemed better to add a new runtime function:
`InitializeClone`.
To avoid unnecessary calls, lowering inserts a call to it only for
privatized items that are derived types with allocatable members.
Fixes https://github.com/llvm/llvm-project/issues/114888
Fixes https://github.com/llvm/llvm-project/issues/114889
Implement the UNSIGNED extension type and operations under control of a
language feature flag (-funsigned).
This is nearly identical to the UNSIGNED feature that has been available
in Sun Fortran for years, and now implemented in GNU Fortran for
gfortran 15, and proposed for ISO standardization in J3/24-116.txt.
See the new documentation for details; but in short, this is C's
unsigned type, with guaranteed modular arithmetic for +, -, and *, and
the related transformational intrinsic functions SUM & al.
This re-applies #117867 with a small fix that hopefully prevents build
bot failures. The fix is avoiding `dyn_cast` for the result of
`getOperation()`. Instead we can assign the result to `mlir::ModuleOp`
directly since the type of the operation is known statically (`OpT` in
`OperationPass`).
This is a starting PR to implicitly map allocatable record fields.
This PR contains the following changes:
1. Re-purposes some of the utils used in `Lower/OpenMP.cpp` so that
these utils work on the `mlir::Value` level rather than the
`semantics::Symbol` level. This takes one step towards to enabling
MLIR passes to more easily do some lowering themselves (e.g. creating
`omp.map.bounds` ops for implicitely caputured data like this PR
does).
2. Adds support for implicitely capturing and mapping allocatable fields
in record types.
There is quite some distant to still cover to have full support for
this. I added a number of todos to guide further development.
Co-authored-by: Andrew Gozillon <andrew.gozillon@amd.com>
Co-authored-by: Andrew Gozillon <andrew.gozillon@amd.com>
-frealloc-lhs is the default.
If -fno-realloc-lhs is specified, then an allocatable on the left
side of an intrinsic assignment is not implicitly (re)allocated
to conform with the right hand side. Fortran runtime will issue
an error if there is a mismatch in shape/type/allocation-status.
Split some headers into headers for public and private declarations in
preparation for #110217. Moving the runtime-private headers in
runtime-private include directory will occur in #110298.
* Do not use `sizeof(Descriptor)` in the compiler. The size of the
descriptor is target-dependent while `sizeof(Descriptor)` is the size of
the Descriptor for the host platform which might be too small when
cross-compiling to a different platform. Another problem is that the
emitted assembly ((cross-)compiling to the same target) is not identical
between Flang's running on different systems. Moving the declaration of
`class Descriptor` out of the included header will also reduce the
amount of #included sources.
* Do not use `sizeof(ArrayConstructorVector)` and
`alignof(ArrayConstructorVector)` in the compiler. Same reason as with
`Descriptor`.
* Compute the descriptor's extra flags without instantiating a
Descriptor. `Fortran::runtime::Descriptor` is defined in the runtime
source, but not the compiler source.
* Move `InquiryKeywordHashDecode` into runtime-private header. The
function is defined in the runtime sources and trying to call it in the
compiler would lead to a link-error.
* Move allocator-kind magic numbers into common header. They are the
only declarations out of `allocator-registry.h` in the compiler as well.
This does not make Flang cross-compile ready yet, the main goal is to
avoid transitive header dependencies from Flang to clang-rt. There are
more assumptions that host platform is the same as the target platform.
Both OpenMP privatization and DO CONCURRENT LOCAL lowering was incorrect
for pointers and derived type with default initialization.
For pointers, the descriptor was not established with the rank/type
code/element size, leading to undefined behavior if any inquiry was made
to it prior to a pointer assignment (and if/when using the runtime for
pointer assignments, the descriptor must have been established).
For derived type with default initialization, the copies were not
default initialized.
When the rhs expression has some constants and a device symbol, an
implicit data transfer needs to be generated for the device symbol and
the computation with the constant is done on the host.
This patch adds a flag to mark hlfir.declare of host variables that are
captured in some internal procedure.
It enables implementing a simple fir.call handling in
fir::AliasAnalysis::getModRef leveraging Fortran language specifications
and without a data flow analysis.
This will allow implementing an optimization for "array =
array_function()" where array storage is passed directly into the hidden
result argument to "array_function" when it can be proven that
arraY_function does not reference "array".
Captured host variables are very tricky because they may be accessed
indirectly in any calls if the internal procedure address was captured
via some global procedure pointer. Without flagging them, there is no
way around doing a complex inter procedural data flow analysis:
- checking that the call is not made to an internal procedure is not
enough because of the possibility of indirect calls made to internal
procedures inside the callee.
- checking that the current func.func has no internal procedure is not
enough because this would be invalid with inlining when an procedure
with internal procedures is inlined inside a procedure without internal
procedure.
We generate `cuf.free` and `func.return` twice if a return statement
exists at the end of program.
```f90
program test
integer, device :: a(10)
return
end
```
```
% flang -x cuda test.cuf -mmlir --mlir-print-ir-after-all
error: loc("/path/to/test.cuf":3:3): 'func.return' op must be the last operation in the parent block
// -----// IR Dump After Fortran::lower::VerifierPass Failed () //----- //
```
Dumped IR:
```mlir
"func.func"() <{function_type = () -> (), sym_name = "_QQmain"}> ({
...
"cuf.free"(%5#1) <{data_attr = #cuf.cuda<device>}> : (!fir.ref<!fir.array<10xi32>>) -> ()
"func.return"() : () -> ()
"cuf.free"(%5#1) <{data_attr = #cuf.cuda<device>}> : (!fir.ref<!fir.array<10xi32>>) -> ()
"func.return"() : () -> ()
}
...
```
The routine `genExitRoutine` in `Bridge.cpp` is guarded by
`blockIsUnterminated()` to make sure that `func.return` is generated
only at the end of a block. However, we redundantly run
`bridge.fctCtx().finalizeAndKeep()` before `genExitRoutine` in this
case, resulting in two pairs of `cuf.free` and `func.return`. This PR
fixes `Bridge.cpp` by using `blockIsUnterminated()` to guard
`finalizeAndKeep` as well.
If you have the following multi-range `do concurrent` loop:
```fortran
do concurrent(i=1:n, j=1:bar(n*m, n/m))
a(i) = n
end do
```
Currently, flang generates the following IR:
```mlir
fir.do_loop %arg1 = %42 to %44 step %c1 unordered {
...
%53:3 = hlfir.associate %49 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
%54:3 = hlfir.associate %52 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
%55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32
hlfir.end_associate %53#1, %53#2 : !fir.ref<i32>, i1
hlfir.end_associate %54#1, %54#2 : !fir.ref<i32>, i1
%56 = fir.convert %55 : (i32) -> index
...
fir.do_loop %arg2 = %46 to %56 step %c1_4 unordered {
...
}
}
```
However, if `bar` is impure, then we have a direct violation of the
standard:
```
C1143 A reference to an impure procedure shall not appear within a DO CONCURRENT construct.
```
Moreover, the standard describes the execution of `do concurrent`
construct in multiple stages:
```
11.1.7.4 Execution of a DO construct
...
11.1.7.4.2 DO CONCURRENT loop control
The concurrent-limit and concurrent-step expressions in the concurrent-control-list are evaluated. ...
11.1.7.4.3 The execution cycle
...
The block of a DO CONCURRENT construct is executed for every active combination of the index-name values.
Each execution of the block is an iteration. The executions may occur in any order.
```
From the above 2 points, it seems to me that execution is divided in
multiple consecutive stages: 11.1.7.4.2 is the stage where we evaluate
all control expressions including the step and then 11.1.7.4.3 is the
stage to execute the block of the concurrent loop itself using the
combination of possible iteration values.
nsw is now added to do-variable increment when -fno-wrapv is enabled as
GFortran seems to do.
That means the option introduced by #91579 isn't necessary any more.
Note that the feature of -flang-experimental-integer-overflow is enabled
by default.
The underlying issue was caused by a file included in two different
places which resulted in duplicate definition errors when linking
individual shared libraries. This was fixed in c3201ddaea
[#109874].
Add support for the -frecord-command-line option that will produce the
llvm.commandline metadata which will eventually be saved in the object
file. This behavior is also supported in clang. Some refactoring of the
code in flang to handle these command line options was carried out. The
corresponding -grecord-command-line option which saves the command line
in the debug information has not yet been enabled for flang.
The new LLVM stack save/restore intrinsic operations are more convenient
than function calls because they do not add function declarations to the
module and therefore do not block the parallelisation of passes.
Furthermore they could be much more easily marked with memory effects
than function calls if that ever proved useful.
This builds on top of #107879.
Resolves#108016
Mark the symbol with OmpShared, and then check that later in lowering to
avoid making a local loop index.
OpenMP 5.2 says: "Loop iteration variables of loops that are not associated
with any OpenMP directive maybe listed in data-sharing attribute clauses on
the surrounding teams, parallel or taskgenerating construct, and on enclosed
constructs, subject to other restrictions."
Tests updated to match the extra OmpShared attribute.
Add regression test for lowering to hlfir.
Closes#102961
---------
Co-authored-by: Tom Eccles <tom.eccles@arm.com>
This patch creates a simple RAII wrapper class for `SymMap` to make it
easier to use and prevent a missing matching `popScope()` for a
`pushScope()` call on simple use cases.
Some push-pop pairs are replaced with instances of the new class by this
patch.
Lowering was crashing when cuf kernels has an unstructured construct.
Blocks created by PFT need to be re-created inside of the operation like
it is done for OpenACC construct.
Code lowering always generates fir.if else blocks for source level if
statements, whether needed or not. Change this to only generate else
blocks that are needed.
ALLOCATE and DEALLOCATE statements can be inlined in device function.
This patch updates the condition that determined to inline these actions
in lowering.
This avoid runtime calls in device function code and can speed up the
execution.
Also move `isCudaDeviceContext` from `Bridge.cpp` so it can be used
elsewhere.
`cuf.data_transfer` will be converted to runtime calls to cuda runtime
api and these are not supported in device code. assignment in OpenACC
region will be handled by the OpenACC code gen so we avoid to generate
data transfer on them.
#106120 Simplify the data transfer when possible by using the reference
and a shape. This bypass the declare op. In order to keep the declare op
around, use the second results of the declare op which achieve the same.
When doing data transfer with dynamic sized array, we are currently
generating a data transfer between two descriptors. If the shape values
can be provided, we can keep the data transfer between two references.
This patch adds the shape operands to the operation.
This will be exploited in lowering in a follow up patch.
This brings the behavior of flang in line with clang which also adds
this metadata unconditionally.
Co-authored-by: Tarun Prabhu <tarun.prabhu@gmail.com>
Reland #96746 with the proper Support/CMakelist.txt change.
fir.type does not contain all Fortran level information about
components. For instance, component lower bounds and default initial
value are lost. For correctness purpose, this does not matter because
this information is "applied" in lowering (e.g., when addressing the
components, the lower bounds are reflected in the hlfir.designate).
However, this "loss" of information will prevent the generation of
correct debug info for the type (needs to know about lower bounds). The
initial value could help building some optimization pass to get rid of
initialization runtime calls.
This patch adds lower bound and initial value information into
fir.type_info via a new fir.dt_component operation. This operation is
generated only for component that needs it, which helps keeping the IR
small for "boring" types.
In general, adding Fortran level info in fir.type_info will allow
delaying the generation of "type descriptors" gobals that are very
verbose in FIR and make it hard to work with FIR dumps from applications
with many derived types.
In nested constructs where a given variable is privatized more than
once, using the default clause, the innermost host association symbol
will point to the previous host association symbol.
Such symbol lacks the allocatable attribute and can't be used to
generate the type of the symbol to be cloned. Use the ultimate
symbol instead.
Fixes#85594, #80398