Instead of emitting globals in the program/default address space, emit
them in the global address space. This also requires changes how address
of code-gen is handled, we need to cast to the default address space to
prevent code-gen issues.
Polymorphic temporary are currently propagated as
fir.ref<fir.class<fir.heap<>>> because their allocation may be delayed
to the hlfir.assign copy (using realloc).
This patch moves away from this and directly allocate the temp and
propagate it as a fir.class.
The polymorphic temporaries creating is also simplified by avoiding the
need to call the runtime to setup the descriptor altogether (the runtime
is still call for the allocation currently because alloca/allocmem do
not support polymorphism).
Flang at Ofast usually produces executables that consume more stack that
other Fortran compilers.
This is in part because the alloca created from temporary heap
allocation by the StackArray pass are created at the function scope
level without lifetimes, and LLVM does not/is not able to merge alloca
that do not have overlapping lifetimes.
This patch adds an option to generate LLVM lifetime in the StackArray
pass at the previous heap allocation/free using the LLVM dialect
operation for it.
Adds support for lowering `do concurrent` nests from PFT to the new
`fir.do_concurrent` MLIR op as well as its special terminator
`fir.do_concurrent.loop` which models the actual loop nest.
To that end, this PR emits the allocations for the iteration variables
within the block of the `fir.do_concurrent` op and creates a region for
the `fir.do_concurrent.loop` op that accepts arguments equal in number
to the number of the input `do concurrent` iteration ranges.
For example, given the following input:
```fortran
do concurrent(i=1:10, j=11:20)
end do
```
the changes in this PR emit the following MLIR:
```mlir
fir.do_concurrent {
%22 = fir.alloca i32 {bindc_name = "i"}
%23:2 = hlfir.declare %22 {uniq_name = "_QFsub1Ei"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
%24 = fir.alloca i32 {bindc_name = "j"}
%25:2 = hlfir.declare %24 {uniq_name = "_QFsub1Ej"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
fir.do_concurrent.loop (%arg1, %arg2) = (%18, %20) to (%19, %21) step (%c1, %c1_0) {
%26 = fir.convert %arg1 : (index) -> i32
fir.store %26 to %23#0 : !fir.ref<i32>
%27 = fir.convert %arg2 : (index) -> i32
fir.store %27 to %25#0 : !fir.ref<i32>
}
}
```
[RFC on
discourse](https://discourse.llvm.org/t/rfc-volatile-representation-in-flang/85404/1)
Flang currently lacks support for volatile variables. For some cases,
the compiler produces TODO error messages and others are ignored. Some
of our tests are like the example from _C.4 Clause 8 notes: The VOLATILE
attribute (8.5.20)_ and require volatile variables.
Prior commits:
```
c9ec1bc753 [flang] Handle volatility in lowering and codegen (#135311)
e42f860985 [flang][nfc] Support volatility in Fir ops (#134858)
b2711e1526 [flang][nfc] Support volatile on ref, box, and class types (#134386)
```
This reverts commit 04b87e15e4.
The reasons for reverting is that the following:
1. I still need need to upstream some part of the do concurrent to
OpenMP pass from our downstream implementation and taking this in
downstream will make things more difficult.
2. I still need to work on a solution for modeling locality specifiers
on `hlfir.do_concurrent` ops. I would prefer to do that and merge the
entire stack together instead of having a partial solution.
After merging the revert I will reopen the origianl PR and keep it
updated against main until I finish the above.
Adds support for lowering `do concurrent` nests from PFT to the new
`fir.do_concurrent` MLIR op as well as its special terminator
`fir.do_concurrent.loop` which models the actual loop nest.
To that end, this PR emits the allocations for the iteration variables
within the block of the `fir.do_concurrent` op and creates a region for
the `fir.do_concurrent.loop` op that accepts arguments equal in number
to the number of the input `do concurrent` iteration ranges.
For example, given the following input:
```fortran
do concurrent(i=1:10, j=11:20)
end do
```
the changes in this PR emit the following MLIR:
```mlir
fir.do_concurrent {
%22 = fir.alloca i32 {bindc_name = "i"}
%23:2 = hlfir.declare %22 {uniq_name = "_QFsub1Ei"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
%24 = fir.alloca i32 {bindc_name = "j"}
%25:2 = hlfir.declare %24 {uniq_name = "_QFsub1Ej"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
fir.do_concurrent.loop (%arg1, %arg2) = (%18, %20) to (%19, %21) step (%c1, %c1_0) {
%26 = fir.convert %arg1 : (index) -> i32
fir.store %26 to %23#0 : !fir.ref<i32>
%27 = fir.convert %arg2 : (index) -> i32
fir.store %27 to %25#0 : !fir.ref<i32>
}
}
```
* Enable lowering and conversion patterns to pass volatility information
from higher level operations to lower level ones.
* Enable codegen to pass volatility to LLVM dialect ops by setting an
attribute on loads, stores, and memory intrinsics.
* Add utilities for passing along the volatility from an input type to
an output type.
To introduce volatile types into the IR, entities with the volatile
attribute will be given a volatile type in the bridge; this is not
enabled in this patch. User code should not result in IR with volatile
types yet, so this patch contains no tests with Fortran source, only IR
that already contains volatile types.
Part 3 of #132486.
Part two of merging #132486. Support volatility in fir ops.
* Introduce a new operation fir.volatile_cast, whose only purpose is to
add or take away the volatility of an SSA value's type. The types must
be otherwise identical, and any other type conversions must be handled
by fir.convert. fir.convert will give an error if the volatility of the
inputs does not match, such that all changes to volatility must be
handled explicitly through fir.volatile_cast.
* Add memory effects to ops that read from or write to memory. The
precedent for this comes from the LLVM dialect (feb7beaf70) where
llvm.load/store ops with the volatile attribute report read/write
effects to a generic memory resource. This change is similar in spirit
but different in two ways: the volatility of an operation is determined
by the type of its memref, not an attribute on the op, and the memory
effects of a load- or store-like operation on a volatile reference type
are reported against a particular memory resource,
`VolatileMemoryResource`. This is so MLIR optimizations are able to
reorder operations that are not volatile around operations that are,
which we believe more precisely models LLVM's volatile memory semantics.
@vzakhari suggested this in #132486 citing LangRef. See
https://llvm.org/docs/LangRef.html#volatile-memory-accesses
Changes needed to generate IR with volatile types are not included in
this change, so it should be non-functional, containing only the changes
to Fir ops and op utilities that will be needed once we enable lowering
to generate volatile types.
Part one of merging #132486. Add support for representing volatility in
the type system for reference, box, and class types. Don't do anything
with volatile just yet, only support and test their representation and
utility functions.
The naming convention is a little goofy - `fir::isa_volatile_type` and
`fir::updateTypeWithVolatility` use different capitalization, but I put
them near similar functions and tried to match the surrounding
conventions and [the
docs](https://github.com/llvm/llvm-project/blob/main/flang/docs/C%2B%2Bstyle.md#naming)
best I could.
The code generation relies on `ShallowCopyDirect` runtime
to copy data between the original and the temporary arrays
(both directions). The allocations are done by the compiler
generated code. The heap allocations could have been passed
to `ShallowCopy` runtime, but I decided to expose the allocations
so that the temporary descriptor passed to `ShallowCopyDirect`
has `nocapture` - maybe this will be better for LLVM optimizations.
Implement handling of `NULL()` RHS, polymorphic pointers, as well as
lower bounds or bounds remapping in pointer assignment inside FORALL.
These cases eventually do not require updating hlfir.region_assign,
lowering can simply prepare the new descriptor for the LHS inside the
RHS region.
Looking more closely at the polymorphic cases, there is not need to call
the runtime, fir.rebox and fir.embox do handle the dynamic type setting
correctly.
After this patch, the last remaining TODO is the allocatable assignment
inside FORALL, which like some cases here, is more likely an accidental
feature given FORALL was deprecated in F2003 at the same time than
allocatable components where added.
This change is inspired by a case in facerec benchmark, where
performance
of scalar code may improve by about 6%@aarch64 due to getting rid of
redundant
loads from Fortran descriptors. These descriptors are corresponding
to subroutine local ALLOCATABLE, SAVE variables. The scalar loop nest
in LocalMove subroutine contains call to Fortran runtime IO functions,
and LLVM globals-aa analysis cannot prove that these calls do not modify
the globalized descriptors with internal linkage.
This patch sets and propagates llvm.memory_effects attribute for
fir.call
operations calling Fortran runtime functions. In particular, it tries
to set the Other memory effect to NoModRef. The Other memory effect
includes accesses to globals and captured pointers, so we cannot set
it for functions taking Fortran descriptors with one exception
for calls where the Fortran descriptor arguments are all null.
As long as different calls to the same Fortran runtime function may have
different attributes, I decided to attach the attributes to the calls
rather than functions. Moreover, attaching the attributes to func.func
will require propagating these attributes to llvm.func, which is not
happening right now.
In addition to llvm.memory_effects, the new pass sets llvm.nosync
and llvm.nocallback attributes that may also help LLVM alias analysis
(e.g. see #127707). These attributes are ignored currently.
I will support them in LLVM IR dialect in a separate patch.
I also added another pass for developers to be able to print
declarations/calls of all Fortran runtime functions that are recognized
by the attributes setting pass. It should help with maintenance
of the LIT tests.
This change is inspired by a case in facerec benchmark, where
performance
of scalar code may improve by about 6%@aarch64 due to getting rid of
redundant
loads from Fortran descriptors. These descriptors are corresponding
to subroutine local ALLOCATABLE, SAVE variables. The scalar loop nest
in LocalMove subroutine contains call to Fortran runtime IO functions,
and LLVM globals-aa analysis cannot prove that these calls do not modify
the globalized descriptors with internal linkage.
This patch sets and propagates llvm.memory_effects attribute for
fir.call
operations calling Fortran runtime functions. In particular, it tries
to set the Other memory effect to NoModRef. The Other memory effect
includes accesses to globals and captured pointers, so we cannot set
it for functions taking Fortran descriptors with one exception
for calls where the Fortran descriptor arguments are all null.
As long as different calls to the same Fortran runtime function may have
different attributes, I decided to attach the attributes to the calls
rather than functions. Moreover, attaching the attributes to func.func
will require propagating these attributes to llvm.func, which is not
happening right now.
In addition to llvm.memory_effects, the new pass sets llvm.nosync
and llvm.nocallback attributes that may also help LLVM alias analysis
(e.g. see #127707). These attributes are ignored currently.
I will support them in LLVM IR dialect in a separate patch.
I also added another pass for developers to be able to print
declarations/calls of all Fortran runtime functions that are recognized
by the attributes setting pass. It should help with maintenance
of the LIT tests.
.../flang/lib/Optimizer/Builder/FIRBuilder.cpp: In function ‘llvm::Small
Vector<mlir::Value> fir::factory::updateRuntimeExtentsForEmptyArrays(fir
::FirOpBuilder&, mlir::Location, mlir::ValueRange)’:
.../flang/lib/Optimizer/Builder/FIRBuilder.cpp:1786:10: error: could not
convert ‘newExtents’ from ‘SmallVector<[...],15>’ to ‘SmallVector<[...]
,6>’
1786 | return newExtents;
| ^~~~~~~~~~
| |
| SmallVector<[...],15>
Remove size from template parameters in the declaration of `newExtents`.
An hlfir.elemental with a shape `(0, HUGE)` still runs `HUGE`
number of iterations when expanded into a loop nest.
HLFIR transformational operations inlined as hlfir.elemental
may execute slower comparing to Fortran runtime implementation.
This patch adds an option for BufferizeHLFIR pass to reset all
upper bounds in the elemental loop nests to zero, if the result
is an empty array.
A separate patch will enable this option in the driver after I do
more performance testing. The option is off by default now.
When a field in a derived type is `c_devptr`, keep check if we can do a
memcpy instead of falling back to the runtime assignment.
Many internal CUDA Fortran derived type have a `c_devptr` field and this
would lead to stack overflow on the device if the assignment is
performed by the runtime function.
Because `c_devptr` has a `c_ptr` field, any assignment were done via the
Assign runtime function. This leads to stack overflow on the device and
taking too much memory. As we know the c_devptr can be directly copied
on assignment, make it a special case.
Inlining `hlfir.matmul` as `hlfir.eval_in_mem` does not allow
to get rid of a temporary array in many cases, but it may still be
much better allowing to:
* Get rid of any overhead related to calling runtime MATMUL
(such as descriptors creation).
* Use CPU-specific vectorization cost model for matmul loops,
which Fortran runtime cannot currently do.
* Optimize matmul of known-size arrays by complete unrolling.
One of the drawbacks of `hlfir.eval_in_mem` inlining is that
the ops inside it with store memory effects block the current
MLIR CSE, so I decided to run this inlining late in the pipeline.
There is a source commen explaining the CSE issue in more detail.
Straightforward inlining of `hlfir.matmul` as an `hlfir.elemental`
is not good for performance, and I got performance regressions
with it comparing to Fortran runtime implementation. I put it
under an enigneering option for experiments.
At the same time, inlining `hlfir.matmul_transpose` as `hlfir.elemental`
seems to be a good approach, e.g. it allows getting rid of a temporay
array in cases like: `A(:)=B(:)+MATMUL(TRANSPOSE(C(:,:)),D(:))`.
This patch improves performance of galgel and tonto a little bit.
Add `c_devloc` as intrinsic and inline it during lowering. `c_devloc` is
used in CUDA Fortran to get the address of device variables.
For the moment, we borrow almost all semantic checks from `c_loc` except
for the pointer or target restriction. The specifications of `c_devloc`
are are pretty vague and we will relax/enforce the restrictions based on
library and apps usage comparing them to the reference compiler.
This PR is one of 3 in a PR stack, this is the primary change set which
seeks to extend the current derived type explicit member mapping support
to handle descriptor member mapping at arbitrary levels of nesting. The
PR stack seems to do this reasonably (from testing so far) but as you
can create quite complex mappings with derived types (in particular when
adding allocatable derived types or arrays of allocatable derived types)
I imagine there will be hiccups, which I am more than happy to address.
There will also be further extensions to this work to handle the
implicit auto-magical mapping of descriptor members in derived types and
a few other changes planned for the future (with some ideas on
optimizing things).
The changes in this PR primarily occur in the OpenMP lowering and the
OMPMapInfoFinalization pass.
In the OpenMP lowering several utility functions were added or extended
to support the generation of appropriate intermediate member mappings
which are currently required when the parent (or multiple parents) of a
mapped member are descriptor types. We need to map the entirety of these
types or do a "deep copy" for lack of a better term, where we map both
the base address and the descriptor as without the copying of both of
these we lack the information in the case of the descriptor to access
the member or attach the pointers data to the pointer and in the latter
case we require the base address to map the chunk of data. Currently we
do not segment descriptor based derived types as we do with regular
non-descriptor derived types, we effectively map their entirety in all
cases at the moment, I hope to address this at some point in the future
as it adds a fair bit of a performance penalty to having nestings of
allocatable derived types as an example. The process of mapping all
intermediate descriptor members in a members path only occurs if a
member has an allocatable or object parent in its symbol path or the
member itself is a member or allocatable. This occurs in the
createParentSymAndGenIntermediateMaps function, which will also generate
the appropriate address for the allocatable member within the derived
type to use as a the varPtr field of the map (for intermediate
allocatable maps and final allocatable mappings). In this case it's
necessary as we can't utilise the usual Fortran::lower functionality
such as gatherDataOperandAddrAndBounds without causing issues later in
the lowering due to extra allocas being spawned which seem to affect the
pointer attachment (at least this is my current assumption, it results
in memory access errors on the device due to incorrect map information
generation). This is similar to why we do not use the MLIR value
generated for this and utilise the original symbol provided when mapping
descriptor types external to derived types. Hopefully this can be
rectified in the future so this function can be simplified and more
closely aligned to the other type mappings. We also make use of
fir::CoordinateOp as opposed to the HLFIR version as the HLFIR version
doesn't support the appropriate lowering to FIR necessary at the moment,
we also cannot use a single CoordinateOp (similarly to a single GEP) as
when we index through a descriptor operation (BoxType) we encounter
issues later in the lowering, however in either case we need access to
intermediate descriptors so individual CoordinateOp's aid this
(although, being able to compress them into a smaller amount of
CoordinateOp's may simplify the IR and perhaps result in a better end
product, something to consider for the future).
The other large change area was in the OMPMapInfoFinalization pass,
where the pass had to be extended to support the expansion of box types
(or multiple nestings of box types) within derived types, or box type
derived types. This requires expanding each BoxType mapping from one
into two maps and then modifying all of the existing member indices of
the overarching parent mapping to account for the addition of these new
members alongside adjusting the existing member indices to support the
addition of these new maps which extend the original member indices (as
a base address of a box type is currently considered a member of the box
type at a position of 0 as when lowered to LLVM-IR it's a pointer
contained at this position in the descriptor type, however, this means
extending mapped children of this expanded descriptor type to
additionally incorporate the new member index in the correct location in
its own index list). I believe there is a reasonable amount of comments
that should aid in understanding this better, alongside the test
alterations for the pass.
A subset of the changes were also aimed at making some of the utilities
for packing and unpacking the DenseIntElementsAttr containing the member
indices shareable across the lowering and OMPMapInfoFinalization, this
required moving some functions to the Lower/Support/Utils.h header, and
transforming the lowering structure containing the member index data
into something more similar to the version used in
OMPMapInfoFinalization. There we also some other attempts at tidying
things up in relation to the member index data generation in the
lowering, some of which required creating a logical operator for the
OpenMP ID class so it can be utilised as a map key (it simply utilises
the symbol address for the moment as ordering isn't particularly
important).
Otherwise I have added a set of new tests encompassing some of the
mappings currently supported by this PR (unfortunately as you can have
arbitrary nestings of all shapes and types it's not very feasible to
cover them all).
With some restrictions, BIND(C) derived types can be converted to
compatible BIND(C) derived types.
Semantics already support this, but ConvertOp was missing the
conversion of such types.
Fixes https://github.com/llvm/llvm-project/issues/107783
The test code with ignore_tkr(tk) on character dummy passed by
fir.boxchar<> was crashing the compiler in [an
assert](2afe678f0a/flang/lib/Optimizer/Dialect/FIRType.cpp (L632))
in `changeElementType`.
It makes little sense to call changeElementType on a fir.boxchar since
this type is lossy (the shape is not part of it). Just skip it in the
code dealing with ignore(tk) when hitting this case
The new LLVM stack save/restore intrinsic operations are more convenient
than function calls because they do not add function declarations to the
module and therefore do not block the parallelisation of passes.
Furthermore they could be much more easily marked with memory effects
than function calls if that ever proved useful.
This builds on top of #107879.
Resolves#108016
This is an extension of CUDA Fortran. The iso_c_binding intrinsic can
accept a `TYPE(c_devptr)` as its first argument. This patch relax the
semantic check to accept it and update the lowering to unwrap the cptr
field from the c_devptr.
This PR implements `ComputeRegionOpInterface` to define `getAllocaBlock`
of OpenACC loop and compute constructs (parallel/kernels/serial). The
primary objective here is to accommodate local variables in OpenACC
compute regions. The change in `fir::FirOpBuilder::getAllocaBlock`
allows local variable allocation inside loops and kernels.
Functions returning C_PTR were lowered to function returning intptr (i64
on 64bit arch). This caused conflicts when these functions were defined
as returning !fir.ref<none>/llvm.ptr in other compiler generated
contexts (e.g., malloc).
Lower them to return !fir.ref<none>.
This should deal with https://github.com/llvm/llvm-project/issues/97325
and https://github.com/llvm/llvm-project/issues/98644.
This patch generalizes the MemoryAllocation pass (alloca -> heap) to
handle fir.alloca regardless of their postion in the IR. Currently, it
only dealt with fir.alloca in function entry blocks. The logic is placed
in a utility that can be used to replace alloca in an operation on
demand to whatever kind of allocation the utility user wants via
callbacks (allocmem, or custom runtime calls to instrument the code...).
To do so, a concept of ownership, that was already implied a bit and
used in passes like stack-reclaim, is formalized. Any operation with the
LoopLikeInterface, AutomaticAllocationScope, or IsolatedFromAbove owns
the alloca directly nested inside its regions, and they must not be used
after the operation.
The pass then looks for the exit points of region with such interface,
and use that to insert deallocation. If dominance is not proved, the
pass fallbacks to storing the new address into a C pointer variable
created in the entry of the owning region which allows inserting
deallocation as needed, included near the alloca itself to avoid leaks
when the alloca is executed multiple times due to block CFGs loops.
This should fix https://github.com/llvm/llvm-project/issues/88344.
In a next step, I will try to refactor lowering a bit to introduce
lifetime operation for alloca so that the deallocation points can be
inserted as soon as possible.
Reland #96746 with the proper Support/CMakelist.txt change.
fir.type does not contain all Fortran level information about
components. For instance, component lower bounds and default initial
value are lost. For correctness purpose, this does not matter because
this information is "applied" in lowering (e.g., when addressing the
components, the lower bounds are reflected in the hlfir.designate).
However, this "loss" of information will prevent the generation of
correct debug info for the type (needs to know about lower bounds). The
initial value could help building some optimization pass to get rid of
initialization runtime calls.
This patch adds lower bound and initial value information into
fir.type_info via a new fir.dt_component operation. This operation is
generated only for component that needs it, which helps keeping the IR
small for "boring" types.
In general, adding Fortran level info in fir.type_info will allow
delaying the generation of "type descriptors" gobals that are very
verbose in FIR and make it hard to work with FIR dumps from applications
with many derived types.
fir.type does not contain all Fortran level information about
components. For instance, component lower bounds and default initial
value are lost. For correctness purpose, this does not matter because
this information is "applied" in lowering (e.g., when addressing the
components, the lower bounds are reflected in the hlfir.designate).
However, this "loss" of information will prevent the generation of
correct debug info for the type (needs to know about lower bounds). The
initial value could help building some optimization pass to get rid of
initialization runtime calls.
This patch adds lower bound and initial value information into
fir.type_info via a new fir.dt_component operation. This operation is
generated only for component that needs it, which helps keeping the IR
small for "boring" types.
In general, adding Fortran level info in fir.type_info will allow
delaying the generation of "type descriptors" gobals that are very
verbose in FIR and make it hard to work with FIR dumps from applications
with many derived types.
The number of operations dedicated to CUF grew and where all still in
FIR. In order to have a better organization, the CUF operations,
attributes and code is moved into their specific dialect and files. CUF
dialect is tightly coupled with HLFIR/FIR and their types.
The CUF attributes are bundled into their own library since some
HLFIR/FIR operations depend on them and the CUF dialect depends on the
FIR types. Without having the attributes into a separate library there
would be a dependency cycle.
Lower locals allocation of cuda device, managed and unified variables to
fir.cuda_alloc. Add fir.cuda_free in the function context finalization.
@vzakhari For some reason the PR #90526 has been closed when I merged PR
#90525. Just reopening one.
…ted. (#89998)" (#90250)
This partially reverts commit 7aedd7dc75.
This change removes calls to the deprecated member functions. It does
not mark the functions deprecated yet and does not disable the
deprecation warning in TypeSwitch. This seems to cause problems with
MSVC.
Both arrays and trivial scalars are supported. Both cases must use
by-ref reductions because both are boxed.
My understanding of the standards are that OpenMP says that this should
follow the rules of the intrinsic reduction operators in fortran, and
fortran says that unallocated allocatable variables can only be
referenced to allocate them or test if they are already allocated.
Therefore we do not need a null pointer check in the combiner region.
The all one masks was not properly created for i128 types because
builder.createIntegerConstant ended-up truncating -1 to something
positive.
Add a builder.createAllOnesInteger/createMinusOneInteger helpers and use
them where createIntegerConstant(..., -1) was used.
Add an assert in createIntegerConstant to catch negative numbers for
i128 type.