This patch adds a canonicalization pattern for optimizing redundant
"pointer" fir.converts. Such converts prevent the StackArrays pass
to recognize fir.freemem for the corresponding fir.allocmem, e.g.:
```
%69 = fir.allocmem !fir.array<2xi32>
%71:2 = hlfir.declare %69(%70) {uniq_name = ".tmp.arrayctor"} :
(!fir.heap<!fir.array<2xi32>>, !fir.shape<1>) ->
(!fir.heap<!fir.array<2xi32>>, !fir.heap<!fir.array<2xi32>>)
%95 = fir.convert %71#1 :
(!fir.heap<!fir.array<2xi32>>) -> !fir.ref<!fir.array<2xi32>>
%100 = fir.convert %95 :
(!fir.ref<!fir.array<2xi32>>) -> !fir.heap<!fir.array<2xi32>>
fir.freemem %100 : !fir.heap<!fir.array<2xi32>>
```
I found this in `tonto`, but the change does not affect performance at all.
Anyway, it looks like a reasonable thing to do, and it makes easier
to compare the performance profiles with other compilers'.
getElementType() was missing from Sequence and Vector types. Did a
replace of the obvious places getEleTy() was used for these two types
and updated to use this name instead.
Co-authored-by: Scott Manley <scmanley@nvidia.com>
With some restrictions, BIND(C) derived types can be converted to
compatible BIND(C) derived types.
Semantics already support this, but ConvertOp was missing the
conversion of such types.
Fixes https://github.com/llvm/llvm-project/issues/107783
Allow some interaction between LLVM and FIR dialect by allowing
conversion between FIR memory types and llvm.ptr type.
This is meant to help experimentation where FIR and LLVM dialect
coexists, and is useful to deal with cases where LLVM type makes it
early into the MLIR produced by flang, like when inserting LLVM stack
intrinsic here:
0a00d32c5f/flang/lib/Optimizer/Transforms/StackReclaim.cpp (L57)
First patch to fix a BIND(C) ABI issue
(https://github.com/llvm/llvm-project/issues/102113). I need to keep
track of BIND(C) in more locations (fir.dispatch and func.func
operations), and I need to fix a few passes that are dropping the
attribute on the floor. Since I expect more procedure attributes that
cannot be reflected in mlir::FunctionType will be needed for ABI,
optimizations, or debug info, this NFC patch adds a new enum attribute
to keep track of procedure attributes in the IR.
This patch is not updating lowering to lower more attributes, this will
be done in a separate patch to keep the test changes low here.
Adding the attribute on fir.dispatch and func.func will also be done in
separate patches.
This patch generalizes the MemoryAllocation pass (alloca -> heap) to
handle fir.alloca regardless of their postion in the IR. Currently, it
only dealt with fir.alloca in function entry blocks. The logic is placed
in a utility that can be used to replace alloca in an operation on
demand to whatever kind of allocation the utility user wants via
callbacks (allocmem, or custom runtime calls to instrument the code...).
To do so, a concept of ownership, that was already implied a bit and
used in passes like stack-reclaim, is formalized. Any operation with the
LoopLikeInterface, AutomaticAllocationScope, or IsolatedFromAbove owns
the alloca directly nested inside its regions, and they must not be used
after the operation.
The pass then looks for the exit points of region with such interface,
and use that to insert deallocation. If dominance is not proved, the
pass fallbacks to storing the new address into a C pointer variable
created in the entry of the owning region which allows inserting
deallocation as needed, included near the alloca itself to avoid leaks
when the alloca is executed multiple times due to block CFGs loops.
This should fix https://github.com/llvm/llvm-project/issues/88344.
In a next step, I will try to refactor lowering a bit to introduce
lifetime operation for alloca so that the deallocation points can be
inserted as soon as possible.
Reland #96746 with the proper Support/CMakelist.txt change.
fir.type does not contain all Fortran level information about
components. For instance, component lower bounds and default initial
value are lost. For correctness purpose, this does not matter because
this information is "applied" in lowering (e.g., when addressing the
components, the lower bounds are reflected in the hlfir.designate).
However, this "loss" of information will prevent the generation of
correct debug info for the type (needs to know about lower bounds). The
initial value could help building some optimization pass to get rid of
initialization runtime calls.
This patch adds lower bound and initial value information into
fir.type_info via a new fir.dt_component operation. This operation is
generated only for component that needs it, which helps keeping the IR
small for "boring" types.
In general, adding Fortran level info in fir.type_info will allow
delaying the generation of "type descriptors" gobals that are very
verbose in FIR and make it hard to work with FIR dumps from applications
with many derived types.
fir.type does not contain all Fortran level information about
components. For instance, component lower bounds and default initial
value are lost. For correctness purpose, this does not matter because
this information is "applied" in lowering (e.g., when addressing the
components, the lower bounds are reflected in the hlfir.designate).
However, this "loss" of information will prevent the generation of
correct debug info for the type (needs to know about lower bounds). The
initial value could help building some optimization pass to get rid of
initialization runtime calls.
This patch adds lower bound and initial value information into
fir.type_info via a new fir.dt_component operation. This operation is
generated only for component that needs it, which helps keeping the IR
small for "boring" types.
In general, adding Fortran level info in fir.type_info will allow
delaying the generation of "type descriptors" gobals that are very
verbose in FIR and make it hard to work with FIR dumps from applications
with many derived types.
This patch adds more precise side effects to the current ops with memory
effects, allowing us to determine which OpOperand/OpResult/BlockArgument
the
operation reads or writes, rather than just recording the reading and
writing
of values. This allows for convenient use of precise side effects to
achieve
analysis and optimization.
Related discussions:
https://discourse.llvm.org/t/rfc-add-operandindex-to-sideeffect-instance/79243
The alloca can be maximized with the maximum number or ranks, which is
reasonable (15 currently as per the standard). Introducing a rank based
dynamic allocation would complexify alloca hoisting and stack size
analysis (this can be revisited if the standard changes to allow more
ranks).
No change is needed since this is already reflected in how the fir.box
type is translated to LLVM.
Derived from #92480. This PR introduces reduction semantics into loops
for DO CONCURRENT REDUCE. The `fir.do_loop` operation now invisibly has
the `operandSegmentsizes` attribute and takes variable-length reduction
operands with their operations given as `fir.reduce_attr`. For the sake
of compatibility, `fir.do_loop`'s builder has additional arguments at
the end. The `iter_args` operand should be placed in front of the
declaration of result types, so the new operand for reduction variables
(`reduce`) is put in the middle of arguments.
In a simple case like this:
```
program test
integer :: u(120, 2)
u(1:120,1:2) = u(1:120,1:2) + 2
end program
```
Flang is creating a copy loop with fir.array_coor using
a result of fir.embox inserted before the loop. This results in split
address computations before and inside the loop, which can be seen
as many more arithmetic operations than required after converting
FIR to LLVM dialect. Even though LLVM SROA/mem2reg are able
to optimize the temporary descriptor, and then LICM is able to hoist
the invariant computations, we seem to get better mix of LLVM dialect
operations after FIR-to-LLVM codegen. This may also slightly reduce
the compilation time taken by LLVM to optimize the generate LLVM IR.
This may also slightly reduce the time spent by FIR AliasAnalysis
to reach the memory reference source.
With MLIR inlining (e.g. `flang-new -mmlir -inline-all=true`)
the current TBAA tags attachment is suboptimal, because
we may lose information about the callee's dummy arguments
(by bypassing fir.declare in AliasAnalysis::getSource).
This is a conservative first step to improve the situation.
This patch makes AddAliasTagsPass to account for fir.dummy_scope
hierarchy after MLIR inlining and use it to place the TBAA tags
into TBAA trees corresponding to different function scopes.
The pass uses special mode of AliasAnalysis to find the instantiation
point of a Fortran variable (a [hl]fir.decalre) when searching
for the source of a memory reference. In this mode, AliasAnalysis
will always stop at fir.declare operations that have dummy_scope
operands - there should not be a reason to past throught it
for the purpose of TBAA tags attachment.
Fixes a bug uncovered by
[pr43337.f90](https://github.com/llvm/llvm-test-suite/blob/main/Fortran/gfortran/regression/gomp/pr43337.f90)
in the test suite.
In particular, this emits `argNo` debug info only if the parent op of a
block is a `func.func` op. This avoids DI conflicts when a function
contains a nested OpenMP region that itself has block arguments with DI
attached to them; for example, `omp.parallel` with delayed privatization
enabled.
First commit is reviewed in
https://github.com/llvm/llvm-project/pull/93682.
Lower RANK using fir.box_rank. This patches updates fir.box_rank to
accept box reference, this avoids the need of generating an assumed-rank
fir.load just for the sake of reading ALLOCATABLE/POINTER rank. The
fir.load would generate a "dynamic" memcpy that is hard to optimize
without further knowledge. A read effect is conditionally given to the
operation.
The number of operations dedicated to CUF grew and where all still in
FIR. In order to have a better organization, the CUF operations,
attributes and code is moved into their specific dialect and files. CUF
dialect is tightly coupled with HLFIR/FIR and their types.
The CUF attributes are bundled into their own library since some
HLFIR/FIR operations depend on them and the CUF dialect depends on the
FIR types. Without having the attributes into a separate library there
would be a dependency cycle.
Lower locals allocation of cuda device, managed and unified variables to
fir.cuda_alloc. Add fir.cuda_free in the function context finalization.
@vzakhari For some reason the PR #90526 has been closed when I merged PR
#90525. Just reopening one.
…ted. (#89998)" (#90250)
This partially reverts commit 7aedd7dc75.
This change removes calls to the deprecated member functions. It does
not mark the functions deprecated yet and does not disable the
deprecation warning in TypeSwitch. This seems to cause problems with
MSVC.
Fix parsing of cuda_kernel: it missed a mlir::succeeded check and it was
not setting up the `types` and causing mismatch between values and types
of the grid/block (CUFKernelValues). @clementval
---------
Co-authored-by: Iman Hosseini <imanh@nvidia.com>
Co-authored-by: Valentin Clement (バレンタイン クレメン) <clementval@gmail.com>
Add MemRead effect on the box operand as the descriptor might be read
when performing the allocation of the data.
Also update the expected type of the box operand to be a reference.
Check in the verifier that this is a reference to a box or class type.
This addresses the comment made post commit on #88586
Add the fir.cuda_deallocate operation that perform device deallocation
of data hold by a descriptor. This will replace the call to
AllocatableDeallocate from the runtime.
This is a companion operation to the one added in #88586
Allocatable with cuda device attribute have special semantic for the
allocate statement. In flang the allocate statement is lowered to a
sequence of runtime call initializing the descriptor and then allocating
the descriptor data. This new operation will replace the last runtime
call and abstract all the device memory allocation needed.
The lowering patch will follow.
Whenever lowering is checking if a function or global already exists in
the mlir::Module, it was doing module->lookup.
On big programs (~5000 globals and functions), this causes important
slowdowns because these lookups are linear. Use mlir::SymbolTable to
speed-up these lookups. The SymbolTable has to be created from the
ModuleOp and maintained in sync. It is therefore placed in the
converter, and FirOPBuilders can take a pointer to it to speed-up the
lookups.
This patch does not bring mlir::SymbolTable to FIR/HLFIR passes, but
some passes creating a lot of runtime calls could benefit from it too.
More analysis will be needed.
As an example of the speed-ups, this patch speeds-up compilation of
Whizard compare_amplitude_UFO.F90 from 5 mins to 2 mins on my machine
(there is still room for speed-ups).
This patch introduces a new operation to represent the CUDA Fortran
kernel loop directive. This operation is modeled as a LoopLikeOp
operation in a similar way to acc.loop.
The CUFKernelDoConstruct parse tree node is also placed correctly in the
PFTBuilder to be available in PFT evaluations.
Lowering from the flang parse-tree to MLIR is also done.
These hardcoded attribute name are a leftover from the upstreaming
period when there was no way to get the attribute name without an
instance of the operation. It is since possible to do without them and
they should be removed to avoid duplication.
This PR cleanup the fir.dt_entry and fir.dispatch ops of these hardcoded
attribute name and use their generated getters. Some other PRs will
follow to cleanup other operations.
These hardcoded attribute name are a leftover from the upstreaming
period when there was no way to get the attribute name without an
instance of the operation. It is since possible to do without them and
they should be removed to avoid duplication.
This PR cleanup the fir.global op of these hardcoded attribute name and
use their generated getters. Some other PRs will follow to cleanup other
operations.
The custom printer for `fir.global` was eluding all the attributes
present on the op when printing the attribute dictionary. So any
attribute that is not part of the pretty printing was therefore
discarded.
This patch fix the printer and also make use of the getters for the
attribute names when they are hardcoded.
hlfir.declare introduce some boxes that can be later optimized away. The
OpenACC lowering is currently setting some attributes on FIR operations
to track declare variables. When the boxes are optimized away these
attributes are lost. This patch propagate OpenACC attributes from
box_addr op to the defining op of the folding result.
Using `LoopLikeOpInterface` as the basis for the implementation unifies
all the tiling logic for both `scf.for` and `scf.forall`. The only
difference is the actual loop generation. This is a follow up to
https://github.com/llvm/llvm-project/pull/72178
Instead of many entry points for each loop type, the loop type is now
passed as part of the options passed to the tiling method.
This is a breaking change with the following changes
1) The `scf::tileUsingSCFForOp` is renamed to `scf::tileUsingSCF`
2) The `scf::tileUsingSCFForallOp` is deprecated. The same
functionality is obtained by using `scf::tileUsingSCF` and setting
the loop type in `scf::SCFTilingOptions` passed into this method to
`scf::SCFTilingOptions::LoopType::ForallOp` (using the
`setLoopType` method).
3) The `scf::tileConsumerAndFusedProducerGreedilyUsingSCFForOp` is
renamed to `scf::tileConsumerAndFuseProducerUsingSCF`. The use of
the `controlFn` in `scf::SCFTileAndFuseOptions` allows implementing
any strategy with the default callback implemeting the greedy fusion.
4) The `scf::SCFTilingResult` and `scf::SCFTileAndFuseResult` now use
`SmallVector<LoopLikeOpInterface>`.
5) To make `scf::ForallOp` implement the parts of
`LoopLikeOpInterface` needed, the `getOutputBlockArguments()`
method is replaced with `getRegionIterArgs()`
These changes now bring the tiling and fusion capabilities using
`scf.forall` on par with what was already supported by `scf.for`
This commit adds extra assertions to `OperationFolder` and `OpBuilder`
to ensure that the types of the folded SSA values match with the result
types of the op. There used to be checks that discard the folded results
if the types do not match. This commit makes these checks stricter and
turns them into assertions.
Discarding folded results with the wrong type (without failing
explicitly) can hide bugs in op folders. Two such bugs became apparent
in MLIR (and some more in downstream projects) and are fixed with this
change.
Note: The existing type checks were introduced in
https://reviews.llvm.org/D95991.
Migration guide: If you see failing assertions (`folder produced value
of incorrect type`; make sure to run with assertions enabled!), run with
`-debug` or dump the operation right before the failing assertion. This
will point you to the op that has the broken folder. A common mistake is
a mismatch between static/dynamic dimensions (e.g., input has a static
dimension but folded result has a dynamic dimension).
This operation allows computing the address of descriptor fields. It is
needed to help attaching descriptors in OpenMP/OpenACC target region.
The pointers inside the descriptor structure must be mapped too, but the
fir.box is abstract, so these fields cannot be computed with
fir.coordinate_of.
To preserve the abstraction of the descriptor layout in FIR, introduce
an operation specifically to !fir.ref<fir.box<>> address fields based on
field names (base_addr or derived_type).
Expose a `MutableArrayRef<OpOperand>` instead of
`ValueRange`/`OperandRange`. This allows users of this interface to
change the yielded values and the init values. The names of the
interface methods are the same as the auto-generated op accessor names
(`get...()` returns `OperandRange`, `get...Mutable()` returns
`MutableOperandRange`).
Note: The interface methods return a `MutableArrayRef` instead of a
`MutableOperandRange` because a loop op may not implement
`getYieldedValuesMutable` etc. and there is no safe way to return an
"empty" range with a `MutableOperandRange`.
Add a new interface method that returns the yielded values.
Also add a verifier that checks the number of inits/iter_args/yielded
values. Most of the checked invariants (but not all of them) are already
covered by the `RegionBranchOpInterface`, but the `LoopLikeOpInterface`
now provides (additional) error messages that are easier to read.
This interface allows (HL)FIR passes to add TBAA information to fir.load
and fir.store. If present, these TBAA tags take precedence over those
added during CodeGen.
We can't reuse mlir::LLVMIR::AliasAnalysisOpInterface because that uses
the mlir::LLVMIR namespace so it tries to define methods for fir
operations in the wrong namespace. But I did re-use the tbaa tag type to
minimise boilerplate code.
The new builders are to preserve the old interface without the tbaa tag.
The goal is to progressively propagate all the derived type info that is
currently in the runtime type info globals into a FIR operation that can
be easily queried and used by FIR/HLFIR passes.
When this will be complete, the last step will be to stop generating the
runtime info global in lowering, but to do that later in or just before
codegen to keep the FIR files readable (on the added type-info.f90
tests, the lowered runtime info globals takes a whooping 2.6 millions
characters on 1600 lines of the FIR textual output. The fir.type_info that
contains all the info required to generate those globals for such
"trivial" types takes 1721 characters on 9 lines).
So far this patch simply starts by replacing the fir.dispatch_table
operation by the fir.type_info operation and to add the noinit/
nofinal/nodestroy flags to it. These flags will soon be used in HLFIR to
better rewrite hlfir.assign with derived types.