The all one masks was not properly created for i128 types because
builder.createIntegerConstant ended-up truncating -1 to something
positive.
Add a builder.createAllOnesInteger/createMinusOneInteger helpers and use
them where createIntegerConstant(..., -1) was used.
Add an assert in createIntegerConstant to catch negative numbers for
i128 type.
Straightforward computation of `A − FLOOR (A / P) * P` should
produce NaN, when P is infinity. The -menable-no-infs lowering
can still use the relaxed operations sequence.
Whenever lowering is checking if a function or global already exists in
the mlir::Module, it was doing module->lookup.
On big programs (~5000 globals and functions), this causes important
slowdowns because these lookups are linear. Use mlir::SymbolTable to
speed-up these lookups. The SymbolTable has to be created from the
ModuleOp and maintained in sync. It is therefore placed in the
converter, and FirOPBuilders can take a pointer to it to speed-up the
lookups.
This patch does not bring mlir::SymbolTable to FIR/HLFIR passes, but
some passes creating a lot of runtime calls could benefit from it too.
More analysis will be needed.
As an example of the speed-ups, this patch speeds-up compilation of
Whizard compare_amplitude_UFO.F90 from 5 mins to 2 mins on my machine
(there is still room for speed-ups).
The ExternalNameConversion pass can be surprisingly slow on big
programs. On an example with a 50kloc Fortran file with about 10000
calls to external procedures, the pass alone took 25s on my machine.
This patch reduces this to 0.16s.
The root cause is that using `replaceAllSymbolUses` on each modified
FuncOp is very expensive: it is walking all operations and attribute
every time.
An alternative would be to use mlir::SymbolUserMap to avoid walking the
module again and again, but this is still much more expensive than what
is needed because it is essentially caching all symbol uses of the
module, and there is no need to such caching here.
Instead:
- Do a shallow walk of the module (only top level operation) to detect
FuncOp/GlobalOp that needs to be updated. Update them and place the name
remapping in a DenseMap.
- If any remapping were done, do a single deep walk of the module
operation, and update any SymbolRefAttr that matches a name that was
remapped.
In CUDA Fortran data transfer can be done via assignment statements
between host and device variables.
This patch introduces a `fir.cuda_data_transfer` operation that
materialized the data transfer between two memory references.
Simple transfer not involving descriptors from host to device are also
lowered in this patch. When the rhs is an expression that required an
evaluation, a temporary is created. The evaluation is done on the host
and then the transfer is initiated.
Implicit transfer when device symbol are present on the rhs is not part
of this patch. Transfer from device to host is not part of this patch.
This patch speeds up the compilation time of the example in
https://github.com/llvm/llvm-project/issues/76478#issuecomment-2011023289
from 2 minutes with my builds to about 2 seconds.
MLIR timers showed more than 98% of the time was spend in BoxedProcedure
trying to figure out if a type needs to be converted.
This is because walking the fir.type members is very expansive for types
containing many components and/or components with many sub-components.
Increase the caching time of visited types from "the type being visited"
to "the whole pass". Use DenseMap since it is not ok anymore to assume
this container will only have a few elements.
This PR extracts `FIROpConversion` and `FIROpAndTypeConversion`
templated base patterns to a header file. All the functions from
FIROpConversion that do not require the template argument are moved to a
base class named `ConvertFIRToLLVMPattern`.
This move is done so the `FIROpConversion` pattern and all its utility
functions can be reused outside of the codegen pass.
For the most part the code is only moved to the new files and not
modified. The only update is that addition of the PatternBenefit
argument with a default value to the constructor so it can be forwarded
to the `ConversionPattern` ctor.
This split is done in a similar way for the `ConvertOpToLLVMPattern`
base pattern that is based on the `ConvertToLLVMPattern` base class in
`mlir/include/mlir/Conversion/LLVMCommon/Pattern.h`.
This has been tested with arrays with compile-time constant bounds.
Allocatable arrays and arrays with non-constant bounds are not yet
supported. User-defined reduction functions are also not yet supported.
The design is intended to work for arrays with non-constant bounds too
without a lot of extra work (mostly there are bugs in OpenMPIRBuilder I
haven't fixed yet).
We need some way to get these runtime bounds into the reduction init and
combiner regions. To keep things simple for now I opted to always box
the array arguments so the box can be passed as one argument and the
lower bounds and extents read from the box. This has the disadvantage of
resulting in fir.box_dim operations inside of the critical section. If
these prove to be a performance issue, we could follow OpenACC reading
box lower bounds and extents before the reduction and passing them as
block arguments to the reduction init and combiner regions. I would
prefer to keep things simple for now.
Note: this implementation only works when the HLFIR lowering is used. I
don't think it is worth supporting FIR-only lowering because the plan is
for that to be removed soon.
OpenMP array reductions 6/6
Previous PR: https://github.com/llvm/llvm-project/pull/84957
OpenMP reduction declare operations can contain FIR code which needs to
be lowered to LLVM. With array reductions, these regions can contain
more complicated operations which need PreCGRewriting. A similar extra
case was already needed for fir::GlobalOp.
OpenMP array reductions 3/6
Previous PR: https://github.com/llvm/llvm-project/pull/84953
Next PR: https://github.com/llvm/llvm-project/pull/84955
Most FIR passes only look for FIR operations inside of functions (either
because they run only on func.func or they run on the module but iterate
over functions internally). But there can also be FIR operations inside
of fir.global, some OpenMP and OpenACC container operations.
This has worked so far for fir.global and OpenMP reductions because they
only contained very simple FIR code which doesn't need most passes to be
lowered into LLVM IR. I am not sure how OpenACC works.
In the long run, I hope to see a more systematic approach to making sure
that every pass runs on all of these container operations. I will write
an RFC for this soon.
In the meantime, this pass duplicates the CFG conversion pass to also
run on omp reduction operations. This is similar to how the
AbstractResult pass is already duplicated for fir.global operations.
OpenMP array reductions 2/6
Previous PR: https://github.com/llvm/llvm-project/pull/84952
Next PR: https://github.com/llvm/llvm-project/pull/84954
---------
Co-authored-by: Mats Petersson <mats.petersson@arm.com>
Advise to place the alloca at the start of the first block of whichever
region (init or combiner) we are currently inside.
It probably isn't safe to put an alloca inside of a combiner region
because this will be executed multiple times. But that would be a bug to
fix in Lower/OpenMP.cpp, not here.
OpenMP array reductions 1/6
Next PR: https://github.com/llvm/llvm-project/pull/84953
Comparing c_ptr type for equality or inequality is raising an error.
```
not yet implemented: intrinsic module procedure: c_ptr_eq
```
or this one for inequality
```
not yet implemented: intrinsic module procedure: c_ptr_ne
```
This patch adds a lowering for them and fix the `__fortran_builtins.f90` module for inequality.
Reland after fix has landed for circular modules #85309
Comparing c_ptr type for equality or inequality is raising an error.
```
not yet implemented: intrinsic module procedure: c_ptr_eq
```
or this one for inequality
```
not yet implemented: intrinsic module procedure: c_ptr_ne
```
This patch adds a lowering for them and fix the `__fortran_builtins.f90`
module for inequality.
A mold argument need to be added to the hlfir.element_addr and set in
lowering so that when the hlfir.element_addr need to be turned into an
hlfir.elemental operation because the designator must be turned into a
value, the mold can be set on the hlfir.elemental to later allocate the
temporary according the the dynamic type.
This situation happens whenever the vector subscripted polymorphic
designator does not appear as an assignment left-hand side, or as an
IO-input item.
I initially thought retrieving the mold would be tricky if the dynamic
type of the designator was set by a part-ref of the right of the vector
subscripts ("array(vector)%polymorphic_comp"), but this turned out to be
impossible because:
1. A derived type component can be polymorphic only if it has the
POINTER or ALLOCATABLE attribute (F2023 C708).
2. Vector-subscripted part are ranked and F2023 C919 prohibits any
part-ref on the right of the rank part to have the POINTER or
ALLOCATABLE attribute.
=> If a vector subscripted designator is polymorphic, the vector
subscripted part is the rightmost part, and the mold is the base of the
vector subscripted part. This makes the retrieval of the mold easy in
lowering. The mold argument is always set to be the base of the vector
subscripted part when lowering the vector subscripted part, and it is
removed at the end of the designator lowering if the designator is not
polymorphic. This way there is no need to find back the mold from the
inside of the hlfir.element_addr body.
For non polymorphic entities, semantics knows the type size and rewrite
sizeof to `"cst element size" * size(x)`.
Lowering has to deal with the polymorphic case where the type size must
be retrieved from the descriptor (note that the lowering implementation
would work with any entity, polymorphic on not, it is just not used for
the non polymorphic cases).
FIROpsDialect has been declared manually with a class inheriting from
the MLIR Dialect class. Another declaration is done using tablegen here
`flang/include/flang/Optimizer/Dialect/FIRDialect.td`. This patch merge
the two declaration so we can use the tablegen generated class for all
the FIROpsDialect needs.
This is part of a series of patch to bring FIR up to date with the
current MLIR infra.
Modifies the privatization logic so that the emitted code only used the
HLFIR base (i.e. SSA value `#0` returned from `hlfir.declare`). Before
that, that emitted privatization logic was a mix of using `#0` and `#1`
which leads to some difficulties trying to move to delayed privatization
(see the discussion on #84033).
When walking over functions (in pre-order), if the function being
visited needs to be erased, skip visiting its regions.
This was detected by address sanitizer.
Rewriting an op can invalidate the operator range being iterated on.
Store the users in a separate list, and iterate over the list instead.
This was detected by address sanitizer.
TBAA builder assumed that all loads/stores are inside of functions and
hit an assertion once it found loads and stores inside of an
omp::ReductionDeclareOp.
For now just don't add TBAA tags to those loads and stores. They would
end up in a different TBAA tree to the host function after
OpenMPIRBuilder inlines them anyway so there isn't an easy way of making
this work.
This will be useful for OpenMP too.
I changed the definition slightly to use `fir::isa_ref_type` (which also
includes llvm pointers) because I think it reads better using the common
type helpers. There shouldn't be any llvm pointers in lowering so this
isn't a functional change.
For `LDBL_MANT_DIG == 113` targets the FortranFloat128Math library
is just an interface library that provides sources and compilation
options to be used for building FortranRuntime - there are not extra
dependencies on other libraries, so it can be a part of FortranRuntime,
which helps to avoid extra linking steps in the compiler driver.
Targets with __float128 support in libc will also use this path.
Other targets, where the math support comes from
FLANG_RUNTIME_F128_MATH_LIB,
FortranFloat128Math is built as a standalone static library,
and the compiler driver needs to conduct the linking.
Flang APIs for COMPLEX(16) are just thin C wrappers around
the C math functions. Flang uses C _Complex ABI for passing/returning
COMPLEX values, so the runtime is aligned to this.
The FIR dialect has been initiated before many interfaces have been
introduced to MLIR. This patch expose the FIR to LLVM patterns in a
`populateFIRToLLVMConversionPatterns` function. The idea is to be able
to add the `ConvertToLLVMPatternInterface`. This is not directly
possible since the FIR dialect does not currently use the table
infrastructure for its definition.
Follow up patches will move the FIR dialect definition to table gen and
then implement the interface.
This makes an adjustment to the existing fir minloc/maxloc generation
code to handle functions with a dim=1 that produce a scalar result. This
should allow us to get the same benefits as the existing generated
minmax reductions.
This is a recommit of #76194 with an extra alteration to the end of
genRuntimeMinMaxlocBody to make sure we convert the output array to the
correct type (a `box<heap<i32>>`, not `box<heap<array<1xi32>>>`) to
prevent writing the wrong type of box into it. This still allocates the
data as a `array<1xi32>`, converting it into a i32 assuming that is
safe. An alternative would be to allocate the data as a i32 and change
more of the accesses to it throughout genRuntimeMinMaxlocBody.
In #83253 @matthias-springer pointed out that LowerHLFIRIntrinsics.cpp
should not be using rewrite patterns with the dialect conversion driver.
The intention of this pass is to lower HLFIR intrinsic operations into
FIR so it conceptually fits dialect conversion. However, dialect
conversion is much stricter about changing types when replacing
operations. This pass sometimes looses track of array bounds, resulting
in replacements with operations with different but compatible types
(expressions of the same rank and element types but with or without
compile time known array bounds). This is difficult to accommodate with
the dialect conversion driver and so I have changed to use the greedy
pattern rewriter.
There is a lot of test churn because the greedy pattern rewriter also
performs canonicalization.
Changed the lowering to call Norm2DimReal16 for REAL(16).
Added the corresponding entry point to FortranFloat128Math,
which required some restructuring in the related templates.
Internal procedures cannot be called directly from outside the host
procedure, so there is no point giving them external linkage. The only
reason flang did is because it is the default in MLIR.
Giving external linkage to them:
- prevents deleting them when not used/inlined by LLVM
- causes bugs with shared libraries (at least on linux x86-64) because
the call to the internal function could lead to a dynamic loader call
that would overwrite r10 register (the static chain pointer) due to
system calls and did not restore (it seems it does not expect r10 to be
used for PLT calls).
This patch gives internal linkage to internal procedures:
Note: the llvm.linkage attribute name cannot be obtained via a
getLinkageAttrName since it is not the same name as the one used in the
LLVM dialect. It is just a placeholder defined in
mlir/lib/Conversion/FuncToLLVM/FuncToLLVM.cpp until the func dialect
gets a real linkage model. So simply avoid hard coding it too many times
in lowering.
This fix is a temporary workaround. `LowerHLFIRIntrinsics.cpp` should be
using the greedy pattern rewriter or a manual IR traversal. All patterns
in this file are rewrite patterns. The test failure was caused by
`replaceAllUsesWith`, which is not supported by the dialect conversion;
additional asserts were added recently to prevent incorrect API usage.
These trigger now.
Alternatively, turning the patterns into conversion patterns and
specifying a type converter may work.
Failing test case:
`Fortran/gfortran/regression/gfortran-regression-compile-regression__inline_matmul_14_f90.test`
The current code was doing an unconditional `fir.store %optional_box to
%host_link` which caused a crash when %optional_box is absent because is
is attempting to copy a descriptor from a null address.
Add code to conditionally do the copy at runtime.
The polymorphic array case with lower bounds can be handled with the
array case that already deals with descriptor argument with a few
modifications, just use that.
This patch introduces a new operation to represent the CUDA Fortran
kernel loop directive. This operation is modeled as a LoopLikeOp
operation in a similar way to acc.loop.
The CUFKernelDoConstruct parse tree node is also placed correctly in the
PFTBuilder to be available in PFT evaluations.
Lowering from the flang parse-tree to MLIR is also done.