Whenever lowering is checking if a function or global already exists in
the mlir::Module, it was doing module->lookup.
On big programs (~5000 globals and functions), this causes important
slowdowns because these lookups are linear. Use mlir::SymbolTable to
speed-up these lookups. The SymbolTable has to be created from the
ModuleOp and maintained in sync. It is therefore placed in the
converter, and FirOPBuilders can take a pointer to it to speed-up the
lookups.
This patch does not bring mlir::SymbolTable to FIR/HLFIR passes, but
some passes creating a lot of runtime calls could benefit from it too.
More analysis will be needed.
As an example of the speed-ups, this patch speeds-up compilation of
Whizard compare_amplitude_UFO.F90 from 5 mins to 2 mins on my machine
(there is still room for speed-ups).
The ExternalNameConversion pass can be surprisingly slow on big
programs. On an example with a 50kloc Fortran file with about 10000
calls to external procedures, the pass alone took 25s on my machine.
This patch reduces this to 0.16s.
The root cause is that using `replaceAllSymbolUses` on each modified
FuncOp is very expensive: it is walking all operations and attribute
every time.
An alternative would be to use mlir::SymbolUserMap to avoid walking the
module again and again, but this is still much more expensive than what
is needed because it is essentially caching all symbol uses of the
module, and there is no need to such caching here.
Instead:
- Do a shallow walk of the module (only top level operation) to detect
FuncOp/GlobalOp that needs to be updated. Update them and place the name
remapping in a DenseMap.
- If any remapping were done, do a single deep walk of the module
operation, and update any SymbolRefAttr that matches a name that was
remapped.
Most FIR passes only look for FIR operations inside of functions (either
because they run only on func.func or they run on the module but iterate
over functions internally). But there can also be FIR operations inside
of fir.global, some OpenMP and OpenACC container operations.
This has worked so far for fir.global and OpenMP reductions because they
only contained very simple FIR code which doesn't need most passes to be
lowered into LLVM IR. I am not sure how OpenACC works.
In the long run, I hope to see a more systematic approach to making sure
that every pass runs on all of these container operations. I will write
an RFC for this soon.
In the meantime, this pass duplicates the CFG conversion pass to also
run on omp reduction operations. This is similar to how the
AbstractResult pass is already duplicated for fir.global operations.
OpenMP array reductions 2/6
Previous PR: https://github.com/llvm/llvm-project/pull/84952
Next PR: https://github.com/llvm/llvm-project/pull/84954
---------
Co-authored-by: Mats Petersson <mats.petersson@arm.com>
When walking over functions (in pre-order), if the function being
visited needs to be erased, skip visiting its regions.
This was detected by address sanitizer.
This makes an adjustment to the existing fir minloc/maxloc generation
code to handle functions with a dim=1 that produce a scalar result. This
should allow us to get the same benefits as the existing generated
minmax reductions.
This is a recommit of #76194 with an extra alteration to the end of
genRuntimeMinMaxlocBody to make sure we convert the output array to the
correct type (a `box<heap<i32>>`, not `box<heap<array<1xi32>>>`) to
prevent writing the wrong type of box into it. This still allocates the
data as a `array<1xi32>`, converting it into a i32 assuming that is
safe. An alternative would be to allocate the data as a i32 and change
more of the accesses to it throughout genRuntimeMinMaxlocBody.
In certain case "extreme" values like Nan, Inf and 0xffffffff could lead
to generating different code via the inline-generated intrinsics vs the
versions in the runtimes (and other compilers like gfortran). There are
some examples I was using for testing in
https://godbolt.org/z/x4EfqEss5.
This changes the generation for the intrinsics to be more like the
runtimes, using a condition that is similar to:
isFirst || (prev != prev && elem == elem) || elem < prev
The middle part is only used for floating point operations, and checks
if the values are Nan. This should then hopefully make the logic closer
to - return the first element with the lowest value, with Nans ignored
unless there are only Nans. The initial limit value for floats are also
changed from the largest float to Inf, to make sure it is handled
correctly.
The integer reductions are also changed to use a similar scheme to make
sure they work with masked values. This means that the preamble after
the loop can be removed.
This is one option for attempting to move genMinMaxlocReductionLoop to a
better location. It moves it into Transforms and makes HLFIRTranforms
depend upon FIRTransforms.
It passes a build locally, both with and without -DBUILD_SHARED_LIBS,
and does OK on the windows CI.
These hardcoded attribute name are a leftover from the upstreaming
period when there was no way to get the attribute name without an
instance of the operation. It is since possible to do without them and
they should be removed to avoid duplication.
This PR cleanup the fir.global op of these hardcoded attribute name and
use their generated getters. Some other PRs will follow to cleanup other
operations.
The implemented logic matches the logic used for Clang in emitting these
attributes. Although it's hoped that function attributes won't be needed
in the future (vs using fast math flags in individual IR instructions),
there are codegen differences currently with/without these attributes,
as can be seen in issues like #79257 or by hacking Clang to avoid
producing these attributes and observing codegen changes.
This patch seeks to add an initial lowering for pointers and allocatable variables
captured by implicit and explicit map in Flang OpenMP for Target operations that
take map clauses e.g. Target, Target Update. Target Exit/Enter etc.
Currently this is done by treating the type that lowers to a descriptor
(allocatable/pointer/assumed shape) as a map of a record type (e.g. a structure) as that's
effectively what descriptor types lower to in LLVM-IR and what they're represented as
in the Fortran runtime (written in C/C++). The descriptor effectively lowers to a structure
containing scalar and array elements that represent various aspects of the underlying
data being mapped (lower bound, upper bound, extent being the main ones of interest
in most cases) and a pointer to the allocated data. In this current iteration of the mapping
we map the structure in it's entirety and then attach the underlying data pointer and map
the data to the device, this allows most of the required data to be resident on the device
for use. Currently we do not support the addendum (another block of pointer data), but
it shouldn't be too difficult to extend this to support it.
The MapInfoOp generation for descriptor types is primarily handled in an optimization
pass, where it expands BoxType (descriptor types) map captures into two maps, one for
the structure (scalar elements) and the other for the pointer data (base address) and
links them in a Parent <-> Child relationship. The later lowering processes will then treat
them as a conjoined structure with a pointer member map.
The existing type size computation in LoopVersioning does not work
for REAL*10, because the compute element size is 10 bytes,
which violates the power-of-two assertion.
We'd better use the DataLayout for computing the storage size
of each element of an array of the given type.
The shared library build doesn't like references of genMinMaxlocReductionLoop,
in Optimizer/Transforms, from HLFIR/Optimizer/Transforms. For the moment I've
moved the code to the header file where it can be shared, like other methods in
Utils.h
Currently the lowering of a minloc intrinsic with a mask will look something
like:
%e = hlfir.elemental %shape ({
...
})
%m = hlfir.minloc %array mask %e
hlfir.assign %m to %result
hlfir.destroy %m
The elemental will be expanded into a temporary+loop, the minloc into a
FortranAMinloc call (which hopefully gets simplified to a specialized call that
can be inlined at the call site), and the assign might get expanded to a
FortranAAssign. It would be better to generate the entire construct as single
loop if we can - one that performs the minloc calculation with the mask
elemental computed inline.
This patch attempt to do that, adding a hlfir version of the expansion code
from SimplifyIntrinsics that turns an minloc+elemental into a single combined
loop nest. It attempts to reuse the methods in genMinlocReductionLoop for
constructing the loop with a modified loop body. The declaration for the
function is currently in Optimizer/Support/Utils.h, but there might be a better
place for it.
It is added as part of the OptimizedBufferizationPass, like the
similar count/any/all that have been added recently.
Currently lowering sets the extents of assumed-size array to "undef"
which was OK as long as the value was not expected to be read.
But when interfacing with the runtime and when passing assumed-size to
assumed-rank, this last extent may be read and must be -1 as specified
in the BIND(C) case in 18.5.3 point 5.
Set this value to -1, and update all the lowering code that was looking
for an undef defining op to identify assumed-size: much safer to
propagate and use semantic info here, the previous check actually did
not work if the array was used in an internal procedure (defining op not
visible anymore).
@clementval and @agozillon, I left assumed-size extent to zero in the
acc/omp bounds op as it was, please double check that is what you want
(I can imagine -1 may create troubles here, and 0 makes some sense as it
would lead to no data transfer).
This also allows removing special cases in UBOUND/LBOUND lowering.
Also disable allocation of cray pointee. This was never intended and
would now lead to crashes with the -1 value for assumed-size cray
pointee.
The added test case has a loop that is versioned, which has a use of the
loop in an if block after the loop. The current code replaces all uses
of the loop with the new version If, but only if the parent blocks
match. As far as I can see it should be safe to replace all the uses,
then construct the result for the If with op.op.
After the removal of the OpenMP early outlining MLIR pass in #67319, the
`EarlyOutliningInterface` stopped doing any useful work. It used to be
necessary to tie the name of the function from which a target region was
outlined to that new function, so it would be used when translating to
LLVM IR in place of the outlined function's name.
This is not necessary anymore, so this patch removes all references to
this interface and uses of the `omp.outline_parent_name` discardable
attribute in tests.
This commit renames 4 pattern rewriter API functions:
* `updateRootInPlace` -> `modifyOpInPlace`
* `startRootUpdate` -> `startOpModification`
* `finalizeRootUpdate` -> `finalizeOpModification`
* `cancelRootUpdate` -> `cancelOpModification`
The term "root" is a misnomer. The root is the op that a rewrite pattern
matches against
(https://mlir.llvm.org/docs/PatternRewriter/#root-operation-name-optional).
A rewriter must be notified of all in-place op modifications, not just
in-place modifications of the root
(https://mlir.llvm.org/docs/PatternRewriter/#pattern-rewriter). The old
function names were confusing and have contributed to various broken
rewrite patterns.
Note: The new function names use the term "modify" instead of "update"
for consistency with the `RewriterBase::Listener` terminology
(`notifyOperationModified`).
This commit changes the MLIR to LLVMIR export to also attach subprogram
debug attachements to function declarations.
This commit additonally fixes the two passes that produce subprograms to
not attach the "Definition" flag to function declarations. This
otherwise results in invalid LLVM IR.
This commit adds an optional distinct attribute parameter to the
DISubprogramAttr. This enables modeling of distinct subprograms, as
required for LLVM IR. This change is required to avoid accidential
uniquing of subprograms on functions that would lead to invalid LLVM IR
post export.
This commit adds a distinct attribute parameter to the DICompileUnit to
enable the modeling of distinctness. LLVM requires DICompileUnits to be
distinct and there are cases where one gets two equivalent compilation
units but LLVM still requires differentiates them. We observed such
cases for combinations of LTO and inline functions.
This patch also changes the DIScopeForLLVMFuncOp pass to a module pass,
to ensure that only one distinct DICompileUnit is created, instead of
one for each function.
[Flang] Revert "Allow Intrinsic simpification with min/maxloc dim
and…scalar result (#76194)"
This reverts commit 9b7cf5bfb0.
See merge request #76194.
This change was causing several failures in our internal tests. I'm
reverting now and will work on creating a test that David Green can use
to reproduce the problem.
This makes an adjustment to the existing fir minloc/maxloc generation
code to handle functions with a dim=1 that produce a scalar result. This
should allow us to get the same benefits as the existing generated
minmax reductions.
This is a recommit of #75820 with the typename added to the generated
function.
This patch fixes:
flang/lib/Optimizer/Transforms/StackArrays.cpp:452:7: error:
ignoring return value of function declared with 'nodiscard'
attribute [-Werror,-Wunused-result]
This makes an adjustment to the existing fir minloc/maxloc generation
code to handle functions with a dim=1 that produce a scalar result. This
should allow us to get the same benefits as the existing generated
minmax reductions.
This takes the code from D144103 and extends it to maxloc, to allow the
simplifyMinMaxlocReduction method to work with both min and max
intrinsics by switching condition and limit/initial value.
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
We pass TBAA alias information with separate TBAA trees per function (to
prevent incorrect alias information after inlining). These TBAA trees
are identified by a unique string per function. Naturally, we use the
mangled name of the function.
TBAA tags are added in two places: during a dedicated pass relatively
early (structured control flow makes fir::AliasAnalysis more accurate),
then again during CodeGen (when implied box loads and stores become
visible). In between these two passes, the ExternalNameConversion pass
changes the name of some functions.
These functions with changed names previously ended up with separate
TBAA trees from the TBAA tags pass and from CodeGen - leading LLVM to
think that all data accesses alias with all descriptor accesses.
This patch solves this by storing the original name of a function in an
attribute during the ExternalNameConversion pass, and using the name
from that attribute when creating TBAA trees during CodeGen.
Currently, when deleting the device functions in the second stage of filtering during MLIR to LLVM translation we can end up with invalid calls to these functions. This is because of the removal of the EarlyOutliningPass which would have otherwise gotten rid of any such calls.
This patch aims to alter the function filtering pass in the following way:
- Any host function is completely removed.
- Call to the host function are also removed and their uses replaced with Undef values.
- Any host function with target region code is marked to be removed during the the second stage.
- Calls to such functions are still removed and their uses replaced with Undef values.
Co-authored-by: Sergio Afonso <sergio.afonsofumero@amd.com>
This patch removes the OMPEarlyOutliningPass as it is no longer required. The implicit map operand capture has now been moved to the PFT lowering stage.
Depends on #67318.
The stack arrays pass uses data flow analysis to determine whether heap
allocations are freed on all paths out of the function.
`interp_domain_em_part2` in spec2017 wrf generates over 120k operations,
including almost 5k fir.if operations and over 200 fir.do_loop
operations, all in the same function. The MLIR data flow analysis
framework cannot provide reasonable performance for such cases because
there is a combinatorial explosion in the number of control flow paths
through the function, all of which must be checked to determine if the
heap allocations will be freed.
This patch skips the stack arrays pass for ridiculously large functions
(defined as having more than 1000 fir.allocmem operations). This
threshold is configurable at runtime with a command line argument.
With this patch, compiling this file is more than 80% faster.
These turn out to be useful for spec2017/fotonik3d and safe so long as
they are not used along side TBAA tags for local allocations. LLVM may
be able to figure out local allocations by itself anyway.
PR #68727
This patch makes changes to the early outlining pass to avoid compiler
crashes due to not handling `hlfir.declare` operations correctly. That
pass is intended to eventually be removed (#67319), but in the meantime
this fixes some issues arising in different parts of the OpenMP
offloading compilation process.
The main changes included in this patch are the following:
- Added support for mapped values defined by an `hlfir.declare`
operation. These operations are now kept in outlined target functions,
so that both of their outputs (base and original base) are available to
the corresponding `omp.target`'s map arguments and region.
- Added a fix by @agozillon to prevent unused map clauses from producing
a compiler crash. All these unused mapped variables are added to the
outlined function's inputs.
- Added a fix to the OpenMP translation to MLIR to support integer
arguments to these outlined functions. This enables successfully
compiling and running the tests in
opemp/libomptarget/test/offloading/fortran using HLFIR.
Co-authored-by: agozillon <Andrew.Gozillon@amd.com>
This avoids trying to version loops that can't be versioned, and thus
avoids hitting an assert.
Co-authored with Slava Zakharin (who provided the test-code).
See RFC at
https://discourse.llvm.org/t/rfc-propagate-fir-alias-analysis-information-using-tbaa/73755
This pass adds TBAA tags to all accesses to non-pointer/target dummy
arguments. These TBAA tags tell LLVM that these accesses cannot alias:
allowing better dead code elimination, hoisting out of loops, and
vectorization.
Each function has its own TBAA tree so that accesses between funtions
MayAlias after inlining.
I also included code for adding tags for local allocations and for
global variables. Enabling all three kinds of tag is known to produce a
miscompile and so these are disabled by default. But it isn't much
code and I thought it could be interesting to play with these later if
one is looking at a benchmark which looks like it would benefit from
more alias information. I'm open to removing this code too.
TBAA tags are also added separately by TBAABuilder during CodeGen.
TBAABuilder has to run during CodeGen because it adds tags to box
accesses, many of which are implicit in FIR. This pass cannot (easily)
run in CodeGen because fir::AliasAnalysis has difficulty tracing values
between blocks, and by the time CodeGen runs, structured control flow
has already been lowered.
Coming in follow up patches
- Change CodeGen/TBAABuilder to use TBAAForest to add tags within the
same per-function trees as are used here (delayed to a later patch
to make it easier to revert)
- Command line argument processing to actually enable the pass
The goal is to progressively propagate all the derived type info that is
currently in the runtime type info globals into a FIR operation that can
be easily queried and used by FIR/HLFIR passes.
When this will be complete, the last step will be to stop generating the
runtime info global in lowering, but to do that later in or just before
codegen to keep the FIR files readable (on the added type-info.f90
tests, the lowered runtime info globals takes a whooping 2.6 millions
characters on 1600 lines of the FIR textual output. The fir.type_info that
contains all the info required to generate those globals for such
"trivial" types takes 1721 characters on 9 lines).
So far this patch simply starts by replacing the fir.dispatch_table
operation by the fir.type_info operation and to add the noinit/
nofinal/nodestroy flags to it. These flags will soon be used in HLFIR to
better rewrite hlfir.assign with derived types.
Add vscale range attirbute for the Scalable Vector Extension (SVE) if
provided on the command-line (options in a previous commit)
If no command-line option is provided, if the target-feature of SVE is
specified and the architecture is AArch64, it defualts to 128-2048. in
other words a vscale-min of 1, vscale-max of 16.
A pass is used to add the atribute to all functions. The vectorizer will
use this attribute to generate the SVE instruction to match the range
specified. The attribute is harmless if there is no vectorizable
operations in the function.
This patch fixes two issues introduced by the D149368 patch, one is
a memory leak from using the removeFromParent rather
than eraseFromParent (the erase also had to be moved to not create
use after deletes).
And the other is a possible iterator invalidation bug, better to be safe
than sorry.
This patch adds initial lowering for DeclareTargetAttr on
GlobalOp's utilising registerTargetGlobalVariable
and getAddrOfDeclareTargetVar from the
OMPIRBuilder.
It also adds initial processing of declare target map
operands, populating the combinedInfo that the
OMPIRBuilder requires to generate kernels and
it's kernel argument structure.
The combination of these additions allows simple mapping
of declare target globals to Target regions, as such a simple
runtime test showcasing this and testing it has been added.
The patch currently does not factor in filtering
based on device_type clauses (e.g. no emission of
globals for device if host specified), this will come in
a future iteration. And for the moment it's only been
tested with 1-D arrays and basic fortran data types,
more complex types (such as user defined derived
types from Fortran, allocatables or Fortran pointers)
may need further work.
reviewers: kiranchandramohan, skatrak
Differential Revision: https://reviews.llvm.org/D149368
Defining a procedure with a BIND(C, NAME="...") where the binding label
matches the assembly name of a non BIND(C) external procedure in the
same file causes a failure when generating the LLVM IR because of the
assembly symbol name clash.
Prevent this crash with a clearer semantic error.
This patch is a required change for the device side IR to
maintain apporpiate links for declare target variables to
their global variables for later lowering.
It is also a requirement to clone over map bounds and
entry operations to maintain the correct information for
later lowering of the IR.
It simply tries to clone over the relevant information
maintaining the appropriate links they would have
maintained prior to the pass, rather than redirecting
them to new function arguments which causes a
loss of information in the case of Declare Target
and map information.
Depends on D158734
reviewers: TIFitis, razvanlupusoru
Differential Revision: https://reviews.llvm.org/D158735
The first test case added in the LIT test demonstrates the problem.
Even though we did not consider the inner loop as a candidate for
the transformation due to the array_coor with a slice, we decided to
version the outer loop for the same function argument.
During the cloning of the outer loop we dropped the slicing completely
producing invalid code.
I restructured the code so that we record all arg uses that cannot be
transformed (regardless of the reason), and then fixup the usage
information across the loop nests. I also noticed that we may generate
redundant contiguity checks for the inner loops, so I fixed it
since it was easy with the new way of keeping the usage data.
This patch changes how common blocks are aggregated and named in
lowering in order to:
* fix one obvious issue where BIND(C) and non BIND(C) with the same
Fortran name were "merged"
* go further and deal with a derivative where the BIND(C) C name matches
the assembly name of a Fortran common block. This is a bit unspecified
IMHO, but gfortran, ifort, and nvfortran "merge" the common block
without complaints as a linker would have done. This required getting
rid of all the common block mangling early in FIR (\_QC) instead of
leaving that to the phase that emits LLVM from FIR because BIND(C)
common blocks did not have mangled names. Care has to be taken to deal
with the underscoring option of flang-new.
See added flang/test/Lower/HLFIR/common-block-bindc-conflicts.f90 for an
illustration.
This is the last piece required for the loop versioning patch to work on
code lowered via HLFIR. With this patch, HLFIR performance on spec2017
roms is now similar to the FIR lowering.
Adding support for fir.array_coor means that many more loops will be
versioned, even in the FIR lowering. So far as I have seen, these do not
seem to have an impact on performance for the benchmarks I tried, but I
expect it would speed up some programs, if the loop being versioned
happened to be the hot code.
The main difference between fir.array_coor and fir.coordinate_of is
that fir.coordinate_of uses zero-based indices, whereas fir.array_coor
uses the indices as specified in the Fortran program (starting from 1 by
default, but also supporting non default lower bounds). I opted to
transform fir.array_coor operations into fir.coordinate_of operations
because this allows both to share the same offset calculation logic.
The tricky bit of this patch is getting the correct lower bounds for the
array operand to subtract from the fir.array_coor indices to get a
zero-based indices. So far as I can tell, the FIR lowering will always
provide lower bounds (shift) information in the shape operand to the
fir.array_coor when non-default lower bounds are used. If none is given,
I originally tried falling back to reading lower bounds from the box,
but this led to misscompilation in SPEC2017 cam4. Therefore the pass
instead assumes that if it can't already find an SSA value for the shift
information, the default lower bound (1) should be used.
A suspect the incorrect lower bounds in the box for the FIR lowering was
already a known issue (see https://reviews.llvm.org/D158119).
Differential Revision: https://reviews.llvm.org/D158597