In #83253 @matthias-springer pointed out that LowerHLFIRIntrinsics.cpp
should not be using rewrite patterns with the dialect conversion driver.
The intention of this pass is to lower HLFIR intrinsic operations into
FIR so it conceptually fits dialect conversion. However, dialect
conversion is much stricter about changing types when replacing
operations. This pass sometimes looses track of array bounds, resulting
in replacements with operations with different but compatible types
(expressions of the same rank and element types but with or without
compile time known array bounds). This is difficult to accommodate with
the dialect conversion driver and so I have changed to use the greedy
pattern rewriter.
There is a lot of test churn because the greedy pattern rewriter also
performs canonicalization.
This fix is a temporary workaround. `LowerHLFIRIntrinsics.cpp` should be
using the greedy pattern rewriter or a manual IR traversal. All patterns
in this file are rewrite patterns. The test failure was caused by
`replaceAllUsesWith`, which is not supported by the dialect conversion;
additional asserts were added recently to prevent incorrect API usage.
These trigger now.
Alternatively, turning the patterns into conversion patterns and
specifying a type converter may work.
Failing test case:
`Fortran/gfortran/regression/gfortran-regression-compile-regression__inline_matmul_14_f90.test`
In certain case "extreme" values like Nan, Inf and 0xffffffff could lead
to generating different code via the inline-generated intrinsics vs the
versions in the runtimes (and other compilers like gfortran). There are
some examples I was using for testing in
https://godbolt.org/z/x4EfqEss5.
This changes the generation for the intrinsics to be more like the
runtimes, using a condition that is similar to:
isFirst || (prev != prev && elem == elem) || elem < prev
The middle part is only used for floating point operations, and checks
if the values are Nan. This should then hopefully make the logic closer
to - return the first element with the lowest value, with Nans ignored
unless there are only Nans. The initial limit value for floats are also
changed from the largest float to Inf, to make sure it is handled
correctly.
The integer reductions are also changed to use a similar scheme to make
sure they work with masked values. This means that the preamble after
the loop can be removed.
Context: Conversion patterns provide a `ConversionPatternRewriter` to
modify the IR. `ConversionPatternRewriter` provides the public API. Most
function calls are forwarded/handled by `ConversionPatternRewriterImpl`.
The dialect conversion uses the listener infrastructure to get notified
about op/block insertions.
In the current design, `ConversionPatternRewriter` inherits from both
`PatternRewriter` and `Listener`. The conversion rewriter registers
itself as a listener. This is problematic because listener functions
such as `notifyOperationInserted` are now part of the public API and can
be called from conversion patterns; that would bring the dialect
conversion into an inconsistent state.
With this commit, `ConversionPatternRewriter` no longer inherits from
`Listener`. Instead `ConversionPatternRewriterImpl` inherits from
`Listener`. This removes the problematic public API and also simplifies
the code a bit: block/op insertion notifications were previously
forwarded to the `ConversionPatternRewriterImpl`. This is no longer
needed.
This is one option for attempting to move genMinMaxlocReductionLoop to a
better location. It moves it into Transforms and makes HLFIRTranforms
depend upon FIRTransforms.
It passes a build locally, both with and without -DBUILD_SHARED_LIBS,
and does OK on the windows CI.
I was looking through to check whether Nan was being handled correctly,
and couldn't work out why simple cases were behaving differently than
they should. It turns out the initial limit values was backwards for
minloc/maxloc reductions in general. This fixes that, introduced in
#79469.
The newly introduced `CUDAAttribute` is meant for CUDA attributes
associated with variable. In order to not clash with the future
attribute for function/subroutine, rename `CUDAAttribute` to
`CUDADataAttribute`.
This is a first simple patch to introduce a new FIR attribute to carry
the CUDA variable attribute information to hlfir.declare and fir.declare
operations. It currently lowers this information for local variables.
The texture attribute is omitted since it is rejected by semantic and
will not make its way to MLIR.
This new attribute is added as optional attribute to the hlfir.declare
and fir.declare operations.
https://github.com/llvm/llvm-project/pull/80683 revealed that
hlfir.as_expr was propagating the temporary buffer for polymorphic
values as an allocatable while codegen later expects to be working with
fir.box/fir.class but not fir.ref<box/class> when processing the
operations using the hlfir.as_expr result.
Dereference the temporary allocatable as soon as it is created.
The MSCV build doesn't allow the constexpr isMax variable to be used in lambda
without a capture. The -Weverything build does not allow isMax to be used in a
lambda capture as it is a constexpr. I've removed the constexpr as it shouldn't
be necessary.
This change makes the callback consistent with
`notifyOperationInserted`: both now notify about IR insertion, not IR
creation. See also #78988.
This change also simplifies the dialect conversion: it is no longer
necessary to override the `inlineRegionBefore` method. All information
that is necessary for rollback is provided with the
`notifyBlockInserted` callback.
Currently the lowering of a minloc intrinsic with a mask will look something
like:
%e = hlfir.elemental %shape ({
...
})
%m = hlfir.minloc %array mask %e
hlfir.assign %m to %result
hlfir.destroy %m
The elemental will be expanded into a temporary+loop, the minloc into a
FortranAMinloc call (which hopefully gets simplified to a specialized call that
can be inlined at the call site), and the assign might get expanded to a
FortranAAssign. It would be better to generate the entire construct as single
loop if we can - one that performs the minloc calculation with the mask
elemental computed inline.
This patch attempt to do that, adding a hlfir version of the expansion code
from SimplifyIntrinsics that turns an minloc+elemental into a single combined
loop nest. It attempts to reuse the methods in genMinlocReductionLoop for
constructing the loop with a modified loop body. The declaration for the
function is currently in Optimizer/Support/Utils.h, but there might be a better
place for it.
It is added as part of the OptimizedBufferizationPass, like the
similar count/any/all that have been added recently.
The pattern rewriter documentation states that "*all* IR mutations [...]
are required to be performed via the `PatternRewriter`." This commit
adds two functions that were missing from the rewriter API:
`moveOpBefore` and `moveOpAfter`.
After an operation was moved, the `notifyOperationInserted` callback is
triggered. This allows listeners such as the greedy pattern rewrite
driver to react to IR changes.
This commit narrows the discrepancy between the kind of IR modification
that can be performed and the kind of IR modifications that can be
listened to.
When running the `flang/test/HLFIR/simplify-hlfir-intrinsics.fir` test
case on AIX we encounter issues building op as they are not found in the
mlir context:
```
LLVM ERROR: Building op `arith.subi` but it isn't known in this MLIRContext: the dialect may not be loaded or this operation hasn't been added by the dialect. See also https://mlir.llvm.org/getting_started/Faq/#registered-loaded-dependent-whats-up-with-dialects-management
LLVM ERROR: Building op `hlfir.yield_element` but it isn't known in this MLIRContext: the dialect may not be loaded or this operation hasn't been added by the dialect. See also https://mlir.llvm.org/getting_started/Faq/#registered-loaded-dependent-whats-up-with-dialects-management
LLVM ERROR: Building op `hlfir.yield_element` but it isn't known in this MLIRContext: the dialect may not be loaded or this operation hasn't been added by the dialect. See also https://mlir.llvm.org/getting_started/Faq/#registered-loaded-dependent-whats-up-with-dialects-management
```
The issue is caused by the "Merge disjoint stack slots" pass and the
error is not present if the source is built with `-mllvm
--no-stack-coloring`
Thanks to investigation by @stefanp-ibm we found that "the
initializer_list {inputIndices[1], inputIndices[0]} has a lifetime that
only exists for the range of the constructor for ValueRange. Once we get
to stack coloring we merge the stack slot for that element with another
stack slot and then it gets overwritten which corrupts
transposedIndices"
The changes below prevents the corruption of transposedIndices and
passes the test case.
Co-authored-by: Mark Danial <mark.danial@ibm.com>
This commit renames 4 pattern rewriter API functions:
* `updateRootInPlace` -> `modifyOpInPlace`
* `startRootUpdate` -> `startOpModification`
* `finalizeRootUpdate` -> `finalizeOpModification`
* `cancelRootUpdate` -> `cancelOpModification`
The term "root" is a misnomer. The root is the op that a rewrite pattern
matches against
(https://mlir.llvm.org/docs/PatternRewriter/#root-operation-name-optional).
A rewriter must be notified of all in-place op modifications, not just
in-place modifications of the root
(https://mlir.llvm.org/docs/PatternRewriter/#pattern-rewriter). The old
function names were confusing and have contributed to various broken
rewrite patterns.
Note: The new function names use the term "modify" instead of "update"
for consistency with the `RewriterBase::Listener` terminology
(`notifyOperationModified`).
When dealing with overlaps in user defined assignments, some entities
with descriptors (fir.box) may be saved without descriptors. The current
code was replacing the original box entity with the "raw" copy with a
simple cast instead of creating a box for the copy. This patch ensures a
fir.embox is emitted instead.
Certain compilers do not seem to like the static assert with a string, causing
a implicit conversion. It can be removed as it should not be reachable and the
mlir::failure should handle it correctly in case it is.
This adds a ReductionElementalConversion transform to
OptimizedBufferizationPass, taking hlfir::count(hlfir::elemental) and
generating the inline loop to perform the count of true elements. This
lets us generate a single loop instead of ending up as two plus a
temporary.
Any and All should be able to share the same code with a different
function/initial value.
The adds a hlfir minloc intrinsic, similar to the minval intrinsic
already added, to help in the lowering of minloc. The idea is to later
add maxloc too, and from there add a simplification for producing minloc
with inlined elemental and hopefully less temporaries.
This got "lost" in the HLFIR transformation. This patch applies the old
attribute to the AssociateOp that needs it, and forwards it to the
AllocaOp that is generated when lowering to FIR.
To avoid the overhead of deallocating allocatable components of the
elemental temporary result on every iteration of the elemental operation,
we can use a shallow copy instead of deep-copy assign.
Assumed shape array are using descriptor and must be handled differently
than known shape arrays. This patch adds support to generate the `init`
and `combiner` region for the reduction recipe operation with assumed
shape array by using the descriptor and the HLFIR lowering path.
`createTempFromMold` function is moved from
`flang/lib/Optimizer/HLFIR/Transforms/BufferizeHLFIR.cpp` to
`flang/include/flang/Optimizer/Builder/HLFIRTools.h` to be reused to
create the private copy.
This set of commits resolves some of the issues with elemental calls producing
results that may require finalization, and also some memory leak issues due to
the missing deallocation of allocatable components of the temporary buffers
created by the bufferization pass.
- [flang][runtime] Expose Finalize API for derived types.
- [flang][hlfir] Add 'finalize' attribute for DestroyOp.
- [flang][hlfir] Postpone result finalization for elemental calls.
The results of elemental calls generated inside hlfir.elemental must not
be finalized/destructed before they are copied into the resulting
array. The finalization must be done on the array as a whole
(e.g. there might be different scalar and array finalization routines).
The finalization work is left to the hlfir.destroy corresponding
to this hlfir.elemental.
- [flang][hlfir] Tighten requirements on hlfir.end_associate operand.
If component deallocation might be required for the operand of
hlfir.end_associate, we have to be able to get the variable
shape/params to create a descriptor for calling the runtime.
This commit adds verification that we can do so.
- [flang][hlfir] Lower argument clean-ups using valid hlfir.end_associate.
The operand must be a Fortran entity, when allocatable component
deallocation may be required.
- [flang][hlfir] Properly clean-up temporary buffers in bufferization pass.
This commit combines changes for proper finalization and component
deallocation of the temporary buffers. The finalization part
relates to hlfir.destroy operations with 'finalize' attribute.
The component deallocation might be invoked for both hlfir.destroy
and hlfir.end_associate, if the operand is of a derived type
with allocatable component(s).
The changes are mostly in one function, so I decided not to split them.
- [flang][hlfir] Disable optimizations for hlfir.elemental requiring finalization.
If hlfir.elemental is coupled with hlfir.destroy with 'finalize' attribute,
the temporary array result of hlfir.elemental needs to be created
for the purpose of finalization. We cannot do certain optimizations
on such hlfir.elemental operations.
I was not able to come up with a test for the OptimizedBufferization pass,
but I put the check there as well.
This reverts commit 775754e328.
Relanding with removing part of the LIT test. There seems to be operations
ordering indeterminism that is unrelated to my change. I will address this
issue separately.
This patch places the finalization code for the RHS of a user-defined
assignment after the assignment code. The change only affects
standalone RegionAssignOp operations.
This is a copy of the corresponding ArrayValueCopy analysis
for non-overlapping array slices. It is required to achieve
the same performance for Polyhedron/nf, though, additional
changes are needed in the alias analysis for disambiguating
host associated accesses.
F18 10.2.1.3 p. 3 states:
If the variable is an unallocated allocatable array, expr shall have the same rank.
So if LHS is an array and RHS is a scalar, then LHS must be allocated and
the assignment is performed according to F18 10.2.1.3 p. 5:
If expr is a scalar and the variable is an array,
the expr is treated as if it were an array of the same shape as the
variable with every element of the array equal to the scalar value of expr.
This resolves performance regression in CPU2006/437.leslie3d caused
by extra Assign runtime calls for ALLOCATABLE local arrays.
Note that the extra calls do not add overhead themselves.
The problem is that the descriptor for ALLOCATABLE is passed
to Assign runtime function, and this messes up the points-to
analysis.
Example:
```
ALLOCATABLE DUDX(:),DUDY(:),DUDZ(:)
...
ALLOCATE( QS(IMAX-1),FSK(IMAX-1,0:KMAX,ND),
> QDIFFZ(IMAX-1), RMU(IMAX-1), EKCOEF(IMAX-1),
> DUDX(IMAX-1),DUDY(IMAX-1),DUDZ(IMAX-1),
...
DUDZ=0D0
...
DO I = I1, I2
DUDZ(I) =
> DZI * ABD * ((U(I,J,KBD) - U(I,J,KCD)) +
> 8.0D0 * (U(I,J, KK) - U(I,J,KBD))) * R6I
```
When we are not lowering `DUDZ=0D0` to Assign call, the `base_addr` of
`DUDZ`'s descriptor is a result of `malloc`, and LLVM is able to figure out
that the accesses through this `base_addr` cannot overlap with accesses of,
for exmaple, module (global) variable DZI. This enables CSE and LICM
for the loop, eventually, resulting in clean vectorization.
When `DUDZ`'s descriptor "escapes" to Assign runtime function,
there are no guarantees about where `base_addr` can point to.
I do not think this can be resolved by using any existing LLVM function/argument
attributes. Maybe we will be able to communicate the no-aliasing information
to LLVM using `Full Restrict Support` representation.
For the purpose of enabling HLFIR by default, I am just aligning the IR
with what we have with FIR lowering.
Reviewed By: tblah
Differential Revision: https://reviews.llvm.org/D159391
This case is important for `Polyhedron/channel2`:
```
u(2:M-1,1:N,new) = u(2:M-1,1:N,old) &
+2.d0*dt*f(2:M-1,1:N)*v(2:M-1,1:N,mid) &
-2.d0*dt/(2.d0*dx)*g*dhdx(2:M-1,1:N)
```
The slices of `u` on the left and the right hand sides are completely
disjoint, but `old` and `new` are unknown runtime values. So the slices
may also be identical rather than disjoint. For the purpose of
hlfir.assign expansion we do not care whether they are identical or
disjoint. Such kind of an answer does not fit well into the alias
analysis definition, so I added a very simplified check to handle
this case. This drops icelake execution time from 120 to 70 seconds.
Reviewed By: tblah
Differential Revision: https://reviews.llvm.org/D159323
Expand hlfir.assign with in-memory array RHS and LHS into
a loop nest with element-by-element assignments.
For small arrays this may result in further loop nest unrolling
enabling more value propagation and redundancy elimination.
Note the change in flang/test/HLFIR/opt-bufferization.fir:
the hlfir.assign inside hlfir.elemental gets expanded by the new
pattern.
Depends on D159151
Reviewed By: tblah
Differential Revision: https://reviews.llvm.org/D159246
Expanding hlfir.assign's with scalar RHS late in MLIR optimization
pipeline allows LLVM to recognize most of them as simple memset loops.
This is especially important for small size LHS arrays, because
the assign loop nest may be completely unrolled enabling more value
propagation.
Reviewed By: tblah
Differential Revision: https://reviews.llvm.org/D159151
This effectively reverts D154715.
The issue appears as the dialect conversion error because we try to
erase an op that has already been erased. See the added LIT test case
with HLFIR that may appear as a result of CSE.
The `adaptor.getSource()` is an operation producing a tuple,
which does not have users, so `allOtherUsesAreSafeForAssociate`
just looks at the empty list of users. So we get completely wrong
answers from it. This causes problems with the following
`eraseAllUsesInDestroys` that tries to remove the `DestroyOp` twice
during both `hflir.associate` processing.
But we also cannot use `associate.getSource()` *efficiently*, because
the original users may still hang around: one example is the original body
of hlfir.elemental (see D154715), another example is other already converted
AssociateOp's that are pending removal in the rewriter
(that is why we have a temporary created for each hlfir.associate
in the newly added LIT case).
This patch just fixes the correctness issue. I think we have to separate
the buffer reuse analysis from the conversion itself.
I also tried to address the issues with the cloned bodies of `hlfir.elemental`,
but this should not matter since D155778: if `hlfir.associate` is inside
`hlfir.elemental`, it will end up inside a do-loop body region, so the early
exit added in D155778 will prevent the buffer reuse.
Reviewed By: tblah
Differential Revision: https://reviews.llvm.org/D158471
This pass is intended to spot cases where we can do better than the
default bufferization and to rewrite those specific cases. Then the
default bufferization (bufferize-hlfir pass) can handle everything else.
The transformation added in this patch rewrites simple element-wise
updates to an array to a do-loop modifying the array in place instead of
creating and assigning an array temporary.
See the RFC at
https://discourse.llvm.org/t/rfc-hlfir-optimized-bufferization-for-elemental-array-updates
This patch gets the improvement to exchange2 but not the improvement to cam4
described in the RFC. I think the cam4 improvement will require better alias
analysis. I aim to follow up to fix this in a later patch. With changes
since the RFC, the pass improves polyhedron channel2 by about 52%.
Depends on: D156805 D157718 D157626
Differential Revision: https://reviews.llvm.org/D157107
This patch makes use of the HLFIR box produced for hlfir.declare
in place of the FIR box (the memref of hlfir.declare) when possible.
This makes the representation a little bit more clear, because
all accesses are made via a single box.
This reduces the life range of the original box, because the new
temporary box produced by embox/rebox is used from now.
Apparently, this works around some issues in the current HLFIR codegen,
for example, look at the LIT tests changes around fir.array_coor
produced by hlfir.designate codegen - using the FIR box for fir.array_coor
might result in using incorrect lbounds.
Apparently, this change enables more intrinsics simplifications
because the SimplifyIntrinsicsPass looks for explicit embox/rebox
in findBoxDef() to decide whether to apply the optimization.
This change also provides better association of the base addresses
referenced by OpenACC clauses with the corresponding boxes
that might be used explicitly in OpenACC regions (e.g. for reading
the lbounds).
Reviewed By: razvanlupusoru, clementval
Differential Revision: https://reviews.llvm.org/D158119
Handle special case of element-per-element assignments generated
for creating a temp for unlimited polymorphic hlfir.expr.
We currently end up generating an invalid fir.store. We should use
the assignment runtime instead.
Reviewed By: tblah
Differential Revision: https://reviews.llvm.org/D157752
The declare enter operations are failing verification after HLFIR to FIR
conversion, because we drop acc.declare attribute when converting
hlfir.declare to fir.declare. This patch just propagates all attributes
during the conversion. This may need to be changed in future, if some
attributes are invalid for fir.declare or they need to be mapped
into other attributes.
The added LIT test is a copy of Lower/OpenACC/acc-declare.f90
with the updated checks. It tests HLFIR after lowering and FIR
after HLFIR-to-FIR conversion.
The Lower/OpenACC/acc-declare.f90 is being actively changed, so
it may be better to put the new RUN commands and the new checks
into the original file. For example, when you add more testing
for OpenACC declare in Lower/OpenACC/acc-declare.f90, you
add checks just for FIR-lowering path. I will be able to add
HLFIR checks later for those pieces. When you change something
for the existing OpenACC declare, you will have to update
the checks for all three pipelines. Alternatively, we can keep
it in a separate file, but this will complicate the migration
a little bit. Please let me know what you would prefer.
Reviewed By: razvanlupusoru, clementval
Differential Revision: https://reviews.llvm.org/D157560
The polymorphic temporary array is created using the provided mold
and the shape of the hlfir.elemental. The array is allocated right away,
because it is going to be initialized element per element.
Depends on D157315
Reviewed By: clementval, tblah
Differential Revision: https://reviews.llvm.org/D157316
To properly create temporary array for a polymorphic result
of hlfir.elemental we need to keep the mold as its operand.
This patch adds just the basic support.
Reviewed By: clementval, tblah
Differential Revision: https://reviews.llvm.org/D157315
The assignments inside where/elsewhere may affect variables participating
in the mask expression, but execution of the assignments must not affect
the established control mask(s) (F'18 10.2.3.2 p. 13).
The following example must print all 42's:
```
program test
integer c(3)
logical :: mask(3) = .true.
where (mask)
c = f()
end where
print *, c
contains
integer function f()
mask = .false.
f = 42
end function f
end program test
```
Reviewed By: tblah
Differential Revision: https://reviews.llvm.org/D156959
hlfir.count lowering was using incorrect default integer kind
by ignoring the kind specified in the ModuleOp.
Reviewed By: tblah
Differential Revision: https://reviews.llvm.org/D156017
Some operations using the result of hlfir.dot_product can tolerate
that the type of the result changes from !fir.logical to i1 during
intrinsics lowering, but some won't. I added a separate LIT case with
fir.store to mimic one of the nag tests.
Reviewed By: kiranchandramohan
Differential Revision: https://reviews.llvm.org/D155914
If end_associate may execute more times than the expr value producer,
then it cannot take ownership of the expr buffer. Otherwise, it may
result in double-free errors.
Note that the LIT test exposes a different issue with fir.alloca
inside the do-loop produced for hlfir.elemental. This may cause
out-of-stack conditions in valid Fortran programs that are not expected
to run out of stack. I will create an issue for this.
Reviewed By: tblah
Differential Revision: https://reviews.llvm.org/D155778