This patch addresses post commit review comments from #124859.
The extra compile definition is not necessary and goes against the
effort to separate the runtimes from the flang compiler itself. The
function declaration for `CUFInit` can be accessed anyway since the
header are always present. The insertion of the call is only based on
the language feature options from the folding context.
A program compiled with cuda enabled but no cufruntime would just fail
at link time as expected.
Parse METADIRECTIVE as a standalone executable directive at the moment.
This will allow testing the parser code.
There is no lowering, not even clause conversion yet. There is also no
verification of the allowed values for trait sets, trait properties.
An hlfir.elemental with a shape `(0, HUGE)` still runs `HUGE`
number of iterations when expanded into a loop nest.
HLFIR transformational operations inlined as hlfir.elemental
may execute slower comparing to Fortran runtime implementation.
This patch adds an option for BufferizeHLFIR pass to reset all
upper bounds in the elemental loop nests to zero, if the result
is an empty array.
A separate patch will enable this option in the driver after I do
more performance testing. The option is off by default now.
A trait poperty can be one of serveral alternatives (name, expression,
etc.), and each property in a list was parsed as if it could be any of
these alternatives independently from other properties. This made the
parsing vulnerable to certain ambiguities in the trait grammar (provided
in the OpenMP spec).
At the same time the OpenMP spec gives the expected types of properties
for almost every trait: all properties listed for a given trait are
usually of the same type, e.g. names, clauses, etc.
Incorporate these restrictions into the parser, and additionally use
property extensions as the fallback if the parsing of the expected
property type failed. This is intended to allow the parser to succeed,
and instead let the semantic-checking code emit a more user-friendly
message.
Add the implementation of the IERRNO intrinsic to get the last system
error number, as given by the C errno variable.
This intrinsic is also used in RAMSES
(https://github.com/ramses-organisation/ramses/).
This patch implements support for the UNROLL directive to control how
many times a loop should be unrolled.
It must be placed immediately before a `DO LOOP` and applies only to the
loop that follows. N is an integer that specifying the unrolling factor.
This is done by adding an attribute to the branch into the loop in LLVM
to indicate that the loop should unrolled.
The code pushed to support the directive `VECTOR ALWAYS` has been
modified to take account of the fact that several directives can be used
before a `DO LOOP`.
This patch adds a call to the CUFInit function just after `ProgramStart`
when CUDA Fortran is enabled to initialize the CUDA context. This allows
us to set up some context information like the stack limit that can be
defined by an environment variable `ACC_OFFLOAD_STACKSIZE=<value>`.
Lower Fortran RESHAPE intrinsic into hlfir.reshape, and then lower
hlfir.reshape into a runtime call.
A later patch will add hlfir.reshape inlining as hlfir.elemental.
When constructing the characteristics of a particular reference to an
intrinsic procedure that was passed a non-coindexed reference to local
coarray data as an actual argument, don't add the corank of the actual
argument to those characteristics.
Also clean up the TypeAndShape characteristics class a little; the
Attr::Coarray is redundant since the corank() accessor can be used to
the same effect.
GetShape() needed to be called with a FoldingContext in order to
properly construct an extent expression for the shape of an array
constructor whose elements (nested in an implied DO loop) were not
scalars.
Fixes https://github.com/llvm/llvm-project/issues/124191.
The event variable in an EVENT POST/WAIT statement can be a coarray
reference, and need not be an entire coarray.
Variables and potential subobject components with EVENT_TYPE/LOCK_TYPE
must be coarrays, unless they are potential subobjects nested within
coarrays or pointers.
This allows the Flang parser to accept the !$OMP DISPATCH and related
clauses.
Lowering is currently not implemented. Tests for unparse and parse-tree
dump is provided, and one for checking that the lowering ends in a "not
yet implemented"
---------
Co-authored-by: Kiran Chandramohan <kiran.chandramohan@arm.com>
The backtrace may at least print the backtrace name in the call stack,
but this does not happen with the release builds of the runtime.
Surprisingly, specifying "no-omit-frame-pointer" did not work
with GCC, so I decided to fall back to -O0 for these functions.
With these changes, CUF atomic operations are handled as cudadevice
intrinsics and are converted straight to the LLVM dialect with the
`llvm.atomicrw` operation.
I am only submitting changes for `atomicadd` to gather feedback. If we
are to proceed with these changes I will add support for all other
applicable atomic operations following this pattern.
Move OpenACC/OpenMP helpers from Lower/DirectivesCommon.h that are also
used in OpenACC and OpenMP mlir passes into a new
Optimizer/Builder/DirectivesCommon.h so that parser and evaluate headers
are not included in Optimizer libraries (this both introduce
compile-time and link-time pointless overheads).
This should fix https://github.com/llvm/llvm-project/issues/123377
Because `c_devptr` has a `c_ptr` field, any assignment were done via the
Assign runtime function. This leads to stack overflow on the device and
taking too much memory. As we know the c_devptr can be directly copied
on assignment, make it a special case.
This reverts commit afc43a7b62.
+Fixed declaration of hlfir::genExtentsVector().
Some good results for induct2, where dot_product is applied
to a vector of unknow size and a known 3-element vector:
the inlining ends up generating a 3-iteration loop, which
is then fully unrolled. With late FIR simplification
it is not happening even when the simplified intrinsics
implementation is inlined by LLVM (because the loop bounds
are not known).
This change just follows the current approach to expose
the loops for later worksharing application.
Some good results for induct2, where dot_product is applied
to a vector of unknow size and a known 3-element vector:
the inlining ends up generating a 3-iteration loop, which
is then fully unrolled. With late FIR simplification
it is not happening even when the simplified intrinsics
implementation is inlined by LLVM (because the loop bounds
are not known).
This change just follows the current approach to expose
the loops for later worksharing application.
Runtime function call to a void function are producing a ssa value
because the FunctionType result is set to NoneType with is later
translated to a empty struct. This is not an issue when going to LLVM IR
but it breaks when lowering a gpu module to PTX. This patch update the
RTModel to correctly set the FunctionType result type to nothing.
This is one runtime call before this patch at the LLVM IR dialect step.
```
%45 = llvm.call @_FortranAAssign(%arg0, %1, %44, %4) : (!llvm.ptr, !llvm.ptr, !llvm.ptr, i32) -> !llvm.struct<()>
```
After the patch the call would be correctly formed
```
llvm.call @_FortranAAssign(%arg0, %1, %44, %4) : (!llvm.ptr, !llvm.ptr, !llvm.ptr, i32) -> ()
```
Without the patch it would lead to error like:
```
ptxas /tmp/mlir-cuda_device_mod-nvptx64-nvidia-cuda-sm_60-e804b6.ptx, line 10; error : Output parameter cannot be an incomplete array.
ptxas /tmp/mlir-cuda_device_mod-nvptx64-nvidia-cuda-sm_60-e804b6.ptx, line 125; error : Call has wrong number of parameters
```
The change is pretty much mechanical.
This fixes a bug when the same variable is used in `firstprivate` and
`lastprivate` clauses on the same construct. The issue boils down to the
fact that `copyHostAssociateVar` was deciding the direction of the copy
assignment (i.e. the `lhs` and `rhs`) based on whether the
`copyAssignIP`
parameter is set. This is not the best way to do it since it is not
related to whether we doing a copy from host to localized copy or the
other way around. When we set the insertion for `firstprivate` in
delayed privatization, this resulted in switching the direction of the
copy assignment. Instead, this PR adds a new paramter to explicitely
tell
the function the direction of the assignment.
This is a follow up PR for
https://github.com/llvm/llvm-project/pull/122471, only the latest commit
is relevant.
Inlining `hlfir.matmul` as `hlfir.eval_in_mem` does not allow
to get rid of a temporary array in many cases, but it may still be
much better allowing to:
* Get rid of any overhead related to calling runtime MATMUL
(such as descriptors creation).
* Use CPU-specific vectorization cost model for matmul loops,
which Fortran runtime cannot currently do.
* Optimize matmul of known-size arrays by complete unrolling.
One of the drawbacks of `hlfir.eval_in_mem` inlining is that
the ops inside it with store memory effects block the current
MLIR CSE, so I decided to run this inlining late in the pipeline.
There is a source commen explaining the CSE issue in more detail.
Straightforward inlining of `hlfir.matmul` as an `hlfir.elemental`
is not good for performance, and I got performance regressions
with it comparing to Fortran runtime implementation. I put it
under an enigneering option for experiments.
At the same time, inlining `hlfir.matmul_transpose` as `hlfir.elemental`
seems to be a good approach, e.g. it allows getting rid of a temporay
array in cases like: `A(:)=B(:)+MATMUL(TRANSPOSE(C(:,:)),D(:))`.
This patch improves performance of galgel and tonto a little bit.
Intrinsic module procedures ieee_get_modes, ieee_set_modes,
ieee_get_status, and ieee_set_status store and retrieve opaque data
values whose size varies by machine and OS environment. These data
values are usually, but not always small. Their sizes are not directly
known in a cross compilation environment. Address this issue by
implementing two mechanisms for processing these data values.
Environments that use typical small data sizes can access storage
defined at compile time. When this is not valid, data storage of any
size can be allocated at runtime.
Introduce a new op to get the device address from a host symbol. This
simplify the current conversion and this is also in preparation for some
legalization work that need to be done in cuf kernel and cuf kernel
launch similar to
https://github.com/llvm/llvm-project/pull/122802
A module can't USE itself, either directly within the top-level module
or from one of its submodules. Add a test for this case (which we
already caught), and improve the diagnostic for the more confusing case
involving a submodule.
Add tests for negative array extents where necessary, motivated by a
compiler crash exposed by yet another fuzzer test, and improve overall
error message quality for RESHAPE().
Fixes https://github.com/llvm/llvm-project/issues/122060.
The expression traversal library needs to use interfaces into triplets
(and substrings) that return pointers to nested expressions, rather than
optional copies of them, since at least one semantic analysis collects a
set of references to some subexpression representation class instances,
and those references obviously can't point to local copies of objects.
Fixes https://github.com/llvm/llvm-project/issues/121999.
The newly introduced MappableType interface in `acc` dialect was
primarily intended to allow variables with non-materialized storage to
be used in acc data clauses (previously everything was required to be
`pointer-like`). One motivator for this was `fir.box` since it is
possible to be passed to functions without a wrapping `fir.ref` and also
it can be generated directly via operations like `fir.embox` - and
unlike other variable representations in FIR, the underlying storage for
it does not get materialized until LLVM codegen.
The new interface is being attached to both `fir.box` and `fir.array`.
Strictly speaking, attaching to the latter is primarily for consistency
since the MappableType interface requires implementation of utilities to
compute byte size - and it made sense that a
`fir.box<fir.array<10xi32>>` and `fir.array<10xi32>` would have a
consistently computable size. This decision may be revisited as
MappableType interface evolves.
The new interface attachments are made in a new library named
`FIROpenACCSupport`. The reason for this is to avoid circular
dependencies since the implementation of this library is reusing code
from lowering of OpenACC. More specifically, the types are defined in
`FIRDialect` and `FortranLower` depends on it. Thus we cannot attach
these interfaces in `FIRDialect`.