This does add a little computational complexity because now every
freemem operation has to be tested for every allocation. This could be
improved with some more memoisation but I think it is easier to read
this way. Let me know if you would prefer me to change this to
pre-compute the normalised addresses each freemem operation is using.
Weirdly, this change resulted in a verifier failure for the fir.declare
in the previous test case. Maybe it was previously removed as dead code
and now it isn't. Anyway I fixed that too.
Currently fir::isDummyArgument is being used to check if a DeclareOp
represents a dummy argument. The argument passed to the function is
declOp.getMemref(). This bypasses the code in isDummyArgument that
checks for dummy_scope because the `Value` returned by the getMemref()
may not have DeclareOp as its defining op.
This bypassing mean that sometime a variable will be marked as argument
when it should not. This happened in this case where same arg was being
used for 2 different result variables with use of `entry` in the
function.
The solution is to check directly if the declOp has a dummy_scope. If
yes, we know this is dummy argument. We can now check if the memref
points to the BlockArgument and use its number. This will still miss
arguments where memref does not directly point to a BlockArgument but
that is missed currently too. Note that we can still evaluate those
variable in debugger. It is just that they are not marked as arguments.
Fixes#116525.
Introduce cuf.sync_descriptor to be used to sync device global
descriptor after pointer association.
Also move CUFCommon so it can be used in FIRBuilder lib as well.
We were passing size in bytes for the sizeInBits field in
DIDerivedTypeAttr with DW_TAG_pointer_type. Although this field is
un-used in this case but better to be accurate.
Loops resulting from array expressions like array(:,i)
may be versioned for the unit stride of the innermost dimension,
when the initial array is an assumed-shape array (which are contiguous
in many Fortran programs).
This speeds up facerec for about 12% due to further vectorization
of the innermost loop produced for the total SUM reduction.
Note that PointerUnion::{is,get} have been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
I'm not touching PointerUnion::dyn_cast for now because it's a bit
complicated; we could blindly migrate it to dyn_cast_if_present, but
we should probably use dyn_cast when the operand is known to be
non-null.
The greedy rewriter is used in many different flows and it has a lot of
convenience (work list management, debugging actions, tracing, etc). But
it combines two kinds of greedy behavior 1) how ops are matched, 2)
folding wherever it can.
These are independent forms of greedy and leads to inefficiency. E.g.,
cases where one need to create different phases in lowering and is
required to applying patterns in specific order split across different
passes. Using the driver one ends up needlessly retrying folding/having
multiple rounds of folding attempts, where one final run would have
sufficed.
Of course folks can locally avoid this behavior by just building their
own, but this is also a common requested feature that folks keep on
working around locally in suboptimal ways.
For downstream users, there should be no behavioral change. Updating
from the deprecated should just be a find and replace (e.g., `find ./
-type f -exec sed -i
's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} \;` variety)
as the API arguments hasn't changed between the two.
In order to get the pointer to a structure member, `getelementptr`
typically requires two indices: one to indicate the structure itself,
and another to specify the member's position. We are missing the former
in `GPULaunchKernelConversion`, so generated code may cause stack
corruption. This PR corrects the indices of a structure used as a kernel
launch temp.
For the call to _FortranACUFLaunchKernel, we store the pointer to a
member of a temporary structure in a parameter array. However, when we
obtain an element pointer from the parameter array, its address is
calculated based on the type of the structure. This PR properly treats
the parameter array as an array of pointers.
Example:
```mlir
%30 = llvm.load %29 : !llvm.ptr -> i32
%31 = llvm.mlir.constant(1 : i32) : i32
%32 = llvm.alloca %31 x !llvm.struct<(i64, i64, i32, ptr)> : (i32) -> !llvm.ptr
%33 = llvm.mlir.constant(4 : i32) : i32
%34 = llvm.alloca %33 x !llvm.ptr : (i32) -> !llvm.ptr
%35 = llvm.mlir.constant(0 : i32) : i32
%36 = llvm.getelementptr %32[%35] : (!llvm.ptr, i32) -> !llvm.ptr, !llvm.struct<(i64, i64, i32, ptr)>
llvm.store %8, %36 : i64, !llvm.ptr
%37 = llvm.getelementptr %34[%35] : (!llvm.ptr, i32) -> !llvm.ptr, !llvm.struct<(i64, i64, i32, ptr)>
llvm.store %36, %37 : !llvm.ptr, !llvm.ptr
...
llvm.call @_FortranACUFLaunchKernel(%47, %8, %8, %8, %2, %8, %8, %7, %34, %48) : (!llvm.ptr, i64, i64, i64, i64, i64, i64, i32, !llvm.ptr, !llvm.ptr) -> ()
```
In this example, `%37 = llvm.getelementptr %34[%35] : (!llvm.ptr, i32)
-> !llvm.ptr, !llvm.struct<(i64, i64, i32, ptr)>` will be `%37 =
llvm.getelementptr %34[%35] : (!llvm.ptr, i32) -> !llvm.ptr, !llvm.ptr`.
Use `pm.nest` to schedule the pass on nested `func.func` and `gpu.func`
in the `gpu.module`.
AbstractResult pass is not meant to run on the whole gpu.module at once.
1. In `CufOpConversion` `isDeviceGlobal` was renamed
`isRegisteredGlobal` and moved to the common file. `isRegisteredGlobal`
excludes constant `fir.global` operation from registration. This is to
avoid calls to `_FortranACUFGetDeviceAddress` on globals which do not
have any symbols in the runtime. This was done for
`_FortranACUFRegisterVariable` in #118582, but also needs to be done
here after #118591
2. `CufDeviceGlobal` no longer adds the `#cuf.cuda<constant>` attribute
to the constant global. As discussed in #118582 a module variable with
the #cuf.cuda<constant> attribute is not a compile time constant. Yet,
the compile time constant also needs to be copied into the GPU module.
The candidates for copy to the GPU modules are
- the globals needing regsitrations regardless of their uses in device
code (they can be referred to in host code as well)
- the compile time constant when used in device code
3. The registration of "constant" module device variables (
#cuf.cuda<constant>) can be restored in `CufAddConstructor`
Add pattern that update fir.declare memref when it comes from a device
global and is not a descriptor. In that case, we recover the device
address that needs to be used in ops like `fir.array_coor` and so on.
Global constants have no symbols in library files. They are replaced
with literal constants during lowering before kernels are moved into a
GPU module. Do not register them because they will result in unresolved
symbols.
in CUDA Fortran, device function are converted to `gpu.func` inside the
`gpu.module` operation. Update the AbstractResult pass to be able to run
on `func.func` and `gpu.func` operations inside the `gpu.module`.
Materialize the box when the src comes from a embox or rebox operation.
This was done in the case of transfer to a descriptor but not when
transferring from a descriptor.
fir.call side effects are hard to describe in a useful way using
`MemoryEffectOpInterface` because it is impossible to list which memory
location a user procedure read/write without doing a data flow analysis
of its body (even PURE procedures may read from any module variable,
Fortran SIMPLE procedure from F2023 will allow that, but they are far
from common at that point).
Fortran language specifications allow the compiler to deduce
that a procedure call cannot access a variable in many cases
This patch leverages this to extend `fir::AliasAnalysis::getModRef` to
deal with fir.call.
This will allow implementing "array = array_function()" optimization in
a future patch.
- Update the runtime entry points to accept a stream information
- Update the conversion of `cuf.allocate` to pass correctly the stream
information when present.
Note that the stream is not currently used in the runtime. This will be
done in a separate patch as a design/solution needs to be down together
with the allocators.
When the src of the data transfer is a constant, it needs to be
materialized in memory to be able to perform a data transfer.
```
subroutine sub1()
real, device :: a(10)
integer :: I
do i = 5, 10
a(i) = -4.0
end do
end
```
Update the implicit global detection by looking for them in the CUF
kernel and also update to a walk so nested `fir.address_of` in nested
statement are also accounted for.
This PR adds the handling of `ClassType`. It is treated as pointer to
the underlying type. Note that `ClassType` when passed to the function
have double indirection so it is represented as pointer to type
(compared to other types which may have a single indirection).
If `ClassType` wraps a pointer or allocatable then we take care to
generate it as PTR -> type (and not PTR -> PTR -> type).
This is how it looks like in the debugger.
```
subroutine test_proc (this)
class(test_type), intent (inout) :: this
allocate (this%b (3, 2))
call fill_array_2d (this%b)
print *, this%a
end
```
```
(gdb) p this
$6 = (PTR TO -> ( Type test_type )) 0x2052a0
(gdb) p this%a
$7 = 0
(gdb) p this%b
$8 = ((1, 2, 3) (4, 5, 6))
```
The runtime Assign function is not meant to initialize an array from a
scalar. For that we need to use DoAssignFromSource. Update the data
transfer from scalar to descriptor to use a new entry point that use
this function underneath.
When an array is declared with a non default lower bound, the declare op
`getShape` will return a `ShapeShiftOp`. This result is used in data
transfer operation to compute the number of bytes to transfer. Update
the op to support `ShapeShiftOp`.