Commit Graph

460 Commits

Author SHA1 Message Date
Michael Kruse
b815a3942a [Flang] Move non-common headers to FortranSupport (#124416)
Move non-common files from FortranCommon to FortranSupport (analogous to
LLVMSupport) such that

* declarations and definitions that are only used by the Flang compiler,
but not by the runtime, are moved to FortranSupport

* declarations and definitions that are used by both ("common"), the
compiler and the runtime, remain in FortranCommon

* generic STL-like/ADT/utility classes and algorithms remain in
FortranCommon

This allows a for cleaner separation between compiler and runtime
components, which are compiled differently. For instance, runtime
sources must not use STL's `<optional>` which causes problems with CUDA
support. Instead, the surrogate header `flang/Common/optional.h` must be
used. This PR fixes this for `fast-int-sel.h`.

Declarations in include/Runtime are also used by both, but are
header-only. `ISO_Fortran_binding_wrapper.h`, a header used by compiler
and runtime, is also moved into FortranCommon.
2025-02-06 15:29:10 +01:00
Abid Qadeer
7ece824b6f [flang][debug] Improve check for global variable detection. (#118326)
When a global variable is used in the OpenMP target region, it is passed
as an argument to the function that implements target region. But the
`DeclareOp` for this incarnation still have the original name of the
variable. As some of our checks to decide if a variable is global or nor
are based on the name, this can result in a local variable being treated
as global. This PR hardens the check a bit. We now also check that
memory ref is actually an `AddrOfOp` before looking at the name.
2025-02-04 13:17:14 +00:00
jeanPerier
327d627066 [mlir] share argument attributes interface between calls and callables (#123176)
This patch shares core interface methods dealing with argument and
result attributes from CallableOpInterface with the CallOpInterface and
makes them mandatory to gives more consistent guarantees about concrete
operations using these interfaces.

This allows adding argument attributes on call like operations, which is
sometimes required to get proper ABI, like with  llvm.call (and llvm.invoke).


The patch adds optional `arg_attrs` and `res_attrs` attributes to operations using
these interfaces that did not have that already.
They can then re-use the common "rich function signature"
printing/parsing helpers if they want (for the LLVM dialect, this is
done in the next patch).

Part of RFC: https://discourse.llvm.org/t/mlir-rfc-adding-argument-and-result-attributes-to-llvm-call/84107
2025-02-03 11:27:14 +01:00
Valentin Clement (バレンタイン クレメン)
f1b075df2e [flang][cuda] Pass the pinned variable in allocate calls (#125310) 2025-02-02 18:05:59 -08:00
agozillon
4186805060 [Flang][MLIR] Extend DataLayout utilities to have basic GPU Module support (#123149)
As there is now certain areas where we now have the possibility of
having either a ModuleOp or GPUModuleOp and both of these modules can
have DataLayout's and we may require utilising the DataLayout utilities
in these areas I've taken the liberty of trying to extend them for use
with both.

Those with more knowledge of how they wish the GPUModuleOp's to interact
with their parent ModuleOp's DataLayout may have further alterations
they wish to make in the future, but for the moment, it'll simply
utilise the basic data layout construction which I believe combines
parent and child datalayouts from the ModuleOp and GPUModuleOp. If there
is no GPUModuleOp DataLayout it should default to the parent ModuleOp.

It's worth noting there is some weirdness if you have two module
operations defining builtin dialect DataLayout Entries, it appears the
combinatorial functionality for DataLayouts doesn't support the merging
of these.

This behaviour is useful for areas like:
https://github.com/llvm/llvm-project/pull/119585/files#diff-19fc4bcb38829d085e25d601d344bbd85bf7ef749ca359e348f4a7c750eae89dR1412
where we have a crossroads between the two different module operations.
2025-01-30 17:31:50 +01:00
Valentin Clement (バレンタイン クレメン)
382d3599c2 [flang][cuda] Propagate the data attribute on the converted calls (#124877) 2025-01-29 08:04:22 -08:00
Abid Qadeer
afa4681ce4 [flang][debug] Add support for common blocks. (#112398)
This PR adds debug support for common block in flang. As variable which
are part of a common block don't have a special marker to recognize
them, we use the following check to find them.

%0 = fir.address_of(@a)
%1 = fir.convert %0
%2 = fir.coordinate_of %1, %c0
%3 = fir.convert %2
%4 = fircg.ext_declare %3

If the memref of a fircg.ext_declare points to a fir.coordinate_of and
that in turn points to an fir.address_of (ignoring immediate
fir.convert) then we assume that it is a common block variable. The
fir.address_of gives us the global symbol which is the storage for
common block and fir.coordinate_of provides the offset in this storage.

The debug hierarchy looks like as

subroutine f3
  integer :: x, y
  common /a/ x, y
end subroutine

@a_ = global { ... } { ... }, !dbg !26, !dbg !28

!23 = !DISubprogram(name: "f3"...)
!24 = !DICommonBlock(scope: !23, name: "a", ...)
!25 = !DIGlobalVariable(name: "x", scope: !24 ...)
!26 = !DIGlobalVariableExpression(var: !25, expr: !DIExpression())
!27 = !DIGlobalVariable(name: "y", scope: !24 ...)
!28 = !DIGlobalVariableExpression(var: !27, expr:
!DIExpression(DW_OP_plus_uconst, 4))

This required following changes:

1. Instead of using DIGlobalVariableAttr in the FusedLoc of GlobalOp, we
use DIGlobalVariableExpressionAttr. This allows us the generate the
DIExpression where we have the information.

2. Previously, only one DIGlobalVariableExpressionAttr could be linked
to one global op. I recently removed this restriction in mlir. To make
use of it, we add an ArrayAttr to the FusedLoc of a GlobalOp. This
allows us to pass multiple DIGlobalVariableExpressionAttr.

3. I was depending on the name of global for the name of the common
block. The name gets a '_' appended. I could not find a utility function
in flang to remove it so I have to brute force it.
2025-01-28 12:54:15 +00:00
Abid Qadeer
2e5a5237da [flang][debug] Avoid redundant debug data generation for derived types. (#124473)
Since https://github.com/llvm/llvm-project/pull/122770, we have seen
that compile time have become extremely slow for cyclic derived types.
In #122770, we made the criteria to cache a derived type very strict. As
a result, some types which are safe to cache were also being
re-generated every type they were required. This increased the compile
time and also the size of the debug info.

Please see the description of PR# 122770. We decided that when
processing `t1`, the type generated for `t2` and `t3` were not safe to
cached. But our algorithm also denied caching to `t1` which as top level
type was safe.

```
type t1
  type(t2), pointer :: p1
end type
type t2
  type(t3), pointer :: p2
end type
type t3
  type(t1), pointer :: p3
end type
```

I have tinkered the check a bit so that top level type is always cached.
To detect a top level type, we use a depth counter that get incremented
before call to `convertRecordType` and decremented after it returns.

After this change, the following
[file](https://github.com/fujitsu/compiler-test-suite/blob/main/Fortran/0394/0394_0031.f90)
from Fujitsu get compiled around 40s which is same as it was before
#122770.


The smaller testcase present in issue #124049 takes less than half a
second. I also added check to make sure that duplicate entries of the
`DICompositeType` are not present in the IR.

Fixes #124049 and #123960.
2025-01-27 19:38:24 +00:00
Valentin Clement (バレンタイン クレメン)
48657bf29b [flang][cuda] Handle launch of cooperative kernel (#124362)
Add `CUFLaunchCooperativeKernel` entry points and lower gpu.launch_func
with grid_global attribute to this entry point.
2025-01-24 15:52:05 -08:00
Valentin Clement (バレンタイン クレメン)
ee054404df [flang][cuda] Carry over the cuf.proc_attr attribute to gpu.launch_func (#124325) 2025-01-24 13:09:58 -08:00
Valentin Clement (バレンタイン クレメン)
544a3cb65b [flang][cuda] Handle variable with initialization in device global pass (#124307) 2025-01-24 10:09:38 -08:00
Valentin Clement (バレンタイン クレメン)
67a8857989 [flang][cuda] Handle pointer allocation with double descriptors (#124183) 2025-01-23 15:54:22 -08:00
Valentin Clement (バレンタイン クレメン)
8c138bee6e [flang][cuda] Handle pointer allocation with source (#124070) 2025-01-23 09:24:06 -08:00
Valentin Clement (バレンタイン クレメン)
6e498bc2cd [flang][cuda] Handle simple device pointer allocation (#123996) 2025-01-22 15:59:32 -08:00
Abid Qadeer
0ec153b9fd [flang][debug] Remove an unused function to fix build. (#123602) 2025-01-20 12:18:27 +00:00
Abid Qadeer
af91372b75 [flang][debug] Improve handling of cyclic derived types. (#122770)
When `RecordType` is converted to corresponding `DIType`, we cache the
information to avoid doing the conversion again.

Our conversion of `RecordType` looks like this:

`ConvertRecordType(RecordType Ty)`
1. If type `Ty` is already in the cache, then return the corresponding
item.
2. Create a place holder `DICompositeTypeAttr` (called `ty_self` below)
for `Ty`
3. Put `Ty->ty_self` in the cache
4. Convert members of `Ty`. This may cause `ConvertRecordType` to be
called again with other types.
5. Create final `DICompositeTypeAttr`
6. Replace the `ty_self` in the cache with one created in step 5 end


The purpose of creating `ty_self` is to handle cases where a member may
have reference to parent type.

Now consider the code below:

```
type t1
  type(t2), pointer :: p1
end type
type t2
   type(t1), pointer :: p2
end type
```

While processing t1, we could have a structure like below. `t1 -> t2 ->
t1_self`

The `t2` created during handling of `t1` cant be cached on its own as it
contains a place holder reference. It will fail an assert in MLIR if it
is processed standalone. To avoid this problem, we have a check in the
step 6 above to not cache such types. But this check was not tight
enough. It just checked if a type should not have a place holder
reference to another type. It missed the following case where the place
holder reference can be in a type further down the line.

```
type t1
  type(t2), pointer :: p1
end type
type t2
  type(t3), pointer :: p2
end type
type t3
  type(t1), pointer :: p3
end type
```

So while processing `t1`, we have to stop caching of not only `t3` but
also of `t2`. This PR improves the check and moves the logic inside
`convertRecordType`.

Please note that this limitation of why a type cant have a placeholder
reference is because of how such references are resolved in the mlir.
Please see the discussion at the end of this
[PR](https://github.com/llvm/llvm-project/pull/106571).

I have to change `getDerivedType` so that it will also get the derived
type for things like `type(t2), pointer :: p1` which are wrapped in
`BoxType`. Happy to move it to a new function or a local helper in case
this change is problematic.

Fixes #122024.
2025-01-20 12:03:59 +00:00
Michał Górny
6a2cc12229 [flang] Support linking to MLIR dylib (#120966)
Introduce a new `MLIR_LIBS` argument to `add_flang_library`, that uses
`mlir_target_link_libraries` to link the MLIR dylib alterantively to the
component libraries. Use it, along with a few inline
`mlir_target_link_libraries` in tools, to support linking Flang to MLIR
dylib rather than the static libraries.

With these changes, the vast majority of Flang can be linked
dynamically. The only parts still using static libraries are these
requiring MLIR test libraries, that are not included in the dylib.
2025-01-16 13:35:26 +00:00
Valentin Clement (バレンタイン クレメン)
ce30ee53a8 [flang][cuda] Add gpu.launch to device context (#123105)
`gpu.launch` should also be considered device context.
2025-01-15 14:04:38 -08:00
Valentin Clement (バレンタイン クレメン)
a19919f4cd [flang][cuda] Add cuf.device_address operation (#122975)
Introduce a new op to get the device address from a host symbol. This
simplify the current conversion and this is also in preparation for some
legalization work that need to be done in cuf kernel and cuf kernel
launch similar to
https://github.com/llvm/llvm-project/pull/122802
2025-01-14 17:38:04 -08:00
Valentin Clement (バレンタイン クレメン)
25f28ddd69 [flang][cuda][NFC] Fix file header 2025-01-14 14:05:57 -08:00
Valentin Clement (バレンタイン クレメン)
ba4dc5a0d6 [flang][cuda] Pass the device address for global descriptor (#122802) 2025-01-13 17:23:12 -08:00
Slava Zakharin
e3cd88a7be [flang] Fixed StackArrays assertion after #121919. (#122550)
`findAllocaLoopInsertionPoint()` hit assertion not being able
to find the `fir.freemem` because of the `fir.convert`.
I think it is better to look for `fir.freemem` same way
with the look-through walk.
2025-01-13 11:56:11 -08:00
Tom Eccles
303249c449 [flang][StackArrays] track pointers through fir.convert (#121919)
This does add a little computational complexity because now every
freemem operation has to be tested for every allocation. This could be
improved with some more memoisation but I think it is easier to read
this way. Let me know if you would prefer me to change this to
pre-compute the normalised addresses each freemem operation is using.

Weirdly, this change resulted in a verifier failure for the fir.declare
in the previous test case. Maybe it was previously removed as dead code
and now it isn't. Anyway I fixed that too.
2025-01-08 10:05:21 +00:00
Abid Qadeer
f7420a9dff [flang][debug] Fix issue with argument numbering. (#120726)
Currently fir::isDummyArgument is being used to check if a DeclareOp
represents a dummy argument. The argument passed to the function is
declOp.getMemref(). This bypasses the code in isDummyArgument that
checks for dummy_scope because the `Value` returned by the getMemref()
may not have DeclareOp as its defining op.

This bypassing mean that sometime a variable will be marked as argument
when it should not. This happened in this case where same arg was being
used for 2 different result variables with use of `entry` in the
function.

The solution is to check directly if the declOp has a dummy_scope. If
yes, we know this is dummy argument. We can now check if the memref
points to the BlockArgument and use its number. This will still miss
arguments where memref does not directly point to a BlockArgument but
that is missed currently too. Note that we can still evaluate those
variable in debugger. It is just that they are not marked as arguments.

Fixes #116525.
2025-01-03 19:41:48 +00:00
Valentin Clement (バレンタイン クレメン)
7531672712 [flang][cuda][NFC] Remove unused variable (#121533)
Failed buildbot after https://github.com/llvm/llvm-project/pull/121524
2025-01-02 17:37:44 -08:00
Valentin Clement (バレンタイン クレメン)
6dcd2b035d [flang][cuda] Convert cuf.sync_descriptor to runtime call (#121524)
Convert the op to a new entry point in the runtime
`CUFSyncGlobalDescriptor`
2025-01-02 17:02:59 -08:00
Valentin Clement (バレンタイン クレメン)
4b17a8b10e [flang][cuda] Add operation to sync global descriptor (#121520)
Introduce cuf.sync_descriptor to be used to sync device global
descriptor after pointer association.

Also move CUFCommon so it can be used in FIRBuilder lib as well.
2025-01-02 17:02:45 -08:00
Abid Qadeer
328ff042e3 [flang][NFC] Replace dyn_cast_or_null with dyn_cast_if_present. (#120785) 2025-01-02 10:30:08 +00:00
Abid Qadeer
460e7d5f30 [flang][debug] Correct pointer size. (#120781)
We were passing size in bytes for the sizeInBits field in
DIDerivedTypeAttr with DW_TAG_pointer_type. Although this field is
un-used in this case but better to be accurate.
2025-01-02 10:26:27 +00:00
Slava Zakharin
711419e302 [flang] Enable loop-versioning for slices. (#120344)
Loops resulting from array expressions like array(:,i)
may be versioned for the unit stride of the innermost dimension,
when the initial array is an assumed-shape array (which are contiguous
in many Fortran programs).
This speeds up facerec for about 12% due to further vectorization
of the innermost loop produced for the total SUM reduction.
2024-12-23 07:53:10 -08:00
Kazu Hirata
392651a7ec [flang] Migrate away from PointerUnion::{is,get} (NFC) (#120880)
Note that PointerUnion::{is,get} have been soft deprecated in
PointerUnion.h:

  // FIXME: Replace the uses of is(), get() and dyn_cast() with
  //        isa<T>, cast<T> and the llvm::dyn_cast<T>

I'm not touching PointerUnion::dyn_cast for now because it's a bit
complicated; we could blindly migrate it to dyn_cast_if_present, but
we should probably use dyn_cast when the operand is known to be
non-null.
2024-12-22 13:30:16 -08:00
Valentin Clement (バレンタイン クレメン)
415cfaf339 [flang][cuda][NFC] Fix type in CUFFreeDescriptor (#120799) 2024-12-20 14:43:12 -08:00
Valentin Clement (バレンタイン クレメン)
e650ac1654 [flang][cuda][NFC] Fix typo in CUFAllocDescriptor (#120797)
Missing `r` in the function name.
2024-12-20 13:57:47 -08:00
Jacques Pienaar
09dfc5713d [mlir] Enable decoupling two kinds of greedy behavior. (#104649)
The greedy rewriter is used in many different flows and it has a lot of
convenience (work list management, debugging actions, tracing, etc). But
it combines two kinds of greedy behavior 1) how ops are matched, 2)
folding wherever it can.

These are independent forms of greedy and leads to inefficiency. E.g.,
cases where one need to create different phases in lowering and is
required to applying patterns in specific order split across different
passes. Using the driver one ends up needlessly retrying folding/having
multiple rounds of folding attempts, where one final run would have
sufficed.

Of course folks can locally avoid this behavior by just building their
own, but this is also a common requested feature that folks keep on
working around locally in suboptimal ways.

For downstream users, there should be no behavioral change. Updating
from the deprecated should just be a find and replace (e.g., `find ./
-type f -exec sed -i
's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} \;` variety)
as the API arguments hasn't changed between the two.
2024-12-20 08:15:48 -08:00
Valentin Clement (バレンタイン クレメン)
e93d226664 [flang][cuda] Update CompilerGeneratedNames pass to work on gpu module (#120660)
- Update `CompilerGeneratedNames` so it can perform renaming in
gpu.module
- Update Codegen so it look in the correct module for the type
descriptor.
2024-12-19 19:07:00 -08:00
Valentin Clement (バレンタイン クレメン)
37978c466b [flang][cuda] Remove unused variable 2024-12-12 15:04:16 -08:00
Valentin Clement (バレンタイン クレメン)
ea04148c27 [flang][cuda] Extend implicit global handling to any type descriptor (#119769)
Relax the check to also handle other type descriptor globals.
2024-12-12 14:52:49 -08:00
Valentin Clement (バレンタイン クレメン)
956d0dd624 [flang][cuda] Support builtin global in device global pass (#119626) 2024-12-11 17:09:56 -08:00
khaki3
609899f443 [flang][cuda] Avoid stack corruption when setting kernel launch parameters (#119469)
In order to get the pointer to a structure member, `getelementptr`
typically requires two indices: one to indicate the structure itself,
and another to specify the member's position. We are missing the former
in `GPULaunchKernelConversion`, so generated code may cause stack
corruption. This PR corrects the indices of a structure used as a kernel
launch temp.
2024-12-10 16:08:22 -08:00
Valentin Clement (バレンタイン クレメン)
850c932f05 [flang][cuda] Walk through cuf kernel for implicit globals (#119455)
Globals used in cuf kernel need to be flagged as well.
2024-12-10 14:01:53 -08:00
khaki3
e9866d5d14 [flang][cuda] Fix GPULaunchKernelConversion to generate correct kernel launch parameters (#119431)
For the call to _FortranACUFLaunchKernel, we store the pointer to a
member of a temporary structure in a parameter array. However, when we
obtain an element pointer from the parameter array, its address is
calculated based on the type of the structure. This PR properly treats
the parameter array as an array of pointers.

Example:

```mlir
%30 = llvm.load %29 : !llvm.ptr -> i32
%31 = llvm.mlir.constant(1 : i32) : i32
%32 = llvm.alloca %31 x !llvm.struct<(i64, i64, i32, ptr)> : (i32) -> !llvm.ptr
%33 = llvm.mlir.constant(4 : i32) : i32
%34 = llvm.alloca %33 x !llvm.ptr : (i32) -> !llvm.ptr
%35 = llvm.mlir.constant(0 : i32) : i32
%36 = llvm.getelementptr %32[%35] : (!llvm.ptr, i32) -> !llvm.ptr, !llvm.struct<(i64, i64, i32, ptr)>
llvm.store %8, %36 : i64, !llvm.ptr
%37 = llvm.getelementptr %34[%35] : (!llvm.ptr, i32) -> !llvm.ptr, !llvm.struct<(i64, i64, i32, ptr)>
llvm.store %36, %37 : !llvm.ptr, !llvm.ptr
...
llvm.call @_FortranACUFLaunchKernel(%47, %8, %8, %8, %2, %8, %8, %7, %34, %48) : (!llvm.ptr, i64, i64, i64, i64, i64, i64, i32, !llvm.ptr, !llvm.ptr) -> () 
```
In this example, `%37 = llvm.getelementptr %34[%35] : (!llvm.ptr, i32)
-> !llvm.ptr, !llvm.struct<(i64, i64, i32, ptr)>` will be `%37 =
llvm.getelementptr %34[%35] : (!llvm.ptr, i32) -> !llvm.ptr, !llvm.ptr`.
2024-12-10 11:32:32 -08:00
Yusuke MINATO
a88677edc0 Reland "[flang] Integrate the option -flang-experimental-integer-overflow into -fno-wrapv" (#118933)
This relands #110063.
The performance issue on 503.bwaves_r is found not to be related to the
patch, and is resolved by fbd89bcc when LTO is enabled.
2024-12-10 16:26:53 +09:00
Valentin Clement (バレンタイン クレメン)
a1d71c3693 [flang][cuda] Additional update to ExternalNameConversion (#119276) 2024-12-09 17:39:51 -08:00
Valentin Clement (バレンタイン クレメン)
75623bfe1b [flang][cuda] Handle gpu.return in AbstractResult pass (#119035) 2024-12-09 17:39:16 -08:00
Valentin Clement (バレンタイン クレメン)
1d4b5c161f [flang][cuda] Change how abstract result pass is scheduled on func.func and gpu.func (#119034)
Use `pm.nest` to schedule the pass on nested `func.func` and `gpu.func`
in the `gpu.module`.

AbstractResult pass is not meant to run on the whole gpu.module at once.
2024-12-09 13:31:27 -08:00
Renaud Kauffmann
27e458c8cb [flang][cuda] Distinguish constant fir.global from globals with a #cuf.cuda<constant> attribute (#118912)
1. In `CufOpConversion` `isDeviceGlobal` was renamed
`isRegisteredGlobal` and moved to the common file. `isRegisteredGlobal`
excludes constant `fir.global` operation from registration. This is to
avoid calls to `_FortranACUFGetDeviceAddress` on globals which do not
have any symbols in the runtime. This was done for
`_FortranACUFRegisterVariable` in #118582, but also needs to be done
here after #118591
2. `CufDeviceGlobal` no longer adds the `#cuf.cuda<constant>` attribute
to the constant global. As discussed in #118582 a module variable with
the #cuf.cuda<constant> attribute is not a compile time constant. Yet,
the compile time constant also needs to be copied into the GPU module.
The candidates for copy to the GPU modules are
- the globals needing regsitrations regardless of their uses in device
code (they can be referred to in host code as well)
       - the compile time constant when used in device code 

3. The registration of "constant" module device variables (
#cuf.cuda<constant>) can be restored in `CufAddConstructor`
2024-12-05 18:36:48 -08:00
Valentin Clement (バレンタイン クレメン)
7efd6139f2 [flang][cuda] Get device address in fir.declare (#118591)
Add pattern that update fir.declare memref when it comes from a device
global and is not a descriptor. In that case, we recover the device
address that needs to be used in ops like `fir.array_coor` and so on.
2024-12-04 13:36:58 -08:00
Renaud Kauffmann
ed2db3be61 [flang][cuda] Do not register global constants (#118582)
Global constants have no symbols in library files. They are replaced
with literal constants during lowering before kernels are moved into a
GPU module. Do not register them because they will result in unresolved
symbols.
2024-12-04 09:37:08 -08:00
Valentin Clement (バレンタイン クレメン)
5522d2462e [flang][cuda] Allow AbstractResult to run in gpu.module (#118529)
in CUDA Fortran, device function are converted to `gpu.func` inside the
`gpu.module` operation. Update the AbstractResult pass to be able to run
on `func.func` and `gpu.func` operations inside the `gpu.module`.
2024-12-03 14:04:49 -08:00
s-watanabe314
f3cf24fcc4 [flang] Apply nocapture attribute to dummy arguments (#116182)
Apply llvm.nocapture attribute to dummy arguments that do not have the
target, asynchronous, volatile, or pointer attributes in a procedure
that is not a bind(c). This was discussed in


https://discourse.llvm.org/t/applying-the-nocapture-attribute-to-reference-passed-arguments-in-fortran-subroutines/81401
2024-11-28 15:39:26 +09:00