Commit Graph

410 Commits

Author SHA1 Message Date
Joseph Huber
42eb54b774 [Clang] Put offloading globals in the .llvm.rodata.offloading section (#111890)
Summary:
For our offloading entries, we currently store all the string names of
kernels that the runtime will need to load from the target executable.
These are available via pointer in the `__tgt_offload_entry` struct,
however this makes it difficult to obtain from the object itself. This
patch simply puts the strings in a named section so they can be easily
queried.

The motivation behind this is that when the linker wrapper is doing
linking, it wants to know which kernels the host executable is calling.
We *could* get this already via the `.relaomp_offloading_entires`
section and trawling through the string table, but that's quite annoying
and not portable. The follow-up to this should be to make the linker
wrapper get a list of all used symbols the device link job should count
as "needed" so we can handle static linking more directly.
2024-10-28 07:17:50 -07:00
Alex Voicu
2074de252b [clang][HIP] Don't use the OpenCLKernel CC when targeting AMDGCNSPIRV (#110447)
When compiling HIP source for AMDGCN flavoured SPIR-V that is expected
to be consumed by the ROCm HIP RT, it's not desirable to set the OpenCL
Kernel CC on `__global__` functions. On one hand, this is not an OpenCL
RT, so it doesn't compose with e.g. OCL specific attributes. On the
other it is a "noisy" CC that carries semantics, and breaks overload
resolution when using [generic dispatchers such as those used by
RAJA](186d4194a5/src/common/HipDataUtils.hpp (L39)).
2024-10-22 17:16:46 +01:00
Youngsuk Kim
0f0a96b862 [llvm][NVPTX] Strip unneeded '+0' in PTX load/store (#113017)
Remove the extraneous '+0' immediate offset part in PTX load/stores, to
improve readability of output PTX code.
2024-10-19 10:05:36 -04:00
Matt Arsenault
51b4ada458 clang/AMDGPU: Set noalias.addrspace metadata on atomicrmw (#102462) 2024-10-17 17:10:45 +04:00
Matt Arsenault
d50302f31c clang/AMDGPU: Stop emitting amdgpu-unsafe-fp-atomics attribute (#111579) 2024-10-09 08:52:32 +04:00
Alex Voicu
e203a67f4c [cuda][HIP] __constant__ should imply constant (#110182)
Currently, `__constant__` variables do not get unconditionally marked as
`constant` in IR, which seems a bit odd given their definition. This is
generally inconsequential for NVPTX/AMDGPU, since said variables get
emitted in the constant address space for those BEs. However, it is
potentially significant for e.g. HIP-on-SPIR-V cases, as SPIR-V does not
allow casts to/from the constant AS (`UniformConstant`), which forces
`__constant__` variables to be emitted in the global AS, thus making IR
constness meaningful.
2024-09-29 01:22:52 +01:00
jofrn
b5fd9463a3 [HIP][Clang][CodeGen] Handle hip bin symbols properly. (#107458)
Remove '_' in fatbin and gpubin symbol suffixes when missing TU hash ID.
Internalize gpubin symbol so that it is not unresolved at link-time when
symbol is not relocatable.
2024-09-11 18:46:46 -04:00
Alex Voicu
ad435bcc14 [clang][CodeGen][SPIR-V][AMDGPU] Tweak AMDGCNSPIRV ABI to allow for the correct handling of aggregates passed to kernels / functions. (#102776)
The AMDGPU kernel ABI is not directly representable in SPIR-V, since it
relies on passing aggregates `byref`, and SPIR-V only encodes `byval`
(which the AMDGPU BE disallows for kernel arguments). As a temporary
solution to this mismatch, we add special handling for AMDGCN flavoured
SPIR-V, whereby aggregates are passed as direct, both to kernels and to
normal functions. This is not ideal (there are pathological cases where
performance is heavily impacted), but empirically robust and guaranteed
to work as the AMDGPU BE retains handling of `direct` passing for legacy
reasons.

We will revisit this in the future, but as it stands it is enough to
pass a wide array of integration tests and generates correct SPIR-V and
correct reverse translation into LLVM IR. The
amdgpu-kernel-arg-pointer-type test is updated via the automated script,
and thus becomes quite noisy.
2024-08-21 13:16:59 +01:00
Johannes Doerfert
80525dfcde [Offload][CUDA] Allow CUDA kernels to use LLVM/Offload (#94549)
Through the new `-foffload-via-llvm` flag, CUDA kernels can now be
lowered to the LLVM/Offload API. On the Clang side, this is simply done
by using the OpenMP offload toolchain and emitting calls to `llvm*`
functions to orchestrate the kernel launch rather than `cuda*`
functions. These `llvm*` functions are implemented on top of the
existing LLVM/Offload API.

As we are about to redefine the Offload API, this wil help us in the
design process as a second offload language.

We do not support any CUDA APIs yet, however, we could:
  https://www.osti.gov/servlets/purl/1892137

For proper host execution we need to resurrect/rebase
  https://tianshilei.me/wp-content/uploads/2021/12/llpp-2021.pdf
(which was designed for debugging).

```
❯❯❯ cat test.cu
extern "C" {
void *llvm_omp_target_alloc_shared(size_t Size, int DeviceNum);
void llvm_omp_target_free_shared(void *DevicePtr, int DeviceNum);
}

__global__ void square(int *A) { *A = 42; }

int main(int argc, char **argv) {
  int DevNo = 0;
  int *Ptr = reinterpret_cast<int *>(llvm_omp_target_alloc_shared(4, DevNo));
  *Ptr = 7;
  printf("Ptr %p, *Ptr %i\n", Ptr, *Ptr);
  square<<<1, 1>>>(Ptr);
  printf("Ptr %p, *Ptr %i\n", Ptr, *Ptr);
  llvm_omp_target_free_shared(Ptr, DevNo);
}

❯❯❯ clang++ test.cu -O3 -o test123 -foffload-via-llvm --offload-arch=native

❯❯❯ llvm-objdump --offloading test123

test123:        file format elf64-x86-64

OFFLOADING IMAGE [0]:
kind            elf
arch            gfx90a
triple          amdgcn-amd-amdhsa
producer        openmp

❯❯❯ LIBOMPTARGET_INFO=16 ./test123
Ptr 0x155448ac8000, *Ptr 7
Ptr 0x155448ac8000, *Ptr 42
```
2024-08-12 17:44:58 -07:00
Artem Belevich
5629249575 [CUDA] Emit used function list in deterministic order. (#102661)
Fixes https://github.com/llvm/llvm-project/issues/101560
2024-08-12 10:21:23 -07:00
Hari Limaye
94473f4db6 [IRBuilder] Generate nuw GEPs for struct member accesses (#99538)
Generate nuw GEPs for struct member accesses, as inbounds + non-negative
implies nuw.

Regression tests are updated using update scripts where possible, and by
find + replace where not.
2024-08-09 13:25:04 +01:00
Eli Friedman
1762e01cca Fix codegen of consteval functions returning an empty class, and related issues (#93115)
Fix codegen of consteval functions returning an empty class, and related
issues

If a class is empty, don't store it to memory: the store might overwrite
useful data. Similarly, if a class has tail padding that might overlap
other fields, don't store the tail padding to memory.

The problem here turned out a bit more general than I initially thought:
basically all uses of EmitAggregateStore were broken. Call lowering had
a method that did mostly the right thing, though: CreateCoercedStore.
Adapt CreateCoercedStore so it always does the conservatively right
thing, and use it for both calls and ConstantExpr.

Also, along the way, fix the "overlap" bit in AggValueSlot: the bit was
set incorrectly for empty classes in some cases.

Fixes #93040.
2024-08-01 16:18:20 -07:00
Matt Arsenault
41439d5bb7 AMDGPU: Handle remote/fine-grained memory in atomicrmw fmin/fmax lowering (#96759)
Consider the new atomic metadata when choosing to expand as cmpxchg
instead.
2024-08-01 22:08:01 +04:00
darkbuck
fa84297002 [clang][CUDA] Add 'noconvergent' function and statement attribute
- For languages following SPMD/SIMT programming model, functions and
  call sites are marked 'convergent' by default. 'noconvergent' is added
  in this patch to allow developers to remove that 'convergent'
  attribute when it's safe.

Reviewers:
nhaehnle, Sirraide, yxsamliu, Artem-B, ilovepi, jayfoad, ssahasra, arsenm

Reviewed By: arsenm

Pull Request: https://github.com/llvm/llvm-project/pull/100637
2024-07-31 11:30:48 -04:00
Matt Arsenault
e108853ac8 clang: Allow targets to set custom metadata on atomics (#96906)
Use this to replace the emission of the amdgpu-unsafe-fp-atomics
attribute in favor of per-instruction metadata. In the future
new fine grained controls should be introduced that also cover
the integer cases.

Add a wrapper around CreateAtomicRMW that appends the metadata,
and update a few use contexts to use it.
2024-07-26 09:57:28 +04:00
Yaxun (Sam) Liu
77fd30f7ce [CUDA][HIP] Fix template static member (#98580)
Should check host/device attributes before emitting static member of
template instantiation.

Fixes: https://github.com/llvm/llvm-project/issues/98151
2024-07-12 10:08:34 -04:00
Matt Arsenault
8f63d154ec clang/AMDGPU: Use atomicrmw for ds fmin/fmax builtins (#96738) 2024-06-27 15:32:08 +02:00
Matt Arsenault
a440a96ec2 AMDGPU: Start selecting flat/global atomicrmw fmin/fmax. (#95592)
Define subtarget features for atomic fmin/fmax support.

The flat/global support is a real messe. We had float/double support at
the beginning in gfx6 and gfx7. gfx8 removed these. gfx10 reintroduced them.
gfx11 removed the f64 versions again.

gfx9 partially reintroduced them, in gfx90a and gfx940 but only for f64.
2024-06-23 10:10:41 +02:00
Matt Arsenault
76894c5e6e clang/AMDGPU: Emit atomicrmw from ds_fadd builtins (#95395)
We should have done this for the f32/f64 case a long time ago. Now that
codegen handles atomicrmw selection for the v2f16/v2bf16 case, start emitting
it instead.

This also does upgrade the behavior to respect a volatile qualified pointer,
which was previously ignored (for the cases that don't have an explicit
volatile argument).
2024-06-18 20:51:14 +02:00
Stephen Tozer
094572701d [RemoveDIs] Print IR with debug records by default (#91724)
This patch makes the final major change of the RemoveDIs project, changing the
default IR output from debug intrinsics to debug records. This is expected to
break a large number of tests: every single one that tests for uses or
declarations of debug intrinsics and does not explicitly disable writing
records. 

If this patch has broken your downstream tests (or upstream tests on a
configuration I wasn't able to run):
1. If you need to immediately unblock a build, pass
`--write-experimental-debuginfo=false` to LLVM's option processing for all
failing tests (remember to use `-mllvm` for clang/flang to forward arguments to
LLVM).
2. For most test failures, the changes are trivial and mechanical, enough that
they can be done by script; see the migration guide for a guide on how to do
this: https://llvm.org/docs/RemoveDIsDebugInfo.html#test-updates
3. If any tests fail for reasons other than FileCheck check lines that need
updating, such as assertion failures, that is most likely a real bug with this
patch and should be reported as such.

For more information, see the recent PSA:
https://discourse.llvm.org/t/psa-ir-output-changing-from-debug-intrinsics-to-debug-records/79578
2024-06-14 15:07:27 +01:00
Nikita Popov
cc2dc0916a Reapply [ConstantFold] Drop gep of gep fold entirely (#95126)
Reapplying without changes. The flang+openmp buildbot failure
should be addressed by https://github.com/llvm/llvm-project/pull/94541.

-----

This is a followup to https://github.com/llvm/llvm-project/pull/93823
and drops the DataLayout-unaware GEP of GEP fold entirely. All cases are
now left to the DataLayout-aware constant folder, which will fold
everything to a single i8 GEP.

We didn't have any test coverage for this fold in LLVM, but some Clang
tests change.
2024-06-13 17:03:35 +02:00
Nikita Popov
cece0a105b Revert "[ConstantFold] Drop gep of gep fold entirely (#95126)"
This reverts commit 3b3b839c66.

This broke the flang+openmp+offload buildbot, as reported in
https://github.com/llvm/llvm-project/pull/95126#issuecomment-2162424019.
2024-06-12 11:52:12 +02:00
Nikita Popov
3b3b839c66 [ConstantFold] Drop gep of gep fold entirely (#95126)
This is a followup to https://github.com/llvm/llvm-project/pull/93823
and drops the DataLayout-unaware GEP of GEP fold entirely. All cases are
now left to the DataLayout-aware constant folder, which will fold
everything to a single i8 GEP.

We didn't have any test coverage for this fold in LLVM, but some Clang
tests change.
2024-06-12 09:50:14 +02:00
Alex Voicu
88e2bb4092 [clang][SPIR-V] Add support for AMDGCN flavoured SPIRV (#89796)
This change seeks to add support for vendor flavoured SPIRV - more
specifically, AMDGCN flavoured SPIRV. The aim is to generate SPIRV that
carries some extra bits of information that are only usable by AMDGCN
targets, forfeiting absolute genericity to obtain greater expressiveness
for target features:

- AMDGCN inline ASM is allowed/supported, under the assumption that the
[SPV_INTEL_inline_assembly](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/spirv-extensions/SPV_INTEL_inline_assembly.asciidoc)
extension is enabled/used
- AMDGCN target specific builtins are allowed/supported, under the
assumption that e.g. the `--spirv-allow-unknown-intrinsics` option is
enabled when using the downstream translator
- the featureset matches the union of AMDGCN targets' features
- the datalayout string is overspecified to affix both the program
address space and the alloca address space, the latter under the
assumption that the
[SPV_INTEL_function_pointers](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/spirv-extensions/SPV_INTEL_function_pointers.asciidoc)
extension is enabled/used, case in which the extant SPIRV datalayout
string would lead to pointers to function pointing to the private
address space, which would be wrong.

Existing AMDGCN tests are extended to cover this new target. It is
currently dormant / will require some additional changes, but I thought
I'd rather put it up for review to get feedback as early as possible. I
will note that an alternative option is to place this under AMDGPU, but
that seems slightly less natural, since this is still SPIRV, albeit
relaxed in terms of preconditions & constrained in terms of
postconditions, and only guaranteed to be usable on AMDGCN targets (it
is still possible to obtain pristine portable SPIRV through usage of the
flavoured target, though).
2024-06-07 11:50:23 +01:00
Alex MacLean
e8500a7054 fixup cuda-builtin-vars.cu broken in IntrRange change (#94639) 2024-06-06 10:10:00 -07:00
Alex MacLean
435addbf50 [NVPTX] Revamp NVVMIntrRange pass (#94422)
Revamp the NVVMIntrRange pass making the following updates:
- Use range attributes over range metadata. This is what instcombine has
move to for ranges on intrinsics in
https://github.com/llvm/llvm-project/pull/88776 and it seems a bit
cleaner.
- Consider the `!"maxntid{x,y,z}"` and `!"reqntid{x,y,z}"` function
metadata when adding ranges for `tid` srge instrinsics. This can allow
for smaller ranges and more optimization.
- When range attributes are already present, use the intersection of the
old and new range. This complements the metadata change by allowing
ranges to be shrunk when an intrinsic is in a function which is inlined
into a kernel with metadata. While we don't call this more then once
yet, we should consider adding a second call after inlining, once this
has had a chance to soak for a while and no issues have arisen.

I've also re-enabled this pass in the TM, it was disabled years ago due
to "numerical discrepancies" https://reviews.llvm.org/D96166. In our
testing we haven't seen any issues with adding ranges to intrinsics, and
I cannot find any further info about what issues were encountered.
2024-06-06 06:42:46 -07:00
Yaxun (Sam) Liu
be5075ab8d [CUDA] make kernel stub ICF-proof (#90155)
MSVC linker merges functions having comdat which have identical set of
instructions. CUDA uses kernel stub function as key to look up kernels
in device executables. If kernel stub function for different kernels are
merged by ICF, incorrect kernels will be launched.

To prevent ICF from merging kernel stub functions, an unique global
variable is created for each kernel stub function having comdat and a
store is added to the kernel stub function. This makes the set of
instructions in each kernel function unique.

Fixes: https://github.com/llvm/llvm-project/issues/88883
2024-05-01 10:24:23 -04:00
Yaxun (Sam) Liu
748ef7eccc [CUDA][HIP] Fix record layout on Windows (#87651)
On windows, record layout should be consistent with host side, otherwise
host code is not able to access fields of the record correctly.

Fixes: https://github.com/llvm/llvm-project/issues/51031

Fixes: SWDEV-446010
2024-04-17 21:44:12 -04:00
Joseph Huber
470aefb240 [Offload][NFC] Remove omp_ prefix from offloading entries (#88071)
Summary:
These entires are generic for offloading with the new driver now. Having
the `omp` prefix was a historical artifact and is confusing when used
for CUDA. This patch just renames them for now, future patches will
rework the binary format to make it more common.
2024-04-09 15:50:15 -05:00
Jun Wang
c4e517f59c [AMDGPU] Adding the amdgpu_num_work_groups function attribute (#79035)
A new function attribute named amdgpu_num_work_groups is added. This
attribute, which consists of three integers, allows programmers to let
the compiler know the number of workgroups to be launched in each of the
three dimensions and do optimizations based on that information.

---------

Co-authored-by: Jun Wang <jun.wang7@amd.com>
2024-03-12 10:30:39 -07:00
Yaxun (Sam) Liu
b46f980454 [HIP] fix host-used external kernel (#83870)
In -fgpu-rdc mode, when an external kernel is used by a host function
with weak_odr linkage (e.g. explicitly instantiated template function),
the kernel should not be marked as host-used external kernel, since the
host function may be dropped by the linker. Mark the external kernel as
host-used external kernel will force a reference to the external kernel,
which the user may not define in other TU.

Fixes: https://github.com/llvm/llvm-project/issues/83771
2024-03-08 10:50:38 -05:00
Emma Pilkington
4490003a22 [AMDGPU] Rename COV module flag to amdhsa_code_object_version (#79905)
The previous name 'amdgpu_code_object_version', was misleading since
this is really a property of the HSA OS. The new spelling also matches
the asm directive I added in bc82cfb.
2024-03-06 09:51:48 -05:00
Yaxun (Sam) Liu
33a6ce1837 [HIP] Allow partial linking for -fgpu-rdc (#81700)
`-fgpu-rdc` mode allows device functions call device functions in
different TU. However, currently all device objects have to be linked
together since only one fat binary is supported. This is time consuming
for AMDGPU backend since it only supports LTO.

There are use cases that objects can be divided into groups in which
device functions are self-contained but host functions are not. It is
desirable to link/optimize/codegen the device code and generate a fatbin
for each group, whereas partially link the host code with `ld -r` or
generate a static library by using the `--emit-static-lib` option of
clang. This avoids linking all device code together, therefore decreases
the linking time for `-fgpu-rdc`.

Previously, clang emits an external symbol `__hip_fatbin` for all
objects for `-fgpu-rdc`. With this patch, clang emits an unique external
symbol `__hip_fatbin_{cuid}` for the fat binary for each object. When a
group of objects are linked together to generate a fatbin, the symbols
are merged by alias and point to the same fat binary. Each group has its
own fat binary. One executable or shared library can have multiple fat
binaries. Device linking is done for undefined fab binary symbols only
to avoid repeated linking. `__hip_gpubin_handle` is also uniquefied and
merged to avoid repeated registering. Symbol `__hip_cuid_{cuid}` is
introduced to facilitate debugging and tooling.

Fixes: https://github.com/llvm/llvm-project/issues/77018
2024-02-22 13:51:31 -05:00
Mészáros Gergely
5942868a21 [clang][AMDGPU][CUDA] Handle __builtin_printf for device printf (#68515)
Previously `__builtin_printf` would result to emitting call to `printf`,
even though directly calling `printf` was translated.

Ref: #68478
2024-02-05 23:23:13 +05:30
Pierre van Houtryve
500846d2f5 [AMDGPU] Introduce Code Object V6 (#76954)
Introduce Code Object V6 in Clang, LLD, Flang and LLVM. This is the same
as V5 except a new "generic version" flag can be present in EFLAGS. This
is related to new generic targets that'll be added in a follow-up patch.
It's also likely V6 will have new changes (possibly new metadata
entries) added later.

Docs change are part of the follow-up patch #76955
2024-02-05 08:19:53 +01:00
Saiyedul Islam
082f87c9d4 [AMDGPU] Change default AMDHSA Code Object version to 5 (#79038)
Also update LIT tests and docs.
For more details, see
https://llvm.org/docs/AMDGPUUsage.html#code-object-v5-metadata

Corresponding llvm-objdump AMDGPU lit tests are updated
in a follow-up PR.
2024-01-23 17:08:18 +05:30
Yingwei Zheng
1228becf7d [FuncAttrs] Deduce noundef attributes for return values (#76553)
This patch deduces `noundef` attributes for return values.
IIUC, a function returns `noundef` values iff all of its return values
are guaranteed not to be `undef` or `poison`.
Definition of `noundef` from LangRef:
```
noundef
This attribute applies to parameters and return values. If the value representation contains any 
undefined or poison bits, the behavior is undefined. Note that this does not refer to padding 
introduced by the type’s storage representation.
```
Alive2: https://alive2.llvm.org/ce/z/g8Eis6

Compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=30dcc33c4ea3ab50397a7adbe85fe977d4a400bd&to=c5e8738d4bfbf1e97e3f455fded90b791f223d74&stat=instructions:u
|stage1-O3|stage1-ReleaseThinLTO|stage1-ReleaseLTO-g|stage1-O0-g|stage2-O3|stage2-O0-g|stage2-clang|
|--|--|--|--|--|--|--|
|+0.01%|+0.01%|-0.01%|+0.01%|+0.03%|-0.04%|+0.01%|

The motivation of this patch is to reduce the number of `freeze` insts
and enable more optimizations.
2023-12-31 20:44:48 +08:00
Joseph Huber
97f3be2c5a [CUDA][HIP] Improve variable registration with the new driver (#73177)
Summary:
This patch adds support for registering texture / surface variables from
CUDA / HIP. Additionally, we now properly track the `extern` and `const`
flags that are also used in these runtime functions.

This does not implement the `managed` variables yet as those seem to
require some extra handling I'm not familiar with. The issue is that the
current offload entry isn't large enough to carry size and alignment
information along with an extra global.
2023-12-07 15:44:23 -06:00
Joseph Huber
52204a29ab [Offload] Initial support for registering offloading entries on COFF targets (#72697)
Summary:
This patch provides the initial support to allow handling the new
driver's offloading entries. Normally, the ELF target can emit varibles
at C-identifier named sections and the linker will provide a pointer to
the section. For COFF target, instead the linker merges sections
containing a `$` in alphabetical order. We thus can emit these variables
at sections and then emit two variables that are guaranteed to be sorted
before and after the others to traverse it. Previous patches
consolidated the handling of offloading entries so that this patch more
easily can handle mapping them to the appropriate section.

Ideally, the only remaining step to allow the new driver to run on
Windows targets is to accurately map the following `ld.lld` arguments to
their `llvm-link` equivalents. These are used inside the linker-wrapper,
so we should simply need to remap the arguments to the same
functionality if possible.
```
-o, -output
-l, --library
-L, --library-path
-v, --version
-rpath
-whole-archive, -no-whole-archive
```

I have not tested this at runtime as I do not have access to a windows
machine.

This patch was adapted from some initial efforts in
https://reviews.llvm.org/D137470.
2023-11-21 06:48:34 -06:00
Yaxun (Sam) Liu
9774d0ce5f [CUDA][HIP] Make template implicitly host device (#70369)
Added option -foffload-implicit-host-device-templates which is off by
default.

When the option is on, template functions and specializations without
host/device attributes have implicit host device attributes.

They can be overridden by device template functions with the same
signagure.
They are emitted on device side only if they are used on device side.

This feature is added as an extension.
`__has_extension(cuda_implicit_host_device_templates)` can be used to
check whether it is enabled.

This is to facilitate using standard C++ headers for device.

Fixes: https://github.com/llvm/llvm-project/issues/69956

Fixes: SWDEV-428314
2023-11-09 20:36:38 -05:00
Saiyedul Islam
21861991e7 [OpenMP] Cleanup and fixes for ABI agnostic DeviceRTL (#71234)
Fixes the DeviceRTL compilation to ensure it is ABI agnostic. Uses
already available global variable "oclc_ABI_version" instead of
"llvm.amdgcn.abi.verion".

It also adds some minor fields in ImplicitArg structure.
2023-11-09 10:34:35 +05:30
Pierre van Houtryve
4428b01faa Reland: [AMDGPU] Remove Code Object V3 (#67118)
V3 has been deprecated for a while as well, so it can safely be removed
like V2 was removed.

- [Clang] Set minimum code object version to 4
- [lld] Fix tests using code object v3
- Remove code object V3 from the AMDGPU backend, and delete or port v3
tests to v4.
- Update docs to make it clear V3 can no longer be emitted.
2023-11-07 12:23:03 +01:00
pvanhout
868abf0961 Revert "[AMDGPU] Remove Code Object V3 (#67118)"
This reverts commit 544d91280c.
2023-10-18 12:55:36 +02:00
Pierre van Houtryve
544d91280c [AMDGPU] Remove Code Object V3 (#67118)
V3 has been deprecated for a while as well, so it can safely be removed
like V2 was removed.

- [Clang] Set minimum code object version to 4
- [lld] Fix tests using code object v3
- Remove code object V3 from the AMDGPU backend, and delete or port v3
tests to v4.
- Update docs to make it clear V3 can no longer be emitted.
2023-10-16 08:21:48 +02:00
Noah Goldstein
2da4960f20 [Inliner] Also propagate noundef and align ret attributes during inlining
Both of these can potentially be lost otherwise.
2023-10-03 16:12:19 -05:00
Jakub Chlanda
3f8d4a8ef2 Reland [NVPTX] Add support for maxclusterrank in launch_bounds (#66496) (#67667)
This reverts commit 0afbcb20fd.
2023-09-29 08:39:31 +02:00
Sam McCall
0afbcb20fd Revert "[NVPTX] Add support for maxclusterrank in launch_bounds (#66496)"
This reverts commit dfab31b41b.

SemaDeclAttr.cpp cannot depend on Basic's private headers
(lib/Basic/Targets/NVPTX.h)
2023-09-27 10:59:04 +02:00
Jakub Chlanda
dfab31b41b [NVPTX] Add support for maxclusterrank in launch_bounds (#66496)
Since SM_90 CUDA supports specifying additional argument to the
launch_bounds attribute: maxBlocksPerCluster, to express the maximum
number of CTAs that can be part of the cluster. See:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-dimension-directives-maxclusterrank
and

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds
for details.
2023-09-27 08:51:26 +02:00
Pierre van Houtryve
fe2f67e4ba [AMDGPU] Remove Code Object V2 (#65715)
Code Object V2 has been deprecated for more than a year now. We can
safely remove it from LLVM.

- [clang] Remove support for the `-mcode-object-version=2` option.
- [lld] Remove/refactor tests that were still using COV2
- [llvm] Update AMDGPUUsage.rst
- Code Object V2 docs are left for informational purposes because those
code objects may still be supported by the runtime/loaders for a while.
- [AMDGPU] Remove COV2 emission capabilities.
- [AMDGPU] Remove `MetadataStreamerYamlV2` which was only used by COV2
- [AMDGPU] Update all tests that were still using COV2 - They are either
deleted or ported directly to code object v4 (as v3 is also planned to
be removed soon).
2023-09-21 12:00:45 +02:00
Yaxun (Sam) Liu
d7e1932f85 [HIP] Fix comdat of template kernel handle (#66283)
Currently, clang emits LLVM IR that fails verifier for the following
code:

```
template<typename T>
__global__ void foo(T x);

void bar() {
  foo<<<1, 1>>>(0);
}
```
This is due to clang putting the kernel handle for foo into comdat,
which is not allowed, since the kernel handle is a declaration.

The siutation is similar to calling a declaration-only template
function. The callee will be a declaration in LLVM IR and won't be put
into comdat. This is in contrast to calling a template function with
body, which will be put into comdat.

Fixes: SWDEV-419769
2023-09-14 15:56:02 -04:00