Commit Graph

725 Commits

Author SHA1 Message Date
Changpeng Fang
350bda4419 AMDGPU: Rename intrinsics and remove f16/bf16 versions for load transpose (#86313)
Rename the intrinsics to close to the instruction mnemonic names:
Use global_load_tr_b64 and global_load_tr_b128 instead of
global_load_tr.

This patch also removes f16/bf16 versions of builtins/intrinsics. To
simplify the design, we should avoid enumerating all possible types in
implementing builtins. We can always use bitcast.
2024-03-25 16:55:22 -07:00
Changpeng Fang
3054d0dae7 AMDGPU: Rename and add bf16 support for global_load_tr builtins (#86202)
Make the name of a clang builtin as close to the mnemonic instruction
name as possible. The data type suffix may not be enough to tell what
instruction the builtin is going to produce.
  This patch also add the bf16 support for global_load_tr_b128 builtins.
2024-03-22 08:51:53 -07:00
Antonio Frighetto
b433076fcb [clang][CodeGen] Allow memcpy replace with trivial auto var init
When emitting the storage (or memory copy operations) for constant
initializers, the decision whether to split a constant structure or
array store into a sequence of field stores or to use `memcpy` is
based upon the optimization level and the size of the initializer.
In afe8b93ffd, we extended this by
allowing constants to be split when the array (or struct) type does
not match the type of data the address to the object (constant) is
expected to contain. This may happen when `emitStoresForConstant` is
called by `EmitAutoVarInit`, as the element type of the address gets
shrunk. When this occurs, let the initializer be split into a bunch
of stores only under `-ftrivial-auto-var-init=pattern`.

Fixes: https://github.com/llvm/llvm-project/issues/84178.
2024-03-21 09:55:04 +01:00
Jun Wang
c4e517f59c [AMDGPU] Adding the amdgpu_num_work_groups function attribute (#79035)
A new function attribute named amdgpu_num_work_groups is added. This
attribute, which consists of three integers, allows programmers to let
the compiler know the number of workgroups to be launched in each of the
three dimensions and do optimizations based on that information.

---------

Co-authored-by: Jun Wang <jun.wang7@amd.com>
2024-03-12 10:30:39 -07:00
Changpeng Fang
96813de52d AMDGPU: Define a feature for v_dot4_f32_* instructions (#84248)
FeatureDot11Insts (dot11-insts) for:
  v_dot4_f32_fp8_fp8, v_dot4_f32_fp8_bf8,
  v_dot4_f32_bf8_fp8, v_dot4_f32_bf8_bf8
2024-03-06 14:37:03 -08:00
Emma Pilkington
4490003a22 [AMDGPU] Rename COV module flag to amdhsa_code_object_version (#79905)
The previous name 'amdgpu_code_object_version', was misleading since
this is really a property of the HSA OS. The new spelling also matches
the asm directive I added in bc82cfb.
2024-03-06 09:51:48 -05:00
Joseph Huber
1fc5e50ceb [AMDGPU] Implement 'llvm.get.fpenv' and 'llvm.set.fpenv' (#83906)
Summary:
This patch implements the LLVM floating point environment control
intrinsics and also exposes it through clang. We encode the floating
point environment as a 64-bit value that simply concatenates the values
of the mode registers and the current trap status. We only fetch the
bits relevant for floating point instructions. That is, rounding mode,
denormalization mode, ieee, dx10 clamp, debug, enabled traps, f16
overflow, and active exceptions.
2024-03-06 08:11:54 -06:00
Shilei Tian
2ad43fa467 [AMDGPU] Fix operand types for V_DOT2_F32_BF16 (#82044) 2024-02-20 08:25:01 -05:00
Shilei Tian
46734aa1e5 [AMDGPU] Use bf16 instead of i16 for bfloat (#80908)
Currently we generally use `i16` to represent `bf16` in those tablegen
files. This patch is trying to use `bf16` directly.

Fix #79369.
2024-02-16 15:58:30 -05:00
Logikable
5fdd094837 [clang][CodeGen] Emit atomic IR in place of optimized libcalls. (#73176)
In the beginning, Clang only emitted atomic IR for operations it knew
the
underlying microarch had instructions for, meaning it required
significant
knowledge of the target. Later, the backend acquired the ability to
lower
IR to libcalls. To avoid duplicating logic and improve logic locality,
we'd like to move as much as possible to the backend.

There are many ways to describe this change. For example, this change
reduces the variables Clang uses to decide whether to emit libcalls or
IR, down to only the atomic's size.
2024-02-12 09:33:09 -08:00
Joseph Huber
3c707310a3 [NVPTX] Add clang builtin for __nvvm_reflect intrinsic (#81277)
Summary:
Some recent support made usage of `__nvvm_reflect` more consistent. We
should expose it as a builtin rather than forcing users to externally
define the function.
2024-02-09 14:11:01 -06:00
Joseph Huber
dbed89814e [AMDGPU] Add missing __builtin_amdgcn_wavefrontsize builtin (#80741)
Summary:
The backend supports the wavefrontsize intrinsic, and suggests that it
is tied to a corresponding clang builtin, but it is not actually
present. This simply adds it in so it can be used from clang. This
attribute likely isn't the best to rely on, but for the `libc` use-case
we will need to detect a struct's differing size in a way that will
depend on the wavefront size.
2024-02-05 17:58:19 -06:00
Valery Pykhtin
b8025d1482 Reapply "[AMDGPU] Add InstCombine rule for ballot.i64 intrinsic in wave32 mode." (#80303)
Reapply #71556 with added lit test constraint: `REQUIRES: amdgpu-registered-target`.

This reverts commit 9791e54149.
2024-02-02 13:09:25 +01:00
Jay Foad
b5c0b67bc2 [AMDGPU] Check wavefrontsize for GFX11 WMMA builtins (#79980) 2024-02-01 10:49:42 +00:00
Mariusz Sikora
f96e85b949 [AMDGPU][GFX12] Add tests for unsupported builtins (#78729)
__builtin_amdgcn_mfma* and __builtin_amdgcn_smfmac*
2024-01-31 14:04:04 +01:00
Joseph Huber
f2a78e68ee [AMDGPU] Do not emit arch dependent macros with unspecified cpu (#80035)
Summary:
Currently, the AMDGPU toolchain accepts not passing `-mcpu` as a means
to create a sort of "generic" IR. The resulting IR will not contain any
target dependent attributes and can then be inserted into another
program via `-mlink-builtin-bitcode` to inherit its attributes.

However, there are a handful of macros that can leak incorrect
information when compiling for an unspecified architecture. Currently,
things like the wavefront size will default to 64, which is actually
variable. We should not expose these macros unless it is known.
2024-01-30 13:05:29 -06:00
Joseph Huber
72d4fc1b4d Revert "[AMDGPU] Do not emit arch dependent macros with unspecified cpu (#79660)"
This reverts commit c9a6e993f7.

This breaks HIP code that incorrectly depended on GPU-specific macros to
be set. The code is totally wrong as using `__WAVEFRTONSIZE__` on the
host is absolutely meaningless, but it seems this entire corner of the
toolchain is fundmentally broken. Reverting for now to avoid breakages.
2024-01-29 11:11:25 -06:00
Joseph Huber
c9a6e993f7 [AMDGPU] Do not emit arch dependent macros with unspecified cpu (#79660)
Summary:
Currently, the AMDGPU toolchain accepts not passing `-mcpu` as a means
to create a sort of "generic" IR. The resulting IR will not contain any
target dependent attributes and can then be inserted into another
program via `-mlink-builtin-bitcode` to inherit its attributes.

However, there are a handful of macros that can leak incorrect
information when compiling for an unspecified architecture. Currently,
things like the wavefront size will default to 64, which is actually
variable. We should not expose these macros unless it is known.
2024-01-29 08:46:14 -06:00
Mirko Brkušanin
7fdf608cef [AMDGPU] Add GFX12 WMMA and SWMMAC instructions (#77795)
Co-authored-by: Petar Avramovic <Petar.Avramovic@amd.com>
Co-authored-by: Piotr Sobczak <piotr.sobczak@amd.com>
2024-01-24 13:43:07 +01:00
Mariusz Sikora
cfddb59be2 [AMDGPU][GFX12] VOP encoding and codegen - add support for v_cvt fp8/… (#78414)
…bf8 instructions

    Add VOP1, VOP1_DPP8, VOP1_DPP16, VOP3, VOP3_DPP8, VOP3_DPP16
    instructions that were supported on GFX940 (MI300):
    - V_CVT_F32_FP8
    - V_CVT_F32_BF8
    - V_CVT_PK_F32_FP8
    - V_CVT_PK_F32_BF8
    - V_CVT_PK_FP8_F32
    - V_CVT_PK_BF8_F32
    - V_CVT_SR_FP8_F32
    - V_CVT_SR_BF8_F32

---------

Co-authored-by: Mateja Marjanovic <mateja.marjanovic@amd.com>
Co-authored-by: Mirko Brkušanin <Mirko.Brkusanin@amd.com>
2024-01-24 12:21:15 +01:00
Saiyedul Islam
082f87c9d4 [AMDGPU] Change default AMDHSA Code Object version to 5 (#79038)
Also update LIT tests and docs.
For more details, see
https://llvm.org/docs/AMDGPUUsage.html#code-object-v5-metadata

Corresponding llvm-objdump AMDGPU lit tests are updated
in a follow-up PR.
2024-01-23 17:08:18 +05:30
Jay Foad
e21b0b083e [AMDGPU] Remove gws feature from GFX12 (#78711)
This was already done for LLVM. This patch just updates the Clang
builtin handling to match.
2024-01-19 15:45:53 +00:00
Jay Foad
ed12388082 [AMDGPU] Do not emit V_DOT2C_F32_F16_e32 on GFX12 (#78709)
That instruction is not supported on GFX12.
Added a testcase which previously crashed without this change.

Co-authored-by: pvanhout <pierre.vanhoutryve@amd.com>
2024-01-19 14:36:27 +00:00
Piotr Sobczak
57f6a3f7ea [AMDGPU] Add global_load_tr for GFX12 (#77772)
Support new amdgcn_global_load_tr instructions for load with transpose.

* MC layer support for GLOBAL_LOAD_TR_B64/GLOBAL_LOAD_TR_B128
* Intrinsic int_amdgcn_global_load_tr
* Clang builtins amdgcn_global_load_tr*
2024-01-18 15:14:42 +01:00
Mariusz Sikora
3e6589f21c [AMDGPU][GFX12] Add 16 bit atomic fadd instructions (#75917)
- image_atomic_pk_add_f16
- image_atomic_pk_add_bf16
- ds_pk_add_bf16
- ds_pk_add_f16
- ds_pk_add_rtn_bf16
- ds_pk_add_rtn_f16
- flat_atomic_pk_add_f16
- flat_atomic_pk_add_bf16
- global_atomic_pk_add_f16
- global_atomic_pk_add_bf16
- buffer_atomic_pk_add_f16
- buffer_atomic_pk_add_bf16
2024-01-18 14:01:09 +01:00
Mariusz Sikora
28b7e498b6 AMDGPU/GFX12: Add new dot4 fp8/bf8 instructions (#77892)
Endoding is VOP3P. Tagged as deep/machine learning instructions. i32
type (v4fp8 or v4bf8 packed in i32) is used for src0 and src1. src0 and
src1 have no src_modifiers. src2 is f32 and has src_modifiers: f32
fneg(neg_lo[2]) and f32 fabs(neg_hi[2]).

---------

Co-authored-by: Petar Avramovic <Petar.Avramovic@amd.com>
2024-01-18 14:00:27 +01:00
Jay Foad
4c65787f1e [AMDGPU] Add GFX12 __builtin_amdgcn_s_sleep_var (#77926) 2024-01-18 10:14:01 +00:00
Mariusz Sikora
264fd9e13e [AMDGPU][NFC] Rename feature FP8Insts to FP8ConversionInsts (#78439) 2024-01-18 08:46:53 +01:00
Valery Pykhtin
9791e54149 Revert "[AMDGPU] Add InstCombine rule for ballot.i64 intrinsic in wave32 mode." (#78429)
Reverts llvm/llvm-project#71556

Fixes failures:
https://lab.llvm.org/buildbot/#/builders/188/builds/40541
https://lab.llvm.org/buildbot/#/builders/91/builds/21847
https://lab.llvm.org/buildbot/#/builders/98/builds/31671
https://lab.llvm.org/buildbot/#/builders/139/builds/57289
2024-01-17 14:12:07 +01:00
Valery Pykhtin
57b50ef017 [AMDGPU] Add InstCombine rule for ballot.i64 intrinsic in wave32 mode. (#71556)
Substitute with zero-extended to i64 ballot.i32 intrinsic.
2024-01-17 17:02:05 +07:00
Nikita Popov
158d72d728 [Clang] Set writable and dead_on_unwind attributes on sret arguments (#77116)
Set the writable and dead_on_unwind attributes for sret arguments. These
indicate that the argument points to writable memory (and it's legal to
introduce spurious writes to it on entry to the function) and that the
argument memory will not be used if the call unwinds.

This enables additional MemCpyOpt/DSE/LICM optimizations.
2024-01-11 09:46:54 +01:00
Yingwei Zheng
1228becf7d [FuncAttrs] Deduce noundef attributes for return values (#76553)
This patch deduces `noundef` attributes for return values.
IIUC, a function returns `noundef` values iff all of its return values
are guaranteed not to be `undef` or `poison`.
Definition of `noundef` from LangRef:
```
noundef
This attribute applies to parameters and return values. If the value representation contains any 
undefined or poison bits, the behavior is undefined. Note that this does not refer to padding 
introduced by the type’s storage representation.
```
Alive2: https://alive2.llvm.org/ce/z/g8Eis6

Compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=30dcc33c4ea3ab50397a7adbe85fe977d4a400bd&to=c5e8738d4bfbf1e97e3f455fded90b791f223d74&stat=instructions:u
|stage1-O3|stage1-ReleaseThinLTO|stage1-ReleaseLTO-g|stage1-O0-g|stage2-O3|stage2-O0-g|stage2-clang|
|--|--|--|--|--|--|--|
|+0.01%|+0.01%|-0.01%|+0.01%|+0.03%|-0.04%|+0.01%|

The motivation of this patch is to reduce the number of `freeze` insts
and enable more optimizations.
2023-12-31 20:44:48 +08:00
Nikita Popov
a3d2d34e84 [Clang] Use poison as base for vector literals
When constructing vectors from elements, use poison instead of
undef as the base value. These literals always initialize all
elements (padding the remainder with zero), so that the choice
of base value does not affect semantics.
2023-12-19 11:53:18 +01:00
Jessica Del
32f9983c06 [AMDGPU] - Add address space for strided buffers (#74471)
This is an experimental address space for strided buffers. These buffers
can have structs as elements and
a stride > 1.
These pointers allow the indexed access in units of stride, i.e., they
point at `buffer[index * stride]`.
Thus, we can use the `idxen` modifier for buffer loads.

We assign address space 9 to 192-bit buffer pointers which contain a
128-bit descriptor, a 32-bit offset and a 32-bit index. Essentially,
they are fat buffer pointers with an additional 32-bit index.
2023-12-15 15:49:25 +01:00
Mariusz Sikora
966416b9e8 [AMDGPU][GFX12] Add new v_permlane16 variants (#75475) 2023-12-15 10:14:38 +01:00
Mariusz Sikora
7f55d7de1a [AMDGPU] GFX12: Add Split Workgroup Barrier (#74836)
Co-authored-by: Vang Thao <Vang.Thao@amd.com>
2023-12-13 15:01:13 +01:00
Romaric Jodin
d56e0d07cc clang/OpenCL: set sqrt fp accuracy on call to Z4sqrt (#66651)
This is reverting the previous implementation to avoid adding inline
function in opencl headers.
This was breaking clspv flow google/clspv#1231, while
https://reviews.llvm.org/D156743 mentioned that just decorating the call
node with `!pfmath` was enough.
This PR is implementing this idea.
The test has been updated with this implementation.
2023-12-01 16:34:44 +09:00
serge-sans-paille
afe8b93ffd [clang] Avoid memcopy for small structure with padding under -ftrivial-auto-var-init (#71677)
Recommit of 0d2860b795 with extra test
cases fixed.
2023-11-25 00:11:20 +01:00
Florian Hahn
419a4e41fc Revert "[clang] Avoid memcopy for small structure with padding under -ftrivial-auto-var-init (#71677)"
This reverts commit fe5c360a9a.
The commit causes the tests below to fail on many buildbots, e.g.
https://lab.llvm.org/buildbot/#/builders/245/builds/17047

  Clang :: CodeGen/aapcs-align.cpp
  Clang :: CodeGen/aapcs64-align.cpp
2023-11-23 20:18:55 +00:00
Jay Foad
cf1e0c0b07 [AMDGPU] Define new targets gfx1200 and gfx1201 (#73133)
Define target names and ELF numbers for new GFX12 targets gfx1200 and
gfx1201. For now they behave identically to GFX11.
2023-11-23 16:44:05 +00:00
serge-sans-paille
fe5c360a9a [clang] Avoid memcopy for small structure with padding under -ftrivial-auto-var-init (#71677)
Recommit of 0d2860b795 with extra test
cases fixed.
2023-11-23 17:37:03 +01:00
Jessica Del
b025864af8 [AMDGPU] - Add clang builtins for tied WMMA intrinsics (#70669)
Add clang builtins for the new tied wmma intrinsics. 
These variations tie the destination
accumulator matrix to the input
accumulator matrix.

See https://github.com/llvm/llvm-project/pull/69903 for context.
2023-11-13 13:23:26 +01:00
Rana Pratap Reddy
13ea1146a7 [AMDGPU] Lower __builtin_amdgcn_read_exec_hi to use amdgcn_ballot (#69567)
Currently __builtin_amdgcn_read_exec_hi lowers to llvm.read_register,
this patch lowers it to use amdgcn_ballot.
2023-10-26 10:26:11 +05:30
Nikita Popov
3b25407d97 [IR] Mark zext/sext constant expressions as undesirable
Introduce isDesirableCastOp() which determines whether IR builder
and constant folding should produce constant expressions for a
given cast type. This mirrors what we do for binary operators.

Mark zext/sext as undesirable, which prevents most creations of such
constant expressions. This is still somewhat incomplete and there
are a few more places that can create zext/sext expressions.

This is part of the work for
https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179.

The reason for the odd result in the constantexpr-fneg.c test is
that initially the "a[]" global is created with an [0 x i32] type,
at which point the icmp expression cannot be folded. Later it is
replaced with an [1 x i32] global and the icmp gets folded away.
But at that point we no longer fold the zext.
2023-10-02 12:40:20 +02:00
Matt Arsenault
ddc3346a6b clang/AMDGPU: Fix accidental behavior change for __builtin_amdgcn_ldexph (#66340) 2023-09-14 18:15:44 +03:00
Matt Arsenault
15e0fe0b61 clang/OpenCL: Add inline implementations of sqrt in builtin header
We want the !fpmath metadata to be attached to the sqrt intrinsic to
make it to the backend lowering. Emit an available_externally
definition which uses the builtin, which emits the !fpmath.

Fixes #64264

https://reviews.llvm.org/D156743
2023-09-12 23:23:00 +03:00
Saiyedul Islam
466a8149b3 Revert "[AMDGPU] Make default AMDHSA Code Object Version to be 5 (#65410)" (#66060)
This reverts commit 0a8d17e79b.
2023-09-12 15:13:59 +05:30
Saiyedul Islam
0a8d17e79b [AMDGPU] Make default AMDHSA Code Object Version to be 5 (#65410)
Also update LIT tests and docs.
For more details, see
https://llvm.org/docs/AMDGPUUsage.html#code-object-v5-metadata

Reviewed By: arsenm, jhuber6

Github PR: #65410

Differential Revision: https://reviews.llvm.org/D129818
2023-09-12 13:53:31 +05:30
Matt Arsenault
6a08cf12d9 clang: Add __builtin_exp10* and use new llvm.exp10 intrinsic
https://reviews.llvm.org/D157911
2023-09-09 23:14:12 +03:00
Saiyedul Islam
f616c3eeb4 [OpenMP][DeviceRTL][AMDGPU] Support code object version 5
Update DeviceRTL and the AMDGPU plugin to support code
object version 5. Default is code object version 4.

CodeGen for __builtin_amdgpu_workgroup_size generates code
for cov4 as well as cov5 if -mcode-object-version=none
is specified. DeviceRTL compilation passes this argument
via Xclang option to generate abi-agnostic code.

Generated code for the above builtin uses a clang
control constant "llvm.amdgcn.abi.version" to branch on
the abi version, which is available during linking of
user's OpenMP code. Load of this constant gets eliminated
during linking.

AMDGPU plugin queries the ELF for code object version
and then prepares various implicitargs accordingly.

Differential Revision: https://reviews.llvm.org/D139730

Reviewed By: jhuber6, yaxunl
2023-08-29 06:35:44 -05:00