Summary:
This patch adds support for registering texture / surface variables from
CUDA / HIP. Additionally, we now properly track the `extern` and `const`
flags that are also used in these runtime functions.
This does not implement the `managed` variables yet as those seem to
require some extra handling I'm not familiar with. The issue is that the
current offload entry isn't large enough to carry size and alignment
information along with an extra global.
Summary:
This patch provides the initial support to allow handling the new
driver's offloading entries. Normally, the ELF target can emit varibles
at C-identifier named sections and the linker will provide a pointer to
the section. For COFF target, instead the linker merges sections
containing a `$` in alphabetical order. We thus can emit these variables
at sections and then emit two variables that are guaranteed to be sorted
before and after the others to traverse it. Previous patches
consolidated the handling of offloading entries so that this patch more
easily can handle mapping them to the appropriate section.
Ideally, the only remaining step to allow the new driver to run on
Windows targets is to accurately map the following `ld.lld` arguments to
their `llvm-link` equivalents. These are used inside the linker-wrapper,
so we should simply need to remap the arguments to the same
functionality if possible.
```
-o, -output
-l, --library
-L, --library-path
-v, --version
-rpath
-whole-archive, -no-whole-archive
```
I have not tested this at runtime as I do not have access to a windows
machine.
This patch was adapted from some initial efforts in
https://reviews.llvm.org/D137470.
Added option -foffload-implicit-host-device-templates which is off by
default.
When the option is on, template functions and specializations without
host/device attributes have implicit host device attributes.
They can be overridden by device template functions with the same
signagure.
They are emitted on device side only if they are used on device side.
This feature is added as an extension.
`__has_extension(cuda_implicit_host_device_templates)` can be used to
check whether it is enabled.
This is to facilitate using standard C++ headers for device.
Fixes: https://github.com/llvm/llvm-project/issues/69956
Fixes: SWDEV-428314
Fixes the DeviceRTL compilation to ensure it is ABI agnostic. Uses
already available global variable "oclc_ABI_version" instead of
"llvm.amdgcn.abi.verion".
It also adds some minor fields in ImplicitArg structure.
V3 has been deprecated for a while as well, so it can safely be removed
like V2 was removed.
- [Clang] Set minimum code object version to 4
- [lld] Fix tests using code object v3
- Remove code object V3 from the AMDGPU backend, and delete or port v3
tests to v4.
- Update docs to make it clear V3 can no longer be emitted.
V3 has been deprecated for a while as well, so it can safely be removed
like V2 was removed.
- [Clang] Set minimum code object version to 4
- [lld] Fix tests using code object v3
- Remove code object V3 from the AMDGPU backend, and delete or port v3
tests to v4.
- Update docs to make it clear V3 can no longer be emitted.
Code Object V2 has been deprecated for more than a year now. We can
safely remove it from LLVM.
- [clang] Remove support for the `-mcode-object-version=2` option.
- [lld] Remove/refactor tests that were still using COV2
- [llvm] Update AMDGPUUsage.rst
- Code Object V2 docs are left for informational purposes because those
code objects may still be supported by the runtime/loaders for a while.
- [AMDGPU] Remove COV2 emission capabilities.
- [AMDGPU] Remove `MetadataStreamerYamlV2` which was only used by COV2
- [AMDGPU] Update all tests that were still using COV2 - They are either
deleted or ported directly to code object v4 (as v3 is also planned to
be removed soon).
Currently, clang emits LLVM IR that fails verifier for the following
code:
```
template<typename T>
__global__ void foo(T x);
void bar() {
foo<<<1, 1>>>(0);
}
```
This is due to clang putting the kernel handle for foo into comdat,
which is not allowed, since the kernel handle is a declaration.
The siutation is similar to calling a declaration-only template
function. The callee will be a declaration in LLVM IR and won't be put
into comdat. This is in contrast to calling a template function with
body, which will be put into comdat.
Fixes: SWDEV-419769
Summary:
We use the `llvm.amgcn.abi.version` varaible to control code generation.
This is emitted in every module now to indicate what should be used when
compiling. Previously, the logic caused us to emit an external reference
to this variable when creating the code for the `none` type. This would
then cause us not to emit the actual definition. This patch refines the
logic to create the external reference, and then update it if it is
found unset by the time we emit the global. I had to remove the
reference to `GetOrCreateLLVmGlobal` because it did not accept the
proper address space.
Fixes: https://github.com/llvm/llvm-project/issues/65806
Currently clang put extern shared var ODR-used by host device functions
in global var __clang_gpu_used_external. This behavior was due to
https://reviews.llvm.org/D123441. However, clang should not do that for
extern shared vars since their addresses are per warp, therefore cannot
be accessed by host code.
Buitlins from AMD's device-libs are compiled without specifying a
target-cpu, which results in builtins without the target-features
attribute set.
Before this patch, when linking this builtins with -mlink-builtin-bitcode
the target-features were not propagated in the incoming builtins.
With this patch, the default target features are propagated
if they are compatible with the target-features in the incoming builtin.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D159206
https://reviews.llvm.org/D158247 caused regressions for HIP on Windows
and was reverted.
A reduced test case is:
```
typedef void (__stdcall* funcTy)();
void invoke(funcTy f);
static void __stdcall callee() noexcept {
}
void foo() {
invoke(callee);
}
```
It is due to clang missing handling host/device attributes for calling
convention at a few places
This patch fixes that.
This reverts commit de0df63972.
It was reverted due to regression in HIP unit test on Windows:
In file included from C:\hip-tests\catch\unit\graph\hipGraphClone.cc:37:
In file included from C:\hip-tests\catch\.\include\hip_test_common.hh:24:
In file included from C:\hip-tests\catch\.\include/hip_test_context.hh:24:
In file included from C:/install/native/Release/x64/hip/include\hip/hip_runtime.h:54:
C:/dk/win\vc\14.31.31107\include\thread:76:70: error: cannot initialize a parameter of type '_beginthreadex_proc_type' (aka 'unsigned int (*)(void *) __attribute__((stdcall))') with an lvalue of type 'const unsigned int (*)(void *) noexcept __attribute__((stdcall))': different exception specifications
76 | reinterpret_cast<void*>(_CSTD _beginthreadex(nullptr, 0, _Invoker_proc, _Decay_copied.get(), 0, &_Thr._Id));
| ^~~~~~~~~~~~~
C:\hip-tests\catch\unit\graph\hipGraphClone.cc:290:21) &>' requested here
90 | _Start(_STD forward<_Fn>(_Fx), _STD forward<_Args>(_Ax)...);
| ^
C:\hip-tests\catch\unit\graph\hipGraphClone.cc:290:21) &, 0>' requested here
311 | std::thread t(lambdaFunc);
| ^
C:/dk/win\ms_wdk\e22621\Include\10.0.22621.0\ucrt\process.h:99:40: note: passing argument to parameter '_StartAddress' here
99 | _In_ _beginthreadex_proc_type _StartAddress,
| ^
1 error generated when compiling for gfx1030.
Currently, clang does not resolve certain overloaded functions correctly in the initializer
of global variables, e.g.
template<typename T1, typename U>
T1 mypow(T1, U);
__attribute__((device)) double mypow(double, int);
double t_extent = mypow(1.0, 2);
In the above example, mypow is supposed to resolve to the host version
but clang resolves it to the device version instead, and emits an error
(https://godbolt.org/z/17xxzaa67).
However, if the variable is assigned in a host function, there is no error.
The discrepancy in overloading resolution inside and outside of
a function is due to clang not accounting for the host/device target
when resolving functions called in the initializer of a global variable.
This patch introduces a global host/device target context for CUDA/HIP
for functions called outside of functions. For global variable initialization,
it is determined by the host/device attribute of the variable. For other
situations, a default value of host_device is sufficient.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D158247
Fixes: SWDEV-416731
Update DeviceRTL and the AMDGPU plugin to support code
object version 5. Default is code object version 4.
CodeGen for __builtin_amdgpu_workgroup_size generates code
for cov4 as well as cov5 if -mcode-object-version=none
is specified. DeviceRTL compilation passes this argument
via Xclang option to generate abi-agnostic code.
Generated code for the above builtin uses a clang
control constant "llvm.amdgcn.abi.version" to branch on
the abi version, which is available during linking of
user's OpenMP code. Load of this constant gets eliminated
during linking.
AMDGPU plugin queries the ELF for code object version
and then prepares various implicitargs accordingly.
Differential Revision: https://reviews.llvm.org/D139730
Reviewed By: jhuber6, yaxunl
We had some instances when LLVM would not inline fixed-count memcpy and ended up
attempting to lower it a a libcall, which would not work on NVPTX as there's no
standard library to call.
The patch relaxes the threshold used for -Os compilation so we're always allowed
to inline memory copy functions.
Differential Revision: https://reviews.llvm.org/D158226
Summary:
Byval requires allocating additional stack space, and always requires an implicit copy to be inserted in codegen,
where it can be difficult to optimize. In this work, we use byref/IndirectAliased promotion method instead of
byval with the implicit copy semantics.
Reviewers:
arsenm
Differential Revision:
https://reviews.llvm.org/D155986
Currently clang does not consider host/device preference
when resolving delete operator in the file scope, which
causes device operator delete selected for class member
initialization.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D156795
The patch in D155211 added basic support for the `.alias` keyword in
PTX. This means we should be able to permit use of this in clang.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D156014
By default, clang assumes HIP kernels are launched with uniform block size,
which is the case for kernels launched through triple chevron or
hipLaunchKernelGGL. Clang adds uniform-work-group-size function attribute
to HIP kernels to allow the backend to do optimizations on that.
However, in some rare cases, HIP kernels can be launched
through hipExtModuleLaunchKernel where global work size is specified,
which may result in non-uniform block size.
To be able to support non-uniform block size for HIP kernels,
an option `-f[no-]offload-uniform-block is added. This option
is generic for offloading languages. Its default value is on for
CUDA/HIP and off otherwise.
Make -cl-uniform-work-group-size an alias to -foffload-uniform-block.
Reviewed by: Siu Chi Chan, Matt Arsenault, Fangrui Song, Johannes Doerfert
Differential Revision: https://reviews.llvm.org/D155213
Fixes: SWDEV-406592
Currently CUDA/HIP defines their own language standards in
LanguageStandards.def but they are redundant. They are the same as stdc++14.
The fact that CUDA/HIP uses c++* in option -std= indicates that they have the
same language standards as C++. The CUDA/HIP specific language features are
conveyed through language options, not language standards features. It makes
sense to let CUDA/HIP uses the same default language standard as C++.
Reviewed by: Siu Chi Chan, Artem Belevich
Differential Revision: https://reviews.llvm.org/D155539
Fixes: SWDEV-407685
Rename HIP_API_PER_THREAD_DEFAULT_STREAM and __HIP_NO_IMAGE_SUPPORT
so that they follow the convention with prefix and postfix __.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D155480
OpenCL and HIP have -cl-fp32-correctly-rounded-divide-sqrt and
-fno-hip-correctly-rounded-divide-sqrt. The corresponding fpmath metadata
was only set on fdiv, and not sqrt. The backend is currently underutilizing
sqrt lowering options, and the responsibility is split between the libraries
and backend and this metadata is needed.
CUDA/NVCC has -prec-div and -prev-sqrt but clang doesn't appear to be
aiming for compatibility with those. Don't know if OpenMP has a similar
control.
I'm using clang to compile CUDA code. And just found that clang doesn't support the per-thread stream option for NV CUDA. I don't know if there is another solution.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D154822
Currently, bf16 has been scatteredly added to the PTX codegen. This patch aims to complete the set of instructions and code path required to support bf16 data type.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D144911
Co-authored-by: Artem Belevich <tra@google.com>
Device libs make use of patterns like this:
```
__attribute__((target("gfx11-insts")))
static unsigned do_intrin_stuff(void)
{
return __builtin_amdgcn_s_sendmsg_rtnl(0x0);
}
```
For functions that are assumed to be eliminated if the currennt GPU target doesn't support them.
At O0 such functions aren't eliminated by common optimizations but often by AMDGPURemoveIncompatibleFunctions instead, which sees the "+gfx11-insts" attribute on, say, GFX9 and knows it's not valid, so it removes the function.
D142907 accidentally made it so such attributes were dropped during bitcode linking, making it impossible for RemoveIncompatibleFunctions to catch the functions and causing ISel to catch fire eventually.
This fixes the issue and adds a new test to ensure we don't accidentally fall into this trap again.
Fixes SWDEV-403642
Reviewed By: arsenm, yaxunl
Differential Revision: https://reviews.llvm.org/D152251
i16/f16/bf16 will use the same .b16 registers and
i32/v2f16 and v2bf16 will share .b32 registers.
The changes are mostly mechanical, intended to remove unnecessary register
classes which tend to produce redundant register moves.
Differential Revision: https://reviews.llvm.org/D151601
v2f16 regtype conversion to i32
LLVM IR already allows floating point type in atomicrmw.
Update clang atomic fetch max/min builtins to accept
floating point type like we did for fetch add/sub.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D150985
Fixes: SWDEV-401056
The rest of the fetch/op intrinsics were added in e13246a2ec but sub
was conspicuous by its absence.
Reviewed By: yaxunl
Differential Revision: https://reviews.llvm.org/D151701
Pursuant to discussions at
https://discourse.llvm.org/t/rfc-c-23-p1467r9-extended-floating-point-types-and-standard-names/70033/22,
this commit enhances the handling of the __bf16 type in Clang.
- Firstly, it upgrades __bf16 from a storage-only type to an arithmetic
type.
- Secondly, it changes the mangling of __bf16 to DF16b on all
architectures except ARM. This change has been made in
accordance with the finalization of the mangling for the
std::bfloat16_t type, as discussed at
https://github.com/itanium-cxx-abi/cxx-abi/pull/147.
- Finally, this commit extends the existing excess precision support to
the __bf16 type. This applies to hardware architectures that do not
natively support bfloat16 arithmetic.
Appropriate tests have been added to verify the effects of these
changes and ensure no regressions in other areas of the compiler.
Reviewed By: rjmccall, pengfei, zahiraam
Differential Revision: https://reviews.llvm.org/D150913
This is stricter than the default "ieee", and should probably be the
default. This patch leaves the default alone. I can change this in a
future patch.
There are non-reversible transforms I would like to perform which are
legal under IEEE denormal handling, but illegal with flushing zero
behavior. Namely, conversions between llvm.is.fpclass and fcmp with
zeroes.
Under "ieee" handling, it is legal to translate between
llvm.is.fpclass(x, fcZero) and fcmp x, 0.
Under "preserve-sign" handling, it is legal to translate between
llvm.is.fpclass(x, fcSubnormal|fcZero) and fcmp x, 0.
I would like to compile and distribute some math library functions in
a mode where it's callable from code with and without denormals
enabled, which requires not changing the compares with denormals or
zeroes.
If an IEEE function transforms an llvm.is.fpclass call into an fcmp 0,
it is no longer possible to call the function from code with denormals
enabled, or write an optimization to move the function into a denormal
flushing mode. For the original function, if x was a denormal, the
class would evaluate to false. If the function compiled with denormal
handling was converted to or called from a preserve-sign function, the
fcmp now evaluates to true.
This could also be of use for strictfp handling, where code may be
changing the denormal mode.
Alternative name could be "unknown".
Replaces the old AMDGPU custom inlining logic with more conservative
logic which tries to permit inlining for callees with dynamic handling
and avoids inlining other mismatched modes.
Fixes clang crash caused by a stale function pointer.
The bug has been present for a pretty long time, but we were lucky not to
trigger it until D140663.
Differential Revision: https://reviews.llvm.org/D146448
The asan functions are now attributed as used
in the device library, no need to keep the
declaration of asan device preserve function.
Patch by: Praveen Velliengiri
Reviewed by: Yaxun Liu
Differential Revision: https://reviews.llvm.org/D143495