clang-p2996

Author	SHA1	Message	Date
Ross Brunton	4785832144	[Offload] Fix cmake warning (#145488 ) Cmake was unhappy that there was no space between arguments, now it is.	2025-06-24 13:42:03 +01:00
Johannes Doerfert	57a90edacd	[OpenMP][GPU][FIX] Enable generic barriers in single threaded contexts (#140786 ) The generic GPU barrier implementation checked if it was the main thread in generic mode to identify single threaded regions. This doesn't work since inside of a non-active (=sequential) parallel, that thread becomes the main thread of a team, and is not the main thread in generic mode. At least that is the implementation of the APIs today. To identify single threaded regions we now check the team size explicitly. This exposed three other issues; one is, for now, expected and not a bug, the second one is a bug and has a FIXME in the single_threaded_for_barrier_hang_1.c file, and the final one is also benign as described in the end. The non-bug issue comes up if we ever initialize a thread state. Afterwards we will never run any region in parallel. This is a little conservative, but I guess thread states are really bad for performance anyway. The bug comes up if we optimize single_threaded_for_barrier_hang_1 and execute it in Generic-SPMD mode. For some reason we loose all the updates to b. This looks very much like a compiler bug, but could also be another logic issue in the runtime. Needs to be investigated. Issue number 3 comes up if we have nested parallels inside of a target region. The clang SPMD-check logic gets confused, determines SPMD (which is fine) but picks an unreasonable thread count. This is all benign, I think, just weird: ``` #pragma omp target teams #pragma omp parallel num_threads(64) #pragma omp parallel num_threads(10) {} ``` Was launched with 10 threads, not 64.	2025-05-20 19:33:54 -07:00
Joseph Huber	dbe070eb3e	[Offload] Fix PowerPC builds that pass -mcpu (#138327 ) Summary: Another hacky fix done until https://github.com/llvm/llvm-project/pull/136729 lands. This time for `-mcpu`.	2025-05-06 14:14:16 -05:00
Joseph Huber	dfcb8cb2a9	[OpenMP] Add pre sm_70 load hack back in (#138589 ) Summary: Different ordering modes aren't supported for an atomic load, so we just do an add of zero as the same thing. It's less efficient, but it works. Fixes https://github.com/llvm/llvm-project/issues/138560	2025-05-05 16:33:41 -05:00
Ye Luo	dcb43307ce	[Offload] Fix dependency issue #126143 in CMake	2025-05-05 00:38:48 -05:00
Joseph Huber	346792aafb	[Offload] Override linker for device build (#137246 ) Summary: Override the default linker in case the user is passing it separately. This requires `lld` but it always did. This will be fixed properly when https://github.com/llvm/llvm-project/pull/136729 lands. Fixes https://github.com/llvm/llvm-project/issues/136822	2025-04-25 17:22:07 +02:00
Joseph Huber	6d0d50f0ac	[OpenMP] Update the bitcode library install and search path (#136754 ) Summary: This was accidentally kept in the old location when we moved to the new `lib/<triple>/` location for the DeviceRTL. Move this to reduce the delta with https://github.com/llvm/llvm-project/pull/136729.	2025-04-23 08:20:15 -05:00
Joseph Huber	92bba68634	[Offload] Fix handling of 'bare' mode when environment missing (#136794 ) Summary: We treated the missing kernel environment as a unique mode, but it was kind of this random bool that was doing the same thing and it explicitly expects the kernel environment to be zero. It broke after the previous change since it used to default to SPMD and didn't handle zero in any of the other cases despite being used. This fixes that and queries for it without needing to consume an error.	2025-04-23 08:16:39 -05:00
Joseph Huber	56bf0e7202	[OpenMP] Remove dependency on LLVM include directory from DeviceRTL (#136359 ) Summary: Currently we depend on a single LLVM include directory. This is actually only required to define one enum, which is highly unlikely to change. THis patch makes the `Environment.h` include directory more hermetic so we no long depend on other libraries. In exchange, we get a simpler dependency list for the price of hard-coding `1` somewhere. I think it's a valid trade considering that this flag is highly unlikely to change at this point. @ronlieb AMD version https://gist.github.com/jhuber6/3313e6f957be14dc79fe85e5126d2cb3	2025-04-21 15:21:47 -05:00
Michał Górny	ac8fc09688	[offload] Unset `-march` when building GPU libraries (#136442 ) Unset `-march` when invoking the compiler and linker to build the GPU libraries. These libraries use GPU targets rather than the CPU targets, and an incidental `-march=native` causes Clang to be able to determine the GPU used — which causes the build to fail when there is no GPU available. Resetting `-march=` should suffice to revert to building generic code for the time being. See the discussion in: https://github.com/llvm/llvm-project/pull/126143#issuecomment-2816718492	2025-04-20 04:16:19 +00:00
Joseph Huber	db0f754c5a	[OpenMP] Remove 'libomptarget.devicertl.a' fatbinary and use static library (#126143 ) Summary: Currently, we build a single `libomptarget.devicertl.a` which is a fatbinary. It is a host object file that contains the embedded archive files for both the NVIDIA and AMDGPU targets. This was done primarily as a convenience due to naming conflicts. Now that the clang driver for the GPU targets can appropriate link via the per-target runtime-dir, we can just make two separate static libraries and remove the indirection. This patch creates two new static libraries that get installed into ``` /lib/amdgcn-amd-amdhsa/libomp.a /lib/nvptx64-nvidia-cuda/libomp.a ``` for AMDGPU and NVPTX respectively. The link job created by the linker wrapper now simply needs to do `-lomp` and it will search those directories and link those static libraries. This requires far less special handling. This patch is a precursor to changing the build system entirely to be a runtimes based one. Soon this target will be a standard `add_library` and done through the GPU runtime targets. NOTE that this actually does remove an additional optimization step. Previously we merged all of the files into a single bitcode object and forcibly internalized some definitions. This, instead, just treats them like a normal static library. This may possibly affect performance for some files, but I think it's better overall to use static library semantics because it allows us to have an 'include-what-you-use' relationship with the library. Performance testing will be required. If we really need the merged blob then we can simply pack that into a new static library.	2025-04-18 07:43:31 -05:00
Joseph Huber	2f41fa387d	[AMDGPU] Fix code object version not being set to 'none' (#135036 ) Summary: Previously, we removed the special handling for the code object version global. I erroneously thought that this meant we cold get rid of this weird `-Xclang` option. However, this also emits an LLVM IR module flag, which will then cause linking issues.	2025-04-10 11:31:21 -05:00
Sergio Afonso	66fca0674d	[OpenMP] Fix num_iters in __kmpc__loop DeviceRTL functions (#133435 ) This patch removes the addition of 1 to the number of iterations when calling the following DeviceRTL functions: - `__kmpc_distribute_for_static_loop` - `__kmpc_distribute_static_loop` - `__kmpc_for_static_loop` Calls to these functions are currently only produced by the OMPIRBuilder from flang, which already passes the correct number of iterations to these functions. By adding 1 to the received `num_iters` variable, worksharing can produce incorrect results. This impacts flang OpenMP offloading of `do`, `distribute` and `distribute parallel do` constructs. Expecting the application to pass `tripcount - 1` as the argument seems unexpected as well, so rather than updating flang I think it makes more sense to update the runtime.	2025-04-01 10:29:08 +01:00
Joseph Huber	772173f548	[Clang][AMDGPU] Remove special handling for COV4 libraries (#132870 ) Summary: When we were first porting to COV5, this lead to some ABI issues due to a change in how we looked up the work group size. Bitcode libraries relied on the builtins to emit code, but this was changed between versions. This prevented the bitcode libraries, like OpenMP or libc, from being used for both COV4 and COV5. The solution was to have this 'none' functionality which effectively emitted code that branched off of a global to resolve to either version. This isn't a great solution because it forced every TU to have this variable in it. The patch in https://github.com/llvm/llvm-project/pull/131033 removed support for COV4 from OpenMP, which was the only consumer of this functionality. Other users like HIP and OpenCL did not use this because they linked the ROCm Device Library directly which has its own handling (The name was borrowed from it after all). So, now that we don't need to worry about backward compatibility with COV4, we can remove this special handling. Users can still emit COV4 code, this simply removes the special handling used to make the OpenMP device runtime bitcode version agnostic.	2025-03-28 07:35:16 -05:00
macurtis-amd	21a8c63cdc	[offload] Remove bad assert in StaticLoopChunker::Distribute (#132705 ) When building with asserts enabled, this can actually cause strange miscompilations because an incorrect llvm.assume is generated at the point of the assertion.	2025-03-28 04:53:00 -05:00
Joseph Huber	cb493d2bab	[OpenMP] Replace utilities with 'gpuintrin.h' definitions (#131644 ) Summary: Port more instructions. AMD version is at https://gist.github.com/jhuber6/235d7ee95f747c75f9a3cfd8eedac6aa	2025-03-19 10:47:21 -05:00
Jon Chesterfield	deb0f3c09b	[openmp][nfc] Use builtin align in the devicertl (#131918 ) Noticed while extracting the smartstack as a test case	2025-03-18 21:31:49 +00:00
Jon Chesterfield	395bdebebd	Revert "[openmp][nfc] Refactor shared/lds smartstack for spirv (#131905 )" This reverts commit `c02b935a9b`. Failed a check-offload test under CI	2025-03-18 20:43:05 +00:00
Joseph Huber	206f78dfec	[OpenMP] Use 'gpuintrin.h' definitions for simple block identifiers (#131631 ) Summary: This patch ports the runtime to use `gpuintrin.h` instead of calling the builtins for most things. The `lanemask_gt` stuff was left for now with a fallback. AMD version for Ron https://gist.github.com/jhuber6/42014d635b9a8158727640876bf47226.	2025-03-18 15:38:46 -05:00
Jon Chesterfield	c02b935a9b	[openmp][nfc] Refactor shared/lds smartstack for spirv (#131905 ) Spirv doesn't have implicit conversions between address spaces (at least at present, we might need to change that) and address space qualified *this pointers are not handled well by clang. This commit changes the single instance of the smartstack to be explicitly a singleton, for fractionally simpler IR generation (no this pointer) and to sidestep the work in progress spirv64-- openmp target not being able to compile the original version.	2025-03-18 20:33:24 +00:00
Joseph Huber	ed9107f2d7	[OpenMP] Replace use of target address space with <gpuintrin.h> local (#126119 ) Summary: This definition is more portable since it defines the correct value for the target. I got rid of the helper mostly because I think it's easy enough to use now that it's a type and being explicit about what's `undef` or `poison` is good.	2025-02-09 10:25:25 -06:00
Joseph Huber	bb7ab2557c	[OpenMP] Port the OpenMP device runtime to direct C++ compilation (#123673 ) Summary: This removes the use of OpenMP offloading to build the device runtime. The main benefit here is that we no longer need to rely on offloading semantics to build a device only runtime. Things like variants are now no longer needed and can just be simple if-defs. In the future, I will remove most of the special handling here and fold it into calls to the `<gpuintrin.h>` functions instead. Additionally I will rework the compilation to make this a separate runtime. The current plan is to have this, but make including OpenMP and offloading either automatically add it, or print a warning if it's missing. This will allow us to use a normal CMake workflow and delete all the weird 'lets pull the clang binary out of the build' business. ``` -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=offload -DLLVM_RUNTIME_TARGETS=amdgcn-amd-amdhsa ``` After that, linking the OpenMP device runtime will be `-Xoffload-linker -lomp`. I.e. no more fat binary business. Only look at the most recent commit since this includes the two dependencies (fix to AMDGPUEmitPrintfBinding and the PointerToMember bug).	2025-02-05 08:18:52 -06:00
Christian Clauss	1f56bb3137	[Offload][NFC] Fix typos discovered by codespell (#125119 ) https://github.com/codespell-project/codespell % `codespell --ignore-words-list=archtype,hsa,identty,inout,iself,nd,te,ths,vertexes --write-changes`	2025-01-31 09:35:29 -06:00
Joseph Huber	760a786d15	[Clang] Prevent `mlink-builtin-bitcode` from internalizing the RPC client (#118661 ) Summary: Currently, we only use `-mlink-builtin-bitcode` for non-LTO NVIDIA compiliations. This has the problem that it will internalize the RPC client symbol which needs to be visible to the host. To counteract that, I put `retain` on it, but this also prevents optimizations on the global itself, so the passes we have that remove the symbol don't work on OpenMP anymore. This patch does the dumbest solution, adding a special string check for it in clang. Not the best solution, the runner up would be to have a clang attribute for `externally_initialized` because those can't be internalized, but that might have some unfortunate side-effects. Alternatively we could make NVIDIA compilations do LTO all the time, but that would affect some users and it's harder than I thought.	2025-01-27 19:30:59 -06:00
Joseph Huber	f233a54ae8	[OpenMP] Remove usage of pointer-to-member in lookup (#123671 ) Summary: This is buggy and is currently being tracked in https://github.com/llvm/llvm-project/issues/123241. For now, replace it with a macro so that we can use address spaces directly.	2025-01-21 07:50:40 -06:00
Joseph Huber	3274bf6b42	[OpenMP] Make each atomic helper take an atomic scope argument (#122786 ) Summary: Right now we just default to device for each type, and mix an ad-hoc scope with the one used by the compiler's builtins. Unify this can make each version take the scope optionally. For @ronlieb, this will remove the need for `add_system` in the fork as well as the extra `cas` with system scope, just pass `system`.	2025-01-20 21:58:27 -06:00
Joseph Huber	2d9f406943	[OpenMP] Adjust 'printf' handling in the OpenMP runtime (#123670 ) Summary: We used to avoid a lot of this stuff because we didn't properly handle variadics in device code. That's been solved for now, so we can just make an internal printf handler that forwards to the external `vprintf` function. This is either provided by NVIDIA's SDK or by the GPU libc implementation. The main reason for doing this is because it prevents the stupid AMDGPU printf pass from mangling our beautiful printfs!	2025-01-20 21:56:46 -06:00
Joseph Huber	723a3e746a	[OpenMP] Fix mispelled attribute and warning Summary: This is spelled `ompx_aligned_barrier` when used directly, but wasn't included in the list of known assumptions. Fix that so now th test works.	2025-01-20 08:40:19 -06:00
Joseph Huber	58af82b462	[OpenMP] Remove 'omp assumes' scopes now that we have no inline ASM (#123611 ) Summary: We used this globally scoped `ext_no_call_asm` as a sort of hack around the compiler that allowed the attributor to optimize out inline assembly calls to PTX instructions. Quite some time ago I got rid of every inline assembly call and replaced it with a builitin, so this can just be deleted. Furthermore, I use the `[[omp::assume]]` attribute directly for the aligned barrier usage. This prints an unknown assumption warning (even though it isn't) so I'm just silencing that for now until I fix it later. --------- Co-authored-by: Michael Kruse <github@meinersbur.de>	2025-01-20 08:11:06 -06:00
Joseph Huber	1c00d0d776	[OpenMP] Remove hack around missing atomic load (#122781 ) Summary: We used to do a fetch add of zero to approximate a load. This is because the NVPTX backend didn't handle this properly. It's not an issue anymore so simply use the proper atomic builtin.	2025-01-16 15:17:15 -06:00
Joseph Huber	74d5373f49	[OpenMP] Fix missing type getter for SFINAE helper Summary: This didn't get the type, which made using this always return false.	2025-01-10 19:35:41 -06:00
Joseph Huber	f53cb84df6	[OpenMP] Use __builtin_bit_cast instead of UB type punning (#122325 ) Summary: Use a normal bitcast, remove from the shared utils since it's not available in GCC 7.4	2025-01-09 13:59:21 -06:00
Joseph Huber	b57c0bac81	[OpenMP] Update atomic helpers to just use headers (#122185 ) Summary: Previously we had some indirection here, this patch updates these utilities to just be normal template functions. We use SFINAE to manage the special case handling for floats. Also this strips address spaces so it can be used more generally.	2025-01-09 13:57:39 -06:00
Joseph Huber	34f8573a51	[OpenMP] Use generic IR for the OpenMP DeviceRTL (#119091 ) Summary: We previously built this for every single architecture to deal with incompatibility. This patch updates it to use the 'generic' IR that `libc` and other projects use. Who knows if this will have any side-effects, probably worth testing more but it passes the tests I expect to pass on my side.	2024-12-24 18:05:28 -06:00
Joseph Huber	b0fbddde38	[OpenMP] Only put `retain` for NVPTX so it can be optimized out for AMD Summary: This is a hack that only NVPTX needs.	2024-12-17 15:16:51 -06:00
Joseph Huber	f4ee5a673f	[OpenMP] Replace AMDGPU fences with generic scoped fences (#119619 ) Summary: This is simpler and more common. I would've replaced the CUDA uses and made this the same but currently it doesn't codegen these fences fully and just emits a full system wide barrier as a fallback.	2024-12-12 07:54:51 -06:00
hidekisaito	f2bceb2311	[Offload][AMDGPU] accept generic target (#118919 ) Enables generic ISA, e.g., "--offload-arch=gfx11-generic" device code to run on gfx11-generic ISA capable device. Executable may contain one ELF that has specific target ISA and another ELF that has compatible generic ISA. Under that circumstance, this code should say both ELFs are compatible, leaving the rest to PluginManager to handle. Suggestions on how best to address that is welcome.	2024-12-09 19:11:38 -05:00
Michał Górny	69227a11fe	[offload] Support LIBOMPTARGET_DEVICE_ARCHITECTURES={amdgpu\|nvptx} (#119070 ) Add two more special values for LIBOMPTARGET_DEVICE_ARCHITECTURES: `amdgpu` and `nvptx`, to support building for all AMDGPU and NVPTX targets respectively. This can be used in place of `all` when offload is built with one of the GPU plugins only.	2024-12-07 15:37:28 +00:00
Michał Górny	b54ba5361e	[offload] Add gfx1012 (Navi 14) to AMDGPU models list (#118857 ) Fixes #118824	2024-12-06 03:24:55 +00:00
Jan Patrick Lehr	c7babfa6a3	[Offload] Find libc relative to DeviceRTL path (#118497 ) This was discussed as a potential solution in https://github.com/llvm/llvm-project/pull/118173	2024-12-03 16:37:57 +01:00
Joseph Huber	91f5f974cb	[OpenMP] Unconditionally provide an RPC client interface for OpenMP (#117933 ) Summary: This patch adds an RPC interface that lives directly in the OpenMP device runtime. This allows OpenMP to implement custom opcodes. Currently this is only providing the host call interface, which is the raw version of reverse offloading. Previously this lived in `libc/` as an extension which is not the correct place. The interface here uses a weak symbol for the RPC client by the same name that the `libc` interface uses. This means that it will defer to the libc one if both are present so we don't need to set up multiple instances. The presense of this symbol is what controls whether or not we set up the RPC server. Because this is an external symbol it normally won't be optimized out, so there's a special pass in OpenMPOpt that deletes this symbol if it is unused during linking. That means at `O0` the RPC server will always be present now, but will be removed trivially if it's not used at O1 and higher.	2024-12-02 14:31:51 -06:00
Joseph Huber	506ca19dc9	[OpenMP] Remove use of '__AMDGCN_WAVEFRONT_SIZE' (#113156 ) Summary: This is going to be deprecated in https://github.com/llvm/llvm-project/pull/112849. This patch ports it to use the builtin instead. This isn't a compile constant, so it could slightly negatively affect codegen. There really should be an IR pass to turn it into a constant if the function has known attributes. Using the builtin is correct when we just do it for knowing the size like we do here. Obviously guarding w32/w64 code with this check would be broken.	2024-11-25 07:38:28 -06:00
Matt Arsenault	a6fc489bb7	AMDGPU: Add gfx950 subtarget definitions (#116307 ) Mostly a stub, but adds some baseline tests and tests for removed instructions.	2024-11-18 10:41:14 -08:00
Carl Ritson	076aac59ac	[AMDGPU] Add a new target for gfx1153 (#113138 )	2024-10-23 12:56:58 +09:00
Joseph Huber	e8d2057ca4	[OpenMP] Add critical region lock for NVPTX targets (#110148 ) Summary: We define this on AMDGCN but not NVPTX, which leads to some failures dependong on the target.	2024-09-26 11:33:52 -07:00
Joseph Huber	c3ac3fe825	[OpenMP] Fix redefining `stdint.h` types (#108607 ) Summary: We can include `stdint.h` just fine as long as we don't allow it to find system headers, passing `-nostdlibinc` and `-nogpuinc` suppresses these extra paths so we will just use the clang resource headers for `stdint.h` and `stddef.h`.	2024-09-13 13:22:44 -05:00
Johannes Doerfert	08533a3ee8	[Offload][NFC] Reorganize `utils::` and make Device/Host/Shared clearer (#100280 ) We had three `utils::` namespaces, all with different "meaning" (host, device, hsa_utils). We should, when we can, keep "include/Shared" accessible from host and device, thus RefCountTy has been moved to a separate header. `hsa_utils` was introduced to make `utils::` less overloaded. And common functionality was de-duplicated, e.g., `utils::advance` and `utils::advanceVoidPtr` -> `utils:advancePtr`. Type punning now checks for the size of the result to make sure it matches the source type. No functional change was intended.	2024-09-05 13:36:26 -07:00
WÁNG Xuěruì	9adf81182e	[Offload] Fix stray libomptarget message helper calls (#106837 ) In #92581 the `LibomptargetUitls.cmake` helpers have been removed, but only uses of `libomptarget_say` were migrated. Migrate the remaining few warning and error messages so the `check-offload` target would not fail due to missing `libomptarget_warning_say`. While at it, update the `check-offload` unavailability message to say `check-offload` instead of `check-libomptarget`. Fixes #92581	2024-08-31 07:06:41 -05:00
Ethan Luis McDonough	fde2d23ee2	[PGO][OpenMP] Instrumentation for GPU devices (Revision of #76587 ) (#102691 ) This pull request is a revised version of #76587. This pull request fixes some build issues that were present in the previous version of this change. > This pull request is the first part of an ongoing effort to extends PGO instrumentation to GPU device code. This PR makes the following changes: > > - Adds blank registration functions to device RTL > - Gives PGO globals protected visibility when targeting a supported GPU > - Handles any addrspace casts for PGO calls > - Implements PGO global extraction in GPU plugins (currently only dumps info) > > These changes can be tested by supplying `-fprofile-instrument=clang` while targeting a GPU.	2024-08-22 01:10:54 -05:00
Joseph Huber	74d23f15b6	[OpenMP] Implement 'omp_alloc' on the device (#102526 ) Summary: The 'omp_alloc' function should be callable from a target region. This patch implemets it by simply calling `malloc` for every non-default trait value allocator. All the special access modifiers are unimplemented and return null. The null allocator returns null as the spec states it should not be usable from the target.	2024-08-14 13:38:55 -05:00

1 2

70 Commits