clang-p2996

Author	SHA1	Message	Date
Alex MacLean	7daa65a088	Reland "[NVPTX] Use .common linkage for common globals" (#86824 ) Switch from `.weak` to `.common` linkage for common global variables where possible. The `.common` linkage is described in [PTX ISA 11.6.4. Linking Directives: .common] (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#linking-directives-common) > Declares identifier to be globally visible but “common”. > >Common symbols are similar to globally visible symbols. However multiple object files may declare the same common symbol and they may have different types and sizes and references to a symbol get resolved against a common symbol with the largest size. > >Only one object file can initialize a common symbol and that must have the largest size among all other definitions of that common symbol from different object files. > >.common linking directive can be used only on variables with .global storage. It cannot be used on function symbols or on symbols with opaque type. I've updated the logic and tests to only use `.common` for PTX 5.0 or greater and verified that the new tests now pass with `ptxas`.	2024-03-29 11:58:41 -07:00
Alex MacLean	888e284903	[NVPTX] Use PTX prmt for llvm.bswap (#85545 )	2024-03-19 15:18:53 -07:00
Adrian Kuegel	f0a5e50550	[llvm][NVPTX] Add missing feature guard.	2024-03-19 06:53:14 +00:00
Sterling Augustine	d4a8e979e4	Revert "[NVPTX] Use .common linkage for common globals (#84416 )" This reverts commit `8f0012d3dc`. The common-linkage.ll test fails with ptxas enabled.	2024-03-15 21:27:46 +00:00
Alex MacLean	89b7b3b995	[NVPTX] support dynamic allocas with PTX alloca instruction (#84585 ) Add support for dynamically sized alloca instructions with the PTX alloca instruction introduced in PTX 7.3 ([9.7.15.3. Stack Manipulation Instructions: alloca] (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#stack-manipulation-instructions-alloca))	2024-03-15 11:51:46 -07:00
Adrian Kuegel	4372cab914	Reland "[NVPTX] Add support for atomic add for f16 type" (#85197 ) atom.add.noftz.f16 is supported since SM 7.0	2024-03-15 08:13:59 +01:00
Alex MacLean	8f0012d3dc	[NVPTX] Use .common linkage for common globals (#84416 ) Switch from `.weak` to `.common` linkage for common global variables where possible. The `.common` linkage is described in [PTX ISA 11.6.4. Linking Directives: .common](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#linking-directives-common) > Declares identifier to be globally visible but “common”. > >Common symbols are similar to globally visible symbols. However multiple object files may declare the same common symbol and they may have different types and sizes and references to a symbol get resolved against a common symbol with the largest size. > >Only one object file can initialize a common symbol and that must have the largest size among all other definitions of that common symbol from different object files. > >.common linking directive can be used only on variables with .global storage. It cannot be used on function symbols or on symbols with opaque type.	2024-03-14 13:30:02 -07:00
Nikita Popov	20b15e645c	[Tests] Drop inrange attribute from some tests (NFC) These don't actually test anything related to inrange, so drop the attribute.	2024-03-13 11:49:16 +01:00
Danial Klimkin	afd4758703	Revert "[NVPTX] Add support for atomic add for f16 type" (#84918 ) Reverts llvm/llvm-project#84295 due to breakages.	2024-03-12 15:01:18 +01:00
Adrian Kuegel	8e0f4b943f	[NVPTX] Add support for atomic add for f16 type (#84295 ) atom.add.noftz.f16 is supported since SM 7.0	2024-03-12 09:12:44 +01:00
Benjamin Kramer	20895965b2	[NVPTX] Remove sub.s16x2 instruction According to the PTX ISA this doesn't exist (and ptxas rejects it) See https://github.com/pytorch/pytorch/issues/118589	2024-03-05 12:43:02 +01:00
Alex MacLean	590c968e79	[NVPTX] fixup support for unaligned parameters and returns (#82562 ) Add support for unaligned parameters and return values. These must be loaded and stored one byte at a time and then bit manipulation is used to assemble the correct final result.	2024-02-22 17:27:28 -08:00
David Majnemer	cc13f3ba45	Correctly round FP -> BF16 when SDAG expands such nodes (#82399 ) We did something pretty naive: - round FP64 -> BF16 by first rounding to FP32 - skip FP32 -> BF16 rounding entirely - taking the top 16 bits of a FP32 which will turn some NaNs into infinities Let's do this in a more principled way by rounding types with more precision than FP32 to FP32 using round-inexact-to-odd which will negate double rounding issues.	2024-02-21 12:37:02 -05:00
Joseph Huber	11fcae69db	[LLVM] Add `__builtin_readsteadycounter` intrinsic and builtin for realtime clocks (#81331 ) Summary: This patch adds a new intrinsic and builtin function mirroring the existing `__builtin_readcyclecounter`. The difference is that this implementation targets a separate counter that some targets have which returns a fixed frequency clock that can be used to determine elapsed time, this is different compared to the cycle counter which often has variable frequency. This patch only adds support for the NVPTX and AMDGPU targets. This is done as a new and separate builtin rather than an argument to `readcyclecounter` to avoid needing to change existing code and to make the separation more explicit.	2024-02-13 10:06:25 -06:00
Artem Belevich	61a0fc7947	[NVPTX] pass correct GPU arch to ptxas test (#81535 )	2024-02-12 13:18:08 -08:00
Artem Belevich	8799d7143f	[NVPTX] Fix the error in a pattern match in v4i8 comparisons. (#81308 ) The replacement should've had BFE() as the arguments for the comparison, not the source register. While at that, tighten the patterns a bit, and expand them to cover variants with immediate arguments. Also change the default lowering of bfe() to use unsigned variant, so the value of the upper bits is predictable.	2024-02-12 12:59:03 -08:00
Joseph Huber	2ac8e6b7f5	[NVPTX] Implement `__builtin_readcyclecounter` on NVPTX (#81344 ) Summary: This patch simply states that `__builtin_readcyclecounter` is legal on NVPTX and makes it return the value from the `clock64` sreg. The timer intrinsics are marked as having side effects, which is desireable for timing primitives and required to pattern match the instrinic DAG.	2024-02-12 07:07:48 -06:00
Adrian Kuegel	da9559d69a	Do not use PerformEXTRACTCombine for v8i8 types (#81242 ) Same as with v4i8 types, we should not be using PerformEXTRACTCombine for v8i8 types.	2024-02-12 07:31:31 +01:00
Joseph Huber	3c707310a3	[NVPTX] Add clang builtin for `__nvvm_reflect` intrinsic (#81277 ) Summary: Some recent support made usage of `__nvvm_reflect` more consistent. We should expose it as a builtin rather than forcing users to externally define the function.	2024-02-09 14:11:01 -06:00
Joseph Huber	bb180856ec	[NVPTX][Fix] Update minimum CPU for NVPTX intrinsics test Summary: This test requires at least sm_30 to run, but that is still below the minimum supported version of sm_52 currently. Just set this to sm_60 so the tests pass in the future.	2024-02-09 14:05:40 -06:00
Joseph Huber	07dc85ba0c	[NVVMReflect] Improve folding inside of the NVVMReflect pass (#81253 ) Summary: The previous patch did very simple folding that only worked for driectly used branches. This patch improves this by traversing the use-def chain to sipmlify every constant subexpression until it reaches a terminator we can delete. The support should work for all expected cases now.	2024-02-09 13:39:03 -06:00
Joseph Huber	ffabcbcf8f	[NVVMReflect][Reland] Force dead branch elimination in NVVMReflect (#81189 ) Summary: The `__nvvm_reflect` function is used to guard invalid code that varies between architectures. One problem with this feature is that if it is used without optimizations, it will leave invalid code in the module that will then make it to the backend. The `__nvvm_reflect` pass is already mandatory, so it should do some trivial branch removal to ensure that constants are handled correctly. This dead branch elimination only works in the trivial case of a compare on a branch and does not touch any conditionals that were not realted to the `__nvvm_reflect` call in order to preserve `O0` semantics as much as possible. This should allow the following to work on NVPTX targets ```c int foo() { if (__nvvm_reflect("__CUDA_ARCH") >= 700) asm("valid;\n"); } ``` Relanding after fixing a bug.	2024-02-08 20:09:44 -06:00
Joseph Huber	0800a36053	Revert "[NVVMReflect] Force dead branch elimination in NVVMReflect (#81189 )" This reverts commit `9211e67da3`. Summary: This seemed to crash one one of the CUDA math tests. Revert until it can be fixed.	2024-02-08 17:32:04 -06:00
Joseph Huber	9211e67da3	[NVVMReflect] Force dead branch elimination in NVVMReflect (#81189 ) Summary: The `__nvvm_reflect` function is used to guard invalid code that varies between architectures. One problem with this feature is that if it is used without optimizations, it will leave invalid code in the module that will then make it to the backend. The `__nvvm_reflect` pass is already mandatory, so it should do some trivial branch removal to ensure that constants are handled correctly. This dead branch elimination only works in the trivial case of a compare on a branch and does not touch any conditionals that were not realted to the `__nvvm_reflect` call in order to preserve `O0` semantics as much as possible. This should allow the following to work on NVPTX targets ```c int foo() { if (__nvvm_reflect("__CUDA_ARCH") >= 700) asm("valid;\n"); } ```	2024-02-08 17:16:31 -06:00
Alex MacLean	9affa177b5	[NVPTX] Add support for calling aliases (#81170 ) The current implementation of aliases tries to remove all the aliases in the module to prevent the generic version of `AsmPrinter` from emitting them incorrectly. Unfortunately, if the aliases are used this will fail. Instead let's override the function to print aliases directly. In addition, the declarations of the alias functions must occur before the uses. To fix this we emit alias declarations as part of `emitDeclarations` and only emit the `.alias` directives at the end (where we can assume the aliasee has also already been declared).	2024-02-08 17:14:13 -06:00
Nikita Popov	ff9af4c43a	[CodeGen] Convert tests to opaque pointers (NFC)	2024-02-05 14:07:09 +01:00
Alex MacLean	5e3ae4c4af	[NVPTX] improve Boolean ISel (#80166 ) Add TableGen patterns to convert more instructions to boolean expressions: - mul -> and/or: i1 multiply instructions currently cannot be selected causing the compiler to crash. See https://github.com/llvm/llvm-project/issues/57404 - select -> and/or: Converting selects to and/or can enable more optimizations. `InstCombine` cannot do this as aggressively due to poison semantics.	2024-01-31 14:37:27 -08:00
Justin Fargnoli	577738a12d	Revert "Disable incorrect peephole optimizations" (#79916 ) This reverts commit `ff77058141`.	2024-01-29 16:22:07 -08:00
Justin Fargnoli	ff77058141	Disable incorrect peephole optimizations	2024-01-29 15:54:40 -08:00
Joseph Huber	e633807a1f	[NVPTX] Add builtin support for 'globaltimer' (#79765 ) Summary: This patch adds support for `globaltimer` to match `clock` and `clock64`. See the PTX ISA reference for details. This patch does not implement the `hi` or `lo` variants for brevity as they can be obtained from this with the cost of an additional register. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-globaltimer-globaltimer-lo-globaltimer-hi	2024-01-29 14:11:54 -06:00
Joseph Huber	ea8014046c	[NVPTX] Add builtin for 'exit' handling (#79777 ) Summary: The PTX ISA has always supported the 'exit' instruction to terminate individual threads. This patch adds a builtin to handle it. See the PTX documentation for further details. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#control-flow-instructions-exit	2024-01-29 14:09:34 -06:00
Joseph Huber	5f12cc912a	[NVPTX] Add builtin support for 'nanosleep' PTX instrunction (#79888 ) Summary: This patch adds a builtin for the `nanosleep` PTX function. It takes either an immediate or a register and sleeps for [0, 2t] nanoseconds given t. More information at the documentation: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#miscellaneous-instructions-nanosleep	2024-01-29 14:07:58 -06:00
Joseph Huber	d492faa7aa	[NVPTX] Add 'activemask' builtin and intrinsic support (#79768 ) Summary: This patch adds support for getting the 'activemask' instruction's value without needing to use inline assembly. See the relevant PTX reference for details. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-activemask	2024-01-29 14:07:30 -06:00
Alex MacLean	1d5820aafe	[NVPTX] improve identifier renaming for PTX (#79459 ) Update `NVPTXAssignValidGlobalNames` to convert all characters which are illegal in PTX identifiers to `_$_`. ([PTX ISA: 4.4 Identifiers](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#identifiers)).	2024-01-26 13:49:00 -08:00
Krzysztof Drewniak	63fe80fb18	[SeperateConstOffsetFromGEP] Handle `or disjoint` flags (#76997 ) This commit extends separate-const-offset-from-gep to look at the newly-added `disjoint` flag on `or` instructions so as to preserve additional opportunities for optimization. The tests were pre-committed in #76972.	2024-01-26 09:56:06 -06:00
Alex MacLean	3b8539c9dc	[NVPTX] use incomplete aggregate initializers (#79062 ) The PTX ISA specifies that initializers may be incomplete ([5.4.4. Initializers](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#initializers)) > As in C, array initializers may be incomplete, i.e., the number of initializer elements may be less than the extent of the corresponding array dimension, with remaining array locations initialized to the default value for the specified array type. Emitting initializers in this form is preferable because it reduces the size of the PTX, in some cases significantly, and can improve compile time of ptxas as a result.	2024-01-24 09:24:28 -08:00
Durgadoss R	43531e7196	[LLVM][NVPTX] Add cp.async.bulk.commit/wait intrinsics (#78698 ) This patch adds NVVM intrinsics and NVPTX codegen for the bulk variants of the async-copy commit/wait instructions. lit tests are added to verify the generated PTX. PTX Doc link: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-commit-group Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2024-01-19 10:42:33 -08:00
Alex MacLean	430a40d12e	[NVPTX] extend type support for nvvm.{min,max,mulhi,sad} (#78385 ) Ensure intrinsics and auto-upgrades support i16, i32, and i64 for for `nvvm.{min,max,mulhi,sad}` - `nvvm.min` and `nvvm.max`: These are auto-upgraded to `select` instructions but it is still nice to support the 16 bit variants just in case any generators of IR are still trying to use these intrinsics. - `nvvm.sad` added both the 16 and 64 bit variants, also marked this instruction as speculateble. These directly correspond to the PTX `sad.{u16,s16,u64,s64}` instructions. - `nvvm.mulhi` added the 16 bit variants. These directly correspond to the PTX `mul.hi.{s,u}16` instructions.	2024-01-17 16:18:39 -08:00
Alex MacLean	da7462a6ae	[NVPTX] Add tex.grad.cube{array} intrinsics (#77693 ) Extend IR support for PTX `tex` instruction described in [PTX ISA. 9.7.9.3. Texture Instructions: tex](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#texture-instructions-tex). Add support for unified-move versions of `tex.grad.cube{array}` variants added in PTX ISA 4.3.	2024-01-17 10:41:11 -08:00
mmoadeli	aa23e493f2	[NVPTX] Fix generating permute bytes from register pair when the initial values are undefined (#74437 ) When generating the permute bytes for the prmt instruction, the existence of an undefined initial value initialises the int32 that holds the mask with all 1's (0xFFFFFFFF). That initialization subsequently leads to complications during the subsequent OR operation, leading to inaccuracies in populating mask values for the following bytes. Consequently, the final value persists as a constant -1, irrespective of the actual mask values that succeed the initial set value.	2024-01-16 11:05:41 -08:00
Durgadoss R	8d817f6479	[LLVM][NVPTX]: Add aligned versions of cluster barriers (#77940 )	2024-01-13 10:41:19 +01:00
Durgadoss R	340cc1702e	[LLVM][NVPTX]: Add intrinsic for setmaxnreg (#77289 ) This patch adds an intrinsic for setmaxnreg PTX instruction. * PTX Doc link for this instruction: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#miscellaneous-instructions-setmaxnreg * The i32 argument, an immediate value, specifies the actual absolute register count for the instruction. * The `setmaxnreg` instruction is available in SM90a. So, this patch adds 'hasSM90a' predicate to use in the NVPTX backend. * lit tests are added to verify the lowering of the intrinsic. * Verifier logic (and tests) are added to test the register count range and divisibility-by-8 requirements. Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2024-01-09 12:04:13 -08:00
James Y Knight	b856e77b2d	Set MaxAtomicSizeInBitsSupported for remaining targets. (#75703 ) Targets affected: - NVPTX and BPF: set to 64 bits. - ARC, Lanai, and MSP430: set to 0 (they don't implement atomics). Those which didn't yet add AtomicExpandPass to their pass pipeline now do so. This will result in larger atomic operations getting expanded to `__atomic_*` libcalls via AtomicExpandPass. On all these targets, this now matches what Clang already does in the frontend. The only targets which do not configure AtomicExpandPass now are: - DirectX and SPIRV: they aren't normal backends. - AVR: a single-cpu architecture with no privileged/user divide, which could implement all atomics by disabling/enabling interrupts, regardless of size/alignment. Will be addressed by future work.	2024-01-08 22:34:28 -05:00
Yingwei Zheng	1228becf7d	[FuncAttrs] Deduce `noundef` attributes for return values (#76553 ) This patch deduces `noundef` attributes for return values. IIUC, a function returns `noundef` values iff all of its return values are guaranteed not to be `undef` or `poison`. Definition of `noundef` from LangRef: ``` noundef This attribute applies to parameters and return values. If the value representation contains any undefined or poison bits, the behavior is undefined. Note that this does not refer to padding introduced by the type’s storage representation. ``` Alive2: https://alive2.llvm.org/ce/z/g8Eis6 Compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=30dcc33c4ea3ab50397a7adbe85fe977d4a400bd&to=c5e8738d4bfbf1e97e3f455fded90b791f223d74&stat=instructions:u \|stage1-O3\|stage1-ReleaseThinLTO\|stage1-ReleaseLTO-g\|stage1-O0-g\|stage2-O3\|stage2-O0-g\|stage2-clang\| \|--\|--\|--\|--\|--\|--\|--\| \|+0.01%\|+0.01%\|-0.01%\|+0.01%\|+0.03%\|-0.04%\|+0.01%\| The motivation of this patch is to reduce the number of `freeze` insts and enable more optimizations.	2023-12-31 20:44:48 +08:00
Youngsuk Kim	f9304974cc	[llvm][NVPTX] Inform that 'DYNAMIC_STACKALLOC' is unsupported (#74684 ) Catch unsupported path early up, and emit error with information. Motivated by the following threads: * https://discourse.llvm.org/t/nvptx-problems-with-dynamic-alloca/70745 * #64017	2023-12-14 22:06:22 -05:00
Benjamin Kramer	9458bae553	[NVPTX] Custom lower integer<->bf16 conversions for sm_80 (#74827 ) sm_80 only has f32->bf16 conversions, the remaining integer conversions arrived with sm_90. Use a two-step conversion for sm_80. There doesn't seem to be a way to express this promotion directly within the legalization framework, so fallback on Custom lowering.	2023-12-11 21:06:46 +01:00
Benjamin Kramer	06ebe3b237	[NVPTX] Fix a typo that makes the output invalid PTX It's surprisingly tricky to trigger this as it's only used by abs/neg which expand into and/xor in the integer domain.	2023-12-08 14:22:07 +01:00
Artem Belevich	a2d3bb1fa9	Revert "[NVPTX] Lower 16xi8 and 8xi8 stores efficiently (#73646 )" (#74518 ) This reverts commit `173fcf7da5`. We need to constrain the optimization to properly aligned loads/stores only. https://github.com/llvm/llvm-project/pull/73646#issuecomment-1841454559	2023-12-06 10:48:43 -08:00
Uday Bondhugula	173fcf7da5	[NVPTX] Lower 16xi8 and 8xi8 stores efficiently (#73646 ) Lower 16xi8 vector stores in NVPTX ISel efficiently using st.v4.b32 instead of multiple st.v4.u8 along the lines of vector loads and 8xf16. Similarly, 8xi8 using st.v2.u32.	2023-12-01 11:00:01 +05:30
Uday Bondhugula	b5d132010d	[NFC][NVPTX] Add a simpler test case for `0b80288e9e` (#73379 ) While `0b80288e9e` allowed more efficient lowering for 16xi8 loads, its test case was closer to an "integration" one. Add a much simpler unit test case that exercises it.	2023-11-28 19:28:51 +05:30

1 2 3 4 5 ...

570 Commits