Commit Graph

570 Commits

Author SHA1 Message Date
Alex MacLean
7daa65a088 Reland "[NVPTX] Use .common linkage for common globals" (#86824)
Switch from `.weak` to `.common` linkage for common global variables
where possible. The `.common` linkage is described in
[PTX ISA 11.6.4. Linking Directives: .common]
(https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#linking-directives-common)
> Declares identifier to be globally visible but “common”.
>
>Common symbols are similar to globally visible symbols. However
multiple object files may declare the same common symbol and they may
have different types and sizes and references to a symbol get resolved
against a common symbol with the largest size.
>
>Only one object file can initialize a common symbol and that must have
the largest size among all other definitions of that common symbol from
different object files.
>
>.common linking directive can be used only on variables with .global
storage. It cannot be used on function symbols or on symbols with opaque
type.

I've updated the logic and tests to only use `.common` for PTX 5.0 or
greater and verified that the new tests now pass with `ptxas`.
2024-03-29 11:58:41 -07:00
Alex MacLean
888e284903 [NVPTX] Use PTX prmt for llvm.bswap (#85545) 2024-03-19 15:18:53 -07:00
Adrian Kuegel
f0a5e50550 [llvm][NVPTX] Add missing feature guard. 2024-03-19 06:53:14 +00:00
Sterling Augustine
d4a8e979e4 Revert "[NVPTX] Use .common linkage for common globals (#84416)"
This reverts commit 8f0012d3dc.

The common-linkage.ll test fails with ptxas enabled.
2024-03-15 21:27:46 +00:00
Alex MacLean
89b7b3b995 [NVPTX] support dynamic allocas with PTX alloca instruction (#84585)
Add support for dynamically sized alloca instructions with the PTX
alloca instruction introduced in PTX 7.3 
([9.7.15.3. Stack Manipulation Instructions: alloca]
(https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#stack-manipulation-instructions-alloca))
2024-03-15 11:51:46 -07:00
Adrian Kuegel
4372cab914 Reland "[NVPTX] Add support for atomic add for f16 type" (#85197)
atom.add.noftz.f16 is supported since SM 7.0
2024-03-15 08:13:59 +01:00
Alex MacLean
8f0012d3dc [NVPTX] Use .common linkage for common globals (#84416)
Switch from `.weak` to `.common` linkage for common global variables
where possible. The `.common` linkage is described in [PTX ISA 11.6.4.
Linking Directives:
.common](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#linking-directives-common)
> Declares identifier to be globally visible but “common”.
>
>Common symbols are similar to globally visible symbols. However
multiple object files may declare the same common symbol and they may
have different types and sizes and references to a symbol get resolved
against a common symbol with the largest size.
>
>Only one object file can initialize a common symbol and that must have
the largest size among all other definitions of that common symbol from
different object files.
>
>.common linking directive can be used only on variables with .global
storage. It cannot be used on function symbols or on symbols with opaque
type.
2024-03-14 13:30:02 -07:00
Nikita Popov
20b15e645c [Tests] Drop inrange attribute from some tests (NFC)
These don't actually test anything related to inrange, so drop the
attribute.
2024-03-13 11:49:16 +01:00
Danial Klimkin
afd4758703 Revert "[NVPTX] Add support for atomic add for f16 type" (#84918)
Reverts llvm/llvm-project#84295 due to breakages.
2024-03-12 15:01:18 +01:00
Adrian Kuegel
8e0f4b943f [NVPTX] Add support for atomic add for f16 type (#84295)
atom.add.noftz.f16 is supported since SM 7.0
2024-03-12 09:12:44 +01:00
Benjamin Kramer
20895965b2 [NVPTX] Remove sub.s16x2 instruction
According to the PTX ISA this doesn't exist (and ptxas rejects it)

See https://github.com/pytorch/pytorch/issues/118589
2024-03-05 12:43:02 +01:00
Alex MacLean
590c968e79 [NVPTX] fixup support for unaligned parameters and returns (#82562)
Add support for unaligned parameters and return values. These must be
loaded and stored one byte at a time and then bit manipulation is used
to assemble the correct final result.
2024-02-22 17:27:28 -08:00
David Majnemer
cc13f3ba45 Correctly round FP -> BF16 when SDAG expands such nodes (#82399)
We did something pretty naive:
- round FP64 -> BF16 by first rounding to FP32
- skip FP32 -> BF16 rounding entirely
- taking the top 16 bits of a FP32 which will turn some NaNs into
infinities

Let's do this in a more principled way by rounding types with more
precision than FP32 to FP32 using round-inexact-to-odd which will negate
double rounding issues.
2024-02-21 12:37:02 -05:00
Joseph Huber
11fcae69db [LLVM] Add __builtin_readsteadycounter intrinsic and builtin for realtime clocks (#81331)
Summary:
This patch adds a new intrinsic and builtin function mirroring the
existing `__builtin_readcyclecounter`. The difference is that this
implementation targets a separate counter that some targets have which
returns a fixed frequency clock that can be used to determine elapsed
time, this is different compared to the cycle counter which often has
variable frequency.

This patch only adds support for the NVPTX and AMDGPU targets.

This is done as a new and separate builtin rather than an argument to
`readcyclecounter` to avoid needing to change existing code and to make
the separation more explicit.
2024-02-13 10:06:25 -06:00
Artem Belevich
61a0fc7947 [NVPTX] pass correct GPU arch to ptxas test (#81535) 2024-02-12 13:18:08 -08:00
Artem Belevich
8799d7143f [NVPTX] Fix the error in a pattern match in v4i8 comparisons. (#81308)
The replacement should've had BFE() as the arguments for the comparison,
not the source register.

While at that, tighten the patterns a bit, and expand them to cover
variants with immediate arguments. Also change the default lowering of
bfe() to use unsigned variant, so the value of the upper bits is
predictable.
2024-02-12 12:59:03 -08:00
Joseph Huber
2ac8e6b7f5 [NVPTX] Implement __builtin_readcyclecounter on NVPTX (#81344)
Summary:
This patch simply states that `__builtin_readcyclecounter` is legal on
NVPTX and makes it  return the value from the `clock64` sreg. The timer
intrinsics are marked as having side effects, which is desireable for
timing primitives and required to pattern match the instrinic DAG.
2024-02-12 07:07:48 -06:00
Adrian Kuegel
da9559d69a Do not use PerformEXTRACTCombine for v8i8 types (#81242)
Same as with v4i8 types, we should not be using PerformEXTRACTCombine
for v8i8 types.
2024-02-12 07:31:31 +01:00
Joseph Huber
3c707310a3 [NVPTX] Add clang builtin for __nvvm_reflect intrinsic (#81277)
Summary:
Some recent support made usage of `__nvvm_reflect` more consistent. We
should expose it as a builtin rather than forcing users to externally
define the function.
2024-02-09 14:11:01 -06:00
Joseph Huber
bb180856ec [NVPTX][Fix] Update minimum CPU for NVPTX intrinsics test
Summary:
This test requires at least sm_30 to run, but that is still below the
minimum supported version of sm_52 currently. Just set this to sm_60 so
the tests pass in the future.
2024-02-09 14:05:40 -06:00
Joseph Huber
07dc85ba0c [NVVMReflect] Improve folding inside of the NVVMReflect pass (#81253)
Summary:
The previous patch did very simple folding that only worked for driectly
used branches. This patch improves this by traversing the use-def chain
to sipmlify every constant subexpression until it reaches a terminator
we can delete. The support should work for all expected cases now.
2024-02-09 13:39:03 -06:00
Joseph Huber
ffabcbcf8f [NVVMReflect][Reland] Force dead branch elimination in NVVMReflect (#81189)
Summary:
The `__nvvm_reflect` function is used to guard invalid code that varies
between architectures. One problem with this feature is that if it is
used without optimizations, it will leave invalid code in the module
that will then make it to the backend. The `__nvvm_reflect` pass is
already mandatory, so it should do some trivial branch removal to ensure
that constants are handled correctly. This dead branch elimination only
works in the trivial case of a compare on a branch and does not touch
any conditionals that were not realted to the `__nvvm_reflect` call in
order to preserve `O0` semantics as much as possible. This should allow
the following to work on NVPTX targets

```c
int foo() {
  if (__nvvm_reflect("__CUDA_ARCH") >= 700)
    asm("valid;\n");
}
```

Relanding after fixing a bug.
2024-02-08 20:09:44 -06:00
Joseph Huber
0800a36053 Revert "[NVVMReflect] Force dead branch elimination in NVVMReflect (#81189)"
This reverts commit 9211e67da3.

Summary:
This seemed to crash one one of the CUDA math tests. Revert until it can
be fixed.
2024-02-08 17:32:04 -06:00
Joseph Huber
9211e67da3 [NVVMReflect] Force dead branch elimination in NVVMReflect (#81189)
Summary:
The `__nvvm_reflect` function is used to guard invalid code that varies
between architectures. One problem with this feature is that if it is
used without optimizations, it will leave invalid code in the module
that will then make it to the backend. The `__nvvm_reflect` pass is
already mandatory, so it should do some trivial branch removal to ensure
that constants are handled correctly. This dead branch elimination only
works in the trivial case of a compare on a branch and does not touch
any conditionals that were not realted to the `__nvvm_reflect` call in
order to preserve `O0` semantics as much as possible. This should allow
the following to work on NVPTX targets

```c
int foo() {
  if (__nvvm_reflect("__CUDA_ARCH") >= 700)
    asm("valid;\n");
}
```
2024-02-08 17:16:31 -06:00
Alex MacLean
9affa177b5 [NVPTX] Add support for calling aliases (#81170)
The current implementation of aliases tries to remove all the aliases in
the module to prevent the generic version of `AsmPrinter` from emitting
them incorrectly. Unfortunately, if the aliases are used this will fail.
Instead let's override the function to print aliases directly.

In addition, the declarations of the alias functions must occur before
the uses. To fix this we emit alias declarations as part of
`emitDeclarations` and only emit the `.alias` directives at the end
(where we can assume the aliasee has also already been declared).
2024-02-08 17:14:13 -06:00
Nikita Popov
ff9af4c43a [CodeGen] Convert tests to opaque pointers (NFC) 2024-02-05 14:07:09 +01:00
Alex MacLean
5e3ae4c4af [NVPTX] improve Boolean ISel (#80166)
Add TableGen patterns to convert more instructions to boolean
expressions:

- **mul -> and/or**: i1 multiply instructions currently cannot be
selected causing the compiler to crash. See
https://github.com/llvm/llvm-project/issues/57404
- **select -> and/or**: Converting selects to and/or can enable more
optimizations. `InstCombine` cannot do this as aggressively due to
poison semantics.
2024-01-31 14:37:27 -08:00
Justin Fargnoli
577738a12d Revert "Disable incorrect peephole optimizations" (#79916)
This reverts commit ff77058141.
2024-01-29 16:22:07 -08:00
Justin Fargnoli
ff77058141 Disable incorrect peephole optimizations 2024-01-29 15:54:40 -08:00
Joseph Huber
e633807a1f [NVPTX] Add builtin support for 'globaltimer' (#79765)
Summary:
This patch adds support for `globaltimer` to match `clock` and
`clock64`. See the PTX ISA reference for details. This patch does not
implement the `hi` or `lo` variants for brevity as they can be obtained
from this with the cost of an additional register.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-globaltimer-globaltimer-lo-globaltimer-hi
2024-01-29 14:11:54 -06:00
Joseph Huber
ea8014046c [NVPTX] Add builtin for 'exit' handling (#79777)
Summary:
The PTX ISA has always supported the 'exit' instruction to terminate
individual threads. This patch adds a builtin to handle it. See the PTX
documentation for further details.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#control-flow-instructions-exit
2024-01-29 14:09:34 -06:00
Joseph Huber
5f12cc912a [NVPTX] Add builtin support for 'nanosleep' PTX instrunction (#79888)
Summary:
This patch adds a builtin for the `nanosleep` PTX function. It takes
either an immediate or a register and sleeps for [0, 2t] nanoseconds
given t. More information at the documentation:

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#miscellaneous-instructions-nanosleep
2024-01-29 14:07:58 -06:00
Joseph Huber
d492faa7aa [NVPTX] Add 'activemask' builtin and intrinsic support (#79768)
Summary:
This patch adds support for getting the 'activemask' instruction's value
without needing to use inline assembly. See the relevant PTX reference
for details.


https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-activemask
2024-01-29 14:07:30 -06:00
Alex MacLean
1d5820aafe [NVPTX] improve identifier renaming for PTX (#79459)
Update `NVPTXAssignValidGlobalNames` to convert all characters which are
illegal in PTX identifiers to `_$_`. ([PTX ISA: 4.4
Identifiers](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#identifiers)).
2024-01-26 13:49:00 -08:00
Krzysztof Drewniak
63fe80fb18 [SeperateConstOffsetFromGEP] Handle or disjoint flags (#76997)
This commit extends separate-const-offset-from-gep to look at the
newly-added `disjoint` flag on `or` instructions so as to preserve
additional opportunities for optimization.

The tests were pre-committed in #76972.
2024-01-26 09:56:06 -06:00
Alex MacLean
3b8539c9dc [NVPTX] use incomplete aggregate initializers (#79062)
The PTX ISA specifies that initializers may be incomplete ([5.4.4.
Initializers](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#initializers))
> As in C, array initializers may be incomplete, i.e., the number of
initializer elements may be less than the extent of the corresponding
array dimension, with remaining array locations initialized to the
default value for the specified array type.

Emitting initializers in this form is preferable because it reduces the
size of the PTX, in some cases significantly, and can improve compile
time of ptxas as a result.
2024-01-24 09:24:28 -08:00
Durgadoss R
43531e7196 [LLVM][NVPTX] Add cp.async.bulk.commit/wait intrinsics (#78698)
This patch adds NVVM intrinsics and NVPTX codegen for the bulk variants
of the async-copy commit/wait instructions.
lit tests are added to verify the generated PTX.

PTX Doc link:

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-commit-group

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2024-01-19 10:42:33 -08:00
Alex MacLean
430a40d12e [NVPTX] extend type support for nvvm.{min,max,mulhi,sad} (#78385)
Ensure intrinsics and auto-upgrades support i16, i32, and i64 for for
`nvvm.{min,max,mulhi,sad}`

- `nvvm.min` and `nvvm.max`: These are auto-upgraded to `select`
instructions but it is still nice to support the 16 bit variants just in
case any generators of IR are still trying to use these intrinsics.
- `nvvm.sad` added both the 16 and 64 bit variants, also marked this
instruction as speculateble. These directly correspond to the PTX
`sad.{u16,s16,u64,s64}` instructions.
- `nvvm.mulhi` added the 16 bit variants. These directly correspond to
the PTX `mul.hi.{s,u}16` instructions.
2024-01-17 16:18:39 -08:00
Alex MacLean
da7462a6ae [NVPTX] Add tex.grad.cube{array} intrinsics (#77693)
Extend IR support for PTX `tex` instruction described in [PTX ISA.
9.7.9.3. Texture Instructions:
tex](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#texture-instructions-tex).
Add support for unified-move versions of `tex.grad.cube{array}` variants
added in PTX ISA 4.3.
2024-01-17 10:41:11 -08:00
mmoadeli
aa23e493f2 [NVPTX] Fix generating permute bytes from register pair when the initial values are undefined (#74437)
When generating the permute bytes for the prmt instruction, the
existence of an undefined initial value initialises the int32 that holds
the mask with all 1's (0xFFFFFFFF). That initialization subsequently
leads to complications during the subsequent OR operation, leading to
inaccuracies in populating mask values for the following bytes.
Consequently, the final value persists as a constant -1, irrespective of
the actual mask values that succeed the initial set value.
2024-01-16 11:05:41 -08:00
Durgadoss R
8d817f6479 [LLVM][NVPTX]: Add aligned versions of cluster barriers (#77940) 2024-01-13 10:41:19 +01:00
Durgadoss R
340cc1702e [LLVM][NVPTX]: Add intrinsic for setmaxnreg (#77289)
This patch adds an intrinsic for setmaxnreg PTX instruction.
* PTX Doc link for this instruction:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#miscellaneous-instructions-setmaxnreg

* The i32 argument, an immediate value, specifies the actual
  absolute register count for the instruction.
* The `setmaxnreg` instruction is available in SM90a.
  So, this patch adds 'hasSM90a' predicate to use in
  the NVPTX backend.
* lit tests are added to verify the lowering of the intrinsic.
* Verifier logic (and tests) are added to test the register
  count range and divisibility-by-8 requirements.

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2024-01-09 12:04:13 -08:00
James Y Knight
b856e77b2d Set MaxAtomicSizeInBitsSupported for remaining targets. (#75703)
Targets affected:

- NVPTX and BPF: set to 64 bits.
- ARC, Lanai, and MSP430: set to 0 (they don't implement atomics).

Those which didn't yet add AtomicExpandPass to their pass pipeline now
do so.

This will result in larger atomic operations getting expanded to
`__atomic_*` libcalls via AtomicExpandPass. On all these targets, this
now matches what Clang already does in the frontend.

The only targets which do not configure AtomicExpandPass now are:
- DirectX and SPIRV: they aren't normal backends.
- AVR: a single-cpu architecture with no privileged/user divide, which
could implement all atomics by disabling/enabling interrupts, regardless
of size/alignment. Will be addressed by future work.
2024-01-08 22:34:28 -05:00
Yingwei Zheng
1228becf7d [FuncAttrs] Deduce noundef attributes for return values (#76553)
This patch deduces `noundef` attributes for return values.
IIUC, a function returns `noundef` values iff all of its return values
are guaranteed not to be `undef` or `poison`.
Definition of `noundef` from LangRef:
```
noundef
This attribute applies to parameters and return values. If the value representation contains any 
undefined or poison bits, the behavior is undefined. Note that this does not refer to padding 
introduced by the type’s storage representation.
```
Alive2: https://alive2.llvm.org/ce/z/g8Eis6

Compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=30dcc33c4ea3ab50397a7adbe85fe977d4a400bd&to=c5e8738d4bfbf1e97e3f455fded90b791f223d74&stat=instructions:u
|stage1-O3|stage1-ReleaseThinLTO|stage1-ReleaseLTO-g|stage1-O0-g|stage2-O3|stage2-O0-g|stage2-clang|
|--|--|--|--|--|--|--|
|+0.01%|+0.01%|-0.01%|+0.01%|+0.03%|-0.04%|+0.01%|

The motivation of this patch is to reduce the number of `freeze` insts
and enable more optimizations.
2023-12-31 20:44:48 +08:00
Youngsuk Kim
f9304974cc [llvm][NVPTX] Inform that 'DYNAMIC_STACKALLOC' is unsupported (#74684)
Catch unsupported path early up, and emit error with information.

Motivated by the following threads:
* https://discourse.llvm.org/t/nvptx-problems-with-dynamic-alloca/70745
* #64017
2023-12-14 22:06:22 -05:00
Benjamin Kramer
9458bae553 [NVPTX] Custom lower integer<->bf16 conversions for sm_80 (#74827)
sm_80 only has f32->bf16 conversions, the remaining integer conversions
arrived with sm_90. Use a two-step conversion for sm_80.

There doesn't seem to be a way to express this promotion directly within
the legalization framework, so fallback on Custom lowering.
2023-12-11 21:06:46 +01:00
Benjamin Kramer
06ebe3b237 [NVPTX] Fix a typo that makes the output invalid PTX
It's surprisingly tricky to trigger this as it's only used by abs/neg
which expand into and/xor in the integer domain.
2023-12-08 14:22:07 +01:00
Artem Belevich
a2d3bb1fa9 Revert "[NVPTX] Lower 16xi8 and 8xi8 stores efficiently (#73646)" (#74518)
This reverts commit 173fcf7da5.

We need to constrain the optimization to properly aligned loads/stores
only.
https://github.com/llvm/llvm-project/pull/73646#issuecomment-1841454559
2023-12-06 10:48:43 -08:00
Uday Bondhugula
173fcf7da5 [NVPTX] Lower 16xi8 and 8xi8 stores efficiently (#73646)
Lower 16xi8 vector stores in NVPTX ISel efficiently using
st.v4.b32 instead of multiple st.v4.u8 along the lines of vector loads
and 8xf16. Similarly, 8xi8 using st.v2.u32.
2023-12-01 11:00:01 +05:30
Uday Bondhugula
b5d132010d [NFC][NVPTX] Add a simpler test case for 0b80288e9e (#73379)
While 0b80288e9e allowed more efficient lowering for 16xi8 loads, its
test case was closer to an "integration" one. Add a much simpler unit
test case that exercises it.
2023-11-28 19:28:51 +05:30