Switch from `.weak` to `.common` linkage for common global variables
where possible. The `.common` linkage is described in
[PTX ISA 11.6.4. Linking Directives: .common]
(https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#linking-directives-common)
> Declares identifier to be globally visible but “common”.
>
>Common symbols are similar to globally visible symbols. However
multiple object files may declare the same common symbol and they may
have different types and sizes and references to a symbol get resolved
against a common symbol with the largest size.
>
>Only one object file can initialize a common symbol and that must have
the largest size among all other definitions of that common symbol from
different object files.
>
>.common linking directive can be used only on variables with .global
storage. It cannot be used on function symbols or on symbols with opaque
type.
I've updated the logic and tests to only use `.common` for PTX 5.0 or
greater and verified that the new tests now pass with `ptxas`.
Switch from `.weak` to `.common` linkage for common global variables
where possible. The `.common` linkage is described in [PTX ISA 11.6.4.
Linking Directives:
.common](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#linking-directives-common)
> Declares identifier to be globally visible but “common”.
>
>Common symbols are similar to globally visible symbols. However
multiple object files may declare the same common symbol and they may
have different types and sizes and references to a symbol get resolved
against a common symbol with the largest size.
>
>Only one object file can initialize a common symbol and that must have
the largest size among all other definitions of that common symbol from
different object files.
>
>.common linking directive can be used only on variables with .global
storage. It cannot be used on function symbols or on symbols with opaque
type.
Add support for unaligned parameters and return values. These must be
loaded and stored one byte at a time and then bit manipulation is used
to assemble the correct final result.
We did something pretty naive:
- round FP64 -> BF16 by first rounding to FP32
- skip FP32 -> BF16 rounding entirely
- taking the top 16 bits of a FP32 which will turn some NaNs into
infinities
Let's do this in a more principled way by rounding types with more
precision than FP32 to FP32 using round-inexact-to-odd which will negate
double rounding issues.
Summary:
This patch adds a new intrinsic and builtin function mirroring the
existing `__builtin_readcyclecounter`. The difference is that this
implementation targets a separate counter that some targets have which
returns a fixed frequency clock that can be used to determine elapsed
time, this is different compared to the cycle counter which often has
variable frequency.
This patch only adds support for the NVPTX and AMDGPU targets.
This is done as a new and separate builtin rather than an argument to
`readcyclecounter` to avoid needing to change existing code and to make
the separation more explicit.
The replacement should've had BFE() as the arguments for the comparison,
not the source register.
While at that, tighten the patterns a bit, and expand them to cover
variants with immediate arguments. Also change the default lowering of
bfe() to use unsigned variant, so the value of the upper bits is
predictable.
Summary:
This patch simply states that `__builtin_readcyclecounter` is legal on
NVPTX and makes it return the value from the `clock64` sreg. The timer
intrinsics are marked as having side effects, which is desireable for
timing primitives and required to pattern match the instrinic DAG.
Summary:
Some recent support made usage of `__nvvm_reflect` more consistent. We
should expose it as a builtin rather than forcing users to externally
define the function.
Summary:
This test requires at least sm_30 to run, but that is still below the
minimum supported version of sm_52 currently. Just set this to sm_60 so
the tests pass in the future.
Summary:
The previous patch did very simple folding that only worked for driectly
used branches. This patch improves this by traversing the use-def chain
to sipmlify every constant subexpression until it reaches a terminator
we can delete. The support should work for all expected cases now.
Summary:
The `__nvvm_reflect` function is used to guard invalid code that varies
between architectures. One problem with this feature is that if it is
used without optimizations, it will leave invalid code in the module
that will then make it to the backend. The `__nvvm_reflect` pass is
already mandatory, so it should do some trivial branch removal to ensure
that constants are handled correctly. This dead branch elimination only
works in the trivial case of a compare on a branch and does not touch
any conditionals that were not realted to the `__nvvm_reflect` call in
order to preserve `O0` semantics as much as possible. This should allow
the following to work on NVPTX targets
```c
int foo() {
if (__nvvm_reflect("__CUDA_ARCH") >= 700)
asm("valid;\n");
}
```
Relanding after fixing a bug.
Summary:
The `__nvvm_reflect` function is used to guard invalid code that varies
between architectures. One problem with this feature is that if it is
used without optimizations, it will leave invalid code in the module
that will then make it to the backend. The `__nvvm_reflect` pass is
already mandatory, so it should do some trivial branch removal to ensure
that constants are handled correctly. This dead branch elimination only
works in the trivial case of a compare on a branch and does not touch
any conditionals that were not realted to the `__nvvm_reflect` call in
order to preserve `O0` semantics as much as possible. This should allow
the following to work on NVPTX targets
```c
int foo() {
if (__nvvm_reflect("__CUDA_ARCH") >= 700)
asm("valid;\n");
}
```
The current implementation of aliases tries to remove all the aliases in
the module to prevent the generic version of `AsmPrinter` from emitting
them incorrectly. Unfortunately, if the aliases are used this will fail.
Instead let's override the function to print aliases directly.
In addition, the declarations of the alias functions must occur before
the uses. To fix this we emit alias declarations as part of
`emitDeclarations` and only emit the `.alias` directives at the end
(where we can assume the aliasee has also already been declared).
Add TableGen patterns to convert more instructions to boolean
expressions:
- **mul -> and/or**: i1 multiply instructions currently cannot be
selected causing the compiler to crash. See
https://github.com/llvm/llvm-project/issues/57404
- **select -> and/or**: Converting selects to and/or can enable more
optimizations. `InstCombine` cannot do this as aggressively due to
poison semantics.
This commit extends separate-const-offset-from-gep to look at the
newly-added `disjoint` flag on `or` instructions so as to preserve
additional opportunities for optimization.
The tests were pre-committed in #76972.
The PTX ISA specifies that initializers may be incomplete ([5.4.4.
Initializers](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#initializers))
> As in C, array initializers may be incomplete, i.e., the number of
initializer elements may be less than the extent of the corresponding
array dimension, with remaining array locations initialized to the
default value for the specified array type.
Emitting initializers in this form is preferable because it reduces the
size of the PTX, in some cases significantly, and can improve compile
time of ptxas as a result.
Ensure intrinsics and auto-upgrades support i16, i32, and i64 for for
`nvvm.{min,max,mulhi,sad}`
- `nvvm.min` and `nvvm.max`: These are auto-upgraded to `select`
instructions but it is still nice to support the 16 bit variants just in
case any generators of IR are still trying to use these intrinsics.
- `nvvm.sad` added both the 16 and 64 bit variants, also marked this
instruction as speculateble. These directly correspond to the PTX
`sad.{u16,s16,u64,s64}` instructions.
- `nvvm.mulhi` added the 16 bit variants. These directly correspond to
the PTX `mul.hi.{s,u}16` instructions.
When generating the permute bytes for the prmt instruction, the
existence of an undefined initial value initialises the int32 that holds
the mask with all 1's (0xFFFFFFFF). That initialization subsequently
leads to complications during the subsequent OR operation, leading to
inaccuracies in populating mask values for the following bytes.
Consequently, the final value persists as a constant -1, irrespective of
the actual mask values that succeed the initial set value.
This patch adds an intrinsic for setmaxnreg PTX instruction.
* PTX Doc link for this instruction:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#miscellaneous-instructions-setmaxnreg
* The i32 argument, an immediate value, specifies the actual
absolute register count for the instruction.
* The `setmaxnreg` instruction is available in SM90a.
So, this patch adds 'hasSM90a' predicate to use in
the NVPTX backend.
* lit tests are added to verify the lowering of the intrinsic.
* Verifier logic (and tests) are added to test the register
count range and divisibility-by-8 requirements.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
Targets affected:
- NVPTX and BPF: set to 64 bits.
- ARC, Lanai, and MSP430: set to 0 (they don't implement atomics).
Those which didn't yet add AtomicExpandPass to their pass pipeline now
do so.
This will result in larger atomic operations getting expanded to
`__atomic_*` libcalls via AtomicExpandPass. On all these targets, this
now matches what Clang already does in the frontend.
The only targets which do not configure AtomicExpandPass now are:
- DirectX and SPIRV: they aren't normal backends.
- AVR: a single-cpu architecture with no privileged/user divide, which
could implement all atomics by disabling/enabling interrupts, regardless
of size/alignment. Will be addressed by future work.
This patch deduces `noundef` attributes for return values.
IIUC, a function returns `noundef` values iff all of its return values
are guaranteed not to be `undef` or `poison`.
Definition of `noundef` from LangRef:
```
noundef
This attribute applies to parameters and return values. If the value representation contains any
undefined or poison bits, the behavior is undefined. Note that this does not refer to padding
introduced by the type’s storage representation.
```
Alive2: https://alive2.llvm.org/ce/z/g8Eis6
Compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=30dcc33c4ea3ab50397a7adbe85fe977d4a400bd&to=c5e8738d4bfbf1e97e3f455fded90b791f223d74&stat=instructions:u
|stage1-O3|stage1-ReleaseThinLTO|stage1-ReleaseLTO-g|stage1-O0-g|stage2-O3|stage2-O0-g|stage2-clang|
|--|--|--|--|--|--|--|
|+0.01%|+0.01%|-0.01%|+0.01%|+0.03%|-0.04%|+0.01%|
The motivation of this patch is to reduce the number of `freeze` insts
and enable more optimizations.
sm_80 only has f32->bf16 conversions, the remaining integer conversions
arrived with sm_90. Use a two-step conversion for sm_80.
There doesn't seem to be a way to express this promotion directly within
the legalization framework, so fallback on Custom lowering.
Lower 16xi8 vector stores in NVPTX ISel efficiently using
st.v4.b32 instead of multiple st.v4.u8 along the lines of vector loads
and 8xf16. Similarly, 8xi8 using st.v2.u32.
While 0b80288e9e allowed more efficient lowering for 16xi8 loads, its
test case was closer to an "integration" one. Add a much simpler unit
test case that exercises it.