LLVM intrinsics `get_fpmode`, `set_fpmode` and `reset_fpmode` operate
control modes, the bits of FP environment that affect FP operations. On
ARM these bits are in FPSCR together with the status bits. The
implementation of these intrinsics produces code close to that of
functions `fegetmode` and `fesetmode` from GLIBC.
Pull request: https://github.com/llvm/llvm-project/pull/74054
In some cases, the machine outliner needs to preserve LR across an
outlined call by pushing it onto the stack. Previously, this also
generated unwind table instructions, which is incorrect because EHABI
unwind tables cannot represent different stack frames a different points
in the function, so the extra unwind info applied to the entire
function.
The outliner code already avoided generating CFI instructions, but EHABI
unwind data is generated later from the actual instructions, so we need
to avoid using the FrameSetup and FrameDestroy flags to prevent unwind
data being generated.
These tests rely on SCEV looking recognizing an "or" with no common
bits as an "add". Add the disjoint flag to relevant or instructions
in preparation for switching SCEV to use the flag instead of the
ValueTracking query. The IR with disjoint flag matches what
InstCombine would produce.
The floating-point and MVE features together specify the MVE
functionality that is supported on the Cortex-M85 processor. But the FPU
extension for the underlying architecture(armv8.1-m.main) is FPV5 which
does not include MVE-F. So Compiler's -S output and `-save-temps=obj`
loses MVE feature which leads to assembler error. What happening here is
.fpu directive overrides any previously set features by .cpu directive.
Since the the corresponding .fpu generated (.fpu fpv5-d16) does not
include MVE-F, it overrides those features even though it is supported
and set by the .cpu directive. Looks like .fpu is supposed to do this.
In this case, there should be an .arch_extension directive re-enabling
the relevant extensions after .fpu if the goal is to keep these
extensions enabled. GCC also does the same.
So this patch enables the MVE features by emitting the below arch
extension:
.fpu fpv5-d16
.arch_extension mve.fp
---------
Co-authored-by: Simi Pallipurath <simi.pallipurath.com>
When x is not known to be nonzero, ctpop(x) == 1 is expanded to
x != 0 && (x & (x - 1)) == 0
resulting in codegen like
leal -1(%rdi), %eax
testl %eax, %edi
sete %cl
testl %edi, %edi
setne %al
andb %cl, %al
But another expression that works is
(x ^ (x - 1)) > x - 1
which has nicer codegen:
leal -1(%rdi), %eax
xorl %eax, %edi
cmpl %eax, %edi
seta %al
Remove support for zext and sext constant expressions. All places
creating them have been removed beforehand, so this just removes the
APIs and uses of these constant expressions in tests.
There is some additional cleanup that can be done on top of this, e.g.
we can remove the ZExtInst vs ZExtOperator footgun.
This is part of
https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179.
The ELF code from https://reviews.llvm.org/D112811 emits LDRLIT_ga_pcrel
when `TM.isPositionIndependent()` but uses a different condition
`Subtarget.isGVIndirectSymbol(GV)` (aka dso_preemptable on ELF targets).
This would cause incorrect access for dso_preemptable
`__stack_chk_guard` with the static relocation model.
Regarding whether `__stack_chk_guard` gets the dso_local specifier,
https://reviews.llvm.org/D150841 switched to
`M.getDirectAccessExternalData()` (implied by "PIC Level") instead of
`TM.getRelocationModel() == Reloc::Static`.
The result is that when non-zero "PIC Level" is used with static
relocation model (e.g. -fPIE/-fPIC LTO compiles with -no-pie linking),
`__stack_chk_guard` accesses are incorrect.
```
ldr r0, .LCPI0_0
ldr r0, [r0]
ldr r0, [r0] // incorrectly dereferences __stack_chk_guard
...
.LCPI0_0:
.long __stack_chk_guard
```
To fix this, for dso_preemptable `__stack_chk_guard`, emit a GOT PIC
code sequence like for -fpic using `LDRLIT_ga_pcrel`:
```
ldr r0, .LCPI0_0
.LPC0_0:
add r0, pc, r0
ldr r0, [r0]
ldr r0, [r0]
...
LCPI0_0:
.Ltmp0:
.long __stack_chk_guard(GOT_PREL)-((.LPC0_0+8)-.Ltmp0)
```
Technically, `LDRLIT_ga_abs` with `R_ARM_GOT_ABS` could be used, but
`R_ARM_GOT_ABS` does not have GNU or integrated assembler support.
(Note, `.LCPI0_0: .long __stack_chk_guard@GOT` produces an
`R_ARM_GOT_BREL`, which is not desired).
This patch fixes#6499 while not changing behavior for the following
configurations:
```
run arm.linux.nopic --target=arm-linux-gnueabi -fno-pic
run arm.linux.pie --target=arm-linux-gnueabi -fpie
run arm.linux.pic --target=arm-linux-gnueabi -fpic
run armv6.darwin.nopic --target=armv6-apple-darwin -fno-pic
run armv6.darwin.dynamicnopic --target=armv6-apple-darwin -mdynamic-no-pic
run armv6.darwin.pic --target=armv6-apple-darwin -fpic
run armv7.darwin.nopic --target=armv7-apple-darwin -mcpu=cortex-a8 -fno-pic
run armv7.darwin.dynamicnopic --target=armv7-apple-darwin -mcpu=cortex-a8 -mdynamic-no-pic
run armv7.darwin.pic --target=armv7-apple-darwin -mcpu=cortex-a8 -fpic
run arm64.darwin.pic --target=arm64-apple-darwin
```
llvm/test/LTO/ARM/ssp-static-reloc.ll is more about using the static
relocation model with "PIC Level" and unrelated to the LTO infrastructure.
Move the test. Update stack_guard_remat.ll to clearly test "PIC Level"
with the relevant relocation models.
BlockFrequencyInfo calculates block frequencies as Scaled64 numbers but as a last step converts them to unsigned 64bit integers (`BlockFrequency`). This improves the factors picked for this conversion so that:
* Avoid big numbers close to UINT64_MAX to avoid users overflowing/saturating when adding multiply frequencies together or when multiplying with integers. This leaves the topmost 10 bits unused to allow for some room.
* Spread the difference between hottest/coldest block as much as possible to increase precision.
* If the hot/cold spread cannot be represented loose precision at the lower end, but keep the frequencies at the upper end for hot blocks differentiable.
The MVETRUNC operation can perform the same truncate of two vectors, without
requiring lane inserts/extracts from every vector lane. This moves the concat
i1 lowering to use it for v8i1 and v16i1 result types, trading a bit of extra
stack space for less instructions.
Split a virtual register with hint may generate COPY instructions in
multiple cold basic blocks, and increase code size. So disable this
split when the function is optimized for size.
Reapply now that generation of incorrect debuginfo for FnDef
in rustc has been fixed.
-----
Add a check that the DILocalVariable fragment size in dbg.declare
does not exceed the size of the alloca.
This would have caught the invalid debuginfo regenerated by rustc
in https://github.com/llvm/llvm-project/issues/64149.
Differential Revision: https://reviews.llvm.org/D158743
PR #66334 tried to renumber slot indexes before register allocation, but
the numbering was still affected by list entries for instructions which
had been erased. Fix this to make the register allocator's live range
length heuristics even less dependent on the history of how instructions
have been added to and removed from SlotIndexes's maps.
The legalizer currently generates lots of G_AND artifacts.
For example between boolean uses and defs there is always a G_AND with a mask of 1, but when the target uses ZeroOrOneBooleanContents, this is unnecessary.
Currently these artifacts have to be removed using post-legalize combines.
Omitting these artifacts at their source in the artifact combiner has a few advantages:
- We know that the emitted G_AND is very likely to be useless, so our KnownBits call is likely worth it.
- The G_AND and G_CONSTANT can interrupt e.g. G_UADDE/... sequences generated during legalization of wide adds which makes it harder to detect these sequences in the instruction selector (e.g. useful to prevent unnecessary reloading of AArch64 NZCV register).
- This cleans up a lot of legalizer output and even improves compilation-times.
AArch64 CTMark geomean: `O0` -5.6% size..text; `O0` and `O3` ~-0.9% compilation-time (instruction count).
Since this introduces KnownBits into code-paths used by `O0`, I reduced the default recursion depth.
This doesn't seem to make a difference in CTMark, but should prevent excessive recursive calls in the worst case.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D159140
Split a register generated from another split usually doesn't bring us
too much benefit. It may also cause dead loop as pr67188 shows if the
heuristic cost always satisfy the split condition. So prevent such
splitting.
It fixed pr67188.
Add the command line option --no-trap-after-noreturn, which exposes the
pre-existing TargetOption `NoTrapAfterNoreturn`.
This pull request was split off from this one:
https://github.com/llvm/llvm-project/pull/65876
If a virtual register is not assigned preferred physical register, it means some
COPY instructions will be changed to real register move instructions. In this
case we can try to split the virtual register in colder blocks, if success, the
original COPY instructions can be deleted, and the new COPY instructions in
colder blocks will be generated as register move instructions. It results in
fewer dynamic register move instructions executed.
The new test case split-reg-with-hint.ll gives an example, the hot path contains
24 instructions without this patch, now it is only 4 instructions with this
patch.
Differential Revision: https://reviews.llvm.org/D156491
The indirect lowering hinders the outliner's ability to see that
sequences are in fact common, since the sequence similarity is rendered
opaque by the register callee. The size savings from making them
indirect seems to be dwarfed by the outliner's savings from
de-duplication.
rdar://115178034
rdar://115459865
Reapply after fixing a clang bug this exposed in D158972 and
adjusting a number of tests that failed for 32-bit targets.
-----
Add a check that the DILocalVariable fragment size in dbg.declare
does not exceed the size of the alloca.
This would have caught the invalid debuginfo regenerated by rustc
in https://github.com/llvm/llvm-project/issues/64149.
Differential Revision: https://reviews.llvm.org/D158743
The instruction of ISD::FMINNUM/FMAXNUM should be legal if HasFPARMv8 &&
HasNEON.
For the combination of armv7+fp-armv8, armv7 imply the feature HasNEON
on, and fp-armv8 matchs the feature HasFPARMv8, so it is legal
Fixes https://github.com/llvm/llvm-project/issues/65820
This removes the backend requirement for crc instructions on HasV8, relying on
just HasCRC instead. This should allow them to be selected with ArmV7 + crc,
making them more usable whilst hopefully not making them incorrectly generated
(they only come from intrinsics, and HasCRC usually requires HasV8). This is
how most other instructions are specified.
We currently have log, log2, log10, exp and exp2 intrinsics. Add exp10
to fix this asymmetry. AMDGPU already has most of the code for f32
exp10 expansion implemented alongside exp, so the current
implementation is duplicating nearly identical effort between the
compiler and library which is inconvenient.
https://reviews.llvm.org/D157871
When resolving a frame index with a large offset for v6M execute-only,
we emit a tMOVimm32 pseudo-instruction, which later gets lowered to a
sequence of instructions, all of which are flag-setting. However, a
frame index may be generated for a register spill or reload instruction,
which can be inserted at a point where CPSR is live. This patch inserts
MRS and MSR instructions around the tMOVimm32 to save and restore the
value of CPSR, if CPSR is live at that point.
This may need up to two virtual registers (one to build the immediate
value, one to save CPSR) during frame index lowering, which happens
after register allocation, so we need to ensure two spill slots are
avilable to the register scavenger to ensure it can free up enough
registers for this.
There is no test for the emission (or not) of the MRS/MSR pair, because
it requires a spill or reload to be inserted at a point where CPSR is
live, which requires a large, complex function and is fragile enough
that any optimisation changes will break the test. This bug was easily
found by csmith with -verify-machineinstrs, which I now run regularly on
v6M execute-only (and many other combinations).
Patch by John Brawn and myself.
Reviewed By: stuij
Differential Revision: https://reviews.llvm.org/D158404
When adjusting the Stack Pointer at the end of the function epilogue,
use a callee-saved register, rather than explicitly using R4 which may
not have been saved.
Differential Revision: https://reviews.llvm.org/D157500
Aligning functions yields small performance gains on
embedded cores, moreso with numerous small function calls.
Similar to aligning loops, if the function can fit within
a single cache line then the performance overhead of
fetching more instructions can be limited.
Differential Revision: https://reviews.llvm.org/D157514