This is used by the Linux kernel built with CONFIG_THUMB2_KERNEL.
Because different operands are not permitted to `movs`, the diagnostics now provide multiple suggestions along the lines of using a non-pc destination operand or lr source operand.
Forked from D95586.
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed By: DavidSpickett
Differential Revision: https://reviews.llvm.org/D96304
This was taking the calling convention from the parent function,
instead of the callee. Avoids regressions in a future patch when the
caller and callee have different type breakdowns.
For some reason AArch64's lowerFormalArguments seems to intentionally
ignore the parent isVarArg.
This reverts commit 502a67dd7f.
This expose a failure in test-suite build on PowerPC,
revert to unblock buildbot first,
Dave will re-commit in https://reviews.llvm.org/D96287.
Thanks Dave.
A One-Off Identity mask is a shuffle that is mostly an identity mask
from as single source but contains a single element out-of-place, either
from a different vector or from another position in the same vector. As
opposed to lowering this via a ARMISD::BUILD_VECTOR we can generate an
extract/insert pair directly. Under ARM with individually accessible
lane elements this often becomes a simple lane move.
This also alters the LowerVECTOR_SHUFFLEUsingMovs code to use v4f32 (not
v4i32), a more natural type for lane moves.
Differential Revision: https://reviews.llvm.org/D95551
Because we mark all operations as expand for v2f64, scalar_to_vector
would end up lowering through a stack store/reload. But it is pretty
simple to implement, only inserting a D reg into an undef vector. This
helps clear up some inefficient codegen from soft calling conventions.
Differential Revision: https://reviews.llvm.org/D96153
This adds another tablegen fold that converts an i16 odd-lane-insert of
an even-lane-extract into a VINS. We extract the existing f32 value from
the destination register and VINS the new value into it. The rest of the
backend then is able to optimize the INSERT_SUBREG / COPY_TO_REGCLASS /
EXTRACT_SUBREG.
Differential Revision: https://reviews.llvm.org/D95456
getIntrinsicInstrCost takes a IntrinsicCostAttributes holding various
parameters of the intrinsic being costed. It can either be called with a
scalar intrinsic (RetTy==Scalar, VF==1), with a vector instruction
(RetTy==Vector, VF==1) or from the vectorizer with a scalar type and
vector width (RetTy==Scalar, VF>1). A RetTy==Vector, VF>1 is considered
an error. Both of the vector modes are expected to be treated the same,
but because this is confusing many backends end up getting it wrong.
Instead of trying work with those two values separately this removes the
VF parameter, widening the RetTy/ArgTys by VF used called from the
vectorizer. This keeps things simpler, but does require some other
modifications to keep things consistent.
Most backends look like this will be an improvement (or were not using
getIntrinsicInstrCost). AMDGPU needed the most changes to keep the code
from c230965ccf working. ARM removed the fix in
dfac521da1, webassembly happens to get a fixup for an SLP cost
issue and both X86 and AArch64 seem to now be using better costs from
the vectorizer.
Differential Revision: https://reviews.llvm.org/D95291
As mentioned in TODO comment, casting double to float causes NaNs to change bits.
To avoid the change, this patch adds support for single-floating-point immediate value on MachineCode.
Patch by Yuta Saito.
Differential Revision: https://reviews.llvm.org/D77384
This new f16 shuffle under Neon would hit an assert in
GeneratePerfectShuffle as it would try to treat a f16 vector as an i8.
Add f16 handling, treating them like an i16.
Differential Revision: https://reviews.llvm.org/D95446
This allows the peephole optimizer to know that a MVE_VMOV_to_lane_32 is
the same as an insert subreg, allowing it to optimize some redundant
lane moves.
Differential Revision: https://reviews.llvm.org/D95433
A v4i32 insert of an extract can become a simple lane move, as opposed
to round-tripping via a GPR. This adds a patterns that turns an v4i32
insert-extract pair into a EXTRACT_SUBREG/INSERT_SUBREG, with the
required COPY_TO_REGCLASS. These get better optimized into a simple lane
move by the rest of the backend.
Differential Revision: https://reviews.llvm.org/D95428
This patch adds tablegen patterns for pairs of i16/f16 insert/extracts.
If we are inserting into two adjacent vector lanes (0 and 1 for
example), we can use either a vmov;vins or vmovx;vins to insert the pair
together, avoiding a round-trip from GRP registers. This is quite a
large patterns with a number of EXTRACT_SUBREG/INSERT_SUBREG/
COPY_TO_REGCLASS nodes, but hopefully as most of those become copies all
that will be cleaned up by further optimizations.
The VINS pattern was also adjusted to allow it to represent that it is
inserting into the top half of an existing register.
Differential Revision: https://reviews.llvm.org/D95381
A DLS lr, lr instruction only moves lr to itself. It need not be emitted
on it's own to save a instruction in the loop preheader.
Differential Revision: https://reviews.llvm.org/D78916
Given a shuffle(vqdmulh(shuffle, shuffle), we can flatter the shuffles
out if they become an identity mask. This can come up during lane
interleaving, when we do that better.
Differential Revision: https://reviews.llvm.org/D94034
Under the softfp calling convention, we are often left with
VMOVRRD(extract(bitcast(build_vector(a, b, c, d)))) for the return value
of the function. These can be simplified to a,b or c,d directly,
depending on the value of the extract.
Big endian is a little different because the bitcast switches the lanes
around, meaning we end up with b,a or d,c.
Differential Revision: https://reviews.llvm.org/D94989
This adds a DAG combine for converting sext_inreg of VGetLaneu into
VGetLanes, providing the types match correctly.
Differential Revision: https://reviews.llvm.org/D95073
Under SoftFP calling conventions, we can be left with
extract(bitcast(BUILD_VECTOR(VMOVDRR(a, b), ..))) patterns that can
simplify to a or b, depending on the extract lane.
Differential Revision: https://reviews.llvm.org/D94990
This patch allows targets to define multiple cost
values for each register so that the cost model
can be more flexible and better used during the
register allocation as per the target requirements.
For AMDGPU the VGPR allocation will be more efficient
if the register cost can be associated dynamically
based on the calling convention.
Reviewed By: qcolombet
Differential Revision: https://reviews.llvm.org/D86836
The MVE VLD2/4 and VST2/4 instructions require the pointer to be aligned
to at least the size of the element type. This adds a check for that
into the ARM lowerInterleavedStore and lowerInterleavedLoad functions,
not creating the intrinsics if they are invalid for the alignment of
the load/store.
Unfortunately this is one of those bug fixes that does effect some
useful codegen, as we were able to sometimes do some nice lowering of
q15 types. But they can cause problem with low aligned pointers.
Differential Revision: https://reviews.llvm.org/D95319
This adds some simple fp16 scalar_to_vector patterns, preventing a
selection failure if this came up.
Differential Revision: https://reviews.llvm.org/D95427
STRT, STRHT, and STRBT are store instructions and their source register
$Rt should be treated as an input operand instead of an output operand.
This should fix things (e.g., liveness tracking in LivePhysRegs) if
these instructions were used in CodeGen.
Differential Revision: https://reviews.llvm.org/D95074
Recent shouldAssumeDSOLocal changes (introduced by 961f31d8ad)
do not take in consideration the relocation model anymore. The ARM
fast-isel pass uses the function return to set whether a global symbol
is loaded indirectly or not, and without the expected information
llvm now generates an extra load for following code:
```
$ cat test.ll
@__asan_option_detect_stack_use_after_return = external global i32
define dso_local i32 @main(i32 %argc, i8** %argv) #0 {
entry:
%0 = load i32, i32* @__asan_option_detect_stack_use_after_return,
align 4
%1 = icmp ne i32 %0, 0
br i1 %1, label %2, label %3
2:
ret i32 0
3:
ret i32 1
}
attributes #0 = { noinline optnone }
$ lcc test.ll -o -
[...]
main:
.fnstart
[...]
movw r0, :lower16:__asan_option_detect_stack_use_after_return
movt r0, :upper16:__asan_option_detect_stack_use_after_return
ldr r0, [r0]
ldr r0, [r0]
cmp r0, #0
[...]
```
And without 'optnone' it produces:
```
[...]
main:
.fnstart
[...]
movw r0, :lower16:__asan_option_detect_stack_use_after_return
movt r0, :upper16:__asan_option_detect_stack_use_after_return
ldr r0, [r0]
clz r0, r0
lsr r0, r0, #5
bx lr
[...]
```
This triggered a lot of invalid memory access in sanitizers for
arm-linux-gnueabihf. I checked this patch both a stage1 built with
gcc and a stage2 bootstrap and it fixes all the Linux sanitizers
issues.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D95379
The only caller of this function is in the LocalStackSlotAllocation
and it creates base register of class returned by the target's
getPointerRegClass(). AMDGPU wants to use a different reg class
here so let materializeFrameBaseRegister to just create and return
whatever it wants.
Differential Revision: https://reviews.llvm.org/D95268
I may have given bad advice, and skipping sext_inreg when matching SSAT
patterns is not valid on it's own. It at least needs to sext_inreg the
input again, but as far as I can tell is still only valid based on
demanded bits. For the moment disable that part of the combine,
hopefully reimplementing it in the future more correctly.
This replaces the isSaturatingConditional function with
LowerSaturatingConditional that directly returns a new SSAT or
USAT SDValue, instead of returning true and the components of it.
This adds cost modelling for the inloop vectorization added in
745bf6cf44. Up until now they have been modelled as the original
underlying instruction, usually an add. This happens to works OK for MVE
with instructions that are reducing into the same type as they are
working on. But MVE's instructions can perform the equivalent of an
extended MLA as a single instruction:
%sa = sext <16 x i8> A to <16 x i32>
%sb = sext <16 x i8> B to <16 x i32>
%m = mul <16 x i32> %sa, %sb
%r = vecreduce.add(%m)
->
R = VMLADAV A, B
There are other instructions for performing add reductions of
v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64
(VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV).
The i64 are particularly interesting as there are no native i64 add/mul
instructions, leading to the i64 add and mul naturally getting very
high costs.
Also worth mentioning, under NEON there is the concept of a sdot/udot
instruction which performs a partial reduction from a v16i8 to a v4i32.
They extend and mul/sum the first four elements from the inputs into the
first element of the output, repeating for each of the four output
lanes. They could possibly be represented in the same way as above in
llvm, so long as a vecreduce.add could perform a partial reduction. The
vectorizer would then produce a combination of in and outer loop
reductions to efficiently use the sdot and udot instructions. Although
this patch does not do that yet, it does suggest that separating the
input reduction type from the produced result type is a useful concept
to model. It also shows that a MLA reduction as a single instruction is
fairly common.
This patch attempt to improve the costmodelling of in-loop reductions
by:
- Adding some pattern matching in the loop vectorizer cost model to
match extended reduction patterns that are optionally extended and/or
MLA patterns. This marks the cost of the reduction instruction correctly
and the sext/zext/mul leading up to it as free, which is otherwise
difficult to tell and may get a very high cost. (In the long run this
can hopefully be replaced by vplan producing a single node and costing
it correctly, but that is not yet something that vplan can do).
- getExtendedAddReductionCost is added to query the cost of these
extended reduction patterns.
- Expanded the ARM costs to account for these expanded sizes, which is a
fairly simple change in itself.
- Some minor alterations to allow inloop reduction larger than the highest
vector width and i64 MVE reductions.
- An extra InLoopReductionImmediateChains map was added to the vectorizer
for it to efficiently detect which instructions are reductions in the
cost model.
- The tests have some updates to show what I believe is optimal
vectorization and where we are now.
Put together this can greatly improve performance for reduction loop
under MVE.
Differential Revision: https://reviews.llvm.org/D93476
It turns out the vectorizer calls the getIntrinsicInstrCost functions
with a scalar return type and vector VF. This updates the costmodel to
handle that, still producing the correct vector costs.
A vectorizer test is added to show it vectorizing at the correct factor
again.
We have no lowering for VSELECT vXi1, vXi1, vXi1, so mark them as
expanded to turn them into a series of logical operations.
Differential Revision: https://reviews.llvm.org/D94946
This adds some basic MVE sadd_sat/ssub_sat/uadd_sat/usub_sat costs,
based on when the instruction is legal. With smaller than legal types
that are promoted we generate shr(qadd(shl, shl)), so the cost is 4
appropriately.
Differential Revision: https://reviews.llvm.org/D94958
This patch handles cases where we have to save/restore the link register
into the stack and and load/store instruction which use the stack are
part of the outlined region. It checks that there will be no overflow
introduced by the new offset and fixup these instructions accordingly.
Differential Revision: https://reviews.llvm.org/D92934
It turns our that the BranchFolder and IfCvt does not like unanalyzable
branches that fall-through. This means that removing the unconditional
branches from the end of tail predicated instruction can run into
asserts and verifier issues.
This effectively reverts 372eb2bbb6, but
adds handling to t2DoLoopEndDec which are not branches, so can be safely
skipped.
If the previous block in a function does not fallthough, adding nop's to
align it will never be executed. This means we can freely (except for
codesize) align more branches. This happens in constantislandspass (as
it cannot happen later) and only happens at aggressive optimization
levels as it does increase codesize.
Differential Revision: https://reviews.llvm.org/D94394
This treats low overhead loop branches the same as jump tables and
indirect branches in analyzeBranch - they cannot be analyzed but the
direct branches on the end of the block may be removed. This helps
remove the unnecessary branches earlier, which can help produce better
codegen (and change block layout in a number of cases).
Differential Revision: https://reviews.llvm.org/D94392
The TripCount for a predicated vector loop body will be
ceil(ElementCount/Width). This alters the conversion of an
active.lane.mask to a VCPT intrinsics to match.
Differential Revision: https://reviews.llvm.org/D94608
Not all machine loops will have a predecessor. so the pass needs to
check it before continuing.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D94780
For the ARM hard-float calling convention, calls to variadic functions
need to be treated diffrently, even if only the fixed arguments are
provided.
This fixes GCC-C-execute-pr68390 in the test-suite, which is failing on
the ARM GlobaISel bot.
Blocks can be laid out such that a t2WhileLoopStart branches backwards. This is forbidden by the architecture and so it fails to be converted into a low-overhead loop. This new pass checks for these cases and moves the target block, fixing any fall-through that would then be broken.
Differential Revision: https://reviews.llvm.org/D92385