Commit Graph

1488 Commits

Author SHA1 Message Date
Philip Reames
8d85e945b2 [SCEV] Canonicalize X - urem X, Y patterns
There are multiple possible ways to represent the X - urem X, Y pattern. SCEV was not canonicalizing, and thus, depending on which you analyzed, you could get different results. The sub representation appears to produce strictly inferior results in practice, so I decided to canonicalize to the Y * X/Y version.

The motivation here is that runtime unroll produces the sub X - (and X, Y-1) pattern when Y is a power of two. SCEV is thus unable to recognize that an unrolled loop exits because we don't figure out that the new unrolled step evenly divides the trip count of the unrolled loop. After instcombine runs, we convert the the andn form which SCEV recognizes, so essentially, this is just fixing a nasty pass ordering dependency.

The ARM loop hardware interaction in the test diff is opague to me, but the comments in the review from others knowledge of the infrastructure appear to indicate these are improvements in loop recognition, not regressions.

Differential Revision: https://reviews.llvm.org/D114018
2021-11-16 11:59:21 -08:00
David Green
4c3bfdc7f1 [ARM] Fix GatherScatter AddLikeOr condition 2021-11-15 09:44:41 +00:00
Simon Pilgrim
82b74363a9 [DAG] reassociateOpsCommutative - peek through bitcasts to find constants
Now that FoldConstantArithmetic can fold bitcasted constants, we should peek through bitcasts of binop operands to try and find foldable constants
2021-11-11 12:00:22 +00:00
Sanjay Patel
e2b1d3260a [AArch][x86] add tests for vselect; NFC
This is a potential follow-up suggested in D113212.
2021-11-08 15:21:19 -05:00
Sanjay Patel
7e30404c3b [DAGCombiner] add fold for vselect based on mask of signbit, part 2
This is the 'or' sibling for the fold added with:
D113212

https://alive2.llvm.org/ce/z/tgnp7K

Note that neither of these transforms is poison-safe,
but it does not seem to matter at this level. We have
had the scalar version of D113212 for a long time, so
this is just making optimizer behavior consistent.

We do not have the scalar version of *this* fold,
however, so that is another follow-up.
2021-11-05 15:02:12 -04:00
Sanjay Patel
4d513f2527 [AArch] add tests for vselect; NFC
These are copy/pasted from the related test patterns in D113212.
2021-11-05 15:02:12 -04:00
Sanjay Patel
4fc1fc4005 [DAGCombiner] add fold for vselect based on mask of signbit
(X s< 0) ? Y : 0 --> (X s>> BW-1) & Y

We canonicalize to the icmp+select form in IR, and we already have this fold
for scalar select in SDAG, so I think it's an oversight that we don't have
the fold for vectors. It seems neutral for AArch64 and saves some instructions
on x86.

Whether we should also have the sibling folds for the inverse condition or
all-ones true value may depend on target-specific factors such as whether
there's an "and-not" instruction.

Differential Revision: https://reviews.llvm.org/D113212
2021-11-05 10:06:16 -04:00
Simon Pilgrim
c1e7911c3b [DAG] FoldConstantArithmetic - fold bitlogic(bitcast(x),bitcast(y)) -> bitcast(bitlogic(x,y))
To constant fold bitwise logic ops where we've legalized constant build vectors to a different type (e.g. v2i64 -> v4i32), this patch adds a basic ability to peek through the bitcasts and perform the constant fold on the inner operands.

The MVE predicate v2i64 regressions will be addressed by future support for basic v2i64 type support.

One of the yak shaving fixes for D113192....

Differential Revision: https://reviews.llvm.org/D113202
2021-11-05 12:00:59 +00:00
David Green
cb62c3761f [ARM] Extra MVE constant select test. NFC 2021-11-05 10:57:38 +00:00
David Green
091244023a [ARM] Move VPTBlock pass after post-ra scheduling
Currently when tail predicating loops, vpt blocks need to be created
with the vctp predicate in case we need to revert to non-tail predicated
form. This has the unfortunate side effect of severely hampering post-ra
scheduling at times as the instructions are already stuck in vpt blocks,
not allowed to be independently ordered.

This patch addresses that by just moving the creation of VPT blocks
later in the pipeline, after post-ra scheduling has been performed. This
allows more optimal scheduling post-ra before the vpt blocks are
created, leading to more optimal tail predicated loops.

Differential Revision: https://reviews.llvm.org/D113094
2021-11-04 18:42:12 +00:00
David Green
3bc586b9aa [ARM] Treat MVE gather add-like-or's like adds
LLVM has the habit of turning adds with no common bits set into ors,
which means we need to detect them and treat them like adds again in the
MVE gather/scatter lowering pass.

Differential Revision: https://reviews.llvm.org/D112922
2021-11-03 11:41:06 +00:00
David Green
d36dd1f842 [ARM] Push gather/scatter shl index updates out of loops
This teaches the MVE gather scatter lowering pass that SHL is
essentially the same as Mul, where we are able to optimize the
induction of a gather/scatter address by pushing them out of loops.
https://alive2.llvm.org/ce/z/wG4VyT

Differential Revision: https://reviews.llvm.org/D112920
2021-11-03 11:00:05 +00:00
David Green
0e3a5f1ab3 [ARM] Some extra gather/scatter tests. NFC 2021-11-02 10:32:22 +00:00
David Green
2c4a9e830c [ValueTracking] Teach computeConstantRange that the maximum value of a half is 65504
The maximal value of a half is 0x7bff, which is 65504 when converted to
an integer. This patch teaches that to computeConstantRange to compute a
constant range with the correct maximum value.
https://alive2.llvm.org/ce/z/BV_Spb
https://alive2.llvm.org/ce/z/Nwuqvb

The maximum value for a float converted in the same way is 3.4e38, which
requires 129bits of data. I have not added that here as integer types so
larger are rare, compared to integers types larger than 17 bits require
for half floats.

The MVE tests change because instsimplify happens to be run as a part of
the backend, where it doesn't tend to for other backends.

Differential Revision: https://reviews.llvm.org/D112694
2021-10-30 14:27:38 +01:00
David Green
1ad9b072e5 [ARM] Add some fp convert with saturate MVE tests. NFC 2021-10-30 12:08:17 +01:00
David Green
e6df795759 [ARM] Add a complex dotprod test case. 2021-10-25 10:52:12 +01:00
David Green
9bfe7af159 [ARM] Add new abs test. NFC 2021-10-21 13:03:18 +01:00
David Green
73346f5848 [ARM] Introduce a MQPRCopy
Currently when creating tail predicated loops, we need to validate that
all the live-outs of a loop will be equivalent with and without tail
predication, and if they are not we cannot legally create a
tail-predicated loop, leaving expensive vctp and vpst instructions in
the loop. These notably can include register-allocation instructions
like stack loads and stores, and copys lowered from COPYs to MVE_VORRs.

Instead of trying to prove this is valid late in the pipeline, this
patch introduces a MQPRCopy pseudo instruction that COPY is lowered to.
This can then either be converted to a MVE_VORR where possible, or to a
couple of VMOVD instructions if not. This way they do not behave
differently within and outside of tail-predications regions, and we can
know by construction that they are always valid. The idea is that we can
do the same with stack load and stores, converting them to VLDR/VSTR or
VLDM/VSTM where required to prove tail predication is always valid.

This does unfortunately mean inserting multiple VMOVD instructions,
instead of a single MVE_VORR, but my experiments show it to be an
improvement in general.

Differential Revision: https://reviews.llvm.org/D111048
2021-10-07 12:52:12 +01:00
David Green
bf916cdbd2 [ARM] Add tests for code that spills in tail predicate loops. 2021-10-07 11:35:02 +01:00
David Green
f9aa8623fe [ARM] Add more MVE intrinsics to sink splats to
This adds a few more unpredicated intrinsics to sink splats to, in order
to create more qr instruction variants. Notably this includes
saddsat/uaddsat but also some of the unpredicated mve intrinsics.

Differential Revision: https://reviews.llvm.org/D110333
2021-09-30 14:41:23 +01:00
Jay Foad
156d7d2df7 [LiveIntervals] Remove unused subreg ranges in repairIntervalsInRange
If the old instructions mentioned a subreg that the new instructions do
not, remove the subrange for that subreg.

For example, in TwoAddressInstructionPass::eliminateRegSequence, if a
use operand in the REG_SEQUENCE has the undef flag then we don't
generate a copy for it so after the elimination there should be no live
interval at all for the corresponding subreg of the def.

This is a small step towards switching TwoAddressInstructionPass over
from LiveVariables to LiveIntervals. Currently this path is only tested
if you explicitly enable -early-live-intervals.

Differential Revision: https://reviews.llvm.org/D110542
2021-09-30 09:15:10 +01:00
David Green
fdd8c10959 [ARM] Delay reverting WLS in arm-block-placement
As we have to split blocks, we may be left in an invalid loop state
after a WLS is reverted to a DLS. Instead remember the WLS that could
not be fixed and revert them after finishing processing all other loops.

Differential Revision: https://reviews.llvm.org/D110567
2021-09-28 15:38:29 +01:00
David Green
2c53215e99 [ARM] Skip debug info in recomputeVPTBlockMask
The ARMLowOverheadLoops pass recalculates VPT block masks when it
converts VCMP's inside VPT blocks into VPT's. The function to do so
doesn't seem to handle debug info though, leading to invalid block
creation or asserts at compile time. Make sure the function skips any
debug info between the MVE instructions it inspects.

Differential Revision: https://reviews.llvm.org/D110564
2021-09-28 14:58:13 +01:00
Jay Foad
20c0280733 [LiveIntervals] Repair subreg ranges in processTiedPairs
In TwoAddressInstructionPass::processTiedPairs, update subranges of the
live interval for RegB as well as the main range.

This is a small step towards switching TwoAddressInstructionPass over
from LiveVariables to LiveIntervals. Currently this path is only tested
if you explicitly enable -early-live-intervals.

Differential Revision: https://reviews.llvm.org/D110526
2021-09-28 08:10:16 +01:00
David Green
bb2d23dcd4 [ARM] Improve detection of fallthough when aligning blocks
We align non-fallthrough branches under Cortex-M at O3 to lead to fewer
instruction fetches. This improves that for the block after a LE or
LETP. These blocks will still have terminating branches until the
LowOverheadLoops pass is run (as they are not handled by analyzeBranch,
the branch is not removed until later), so canFallThrough will return
false. These extra branches will eventually be removed, leaving a
fallthrough, so treat them as such and don't add unnecessary alignments.

Differential Revision: https://reviews.llvm.org/D107810
2021-09-27 11:21:21 +01:00
David Green
883758ed48 [ARM] Fix Arm block placement creating branches after jump tables.
Given:
 - A jump table
 - Which jumps to the next block
 - The next block ends in a WLS
 - Where the WLS conditionally jumps to block earlier in the program.

The Arm block placement pass would attempt to move the block containing
the WLS earlier, as the WLS instruction can only branch forward. In
doing so it would add a branch from the jumptable block to the WLS
block, thinking it previously fell-through.

This in itself would be fine, if a little inefficient, but the constant
island pass expects all instructions after a jump-table branch to have
been removed by analyzeBranch. So it gets confused and can assign the
same labels to multiple jump table blocks.

I've changed the condition to the same as used in analyzeBranch.
2021-09-25 11:32:25 +01:00
David Green
a5211bf365 [ARM] Addition jump table plus while loop block placement pass test.
Also regenerated the file, whilst here.
2021-09-24 19:30:49 +01:00
Stanislav Mekhanoshin
08d7eec06e Revert "Allow rematerialization of virtual reg uses"
Reverted due to two distcint performance regression reports.

This reverts commit 92c1fd19ab.
2021-09-24 10:26:11 -07:00
Jay Foad
e4e95f14f1 [LiveIntervals] Repair live intervals that gain subranges
In repairIntervalsInRange, if the new instructions refer to subregs but
the old instructions did not, make sure any existing live interval for
the superreg is updated to have subranges. Also skip repairing any range
that we have recalculated from scratch, partly for efficiency but also
to avoids some cases that repairOldRegInRange can't handle.

The existing test/CodeGen/AMDGPU/twoaddr-regsequence.mir provides some
test coverage for this change: when TwoAddressInstructionPass converts
REG_SEQUENCE into subreg copies, the live intervals will now get
subranges and MachineVerifier will verify that the subranges are
correct. Unfortunately MachineVerifier does not complain if the
subranges are not present, so the test also passed before this patch.

This patch also fixes ~800 of the ~1500 failures in the whole CodeGen
lit test suite when -early-live-intervals is forced on.

Differential Revision: https://reviews.llvm.org/D110328
2021-09-24 11:58:08 +01:00
David Green
e2050f94b6 [ARM] Extra tests for unpredicated qr MVE intrinsics. 2021-09-23 18:07:08 +01:00
David Green
02cd8a6b91 [ARM] Allow smaller VMOVL in tail predicated loops
This allows VMOVL in tail predicated loops so long as the the vector
size the VMOVL is extending into is less than or equal to the size of
the VCTP in the tail predicated loop. These cases represent a
sign-extend-inreg (or zero-extend-inreg), which needn't block tail
predication as in https://godbolt.org/z/hdTsEbx8Y.

For this a vecsize has been added to the TSFlag bits of MVE
instructions, which stores the size of the elements that the MVE
instruction operates on. In the case of multiple size (such as a
MVE_VMOVLs8bh that extends from i8 to i16, the largest size was be
chosen). The sizes are encoded as 00 = i8, 01 = i16, 10 = i32 and 11 =
i64, which often (but not always) comes from the instruction encoding
directly. A unit test was added, and although only a subset of the
vecsizes are currently used, the rest should be useful for other cases.

Differential Revision: https://reviews.llvm.org/D109706
2021-09-22 12:07:52 +01:00
David Green
636fc0ef86 [ARM] Add additional tests for VMOVL in tail predicated loops. 2021-09-22 09:33:36 +01:00
David Green
3f90df22f1 [ARM] MVE reverse shuffles.
The vectorizer can sometimes make reverse shuffles from indices that
count down. In MVE, we don't have a 128bit rev instruction, but we can
select this to a VREV64 with some lane movs to swap the two halfs.

Ideally this would use VMOVD's, but only gets as far as VMOVS's at the
moment.

Differential Revision: https://reviews.llvm.org/D69510
2021-09-20 13:48:01 +01:00
David Green
cb5e3f7959 [ARM] Prevent large integer VQDMULH pattern crashes
Put a limit on the size of constant integers we test when looking for
VQDMULH, to prevent it from crashing from values more than 64bits.
2021-09-18 18:47:02 +01:00
Matt Arsenault
4a36e96c3f RegAllocGreedy: Account for reserved registers in num regs heuristic
This simple heuristic uses the estimated live range length combined
with the number of registers in the class to switch which heuristic to
use. This was taking the raw number of registers in the class, even
though not all of them may be available. AMDGPU heavily relies on
dynamically reserved numbers of registers based on user attributes to
satisfy occupancy constraints, so the raw number is highly misleading.

There are still a few problems here. In the original testcase that
made me notice this, the live range size is incorrect after the
scheduler rearranges instructions, since the instructions don't have
the original InstrDist offsets. Additionally, I think it would be more
appropriate to use the number of disjointly allocatable registers in
the class. For the AMDGPU register tuples, there are a large number of
registers in each tuple class, but only a small fraction can actually
be allocated at the same time since they all overlap with each
other. It seems we do not have a query that corresponds to the number
of independently allocatable registers. Relatedly, I'm still debugging
some allocation failures where overlapping tuples seem to not be
handled correctly.

The test changes are mostly noise. There are a handful of x86 tests
that look like regressions with an additional spill, and a handful
that now avoid a spill. The worst looking regression is likely
test/Thumb2/mve-vld4.ll which introduces a few additional
spills. test/CodeGen/AMDGPU/soft-clause-exceeds-register-budget.ll
shows a massive improvement by completely eliminating a large number
of spills inside a loop.
2021-09-14 21:00:29 -04:00
Nikita Popov
90ec6dff86 [OpaquePtr] Forbid mixing typed and opaque pointers
Currently, opaque pointers are supported in two forms: The
-force-opaque-pointers mode, where all pointers are opaque and
typed pointers do not exist. And as a simple ptr type that can
coexist with typed pointers.

This patch removes support for the mixed mode. You either get
typed pointers, or you get opaque pointers, but not both. In the
(current) default mode, using ptr is forbidden. In -opaque-pointers
mode, all pointers are opaque.

The motivation here is that the mixed mode introduces additional
issues that don't exist in fully opaque mode. D105155 is an example
of a design problem. Looking at D109259, it would probably need
additional work to support mixed mode (e.g. to generate GEPs for
typed base but opaque result). Mixed mode will also end up
inserting many casts between i8* and ptr, which would require
significant additional work to consistently avoid.

I don't think the mixed mode is particularly valuable, as it
doesn't align with our end goal. The only thing I've found it to
be moderately useful for is adding some opaque pointer tests in
between typed pointer tests, but I think we can live without that.

Differential Revision: https://reviews.llvm.org/D109290
2021-09-10 15:18:23 +02:00
Roman Lebedev
909cba9699 [SimplifyCFG] performBranchToCommonDestFolding(): require block-closed SSA form for bonus instructions (PR51125)
I can't seem to wrap my head around the proper fix here,
we should be fine without this requirement, iff we can form this form,
but the naive attempt (https://reviews.llvm.org/D106317) has failed.
So just to unblock the release, put up a restriction.

Fixes https://bugs.llvm.org/show_bug.cgi?id=51125
2021-09-09 12:28:09 +03:00
David Green
adfd12e6d1 [ARM] Add patterns for store(fptosisat(..))
As an extension to D107866, this adds store(fptosisat(..)) patterns,
similar to the existing fptosi patterns, to prevent unnecessarily moving
into gpr regs where we can use fp stores directly.

Differential Revision: https://reviews.llvm.org/D108378
2021-09-03 19:22:11 +01:00
David Green
f37e132263 [ARM] Add VFP lowering for fptosi.sat
This extends D107865 to the VFP insructions, lowering llvm.fptosi.sat
and llvm.fptoui.sat to VCVT instructions that inherently perform the
saturate.

Differential Revision: https://reviews.llvm.org/D107866
2021-09-03 18:11:08 +01:00
David Green
9cb8f4d1ad [ARM] Add a tail-predication loop predicate register
The semantics of tail predication loops means that the value of LR as an
instruction is executed determines the predicate. In other words:

mov r3, #3
DLSTP lr, r3        // Start tail predication, lr==3
VADD.s32 q0, q1, q2 // Lanes 0,1 and 2 are updated in q0.
mov lr, #1
VADD.s32 q0, q1, q2 // Only first lane is updated.

This means that the value of lr cannot be spilled and re-used in tail
predication regions without potentially altering the behaviour of the
program. More lanes than required could be stored, for example, and in
the case of a gather those lanes might not have been setup, leading to
alignment exceptions.

This patch adds a new lr predicate operand to MVE instructions in order
to keep a reference to the lr that they use as a tail predicate. It will
usually hold the zeroreg meaning not predicated, being set to the LR phi
value in the MVETPAndVPTOptimisationsPass. This will prevent it from
being spilled anywhere that it needs to be used.

A lot of tests needed updating.

Differential Revision: https://reviews.llvm.org/D107638
2021-09-02 13:42:58 +01:00
David Green
49476a4d66 [ARM] Add MVE lowering for fptosi.sat
This adds lowering of the llvm.fptosi.sat and llvm.fptoui.sat intinsics,
selecting a VCVT instruction which under MVE will inherently perform the
saturate.

Differential Revision: https://reviews.llvm.org/D107865
2021-09-01 22:38:47 +01:00
David Green
22c384129e [ARM] Add missing validForTailPredication for VMINNM/VMAXNM
Apparently this was missing, preventing the generation of tail
predication loops containing VMINNM, VMAXNM, VMINNMA and VMAXNMA.
2021-08-31 18:19:03 +01:00
David Green
198259becb [ARM] Test for VMINNM/VMAXNM in tail predicated loops. 2021-08-31 18:19:03 +01:00
David Green
bd0959354f [ARM] Add Extra FpToIntSat tests.
This adds extra MVE vector fptosi.sat and fptoui.sat tests, along with
adding or adjusting the existing scalar tests to cover more
architectures and instruction combinations.
2021-08-25 20:10:18 +01:00
Stanislav Mekhanoshin
92c1fd19ab Allow rematerialization of virtual reg uses
Currently isReallyTriviallyReMaterializableGeneric() implementation
prevents rematerialization on any virtual register use on the grounds
that is not a trivial rematerialization and that we do not want to
extend liveranges.

It appears that LRE logic does not attempt to extend a liverange of
a source register for rematerialization so that is not an issue.
That is checked in the LiveRangeEdit::allUsesAvailableAt().

The only non-trivial aspect of it is accounting for tied-defs which
normally represent a read-modify-write operation and not rematerializable.

The test for a tied-def situation already exists in the
/CodeGen/AMDGPU/remat-vop.mir,
test_no_remat_v_cvt_f32_i32_sdwa_dst_unused_preserve.

The change has affected ARM/Thumb, Mips, RISCV, and x86. For the targets
where I more or less understand the asm it seems to reduce spilling
(as expected) or be neutral. However, it needs a review by all targets'
specialists.

Differential Revision: https://reviews.llvm.org/D106408
2021-08-24 11:09:02 -07:00
David Green
605489d593 [ARM] Fix VQDMULH fold for scalar smin
Add a variant of mve-vqdmulh tests that uses min/max intrinsics
directly, including a scalar test that shows it misbehaving for min
intrinsics and a fix for the combine to prevent it from misbehaving.
2021-08-21 16:33:18 +01:00
David Green
d10f23a25d [ISel] Expand saddsat and ssubsat via asr and xor
This changes the lowering of saddsat and ssubsat so that instead of
using:
  r,o = saddo x, y
  c = setcc r < 0
  s = c ? INTMAX : INTMIN
  ret o ? s : r
into using asr and xor to materialize the INTMAX/INTMIN constants:
  r,o = saddo x, y
  s = ashr r, BW-1
  x = xor s, INTMIN
  ret o ? x : r
https://alive2.llvm.org/ce/z/TYufgD

This seems to reduce the instruction count in most testcases across most
architectures. X86 has some custom lowering added to compensate for
cases where it can increase instruction count.

Differential Revision: https://reviews.llvm.org/D105853
2021-08-19 16:08:07 +01:00
David Green
765a421276 [ARM] Add MVE min/max intrinsic tests. NFC 2021-08-19 14:33:34 +01:00
Petr Hosek
2d4470ab89 Revert "Allow rematerialization of virtual reg uses"
This reverts commit 877572cc19 which
introduced PR51516.
2021-08-18 00:12:41 -07:00
David Green
52e0cf9d61 [ARM] Enable subreg liveness
This enables subreg liveness in the arm backend when MVE is present,
which allows the register allocator to detect when subregister are
alive/dead, compared to only acting on full registers. This can helps
produce better code on MVE with the way MQPR registers are made up of
SPR registers, but is especially helpful for MQQPR and MQQQQPR
registers, where there are very few "registers" available and being able
to split them up into subregs can help produce much better code.

Differential Revision: https://reviews.llvm.org/D107642
2021-08-17 14:10:33 +01:00