Commit Graph

3501 Commits

Author SHA1 Message Date
Diana Picus
6beef3c087 [ARM] GlobalISel: Select double G_FADD and copies
Just use VADDD if available, bail out if not.

llvm-svn: 295309
2017-02-16 12:19:52 +00:00
Diana Picus
9b32faa821 [ARM] GlobalISel: Assert that we don't use the FPR bank if we don't have VFP
llvm-svn: 295308
2017-02-16 11:25:09 +00:00
Diana Picus
a93803b9fe [ARM] GlobalISel: Add reg bank mappings for G_SEQUENCE and G_EXTRACT
Support G_SEQUENCE and G_EXTRACT as needed for passing double precision floating
point values in the soft-fp float mode.

llvm-svn: 295306
2017-02-16 11:00:31 +00:00
Diana Picus
7f82c87022 [ARM] GlobalISel: Make the FPR bank 64-bit wide
Also add mappings for single and double precision FP, and use them for G_FADD
and G_LOAD.

llvm-svn: 295302
2017-02-16 10:12:49 +00:00
Diana Picus
21c3d8e0fc [ARM] GlobalISel: Legalize 64-bit G_FADD and G_LOAD
For now we just mark them as legal all the time and let the other passes bail
out if they can't handle it. In the future, we'll want to move more of the
brains into the legalizer.

llvm-svn: 295300
2017-02-16 09:09:49 +00:00
Diana Picus
ca6a890d7f [ARM] GlobalISel: Lower double precision FP args
For the hard float calling convention, we just use the D registers.

For the soft-fp calling convention, we use the R registers and move values
to/from the D registers by means of G_SEQUENCE/G_EXTRACT. While doing so, we
make sure to honor the endianness of the target, since the CCAssignFn doesn't do
that for us.

For pure soft float targets, we still bail out because we don't support the
libcalls yet.

llvm-svn: 295295
2017-02-16 07:53:07 +00:00
Kyle Butt
7fbec9bdf1 Codegen: Make chains from trellis-shaped CFGs
Lay out trellis-shaped CFGs optimally.
A trellis of the shape below:

  A     B
  |\   /|
  | \ / |
  |  X  |
  | / \ |
  |/   \|
  C     D

would be laid out A; B->C ; D by the current layout algorithm. Now we identify
trellises and lay them out either A->C; B->D or A->D; B->C. This scales with an
increasing number of predecessors. A trellis is a a group of 2 or more
predecessor blocks that all have the same successors.

because of this we can tail duplicate to extend existing trellises.

As an example consider the following CFG:

    B   D   F   H
   / \ / \ / \ / \
  A---C---E---G---Ret

Where A,C,E,G are all small (Currently 2 instructions).

The CFG preserving layout is then A,B,C,D,E,F,G,H,Ret.

The current code will copy C into B, E into D and G into F and yield the layout
A,C,B(C),E,D(E),F(G),G,H,ret

define void @straight_test(i32 %tag) {
entry:
  br label %test1
test1: ; A
  %tagbit1 = and i32 %tag, 1
  %tagbit1eq0 = icmp eq i32 %tagbit1, 0
  br i1 %tagbit1eq0, label %test2, label %optional1
optional1: ; B
  call void @a()
  br label %test2
test2: ; C
  %tagbit2 = and i32 %tag, 2
  %tagbit2eq0 = icmp eq i32 %tagbit2, 0
  br i1 %tagbit2eq0, label %test3, label %optional2
optional2: ; D
  call void @b()
  br label %test3
test3: ; E
  %tagbit3 = and i32 %tag, 4
  %tagbit3eq0 = icmp eq i32 %tagbit3, 0
  br i1 %tagbit3eq0, label %test4, label %optional3
optional3: ; F
  call void @c()
  br label %test4
test4: ; G
  %tagbit4 = and i32 %tag, 8
  %tagbit4eq0 = icmp eq i32 %tagbit4, 0
  br i1 %tagbit4eq0, label %exit, label %optional4
optional4: ; H
  call void @d()
  br label %exit
exit:
  ret void
}

here is the layout after D27742:
straight_test:                          # @straight_test
; ... Prologue elided
; BB#0:                                 # %entry ; A (merged with test1)
; ... More prologue elided
	mr 30, 3
	andi. 3, 30, 1
	bc 12, 1, .LBB0_2
; BB#1:                                 # %test2 ; C
	rlwinm. 3, 30, 0, 30, 30
	beq	 0, .LBB0_3
	b .LBB0_4
.LBB0_2:                                # %optional1 ; B (copy of C)
	bl a
	nop
	rlwinm. 3, 30, 0, 30, 30
	bne	 0, .LBB0_4
.LBB0_3:                                # %test3 ; E
	rlwinm. 3, 30, 0, 29, 29
	beq	 0, .LBB0_5
	b .LBB0_6
.LBB0_4:                                # %optional2 ; D (copy of E)
	bl b
	nop
	rlwinm. 3, 30, 0, 29, 29
	bne	 0, .LBB0_6
.LBB0_5:                                # %test4 ; G
	rlwinm. 3, 30, 0, 28, 28
	beq	 0, .LBB0_8
	b .LBB0_7
.LBB0_6:                                # %optional3 ; F (copy of G)
	bl c
	nop
	rlwinm. 3, 30, 0, 28, 28
	beq	 0, .LBB0_8
.LBB0_7:                                # %optional4 ; H
	bl d
	nop
.LBB0_8:                                # %exit ; Ret
	ld 30, 96(1)                    # 8-byte Folded Reload
	addi 1, 1, 112
	ld 0, 16(1)
	mtlr 0
	blr

The tail-duplication has produced some benefit, but it has also produced a
trellis which is not laid out optimally. With this patch, we improve the layouts
of such trellises, and decrease the cost calculation for tail-duplication
accordingly.

This patch produces the layout A,C,E,G,B,D,F,H,Ret. This layout does have
back edges, which is a negative, but it has a bigger compensating
positive, which is that it handles the case where there are long strings
of skipped blocks much better than the original layout. Both layouts
handle runs of executed blocks equally well. Branch prediction also
improves if there is any correlation between subsequent optional blocks.

Here is the resulting concrete layout:

straight_test:                          # @straight_test
; BB#0:                                 # %entry ; A (merged with test1)
	mr 30, 3
	andi. 3, 30, 1
	bc 12, 1, .LBB0_4
; BB#1:                                 # %test2 ; C
	rlwinm. 3, 30, 0, 30, 30
	bne	 0, .LBB0_5
.LBB0_2:                                # %test3 ; E
	rlwinm. 3, 30, 0, 29, 29
	bne	 0, .LBB0_6
.LBB0_3:                                # %test4 ; G
	rlwinm. 3, 30, 0, 28, 28
	bne	 0, .LBB0_7
	b .LBB0_8
.LBB0_4:                                # %optional1 ; B (Copy of C)
	bl a
	nop
	rlwinm. 3, 30, 0, 30, 30
	beq	 0, .LBB0_2
.LBB0_5:                                # %optional2 ; D (Copy of E)
	bl b
	nop
	rlwinm. 3, 30, 0, 29, 29
	beq	 0, .LBB0_3
.LBB0_6:                                # %optional3 ; F (Copy of G)
	bl c
	nop
	rlwinm. 3, 30, 0, 28, 28
	beq	 0, .LBB0_8
.LBB0_7:                                # %optional4 ; H
	bl d
	nop
.LBB0_8:                                # %exit

Differential Revision: https://reviews.llvm.org/D28522

llvm-svn: 295223
2017-02-15 19:49:14 +00:00
Reid Kleckner
a622fc9bdf [BranchFolding] Tail common all identical unreachable blocks
Summary:
Blocks ending in unreachable are typically cold because they end the
program or throw an exception, so merging them with other identical
blocks is usually profitable because it reduces the size of cold code.
MachineBlockPlacement generally does not arrange to fall through to such
blocks, so commoning these blocks will not introduce additional
unconditional branches.

Reviewers: hans, iteratee, haicheng

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D29153

llvm-svn: 295105
2017-02-14 21:02:24 +00:00
Arnold Schwaighofer
8f3df731dc swiftcc: Don't emit tail calls from callers with swifterror parameters
Backends don't support this yet. They would have to move to the swifterror
register before the tail call to make sure it is live-in to the call.

rdar://30495920

llvm-svn: 294982
2017-02-13 19:58:28 +00:00
James Molloy
0ae2202235 [ARM] Fix crash caused by r294945
I'd missed a creator of FCMP nodes - duplicateCmp().

Kindly and promptly reported by Gabor Ballabas, due to his CSiBE test suite.

llvm-svn: 294968
2017-02-13 17:18:00 +00:00
Sanne Wouda
490d4a6da6 [CodeGen] fix alignment of JUMPTABLE_INSTS on v8M.base
Summary:
The attached test case fails with "fatal error: error in backend:
misaligned pc-relative fixup value" as the jump table is misaligned.
The EmitAlignment existed already for ARM and Thumb-1 code, but was
missing for Thumb-2.

The test checks that the fatal error disappears when generating an obj
file, as well as checking the align directive is there when producing an
asm file.


Reviewers: rengolin, grosbach, t.p.northover, jmolloy, SjoerdMeijer, samparker

Reviewed By: samparker

Subscribers: samparker, aemerson, llvm-commits

Differential Revision: https://reviews.llvm.org/D29650

llvm-svn: 294950
2017-02-13 14:07:45 +00:00
James Molloy
d508789668 [ARM] Use VCMP, not VCMPE, for floating point equality comparisons
When generating a floating point comparison we currently unconditionally
generate VCMPE. This has the sideeffect of setting the cumulative Invalid
bit in FPSCR if any of the operands are QNaN.

It is expected that use of a relational predicate on a QNaN value should
raise Invalid. Quoting from the C standard:

  The relational and equality operators support the usual mathematical
  relationships between numeric values. For any ordered pair of numeric
  values exactly one of relationships the less, greater, equal and is true.
  Relational operators may raise the floating-point exception when argument
  values are NaNs.

The standard doesn't explicitly state the expectation for equality operators,
but the implication and obvious expectation is that equality operators
should not raise Invalid on a QNaN input, as those predicates are wholly
defined on unordered inputs (to return not equal).

Therefore, add a new operand to ARMISD::FPCMP and FPCMPZ indicating if
QNaN should raise Invalid, and pipe that through to TableGen.

llvm-svn: 294945
2017-02-13 12:32:47 +00:00
John Brawn
e60f4e4b8d [ARM] Fix incorrect mask bits in MSR encoding for write_register intrinsic
In the encoding of system registers in the M-class MSR instruction the mask bits
should be 2 for registers that don't take a _<bits> qualifier (the instruction
is unpredictable otherwise), and should also be 2 if the register takes a
_<bits> qualifier but it's not present as no _<bits> is an alias for _nzcvq.

Differential Revision: https://reviews.llvm.org/D29828

llvm-svn: 294762
2017-02-10 17:41:08 +00:00
George Burgess IV
ccf11c2f9f [ARM] Add support for armv7ve triple in llvm (PR31358).
Gcc supports target armv7ve which is armv7-a with virtualization
extensions. This change adds support for this in llvm for gcc
compatibility.

Also remove redundant FeatureHWDiv, FeatureHWDivARM for a few models as
this is specified automatically by FeatureVirtualization.

Patch by Manoj Gupta.

Differential Revision: https://reviews.llvm.org/D29472

llvm-svn: 294661
2017-02-09 23:29:14 +00:00
Artur Pilipenko
0e4583b56c Add DAGCombiner load combine tests for partially available values
If some of the trailing or leading bytes of a load combine pattern are zeroes we can combine the pattern to a load + zext and shift. Currently we don't support it, so the tests check the current codegen without load combine. This change will make the patch to support this kind of combine a bit more clear.

llvm-svn: 294591
2017-02-09 15:13:40 +00:00
Diana Picus
7232af352f [ARM] GlobalISel: Lower single precision FP args
Both for aapcscc and aapcs_vfpcc. We currently filter out soft float targets
because we don't support libcalls yet.

llvm-svn: 294584
2017-02-09 13:09:59 +00:00
Artur Pilipenko
4a64031954 [DAGCombiner] Support non-zero offset in load combine
Enable folding patterns which load the value from non-zero offset:

  i8 *a = ...
  i32 val = a[4] | (a[5] << 8) | (a[6] << 16) | (a[7] << 24)
=>
  i32 val = *((i32*)(a+4))

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D29394

llvm-svn: 294582
2017-02-09 12:06:01 +00:00
Arnold Schwaighofer
26f016f143 SwiftCC: swifterror register cannot be as the base register
Functions that have a dynamic alloca require a base register which is defined to
be X19 on AArch64 and r6 on ARM.  We have defined the swifterror register to be
the same register. Use a different callee save register for swifterror instead:

 X21 on AArch64
 R8 on ARM

rdar://30433803

llvm-svn: 294551
2017-02-09 01:52:17 +00:00
Arnold Schwaighofer
db7bbcbe78 [ARM/AArch ISel] SwiftCC: First parameters that are marked swiftself are not 'this returns'
We mark X0 as preserved by a call that passes the returned parameter.

 x0 = ...
 fun(x0) // no implicit def of x0

This no longer is valid if we pass the parameter in a different register then
the returned value as is the case with a swiftself parameter (passed in x20).

x20 = ...
fun(x20) // there should be an implict def of x8

rdar://30425845

llvm-svn: 294527
2017-02-08 22:30:47 +00:00
Diana Picus
e79e5ee244 Fix test to work on swift/cyclone too
I forgot to remove the neonfp target feature from the test, which means we'd
have trouble selecting VADDS on targets that have neonfp enabled by default.

llvm-svn: 294451
2017-02-08 14:23:30 +00:00
Diana Picus
4fa83c03fd [ARM] GlobalISel: Add FPR reg bank
Add a register bank for floating point values and select simple instructions
using them (add, copies from GPR).

This assumes that the hardware can cope with a single precision add (VADDS)
instruction, so the legalizer will treat G_FADD as legal and the instruction
selector will refuse to select if the hardware doesn't support it. In the future
we'll want to be more careful about this, and legalize to libcalls if we have to
use soft float.

llvm-svn: 294442
2017-02-08 13:23:04 +00:00
Artur Pilipenko
469596ef87 Add DAGCombiner load combine tests for {a|s}ext, {a|z|s}ext load nodes
Currently we don't support these nodes, so the tests check the current codegen without load combine. This change makes the review of the change to support these nodes more clear.

Separated from https://reviews.llvm.org/D29591 review.

llvm-svn: 294305
2017-02-07 14:09:37 +00:00
Christof Douma
d3ed8380e0 [ARM] Make RWPI use movw/movt when available
When constructing global address literals while targeting the RWPI
relocation model. LLVM currently only uses literal pools. If MOVW/MOVT
instructions are available we can use these instead. Beside being more
efficient it allows -arm-execute-only to work with
-relocation-model=RWPI as well.

When we generate MOVW/MOVT for global addresses when targeting the RWPI
relocation model, we need to use base relative relocations. This patch
does the needed plumbing in MC to generate these for MOVW/MOVT.

Differential Revision: https://reviews.llvm.org/D29487

Change-Id: I446786e43a6f5aa9b6a5bb2cd216d60d41c7755d
llvm-svn: 294298
2017-02-07 13:07:12 +00:00
Artur Pilipenko
d3464bf9ad [DAGCombiner] Support bswap as a part of load combine patterns
Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D29397

llvm-svn: 294201
2017-02-06 17:48:08 +00:00
Artur Pilipenko
bdf3c5af6a Add DAGCombiner load combine tests with non-zero offset
This is separated from https://reviews.llvm.org/D29394 review.

llvm-svn: 294185
2017-02-06 14:15:31 +00:00
Matthias Braun
82e7f4d877 MachineCopyPropagation: Respect implicit operands of COPY
The code missed to check implicit operands of COPY instructions for
defs/uses.

Differential Revision: https://reviews.llvm.org/D29522

llvm-svn: 294088
2017-02-04 02:27:20 +00:00
Sanne Wouda
a994185757 [ARM] Change TCReturn to tBL if tailcall optimization fails.
Summary:
The tail call optimisation is performed before register allocation, so
at that point we don't know if LR is being spilt or not. If LR was spilt
to the stack, then we cannot do a tail call optimisation. That would
involve popping back into LR which is not possible in Thumb1 code.

Reviewers: rengolin, jmolloy, rovka, olista01

Reviewed By: olista01

Subscribers: llvm-commits, aemerson

Differential Revision: https://reviews.llvm.org/D29020

llvm-svn: 294000
2017-02-03 11:15:53 +00:00
Sanne Wouda
57b63d6ade [LLC] Add an inline assembly diagnostics handler.
Summary:
llc would hit a fatal error for errors in inline assembly. The
diagnostics message is now printed.

Reviewers: rengolin, MatzeB, javed.absar, anemet

Reviewed By: anemet

Subscribers: jyknight, nemanjai, llvm-commits

Differential Revision: https://reviews.llvm.org/D29408

llvm-svn: 293999
2017-02-03 11:14:39 +00:00
Javed Absar
bb8dcc6aec [ARM] Classification Improvements to ARM Sched-Model. NFCI.
This is the second in the series of patches to enable adding
of machine sched-models for ARM processors easier and compact.
This patch focuses on integer instructions and adds missing
sched definitions.

Reviewers: rovka, rengolin
Differential Revision: https://reviews.llvm.org/D29127

llvm-svn: 293935
2017-02-02 21:08:12 +00:00
Nirav Dave
93f9d5ce04 Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled."
This reverts commit r293893 which is miscompiling lua on ARM and
bootstrapping for x86-windows.

llvm-svn: 293915
2017-02-02 18:24:55 +00:00
Nirav Dave
4442667fc5 In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.
Recommiting after fixing X86 inc/dec chain bug.

    * Simplify Consecutive Merge Store Candidate Search

    Now that address aliasing is much less conservative, push through
    simplified store merging search and chain alias analysis which only
    checks for parallel stores through the chain subgraph. This is cleaner
    as the separation of non-interfering loads/stores from the
    store-merging logic.

    When merging stores search up the chain through a single load, and
    finds all possible stores by looking down from through a load and a
    TokenFactor to all stores visited.

    This improves the quality of the output SelectionDAG and the output
    Codegen (save perhaps for some ARM cases where we correctly constructs
    wider loads, but then promotes them to float operations which appear
    but requires more expensive constant generation).

    Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)

    Additional Minor Changes:

      1. Finishes removing unused AliasLoad code

      2. Unifies the chain aggregation in the merged stores across code
         paths

      3. Re-add the Store node to the worklist after calling
         SimplifyDemandedBits.

      4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
         arbitrary, but seems sufficient to not cause regressions in
         tests.

      5. Remove Chain dependencies of Memory operations on CopyfromReg
         nodes as these are captured by data dependence

      6. Forward loads-store values through tokenfactors containing
          {CopyToReg,CopyFromReg} Values.

      7. Peephole to convert buildvector of extract_vector_elt to
         extract_subvector if possible (see
         CodeGen/AArch64/store-merge.ll)

      8. Store merging for the ARM target is restricted to 32-bit as
         some in some contexts invalid 64-bit operations are being
         generated. This can be removed once appropriate checks are
         added.

    This finishes the change Matt Arsenault started in r246307 and
    jyknight's original patch.

    Many tests required some changes as memory operations are now
    reorderable, improving load-store forwarding. One test in
    particular is worth noting:

      CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
      forwarding converts a load-store pair into a parallel store and
      a memory-realized bitcast of the same value. However, because we
      lose the sharing of the explicit and implicit store values we
      must create another local store. A similar transformation
      happens before SelectionDAG as well.

    Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle

llvm-svn: 293893
2017-02-02 14:39:42 +00:00
Diana Picus
32cd9b434c [ARM] GlobalISel: Lower pointer args and returns
It is important to change the ArgInfo's type from pointer to integer, otherwise
the CC assign function won't know what to do. Instead of hacking it up, we use
ComputeValueVTs and introduce some of the helpers that we will need later on for
lowering more complex types.

llvm-svn: 293889
2017-02-02 14:01:00 +00:00
Diana Picus
fc19a8ff07 [ARM] GlobalISel: Legalize loading pointers
Make it legal to load pointer values. Also check that pointers are assigned
to the GPR reg bank by default.

llvm-svn: 293886
2017-02-02 13:20:49 +00:00
Diana Picus
f8c5d93212 [ARM] GlobalISel: Test default banks for load results. NFC.
Check that all scalars are loaded into the GPR by default.

llvm-svn: 293883
2017-02-02 13:00:24 +00:00
Javed Absar
e5ad87e939 [ARM] Enable Cortex-M23 and Cortex-M33 support.
Add both cores to the target parser and TableGen. Test that eabi
attributes are set correctly for both cores. Additionally, test the
absence and presence of MOVT in Cortex-M23 and Cortex-M33, respectively.

Committed on behalf of Sanne Wouda.
Reviewers : rengolin, olista01.

Differential Revision: https://reviews.llvm.org/D29073

llvm-svn: 293761
2017-02-01 11:55:03 +00:00
Kyle Butt
b15c06677c CodeGen: Allow small copyable blocks to "break" the CFG.
When choosing the best successor for a block, ordinarily we would have preferred
a block that preserves the CFG unless there is a strong probability the other
direction. For small blocks that can be duplicated we now skip that requirement
as well, subject to some simple frequency calculations.

Differential Revision: https://reviews.llvm.org/D28583

llvm-svn: 293716
2017-01-31 23:48:32 +00:00
Sam Parker
9bf658d5fe [ARM] Avoid using ARM instructions in Thumb mode
The Requires class overrides the target requirements of an instruction,
rather than adding to them, so all ARM instructions need to include the
IsARM predicate when they have overwitten requirements.

This caused the swp and swpb instructions to be allowed in thumb mode
assembly, and the ARM encoding of CDP to be selected in codegen (which
is different for conditional instructions).

Differential Revision: https://reviews.llvm.org/D29283

llvm-svn: 293634
2017-01-31 14:35:01 +00:00
Matt Arsenault
0c687390fe DAG: Constant fold fp16_to_fp/fp16_to_fp
This fixes emitting conversions of constants on targets
without legal f16 that need to use these for legalization.

llvm-svn: 293499
2017-01-30 16:57:41 +00:00
Saleem Abdulrasool
5282eed06c ARM: support -mlong-calls with AEABI TLS on ELF
Support lowering AEABI TLS access (__aeabi_read_tp) with long calls.
This requires adjusting the call sequence to use an indirect call to get
full addressability.

Resolves PR31769!

llvm-svn: 293433
2017-01-29 16:46:22 +00:00
Matthew Simpson
3650df13be [ARM/AArch64] Relocate and update InterleavedAccessPass tests (NFC)
The interleaved access pass is an IR-to-IR transformation that runs before code
generation. It matches interleaved memory operations to target-specific
intrinsics (that are later lowered to load and store multiple instructions on
ARM/AArch64). We place tests for similar passes (e.g., GlobalMergePass) under
test/Transforms. This patch moves the InterleavedAccessPass tests out of
test/CodeGen and into target-specific directories under
test/Transforms/InterleavedAccess.

Although the pass is an IR pass, many of the existing tests were llc tests
rather opt tests. For example, the tests would check for ldN/stN instructions
generated by llc rather than the intrinsic calls the pass actually inserts.
Thus, this patch updates all tests to be opt tests that check for the inserted
intrinsics. We already have separate CodeGen tests that ensure we lower the
interleaved access intrinsics to their corresponding ldN/stN instructions. In
addition to migrating the tests to opt, this patch also performs some minor
clean-up (to ensure consistent naming, etc.).

Differential Revision: https://reviews.llvm.org/D29184

llvm-svn: 293309
2017-01-27 17:33:16 +00:00
Saleem Abdulrasool
26c00e3700 ARM: fix vectorized division on WoA
The Windows on ARM target uses custom division for normal division as
the backend needs to insert division-by-zero checks.  However, it is
designed to only handle non-vectorized division.  ARM has custom
lowering for vectorized division as that can avoid loading registers
with the values and invoke a division routine for each one, preferring
to lower using NEON instructions.  Fall back to the custom lowering for
the NEON instructions if we encounter a vectorized division.

Resolves PR31778!

llvm-svn: 293259
2017-01-27 03:41:53 +00:00
Nirav Dave
d32a421f75 Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled."
This reverts commit r293184 which is failing in LTO builds

llvm-svn: 293188
2017-01-26 16:46:13 +00:00
Nirav Dave
de6516c466 In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.
* Simplify Consecutive Merge Store Candidate Search

    Now that address aliasing is much less conservative, push through
    simplified store merging search and chain alias analysis which only
    checks for parallel stores through the chain subgraph. This is cleaner
    as the separation of non-interfering loads/stores from the
    store-merging logic.

    When merging stores search up the chain through a single load, and
    finds all possible stores by looking down from through a load and a
    TokenFactor to all stores visited.

    This improves the quality of the output SelectionDAG and the output
    Codegen (save perhaps for some ARM cases where we correctly constructs
    wider loads, but then promotes them to float operations which appear
    but requires more expensive constant generation).

    Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)

    Additional Minor Changes:

      1. Finishes removing unused AliasLoad code

      2. Unifies the chain aggregation in the merged stores across code
         paths

      3. Re-add the Store node to the worklist after calling
         SimplifyDemandedBits.

      4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
         arbitrary, but seems sufficient to not cause regressions in
         tests.

      5. Remove Chain dependencies of Memory operations on CopyfromReg
         nodes as these are captured by data dependence

      6. Forward loads-store values through tokenfactors containing
          {CopyToReg,CopyFromReg} Values.

      7. Peephole to convert buildvector of extract_vector_elt to
         extract_subvector if possible (see
         CodeGen/AArch64/store-merge.ll)

      8. Store merging for the ARM target is restricted to 32-bit as
         some in some contexts invalid 64-bit operations are being
         generated. This can be removed once appropriate checks are
         added.

    This finishes the change Matt Arsenault started in r246307 and
    jyknight's original patch.

    Many tests required some changes as memory operations are now
    reorderable, improving load-store forwarding. One test in
    particular is worth noting:

      CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
      forwarding converts a load-store pair into a parallel store and
      a memory-realized bitcast of the same value. However, because we
      lose the sharing of the explicit and implicit store values we
      must create another local store. A similar transformation
      happens before SelectionDAG as well.

    Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle

llvm-svn: 293184
2017-01-26 16:02:24 +00:00
Diana Picus
278c722e6d [ARM] GlobalISel: Load i1, i8 and i16 args from stack
Add support for loading i1, i8 and i16 arguments from the stack, with or without
the ABI extension flags.

When the ABI extension flags are present, we load a 4-byte value, otherwise we
preserve the size of the load and let the instruction selector replace it with a
LDRB/LDRH. This generates the same thing as DAGISel.

Differential Revision: https://reviews.llvm.org/D27803

llvm-svn: 293163
2017-01-26 09:20:47 +00:00
Tim Northover
470f070b7d SDag: fix how initial loads are formed when splitting vector ops.
Later code expects the vector loads produced to be directly
concatenable, which means we shouldn't pad anything except the last load
produced with UNDEF.

llvm-svn: 293088
2017-01-25 20:58:26 +00:00
Artur Pilipenko
41c0005aa3 [DAGCombiner] Match load by bytes idiom and fold it into a single load. Attempt #2.
The previous patch (https://reviews.llvm.org/rL289538) got reverted because of a bug. Chandler also requested some changes to the algorithm.
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161212/413479.html

This is an updated patch. The key difference is that collectBitProviders (renamed to calculateByteProvider) now collects the origin of one byte, not the whole value. It simplifies the implementation and allows to stop the traversal earlier if we know that the result won't be used.

From the original commit:

Match a pattern where a wide type scalar value is loaded by several narrow loads and combined by shifts and ors. Fold it into a single load or a load and a bswap if the targets supports it.

Assuming little endian target:
  i8 *a = ...
  i32 val = a[0] | (a[1] << 8) | (a[2] << 16) | (a[3] << 24)
=>
  i32 val = *((i32)a)

  i8 *a = ...
  i32 val = (a[0] << 24) | (a[1] << 16) | (a[2] << 8) | a[3]
=>
  i32 val = BSWAP(*((i32)a))

This optimization was discussed on llvm-dev some time ago in "Load combine pass" thread. We came to the conclusion that we want to do this transformation late in the pipeline because in presence of atomic loads load widening is irreversible transformation and it might hinder other optimizations.

Eventually we'd like to support folding patterns like this where the offset has a variable and a constant part:
  i32 val = a[i] | (a[i + 1] << 8) | (a[i + 2] << 16) | (a[i + 3] << 24)

Matching the pattern above is easier at SelectionDAG level since address reassociation has already happened and the fact that the loads are adjacent is clear. Understanding that these loads are adjacent at IR level would have involved looking through geps/zexts/adds while looking at the addresses.

The general scheme is to match OR expressions by recursively calculating the origin of individual bytes which constitute the resulting OR value. If all the OR bytes come from memory verify that they are adjacent and match with little or big endian encoding of a wider value. If so and the load of the wider type (and bswap if needed) is allowed by the target generate a load and a bswap if needed.

Reviewed By: RKSimon, filcab, chandlerc 

Differential Revision: https://reviews.llvm.org/D27861

llvm-svn: 293036
2017-01-25 08:53:31 +00:00
Diana Picus
d83df5d372 [ARM] GlobalISel: Support i1 add and ABI extensions
Add support for:
* i1 add
* i1 function arguments, if passed through registers
* i1 returns, with ABI signext/zeroext

Differential Revision: https://reviews.llvm.org/D27706

llvm-svn: 293035
2017-01-25 08:47:40 +00:00
Diana Picus
8b6c6bedcb [ARM] GlobalISel: Support i8/i16 ABI extensions
At the moment, this means supporting the signext/zeroext attribute on the return
type of the function. For function arguments, signext/zeroext should be handled
by the caller, so there's nothing for us to do until we start lowering calls.

Note that this does not include support for other extensions (i8 to i16), those
will be added later.

Differential Revision: https://reviews.llvm.org/D27705

llvm-svn: 293034
2017-01-25 08:10:40 +00:00
Javed Absar
00cce41752 [ARM] Classification Improvements to ARM Sched-Models. NFCI.
This is a series of patches to enable adding of machine sched
models for ARM processors easier and compact. They define new
sched-readwrites for groups of ARM instructions. This has been
missing so far, and as a consequence, machine scheduler models
for individual sub-targets have tended to be larger than they
needed to be. 

The current patch focuses on floating-point instructions.

Reviewers: Diana Picus (rovka), Renato Golin (rengolin)

Differential Revision: https://reviews.llvm.org/D28194

llvm-svn: 292825
2017-01-23 20:20:39 +00:00
Sjoerd Meijer
2db2a947f6 [Thumb] Add support for tMUL in the compare instruction peephole optimizer.
We also want to optimise tests like this: return a*b == 0.  The MULS
instruction is flag setting, so we don't need the CMP instruction but can
instead branch on the result of the MULS. The generated instructions sequence
for this example was: MULS, MOVS, MOVS, CMP. The MOVS instruction load the
boolean values resulting from the select instruction, but these MOVS
instructions are flag setting and were thus preventing this optimisation. Now
we first reorder and move the MULS to before the CMP and generate sequence
MOVS, MOVS, MULS, CMP so that the optimisation could trigger. Reordering of the
MULS and MOVS is safe to do because the subsequent MOVS instructions just set
the CPSR register and don't use it, i.e. the CPSR is dead.

Differential Revision: https://reviews.llvm.org/D27990

llvm-svn: 292608
2017-01-20 13:10:12 +00:00