clang-p2996

Author	SHA1	Message	Date
Florian Hahn	68469a80cb	[LV] Disable runtime unrolling for vectorized loops. This patch adds metadata to disable runtime unrolling to the vectorized loop. If runtime unrolling/interleaving is considered profitable, LV will interleave the loop directly. There should be no need to perform runtime unrolling at a later stage. Note that we already add metadata to disable runtime unrolling to the scalar loop after vectorization. The additional unrolling unnecessarily increases code size and compile time. In addition to that we have several bug reports of unncessary runtime unrolling for vectorized loops, e.g. PR40961 Compile-time improvements: NewPM-O3: -1.04% NewPM-ReleaseThinLTO: -0.59% NewPM-ReleaseLTO-g: -0.97% https://llvm-compile-time-tracker.com/compare.php?from=ce1be13a868d0f8afa367975558c1a6175cce33a&to=78bc2e67f22e9e10e61cdb6cdac4bb857d95eb1b&stat=instructions:u Fixes #40306. Reviewed By: lebedev.ri, nikic Differential Revision: https://reviews.llvm.org/D115261	2023-01-06 10:56:17 +00:00
Sanjay Patel	d5f8878a6e	[InstCombine] canonicalize insertelement order based on index This puts lower insert indexes before higher. This is independent of endian, so it requires an adjustment to a fold added with `4446f71ce3`, but it makes that fold more robust. That's also where this patch was suggested - D139668. This matches what we already do in DAGCombiner, but there is one more constraint because there's an existing canonicalization for insert-of-scalar-constant. I'm not sure if that is still needed, so it may be adjusted/removed as a follow-up.	2022-12-18 07:08:48 -05:00
Sanjay Patel	4446f71ce3	[InstCombine] try to fold a pair of insertelements into one insertelement This replaces patches that tried to convert related patterns to shuffles (D138872, D138873, D138874 - reverted/abandoned) but caused codegen problems and were questionable as a canonicalization because an insertelement is a simpler op than a shuffle. This detects a larger pattern -- insert-of-insert -- and replaces with another insert, so this hopefully does not cause any problems. As noted by TODO items in the code and tests, this could go a lot further. But this is enough to reduce the motivating test from issue #17113. Example proofs: https://alive2.llvm.org/ce/z/NnUv3a I drafted a version of this for AggressiveInstCombine, but it seems that would uncover yet another phase ordering gap. If we do generalize this to handle the full range of potential patterns, that may be worth looking at again. Differential Revision: https://reviews.llvm.org/D139668	2022-12-12 10:39:58 -05:00
Sanjay Patel	05dbdb0088	Revert "[InstCombine] canonicalize trunc + insert as bitcast + shuffle, part 1 (2nd try)" This reverts commit `e71b81cab0`. As discussed in the planned follow-on to this patch (D138874), this and the subsequent patches in this set can cause trouble for the backend, and there's probably no quick fix. We may even want to canonicalize in the opposite direction (towards insertelt).	2022-12-08 14:16:46 -05:00
Roman Lebedev	75d1a815c3	[NFC] Port all PhaseOrdering tests to `-passes=` syntax	2022-12-08 02:38:50 +03:00
Sanjay Patel	e71b81cab0	[InstCombine] canonicalize trunc + insert as bitcast + shuffle, part 1 (2nd try) The first attempt was reverted because a clang test changed unexpectedly - the file is already marked with a FIXME, so I just updated it this time to pass. Original commit message: This is the main patch for converting a truncated scalar that is inserted into a vector to bitcast+shuffle. We could go either way on patterns like this, but this direction will allow collapsing a pair of these sequences on the motivating example from issue The patch is split into 3 parts to make it easier to see the progression of tests diffs. We allow inserting/shuffling into a different size vector for flexibility, so there are several test variations. The length-changing is handled by shortening/padding the shuffle mask with undef elements. In part 1, handle the basic pattern: inselt undef, (trunc T), IndexC --> shuffle (bitcast T), IdentityMask Proof for the endian-dependency behaving as expected: https://alive2.llvm.org/ce/z/BsA7yC The TODO items for handling shifts and insert into an arbitrary base vector value are implemented as follow-ups. Differential Revision: https://reviews.llvm.org/D138872	2022-11-30 14:52:20 -05:00
Sanjay Patel	5eacdcff06	Revert "[InstCombine] canonicalize trunc + insert as bitcast + shuffle, part 1" This reverts commit `a4c466766d`. This broke clang tests that are wrongly dependent on the optimizer.	2022-11-30 14:10:50 -05:00
Sanjay Patel	a4c466766d	[InstCombine] canonicalize trunc + insert as bitcast + shuffle, part 1 This is the main patch for converting a truncated scalar that is inserted into a vector to bitcast+shuffle. We could go either way on patterns like this, but this direction will allow collapsing a pair of these sequences on the motivating example from issue The patch is split into 3 parts to make it easier to see the progression of tests diffs. We allow inserting/shuffling into a different size vector for flexibility, so there are several test variations. The length-changing is handled by shortening/padding the shuffle mask with undef elements. In part 1, handle the basic pattern: inselt undef, (trunc T), IndexC --> shuffle (bitcast T), IdentityMask Proof for the endian-dependency behaving as expected: https://alive2.llvm.org/ce/z/BsA7yC The TODO items for handling shifts and insert into an arbitrary base vector value are implemented as follow-ups. Differential Revision: https://reviews.llvm.org/D138872	2022-11-30 13:22:04 -05:00
Sanjay Patel	c7bd82dfd8	[PhaseOrdering] add test for vector load combining; NFC This is another example from issue #17113	2022-11-28 16:00:06 -05:00
Matt Arsenault	1c55cc600e	PhaseOrdering: Convert tests to opaque pointers Required manually running update_test_checks: AArch64/hoisting-sinking-required-for-vectorization.ll AArch64/peel-multiple-unreachable-exits-for-vectorization.ll ARM/arm_mult_q15.ll X86/hoist-load-of-baseptr.ll X86/spurious-peeling.ll	2022-11-27 21:26:41 -05:00
Sanjay Patel	163bb6d64e	[Passes][VectorCombine] enable early run generally and try load folds An early run of VectorCombine was added with D102496 specifically to deal with unnecessary vector ops produced with the C matrix extension. This patch is proposing to try those folds in general and add a pair of load folds to the menu. The load transform will partly solve (see PhaseOrdering diffs) a longstanding vectorization perf bug by removing redundant loads via GVN: issue #17113 The main reason for not enabling the extra pass generally in the initial patch was compile-time cost. The cost of VectorCombine was significantly (surprisingly) improved with: `87debdadaf` https://llvm-compile-time-tracker.com/compare.php?from=ffe05b8f57d97bc4340f791cb386c8d00e0739f2&to=87debdadaf18f8a5c7e5d563889e10731dc3554d&stat=instructions:u ...so the extra run is going to cost very little now - the total cost of the 2 runs should be less than the 1 run before that micro-optimization: https://llvm-compile-time-tracker.com/compare.php?from=5e8c2026d10e8e2c93c038c776853bed0e7c8fc1&to=2c4b68eab5ae969811f422714e0eba44c5f7eefb&stat=instructions:u It may be possible to reduce the cost slightly more with a few more earlier-exits like that, but it's probably in the noise based on timing experiments. Differential Revision: https://reviews.llvm.org/D138353	2022-11-21 13:57:55 -05:00
Roman Lebedev	8adfa29706	[Pipelines] Introduce SROA after (final, run-time) loop unrolling Now that we are done with loop unrolling, be it either by LoopVectorizer, or LoopUnroll passes, some variable-offset GEP's into alloca's could have become constant-offset, thus enabling SROA and alloca promotion, yet we don't capitalize on that, which is surprizing. While it would be good to not introduce one more SROA invocation, but instead move the one from `PassBuilder::buildFunctionSimplificationPipeline()`, the existing test coverage says that is a bad idea, though it would be fine compile-time wise: https://llvm-compile-time-tracker.com/compare.php?from=b150d34c47efbd8fa09604bce805c0920360f8d7&to=5a9a5c855158b482552be8c7af3e73d67fa44805&stat=instructions So instead, i add yet another SROA run. I have checked, and it needs to be at least after said final loop unrolling. This is still fine compile-time wise: https://llvm-compile-time-tracker.com/compare.php?from=70324cd88328c0924e605fa81b696572560aa5c9&to=fb489bbef687ad821c3173a931709f9cad9aee8a&stat=instructions I've encountered this in a real code, `SROA-after-final-loop-unrolling.ll` has been reduced from https://godbolt.org/z/fsdMhETh3 Reviewed By: spatel Differential Revision: https://reviews.llvm.org/D136806	2022-11-17 21:31:30 +03:00
Sanjay Patel	3682180e5f	[PhaseOrdering] add test for load combining; NFC Based on an example in issue #17113	2022-11-17 10:33:35 -05:00
Arthur Eubanks	70dc3b811e	[AggressiveInstCombine] Remove legacy PM pass As part of legacy PM optimization pipeline removal. This shouldn't be used in codegen pipelines so it should be ok to remove. Reviewed By: asbirlea Differential Revision: https://reviews.llvm.org/D137116	2022-11-15 14:35:15 -08:00
Bjorn Pettersson	893e351f2f	[test] Avoid legacy PM default pipelines (O0,O1 etc) when running opt Two lit tests were found running something like this: opt -O<n> -pass-locked-to-legacy-PM ... The expand-atomicrmw-xchg-fp.ll seem to have used -O1 just to ensure that the -atomic-expand pass were thinking that it wasn't running at O0 level. Same thing can be ensured by using the -codegen-opt-level=1 option, making it possible to avoid using O1 in that test case. In the vector-reductions-expanded.ll test case it was possible to split the RUN line into using two opt invocations. First running "opt -O2" using the new PM, and then running "opt -expand-reductions" using the legacy PM. I think that given this patch we get closer to removing code related to 'AddOptimizationPasses' in opt.cpp. Differential Revision: https://reviews.llvm.org/D137626	2022-11-09 09:57:57 +01:00
Roman Lebedev	e14f30584d	[NFC][PhaseOrdering] Add one more test for SROA after partial unroll https://reviews.llvm.org/D136806	2022-10-27 19:10:27 +03:00
Roman Lebedev	70324cd883	[NFC][PhaseOrdering] Add new test for SROA misplacement	2022-10-27 03:11:23 +03:00
Yaxun (Sam) Liu	9d5adc7e49	Revert "reland `e5581df60a` [SimplifyCFG] accumulate bonus insts cost" This reverts commit `bd7949bcd8`. Revert this patch since reviwers have different opinions regarding the approach in post-commit review. Will open RFC for further discussion. Differential Revision: https://reviews.llvm.org/D132408	2022-10-25 12:15:39 -04:00
Yaxun (Sam) Liu	bd7949bcd8	reland `e5581df60a` [SimplifyCFG] accumulate bonus insts cost Fixed compile time increase due to always constructing LocalCostTracker. Now only construct LocalCostTracker when needed.	2022-10-24 15:43:53 -04:00
bipmis	38f3e44997	[AggressiveInstCombine] Load merge the reverse load pattern of consecutive loads. This patch extends the load merge/widen in AggressiveInstCombine() to handle reverse load patterns. Differential Revision: https://reviews.llvm.org/D135137	2022-10-19 11:22:58 +01:00
bipmis	82e3056255	Add test for combinations of four i8-loads spliced into a 32-bit value	2022-10-18 15:40:56 +01:00
Nikita Popov	627a0c6b40	[PhaseOrdering] Name instructions in test (NFC) Run through opt -instnamer.	2022-10-05 17:04:11 +02:00
Simon Pilgrim	5849fcb635	Revert rG1b7089fe67b924bdd5ecef786a34bdba7a88778f "[SLP] Add ScalarizationOverheadBuilder helper to track vector extractions" Revert rGef89409a59f3b79ae143b33b7d8e6ee6285aa42f "Fix 'unused-lambda-capture' gcc warning. NFCI." Revert rG926ccfef032d206dcbcdf74ca1e3a9ebf4d1be45 "[SLP] ScalarizationOverheadBuilder - demand all elements for scalarization if the extraction index is unknown / out of bounds" Revert ScalarizationOverheadBuilder sequence from D134605 - when accumulating extraction costs by Type (instead of specific Value), we are not distinguishing enough when they are coming from the same source or not, and we always just count the cost once. This needs addressing before we can use getScalarizationOverhead properly.	2022-09-30 11:22:48 +01:00
Simon Pilgrim	1b7089fe67	[SLP] Add ScalarizationOverheadBuilder helper to track vector extractions Instead of accumulating all extraction costs separately and then adjusting for repeated subvector extractions, this patch collects all the extractions and then converts to calls to getScalarizationOverhead to improve the accuracy of the costs. I'm not entirely satisfied with the getExtractWithExtendCost handling yet - this still just adds all the getExtractWithExtendCost costs together - it really needs to be replaced with a "getScalarizationOverheadWithExtend", but that will require further refactoring first. This replaces my initial attempt in D124769. Differential Revision: https://reviews.llvm.org/D134605	2022-09-27 14:49:07 +01:00
Sanjay Patel	271f3b91bb	[PhaseOrdering] add test for issue #50778 ; NFC Several different passes are involved to get the expected IR, and we don't want that to break again.	2022-09-23 12:12:13 -04:00
Sanjay Patel	34f8112b79	Revert "[PhaseOrdering] add test for issue #50778 ; NFC" This reverts commit `cdc012fa26`. This accidentally deleted a test file (not sure how that became part of the commit).	2022-09-23 12:06:29 -04:00
Sanjay Patel	cdc012fa26	[PhaseOrdering] add test for issue #50778 ; NFC Several different passes are involved to get the expected IR, and we don't want that to break again.	2022-09-23 12:03:03 -04:00
Nikita Popov	dd61726d5b	Revert "[SimplifyCFG] accumulate bonus insts cost" This reverts commit `e5581df60a`. This causes major compile-time regressions, about 2-3% end-to-end on CTMark.	2022-09-19 14:46:43 +02:00
Yaxun (Sam) Liu	e5581df60a	[SimplifyCFG] accumulate bonus insts cost SimplifyCFG folds bool foo() { if (cond1) return false; if (cond2) return false; return true; } as bool foo() { if (cond1 \| cond2) return false return true; } 'cond2' is called 'bonus insts' in branch folding since they introduce overhead since the original CFG could do early exit but the folded CFG always executes them. SimplifyCFG calculates the costs of 'bonus insts' of a folding a BB into its predecessor BB which shares the destination. If it is below bonus-inst-threshold, SimplifyCFG will fold that BB into its predecessor and cond2 will always be executed. When SimplifyCFG calculates the cost of 'bonus insts', it only consider 'bonus' insts in the current BB to be considered for folding. This causes issue for unrolled loops which share destinations, e.g. bool foo(int a) { for (int i = 0; i < 32; i++) if (a[i] > 0) return false; return true; } After unrolling, it becomes bool foo(int a) { if(a[0]>0) return false if(a[1]>0) return false; //... if(a[31]>0) return false; return true; } SimplifyCFG will merge each BB with its predecessor BB, and ends up with 32 'bonus insts' which are always executed, which is much slower than the original CFG. The root cause is that SimplifyCFG does not consider the accumulated cost of 'bonus insts' which are folded from different BB's. This patch fixes that by introducing a ValueMap to track costs of 'bonus insts' coming from different BB's into the same BB, and cuts off if the accumulated cost exceeds a threshold. Reviewed by: Artem Belevich, Florian Hahn, Nikita Popov, Matt Arsenault Differential Revision: https://reviews.llvm.org/D132408	2022-09-18 20:21:14 -04:00
Sanjay Patel	d6498abc24	[InstCombine] remove multi-use add demanded constant fold This was originally part of D133788. There are no visible regressions. All of the diffs show a large unsigned constant becoming a small negative constant. This should be better for analysis (and slightly less compile-time) and codegen.	2022-09-18 14:23:43 -04:00
Valery N Dmitriev	18dde772d6	[SLP] Unify main/alternate selection for CmpInst instructions Make main/alternate operation selection logic for CmpInst consistent across SLP vectorizer. Differential Revision: https://reviews.llvm.org/D133430	2022-09-13 09:20:25 -07:00
Simon Pilgrim	626a84db47	[CostModel][X86] getTypeBasedIntrinsicInstrCost - convert to CostKindTblEntry Begin the refactoring to use CostKindTblEntry and return real latency/codesize/sizelatency costs instead of reusing the throughput numbers	2022-09-04 17:59:08 +01:00
Simon Pilgrim	8dc99180a6	[PhaseOrdering] Move X86 unsigned-multiply-overflow-check.ll test under X86	2022-09-04 17:54:32 +01:00
Simon Pilgrim	3edec9ba60	[CostModel][X86] Support cost kind specific look up tables (REAPPLIED) Most of our cost model tables have been created assuming cost kind == recip-throughput. But we're starting to see passes wanting to get accurate costs for the other kinds as well. Some of these can be determined procedurally (e.g. codesize by default could just be the split count after type legalization), but others are going to need to be handled in cost tables - this is especially true for x86 which has so many ISA combinations. I've created a 'CostKindCosts' struct which can hold cost values for the 4 cost kinds, defaulting to -1U for unknown cost, this can be used with the existing CostTblEntryT/CostTableLookup template code. I've also added a [TargetCostKind] accessor to make it much easier to look up individual <Optional> costs. This just changes the ISD::SELECT costs to check the effect (and also to check that the ISD::SETCC are correctly handled for default/None cost kinds) - the plan would be to slowly extend this and move the CostKindTblEntry type somewhere generic to allow other targets to use it once its matured. I'm also going to resurrect D103695 so that it can help with latency/codesize/sizelatency coverage testing. For sizelatency - IIRC the definition was vague to let it be target specific - I've tried to use typical uop counts so they're comparable to MicroOpBufferSize etc. REAPPLIED: Added early out to prevent getCmpSelInstrCost being used for anything but generic integer/float scalar/vector types - getTypeLegalizationCost can't handle the "exotic" TypeID enums that some passes attempt to get a costs for (aggregates etc.). Differential Revision: https://reviews.llvm.org/D132216	2022-08-25 16:49:17 +01:00
Benjamin Kramer	ab85996e47	Revert "[CostModel][X86] Support cost kind specific look up tables" This reverts commit `45846854a2`. This triggers an assertion failure during Clang selfhost Unknown type! UNREACHABLE executed at llvm/lib/CodeGen/ValueTypes.cpp:548! * SIGABRT received by PID 6107 (TID 6107) on cpu 218 from PID 6107; stack trace: * @ 0x556c8827c2d1 64 llvm::llvm_unreachable_internal() @ 0x556c82a5542a 32 llvm::MVT::getVT() @ 0x556c82a54a28 80 llvm::EVT::getEVT() @ 0x556c7dda1526 80 llvm::TargetLoweringBase::getValueType() @ 0x556c8174dd38 112 llvm::BasicTTIImplBase<>::getTypeLegalizationCost() @ 0x556c81755e72 144 llvm::X86TTIImpl::getCmpSelInstrCost() @ 0x556c8174cadf 512 llvm::TargetTransformInfoImplCRTPBase<>::getInstructionCost() @ 0x556c84ab4dd2 32 llvm::TargetTransformInfo::getInstructionCost() @ 0x556c82ead283 1968 llvm::sinkRegion()	2022-08-25 15:42:44 +02:00
Simon Pilgrim	2e5f16516a	[CostModel][X86] Add CodeSize handling for fdiv ops Eventually this will be part of the cost table lookup	2022-08-25 14:08:03 +01:00
Simon Pilgrim	45846854a2	[CostModel][X86] Support cost kind specific look up tables Most of our cost model tables have been created assuming cost kind == recip-throughput. But we're starting to see passes wanting to get accurate costs for the other kinds as well. Some of these can be determined procedurally (e.g. codesize by default could just be the split count after type legalization), but others are going to need to be handled in cost tables - this is especially true for x86 which has so many ISA combinations. I've created a 'CostKindCosts' struct which can hold cost values for the 4 cost kinds, defaulting to -1U for unknown cost, this can be used with the existing CostTblEntryT/CostTableLookup template code. I've also added a [TargetCostKind] accessor to make it much easier to look up individual <Optional> costs. This just changes the ISD::SELECT costs to check the effect (and also to check that the ISD::SETCC are correctly handled for default/None cost kinds) - the plan would be to slowly extend this and move the CostKindTblEntry type somewhere generic to allow other targets to use it once its matured. I'm also going to resurrect D103695 so that it can help with latency/codesize/sizelatency coverage testing. For sizelatency - IIRC the definition was vague to let it be target specific - I've tried to use typical uop counts so they're comparable to MicroOpBufferSize etc. Differential Revision: https://reviews.llvm.org/D132216	2022-08-25 12:23:36 +01:00
Jay Foad	2754ff883d	[InstCombine] Try not to demand low order bits for Add Don't demand low order bits from the LHS of an Add if: - they are not demanded in the result, and - they are known to be zero in the RHS, so they can't possibly overflow and affect higher bit positions This is intended to avoid a regression from a future patch to change the order of canonicalization of ADD and AND. Differential Revision: https://reviews.llvm.org/D130075	2022-08-22 20:03:53 +01:00
Simon Pilgrim	53c0be28a7	[PhaseOrdering][X86] Regenerate vdiv.ll Noticed while cleaning up x86 cost tables for upcoming cost kind support and it affected this test	2022-08-21 13:39:55 +01:00
Simon Pilgrim	08d153d806	[ValueTracking] computeKnownBits - attempt to use a branch condition feeding a phi to improve known bits range (PR38280) If computeKnownBits encounters a phi node, and we fail to determine any known bits through direct analysis, see if the incoming value is part of a branch condition feeding the phi. Handle cases where icmp(IncomingValue PRED Constant) is driving a branch instruction feeding that phi node - at the moment this only handles EQ/ULT/ULE predicate cases as they are the most straightforward to handle and most likely for branch-loop 'max upper bound' cases - we can extend this if/when necessary. I investigated a more general icmp(LHS PRED RHS) KnownBits system, but the hard limits we put on value tracking depth through phi nodes meant that we were mainly catching constants anyhow. Fixes the pointless vectorization in PR38280 / Issue #37628 (excessive unrolling still needs handling though) Differential Revision: https://reviews.llvm.org/D131838	2022-08-16 16:54:44 +01:00
Florian Hahn	1638ad1ebf	[PhaseOrdering] Add test showing excessive unrolling of vector loop. Test cases based on #42332 showing excessive unrolling with both known and runtime trip counts.	2022-08-16 16:29:15 +01:00
Simon Pilgrim	c771d0fd4b	[Instcombine] Add (simplified) pointless loop unroll / vectorization test for Issue #37628	2022-08-13 14:34:08 +01:00
Sanjay Patel	bfb9b8e075	[Passes] add a tail-call-elim pass near the end of the opt pipeline We call tail-call-elim near the beginning of the pipeline, but that is too early to annotate calls that get added later. In the motivating case from issue #47852, the missing 'tail' on memset leads to sub-optimal codegen. I experimented with removing the early instance of tail-call-elim instead of just adding another pass, but that appears to be slightly worse for compile-time: +0.15% vs. +0.08% time. "tailcall" shows adding the pass; "tailcall2" shows moving the pass to later, then adding the original early pass back (so 1596886802 is functionally equivalent to 180b0439dc ): https://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=rotateright Note that there was an effort to split the tail call functionality into 2 passes - that could help reduce compile-time if we find that this change costs more in compile-time than expected based on the preliminary testing: D60031 Differential Revision: https://reviews.llvm.org/D130374	2022-07-25 15:25:47 -04:00
Nikita Popov	41d5033eb1	[IR] Enable opaque pointers by default This enabled opaque pointers by default in LLVM. The effect of this is twofold: * If IR that contains neither explicit ptr nor %T* types is passed to tools, we will now use opaque pointer mode, unless -opaque-pointers=0 has been explicitly passed. * Users of LLVM as a library will now default to opaque pointers. It is possible to opt-out by calling setOpaquePointers(false) on LLVMContext. A cmake option to toggle this default will not be provided. Frontends or other tools that want to (temporarily) keep using typed pointers should disable opaque pointers via LLVMContext. Differential Revision: https://reviews.llvm.org/D126689	2022-06-02 09:40:56 +02:00
Nikita Popov	1721ff1dfd	[GVN] Enable enable-split-backedge-in-load-pre option by default This option was added in D89854. It prevents GVN from performing load PRE in a loop, if doing so would require critical edge splitting on the backedge. From the review: > I know that GVN Load PRE negatively impacts peeling, > loop predication, so the passes expecting that latch has > a conditional branch. In the PhaseOrdering test in this patch, splitting the backedge negatively affects vectorization: After critical edge splitting, the loop gets rotated, effectively peeling off the first loop iteration. The effect is that the first element is handled separately, then the bulk of the elements use a vectorized reduction (but using unaligned, off-by-one memory accesses) and then a tail of 15 elements is handled separately again. It's probably worth noting that the loop load PRE from D99926 is not affected by this change (as it does not need backedge splitting). This is about normal load PRE that happens to occur inside a loop. Differential Revision: https://reviews.llvm.org/D126382	2022-05-30 09:55:58 +02:00
Nikita Popov	6346a026af	[PhaseOrdering] Add test for unprofitable loop load PRE backedge splitting (NFC)	2022-05-25 16:06:41 +02:00
Nikita Popov	b2a13d3e2d	[InstCombine] Use IRBuilder in freeze pushing transform (PR55619) Use IRBuilder so that the newly created freeze instructions automatically gets inserted back into the IC worklist. The changed worklist processing order leads to some cosmetic differences in tests. Fixes https://github.com/llvm/llvm-project/issues/55619.	2022-05-24 15:48:28 +02:00
Florian Hahn	b7315ffc3c	[LAA,LV] Add initial support for pointer-diff memory checks. This patch adds initial support for a pointer diff based runtime check scheme for vectorization. This scheme requires fewer computations and checks than the existing full overlap checking, if it is applicable. The main idea is to only check if source and sink of a dependency are far enough apart so the accesses won't overlap in the vector loop. To do so, it is sufficient to compute the difference and compare it to the `VF * UF * AccessSize`. It is sufficient to check `(Sink - Src) <u VF * UF * AccessSize` to rule out a backwards dependence in the vector loop with the given VF and UF. If Src >=u Sink, there is not dependence preventing vectorization, hence the overflow should not matter and using the ULT should be sufficient. Note that the initial version is restricted in multiple ways: 1. Pointers must only either be read or written, by a single instruction (this allows re-constructing source/sink for dependences with the available information) 2. Source and sink pointers must be add-recs, with matching steps 3. The step must be a constant. 3. abs(step) == AccessSize. Most of those restrictions can be relaxed in the future. See https://github.com/llvm/llvm-project/issues/53590. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D119078	2022-05-16 15:27:22 +01:00
Alexey Bataev	7ea03f0b4e	[SLP]Improve reductions analysis and emission, part 1. Currently SLP vectorizer walks through the instructions and selects 3 main classes of values: 1) reduction operations - instructions with same reduction opcode (add, mul, min/max, etc.), which build the reduction, 2) reduced values - instructions with the same opcodes, but different from the reduction opcode, 3) extra arguments - all other values, instructions from the different basic block rather than the root node, instructions with to many/less uses. This scheme is not very efficient. It excludes some instructions and all non-instruction values from the reductions (constants, proficient gathers), to many possibly reduced values are marked as extra arguments. Patch improves this process by introducing a bit extended analysis stage. During this stage, we still try to select 3 classes of the values: 1) reduction operations - same as before, 2) possibly reduced values - all instructions from the current block/non-instructions, which may build a vectorization tree, 3) extra arguments - instructions from the different basic blocks. Additionally, an extra sorting of the possibly reduced values occurs to build the scalar sequences which highly likely will bed vectorized, e.g. loads are grouped by the distance between them, constants are grouped together, cmp instructions are sorted by their compare types and predicates, extractelement instructions are sorted by the vector operand, etc. Also, these groups are reordered by their length so the longest group is the first in the list of the possibly reduced values. The vectorization process tries to emit the reductions for all these groups. These reductions, remaining non-vectorized possible reduced values and extra arguments are then combined into the final expression just like it was before. Differential Revision: https://reviews.llvm.org/D114171	2022-05-02 12:03:58 -07:00
Simon Pilgrim	6f80830f06	[PhaseOrdering][X86] Use passes="" instead of passes='' so DOS can evaluate the cmd lines Fix regenerating the tests on windows builds	2022-04-30 19:56:49 +01:00

1 2 3 4

195 Commits