It's possible that we encounter Irreducible control flow, due to which,
we may find that a few predecessors of BB are not a part of the CurLoop.
Currently we crash in the function for such cases. This patch ensures
that we only return Predecessors that are a part of CurLoop and
gracefully ignore other Predecessors.
For example, consider Irreducible IR of this form:
```
define i64 @baz() {
bb:
br label %bb1
bb1: ; preds = %bb3, %bb
br label %bb3
bb2: ; No predecessors!
br label %bb3
bb3: ; preds = %bb2, %bb1
%load = load ptr addrspace(1), ptr addrspace(1) null, align 8
br label %bb1
}
```
This crashes when `collectTransitivePredecessors` is called on the
`%bb1<Header>, %bb3<latch>` loop, because the loop body has a
predecessor `%bb2` which is not a part of the loop.
See https://godbolt.org/z/E9fM1q3cT for the crash
In #111000 we removed promotion of fadd/fmul reductions for bf16 and f16
without zvfh, and marked the cost as invalid to prevent the vectorizers
from emitting them. However it inadvertently didn't change the cost for
ordered reductions, so this moves the check earlier to fix this.
This also uses BasicTTIImpl instead which now assigns a valid but
expensive cost for fixed-length vectors, which reflects how codegen will
actually scalarize them.
We currently can't lower scalable vector lrint and llrint nodes for bf16
and f16, even with zvfh, and will crash.
Mark the cost as invalid for now to prevent the vectorizers from
emitting them.
Note that we can actually lower fixed-length vectors fine by scalarizing
them, but we were still undercosting these too so I've also included
them. I presume there's an opportunity to improve the codegen later on.
Many tests that were in test/Analysis/CostModel were actually
loop vectoriser tests. I've moved them as follows:
Analysis/CostModel/X86 -> Transforms/LoopVectorize/X86/CostModel
Analysis/CostModel/AArch64/arith-fp-frem.ll ->
Transforms/LoopVectorize/AArch64/arith-fp-frem-costs.ll
This computes a structural hash while allowing for selective ignoring of
certain operands based on a custom function that is provided. Instead of
a single hash value, it now returns FunctionHashInfo which includes a
hash value, an instruction mapping, and a map to track the operand
location and its corresponding hash value that is ignored.
Depends on https://github.com/llvm/llvm-project/pull/112621.
This is a patch for
https://discourse.llvm.org/t/rfc-global-function-merging/82608.
With the introduction of the nusw flag in GEPNoWrapFlags, it should be
safe to weaken the check in LoopAccessAnalysis to just check the nusw
flag on the GEP, instead of inbounds.
isNoWrap has exactly one caller which handles Assume = true separately,
but too conservatively. Instead, pass Assume to isNoWrap, so it is
threaded into getPtrStride, which has the correct handling for the
Assume flag. Also note that the Stride == 1 check in isNoWrap is
incorrect: getPtrStride returns Strides == 1 or -1, except when
isNoWrapAddRec or Assume are true, assuming ShouldCheckWrap is true; we
can include the case of -1 Stride, and when isNoWrapAddRec is true. With
this change, passing Assume = true to getPtrStride could return a
non-unit stride, and we correctly handle that case as well.
For regular floating point types we mark these as expanded on scalable
vectors so they're not legal in the cost model, so this does the same
for f16 w/ zvfhmin and bf16.
This patch contains following changes to fix vp intrinsics tests.
1. v\*float -> v\*f32, v\*double -> v\*f64 and v\*half -> v\*f16
2. Fix the order of the vp-intrinsics.
A call to a function that has this attribute is not a source of
divergence, as used by UniformityAnalysis. That allows a front-end to
use known-name calls as an instruction extension mechanism (e.g.
https://github.com/GPUOpen-Drivers/llvm-dialects ) without such a call
being a source of divergence.
It turns out that {s,u}int_to_fp nodes get their operation action from
their operand's type, not the result type, so we don't need to set it
for fp16 or bf16. vp_{s,u}int_to_fp uses the result type though so we
need to keep it.
This also means that we can lower int_to_fp for fixed length bf16
vectors already, so this adds tests for that.
The cost model test changes are due to BasicTTIImpl's getCastInstrCost
not taking into account that int_to_fp needs its legal type swapped.
This can be fixed in a later patch, but its worth noting that the
affected types in the tests currently crash when lowered anyway (due to
them needing split at LMUL > 8)
When trying to add RISC-V fadd reduction cost model tests for bf16, I
noticed a crash when the vector was of <1 x bfloat>.
It turns out that this was being scalarized because unlike f16/f32/f64,
there's no v1bf16 value type, and the existing cost model code assumed
that the legalized type would always be a vector.
This adds v1bf16 to bring bf16 in line with the other fp types.
It also adds some more RISC-V bf16 reduction tests which previously
crashed, including tests to ensure that SLP won't emit fadd/fmul
reductions for bf16 or f16 w/ zvfhmin after #111000.
In https://reviews.llvm.org/D153848, promotion was added for a variety
of f16 ops with zvfhmin, including VP reductions.
However I don't believe it's correct to promote f16 fadd or fmul
reductions to f32 since we need to round the intermediate results.
Today if we lower @llvm.vp.reduce.fadd.nxv1f16 on RISC-V, we'll get two
different results depending on whether we compiled with +zvfh or
+zvfhmin, for example with a 3 element reduction:
; v9 = [0.1563, 5.97e-8, 0.00006104]
; zvfh
vsetivli x0, 3, e16, m1, ta, ma
vmv.v.i v8, 0
vfredosum.vs v8, v9, v8
vfmv.f.s fa0, v8
; fa0 = 0.1563
; zvfhmin
vsetivli x0, 3, e16, m1, ta, ma
vfwcvt.f.f.v v10, v9
vsetivli x0, 3, e32, m1, ta, ma
vmv.v.i v8, 0
vfredosum.vs v8, v10, v8
vfmv.f.s fa0, v8
fcvt.h.s fa0, fa0
; fa0 = 0.1564
This same thing happens with reassociative reductions e.g. vfredusum.vs,
and this also applies for bf16.
I couldn't find anything in the LangRef for reductions that suggest the
excess precision is allowed. There may be something we can do in Clang
with -fexcess-precision=fast, but I haven't looked into this yet.
I presume the same precision issue occurs with fmul, but not with
fmin/fmax/fminimum/fmaximum.
I can't think of another way of lowering these other than scalarizing,
and we can't scalarize scalable vectors, so this just removes the
promotion and adjusts the cost model to return an invalid cost. (It
looks like we also don't currently cost fmul reductions, so presumably
they also have an invalid cost?)
I think this should be enough to stop the loop vectorizer or SLP from
emitting these intrinsics.
Allow AllowEphemerals in isValidAssumeForContext, as the CxtI might
be the producer of the pointer in the bundle. At the moment, align
assumptions aren't optimized away.
This allows using the assumption in the computeKnownBits call in
getConstantMultipleImpl.
We could extend the computeKnownBits API to allow callers to specify if
ephemerals are allowed, if the info from computeKnownBitsFromContext is
used to remove alignment assumptions.
PR: https://github.com/llvm/llvm-project/pull/108632
When expanding SCEV adds to geps, transfer the nuw flag to the resulting
gep. (Note that this doesn't apply to IV increment GEPs, which go
through a different code path.)
After pr96656.ll were added to LAA and LoopVersioning, it was decided
that the bug is in a caller of LoopVersioning, not in LAA or
LoopVersioning itself. The new candidate was LoopLoadElim, but #96656
has since been marked invalid. Hence, re-organize the added tests to
avoid confusion, and the testcase from the investigation to
LoopLoadElim.
When we're lowering to a split sequence, we only need one
materialization of the zero constant. Our codegen looks something like
this:
vmv.v.i v24, 0
vmerge.vim v8, v24, -1, v0
vmv1r.v v0, v16
vmerge.vim v16, v24, -1, v0
Note: Doing this specific case since it was pointed out in
https://github.com/llvm/llvm-project/pull/110164#discussion_r1778268391,
but it's worth noting that we have the same basic problem (over costing
split operations with split invariant terms) at multiple places through
this file.
Calling into BasicTTI is not always safe. In particular, BasicTTI does
not have a full legalization implementation (vector widening is
missing), and falls back on scalarization. The problem is that
scalarization for <N x i1> vectors is cost in terms of the cast API and
we can end up in an infinite recursive cycle.
The "right" fix for this would be teach BasicTTI how to model the full
legalization state machine, but several attempts at doing so have
resulted in dead ends or undesirable cost changes for targets I don't
understand.
This patch instead papers over the issue by avoiding the call to the
base class when dealing with an i1 source or dest. This doesn't
necessarily produce correct costs, but it should at least return
something semi-sensible and not crash.
Fixes https://github.com/llvm/llvm-project/issues/108708
KnownBits::srem does not correctly set the leader zero-bits, omitting
the fact that LHS may be known-negative or known-non-negative. Fix this.
Alive2 proof: https://alive2.llvm.org/ce/z/Ugh-Dq
There is an underlying bug in KnownBits, and we should theoretically be
able to determine the high-bits of an srem as shown in the test, just
like urem. In preparation to fix this bug, add pre-commit tests testing
high-bits of srem and urem.
This is mostly for test: under contextual profiling, we perform ICP for those indirect callsites which have targets marked as `alwaysinline`.
This helped uncover a bug with the way the profile was updated upon ICP, where we were skipping over the update if the target wasn't called in that context. That was resulting in incorrect counts for the indirect BB.
Also flyby fix to the total/direct count values, they should be 64-bit (as all counters are in the contextual profile)
This follows in the spirit of 7d82c99403,
and extends the costing API for compares and selects to provide
information about the operands passed in an analogous manner. This
allows us to model the cost of materializing the vector constant, as
some select-of-constants are significantly more expensive than others
when you account for the cost of materializing the constants involved.
This is a stepping stone towards fixing
https://github.com/llvm/llvm-project/issues/109466. A separate SLP patch
will be required to utilize the new API.
Arithmetic half or bfloat ops on zvfhmin and zvfbfmin respectively will
be promoted and carried out in f32, so this updates
getArithmeticInstrCost to check for this.