Each vslide1down operation is linear in LMUL on common hardware. (For instance, the sifive-x280 cost model models slides this way.) If we do a VL unique inserts, each with a cost linear in LMUL, the overall cost is O(VL*LMUL). Since VL is a linear function of LMUL, this means the current lowering is quadradic in both LMUL and VL. To avoid the degenerate case, fallback to the stack if the cost is more than a fixed (linear) threshold. For context, here's the sifive-x280 llvm-mca results for the current lowering and stack based lowering for each LMUL (using e64). Assumes code was compiled for V (i.e. zvl128b). buildvector_m1_via_stack.mca:Total Cycles: 1904 buildvector_m2_via_stack.mca:Total Cycles: 2104 buildvector_m4_via_stack.mca:Total Cycles: 2504 buildvector_m8_via_stack.mca:Total Cycles: 3304 buildvector_m1_via_vslide1down.mca:Total Cycles: 804 buildvector_m2_via_vslide1down.mca:Total Cycles: 1604 buildvector_m4_via_vslide1down.mca:Total Cycles: 6400 buildvector_m8_via_vslide1down.mca:Total Cycles: 25599 There are other schemes we could use to cap the cost. The next best is recursive decomposition of the vector into smaller LMULs. That's still quadratic, but with a better constant. However, stack based seems to cost better on all LMULs, so we can just go with the simpler scheme. Arguably, this patch is fixing a regression introduced with my D149667 as before that change, we'd always fallback to the stack, and thus didn't have the non-linearity. Differential Revision: https://reviews.llvm.org/D159332
711 KiB
711 KiB