This PR adds a new vector intrinsic `@llvm.experimental.vector.compress` to "compress" data within a vector based on a selection mask, i.e., it moves all selected values (i.e., where `mask[i] == 1`) to consecutive lanes in the result vector. A `passthru` vector can be provided, from which remaining lanes are filled. The main reason for this is that the existing `@llvm.masked.compressstore` has very strong constraints in that it can only write values that were selected, resulting in guard branches for all targets except AVX-512 (and even there the AMD implementation is _very_ slow). More instruction sets support "compress" logic, but only within registers. So to store the values, an additional store is needed. But this combination is likely significantly faster on many target as it avoids branches. In follow up PRs, my plan is to add target-specific lowerings for x86, SVE, and possibly RISCV. I also want to combine this with a store instruction, as this is probably a common case and we can avoid some memory writes in that case. See [discussion in forum](https://discourse.llvm.org/t/new-intrinsic-for-masked-vector-compress-without-store/78663) for initial discussion on the design.
72 KiB
72 KiB