ScalarizedMaskedMemIntr contains an optimization where the <N x i1> mask is bitcast into an iN and then bit-tests with powers of two are used to determine whether to load/store/... or not. However, on machines with branch divergence (mainly GPUs), this is a mis-optimization, since each i1 in the mask will be stored in a condition register - that is, ecah of these "i1"s is likely to be a word or two wide, making these bit operations counterproductive. Therefore, amend this pass to skip the optimizaiton on targets that it pessimizes. Pre-commit tests #104645
43 KiB
43 KiB