Currently we happen to split a chain of 12xi8 accesses into 6xi8 + 6xi8, which produces rather suboptimal code. This change attempts to split-off non-multiples of 4bytes at the end and if that does not work, splits on the smaller power-of-2 boundary. Differential Revision: https://reviews.llvm.org/D147976