The logic in ARMParallelDSP is setup to merge two 16-bits loads into
a 32-bit load and feed them into the smlads. This requires that four
loads are combined for the four inputs, but there wasn't actually a
check for this.
Differential Revision: https://reviews.llvm.org/D78492