shuf (bo X, Y), (bo X, W) --> bo (shuf X), (shuf Y, W)
This is motivated by an example in D111800
(although that patch avoids the problem for that particular example).
The pattern is shown in reduced form with:
https://llvm.org/PR52178https://alive2.llvm.org/ce/z/d8zB4D
There is no difference on the PhaseOrdering test from D111800
because the aarch64 cost model says that the shuffle cost is 3 while
the fadd cost is 2.
Differential Revision: https://reviews.llvm.org/D111901