This special case tried to measure if the shuffle vector will be multiple
inserts into an existing vector, with one of the lanes already in-place. If so
it reduces the cost by 1 to to represent it will can insert n-1 vector lanes.
This isn't always true though as the original vector may need to be moved to a
new value to start inserting new values into it, if other values from the
original are still needed.
This didn't effect performance much when I tried it, but should hopefully start
to address a regression we see from differences in SLP vectorization lane
orders.