This is a follow up to 7f6bbb3. When lowering a <N x i1> build_vector,
we currently chose to extend to i8, perform the build_vector there, and
then truncate back in vector. Our costing on the other hand accounts for
it as if we performed a vector extend, an insert, and a vector extract
for every element. This significantly over estimates the cost.
Note that we can likely do better in our build_vector lowering here by
packing the bits in scalar, and doing a build_vector of the packed bits.
Regardless, our costing should match our lowering.