PTX 8.8+ introduces 256-bit-wide vector loads/stores under certain conditions. This change extends the backend to lower these loads/stores. It also overrides getLoadStoreVecRegBitWidth for NVPTX, allowing the LoadStoreVectorizer to create these wider vector operations. See the spec for the three relevant PTX instructions here: - https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld - https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld-global-nc - https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st