This change adds the `dot.accumulate.2way` Op to the NVVM dialect for 16-bit to 8-bit dot-product accumulate operation. PTX Spec Reference: https://docs.nvidia.com/cuda/parallel-thread-execution/#integer-arithmetic-instructions-dp2a
This change adds the `dot.accumulate.2way` Op to the NVVM dialect for 16-bit to 8-bit dot-product accumulate operation. PTX Spec Reference: https://docs.nvidia.com/cuda/parallel-thread-execution/#integer-arithmetic-instructions-dp2a