[MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops
Add warp synchronous matrix-multiply accumulate ops in GPU and NVVM
dialect. Add following three ops to GPU dialect :-
1.) subgroup_mma_load_matrix
2.) subgroup_mma_store_matrix
3.) subgroup_mma_compute
Add following three ops to NVVM dialect :-
1.) wmma.m16n16k16.load.[a,b,c].[f16,f32].row.stride
2.) wmma.m16n16k16.store.d.[f16,f32].row.stride
3.) wmma.m16n16k16.mma.row.row.[f16,f32].[f16,f32]
Reviewed By: bondhugula, ftynse, ThomasRaoux
Differential Revision: https://reviews.llvm.org/D95330