clang-p2996

Author	SHA1	Message	Date
River Riddle	ffde975e21	NFC: Move AffineOps dialect to the Dialect sub-directory. PiperOrigin-RevId: 264482571	2019-08-20 15:36:39 -07:00
River Riddle	ba0fa92524	NFC: Move LLVMIR, SDBM, and StandardOps to the Dialect/ directory. PiperOrigin-RevId: 264193915	2019-08-19 11:01:25 -07:00
Jacques Pienaar	79f53b0cf1	Change from llvm::make_unique to std::make_unique Switch to C++14 standard method as llvm::make_unique has been removed ( https://reviews.llvm.org/D66259). Also mark some targets as c++14 to ease next integrates. PiperOrigin-RevId: 263953918	2019-08-17 11:06:03 -07:00
Mehdi Amini	926fb685de	Express ownership transfer in PassManager API through std::unique_ptr (NFC) Since raw pointers are always passed around for IR construct without implying any ownership transfer, it can be error prone to have implicit ownership transferred the same way. For example this code can seem harmless: Pass *pass = .... pm.addPass(pass); pm.addPass(pass); pm.run(module); PiperOrigin-RevId: 263053082	2019-08-12 19:13:12 -07:00
River Riddle	1e42954032	NFC: Standardize the terminology used for parent ops/regions/etc. There are currently several different terms used to refer to a parent IR unit in 'get' methods: getParent/getEnclosing/getContaining. This cl standardizes all of these methods to use 'getParent*'. PiperOrigin-RevId: 262680287	2019-08-09 20:07:52 -07:00
Alex Zinenko	58e66d71e7	AffineDataCopyGeneration: don't use CL flag values inside the pass AffineDataCopyGeneration pass relied on command line flags for internal logic in several places, which makes it unusable in a library context (i.e. outside a standalone mlir-opt binary that does the command line parsing). Define configuration flags in the constructor instead, and set them up to command line-based defaults to maintain the original behavior. PiperOrigin-RevId: 261322364	2019-08-02 08:04:30 -07:00
Uday Bondhugula	18b8d4352b	Introduce explicit copying optimization by generalizing the DMA generation pass Explicit copying to contiguous buffers is a standard technique to avoid conflict misses and TLB misses, and improve hardware prefetching performance. When done in conjunction with cache tiling, it nearly eliminates all cache conflict and TLB misses, and a single hardware prefetch stream is needed per data tile. - generalize/extend DMA generation pass (renamed data copying pass) to perform either point-wise explicit copies to fast memory buffers or DMAs (depending on a cmd line option). All logic is the same as erstwhile -dma-generate. - -affine-dma-generate is now renamed -affine-data-copy; when -dma flag is provided, DMAs are generated, or else explicit copy loops are generated (point-wise) by default. - point-wise copying could be used for CPUs (or GPUs); some indicative performance numbers with a "C" version of the MLIR when compiled with and without this optimization (about 2x improvement here). With a matmul on 4096^2 matrices on a single core of an Intel Core i7 Skylake i7-8700K with clang 8.0.0: clang -O3: 518s clang -O3 with MLIR tiling (128x128): 24.5s clang -O3 with MLIR tiling + data copying 12.4s (code equivalent to test/Transforms/data-copy.mlir func @matmul) - fix some misleading comments. - change default fast-mem space to 0 (more intuitive now with the default copy generation using point-wise copies instead of DMAs) On a simple 3-d matmul loop nest, code generated with -affine-data-copy: ``` affine.for %arg3 = 0 to 4096 step 128 { affine.for %arg4 = 0 to 4096 step 128 { %0 = affine.apply #map0(%arg3, %arg4) %1 = affine.apply #map1(%arg3, %arg4) %2 = alloc() : memref<128x128xf32, 2> // Copy-in Out matrix. affine.for %arg5 = 0 to 128 { %5 = affine.apply #map2(%arg3, %arg5) affine.for %arg6 = 0 to 128 { %6 = affine.apply #map2(%arg4, %arg6) %7 = load %arg2[%5, %6] : memref<4096x4096xf32> affine.store %7, %2[%arg5, %arg6] : memref<128x128xf32, 2> } } affine.for %arg5 = 0 to 4096 step 128 { %5 = affine.apply #map0(%arg3, %arg5) %6 = affine.apply #map1(%arg3, %arg5) %7 = alloc() : memref<128x128xf32, 2> // Copy-in LHS. affine.for %arg6 = 0 to 128 { %11 = affine.apply #map2(%arg3, %arg6) affine.for %arg7 = 0 to 128 { %12 = affine.apply #map2(%arg5, %arg7) %13 = load %arg0[%11, %12] : memref<4096x4096xf32> affine.store %13, %7[%arg6, %arg7] : memref<128x128xf32, 2> } } %8 = affine.apply #map0(%arg5, %arg4) %9 = affine.apply #map1(%arg5, %arg4) %10 = alloc() : memref<128x128xf32, 2> // Copy-in RHS. affine.for %arg6 = 0 to 128 { %11 = affine.apply #map2(%arg5, %arg6) affine.for %arg7 = 0 to 128 { %12 = affine.apply #map2(%arg4, %arg7) %13 = load %arg1[%11, %12] : memref<4096x4096xf32> affine.store %13, %10[%arg6, %arg7] : memref<128x128xf32, 2> } } // Compute. affine.for %arg6 = #map7(%arg3) to #map8(%arg3) { affine.for %arg7 = #map7(%arg4) to #map8(%arg4) { affine.for %arg8 = #map7(%arg5) to #map8(%arg5) { %11 = affine.load %7[-%arg3 + %arg6, -%arg5 + %arg8] : memref<128x128xf32, 2> %12 = affine.load %10[-%arg5 + %arg8, -%arg4 + %arg7] : memref<128x128xf32, 2> %13 = affine.load %2[-%arg3 + %arg6, -%arg4 + %arg7] : memref<128x128xf32, 2> %14 = mulf %11, %12 : f32 %15 = addf %13, %14 : f32 affine.store %15, %2[-%arg3 + %arg6, -%arg4 + %arg7] : memref<128x128xf32, 2> } } } dealloc %10 : memref<128x128xf32, 2> dealloc %7 : memref<128x128xf32, 2> } %3 = affine.apply #map0(%arg3, %arg4) %4 = affine.apply #map1(%arg3, %arg4) // Copy out result matrix. affine.for %arg5 = 0 to 128 { %5 = affine.apply #map2(%arg3, %arg5) affine.for %arg6 = 0 to 128 { %6 = affine.apply #map2(%arg4, %arg6) %7 = affine.load %2[%arg5, %arg6] : memref<128x128xf32, 2> store %7, %arg2[%5, %6] : memref<4096x4096xf32> } } dealloc %2 : memref<128x128xf32, 2> } } ``` With -affine-data-copy -dma: ``` affine.for %arg3 = 0 to 4096 step 128 { %0 = affine.apply #map3(%arg3) %1 = alloc() : memref<128xf32, 2> %2 = alloc() : memref<1xi32> affine.dma_start %arg2[%arg3], %1[%c0], %2[%c0], %c128_0 : memref<4096xf32>, memref<128xf32, 2>, memref<1xi32> affine.dma_wait %2[%c0], %c128_0 : memref<1xi32> %3 = alloc() : memref<1xi32> affine.for %arg4 = 0 to 4096 step 128 { %5 = affine.apply #map0(%arg3, %arg4) %6 = affine.apply #map1(%arg3, %arg4) %7 = alloc() : memref<128x128xf32, 2> %8 = alloc() : memref<1xi32> affine.dma_start %arg0[%arg3, %arg4], %7[%c0, %c0], %8[%c0], %c16384, %c4096, %c128_2 : memref<4096x4096xf32>, memref<128x128xf32, 2>, memref<1xi32> affine.dma_wait %8[%c0], %c16384 : memref<1xi32> %9 = affine.apply #map3(%arg4) %10 = alloc() : memref<128xf32, 2> %11 = alloc() : memref<1xi32> affine.dma_start %arg1[%arg4], %10[%c0], %11[%c0], %c128_1 : memref<4096xf32>, memref<128xf32, 2>, memref<1xi32> affine.dma_wait %11[%c0], %c128_1 : memref<1xi32> affine.for %arg5 = #map3(%arg3) to #map5(%arg3) { affine.for %arg6 = #map3(%arg4) to #map5(%arg4) { %12 = affine.load %7[-%arg3 + %arg5, -%arg4 + %arg6] : memref<128x128xf32, 2> %13 = affine.load %10[-%arg4 + %arg6] : memref<128xf32, 2> %14 = affine.load %1[-%arg3 + %arg5] : memref<128xf32, 2> %15 = mulf %12, %13 : f32 %16 = addf %14, %15 : f32 affine.store %16, %1[-%arg3 + %arg5] : memref<128xf32, 2> } } dealloc %11 : memref<1xi32> dealloc %10 : memref<128xf32, 2> dealloc %8 : memref<1xi32> dealloc %7 : memref<128x128xf32, 2> } %4 = affine.apply #map3(%arg3) affine.dma_start %1[%c0], %arg2[%arg3], %3[%c0], %c128 : memref<128xf32, 2>, memref<4096xf32>, memref<1xi32> affine.dma_wait %3[%c0], %c128 : memref<1xi32> dealloc %3 : memref<1xi32> dealloc %2 : memref<1xi32> dealloc %1 : memref<128xf32, 2> } ``` Signed-off-by: Uday Bondhugula <uday@polymagelabs.com> Closes tensorflow/mlir#50 PiperOrigin-RevId: 261221903	2019-08-01 16:31:58 -07:00

7 Commits