Commit Graph

1342 Commits

Author SHA1 Message Date
Adam Paszke
fbfff1caff [MLIR][CAPI] Add C API dialect registration methods for Arith, Math, MemRef and Vector dialects
Reviewed By: ftynse

Differential Revision: https://reviews.llvm.org/D155450
2023-07-17 14:45:49 +00:00
Alex Zinenko
371366ce27 [mlir][nvgpu] add simple pipelining for shared memory copies
Add a simple transform operation to the NVGPU extension that performs
software pipelining of copies to shared memory. The functionality is
extremely minimalistic in this version and only supports copies from
global to shared memory inside an `scf.for` loop with either
`vector.transfer` or `nvgpu.device_async_copy` operations when
pipelining preconditions are already satisfied in the IR. This is the
minimally useful version that uses the more general loop pipeliner in an
NVGPU-specific way. Further extensions and orthogonalizations will be
necessary.

This required a change to the loop pipeliner itself to properly
propagate errors should the predicate generator fail.

This is loosely inspired from the vesion in IREE, but has less unsafe
assumptions and more principled way of communicating decisions.

Reviewed By: nicolasvasilache

Differential Revision: https://reviews.llvm.org/D155223
2023-07-17 14:29:12 +00:00
Matthias Springer
a4f4d82c35 [mlir][NVGPU][NFC] Clean up code structure
* Move passes to `Transforms` directory.
* Add `Utils.h` (will be utilized in a subsequent change).

Differential Revision: https://reviews.llvm.org/D155427
2023-07-17 14:15:42 +02:00
Guillaume Chatelet
b38dda74fa [libc][NFC] Split memcmp implementations per platform
This is a follow up on D154800 and D154770 to make the code structure more principled and avoid too many nested #ifdef/#endif.

Reviewed By: courbet

Differential Revision: https://reviews.llvm.org/D155181
2023-07-17 11:35:31 +00:00
Guillaume Chatelet
83f3920854 [libc][NFC] Split memset implementations per platform
This is a follow up on D154800 and D154770 to make the code structure more principled and avoid too many nested #ifdef/#endif.

Reviewed By: courbet

Differential Revision: https://reviews.llvm.org/D155174
2023-07-17 11:12:19 +00:00
Oleg Shyshkov
4592543a01 [mlir][bazel] Fix build. 2023-07-17 11:00:20 +02:00
Matthias Springer
88f4292a16 [mlir][bufferization] OneShotBufferizeOp: Add options to use linalg.copy
This new option allows users to specify a custom memcpy op.

Differential Revision: https://reviews.llvm.org/D155280
2023-07-14 13:34:22 +02:00
Hanhan Wang
8fc433f055 [mlir][MemRef] Move narrow type emulation common methods to MemRefUtils.
It also unifies the computation of StridedLayoutAttr. If the stride is
static known value, we can just use it.

Differential Revision: https://reviews.llvm.org/D155017
2023-07-13 14:43:21 -07:00
Guillaume Chatelet
8cc440b3e7 [libc][NFC] Split memcpy implementations per platform
This is a follow up on D154800 and D154770 to make the code structure more principled and avoid too many nested #ifdef/#endif.

Reviewed By: courbet

Differential Revision: https://reviews.llvm.org/D155099
2023-07-13 10:30:38 +00:00
Guillaume Chatelet
1c4e4e03bd [libc][NFC] Split bcmp implementations per platform
This is a follow up on D154800 and D154770 to make the code structure more principled and avoid too many nested #ifdef/#endif.

Reviewed By: courbet

Differential Revision: https://reviews.llvm.org/D155076
2023-07-13 10:19:00 +00:00
Sterling Augustine
39d6fe790c Add bazel support for new DebugBTF component. 2023-07-12 14:57:15 -07:00
Andrés Villegas
4f92557bfc [NFC][llvm-dwp] Switch from llvm::cl to OptTable
Switch the parse of command line options from llvm::cl to OptTable.

The motivation for this change is to continue adding llvm based tools
to the llvm driver multicall. For more information about the proposal
and motivation, please see https://discourse.llvm.org/t/rfc-llvm-busybox-proposal/58494

Reviewed By: abrachet

Differential Revision: https://reviews.llvm.org/D154642
2023-07-12 19:12:48 +00:00
Adrian Kuegel
a69b2e3d1c [clang][Bazel] Add dependency to the right target. 2023-07-12 10:19:06 +02:00
Adrian Kuegel
93e7ef5907 [clang][Bazel] Add missing dependency. 2023-07-12 10:14:14 +02:00
Sterling Augustine
8df8f01065 Fix bazel build for 5a1cdcbd86 2023-07-11 14:27:40 -07:00
Arthur Eubanks
4cca3de87e [bazel][docs] Update build documentation
Reviewed By: rnk

Differential Revision: https://reviews.llvm.org/D155004
2023-07-11 13:36:27 -07:00
Alex Zinenko
8a918c54bb [mlir] add backward dense dataflow analysis
This is the counterpart to the forward dense dataflow analysis and
integrates into the dataflow framework. The implementation follows the
structure of existing dataflow analyses.

Reviewed By: Mogball, phisiart

Differential Revision: https://reviews.llvm.org/D154713
2023-07-11 16:47:53 +00:00
Aliia Khasanova
be29fe2f98 Fix bazel build file for D154060.
Differential Revision: https://reviews.llvm.org/D154976
2023-07-11 17:33:58 +02:00
NAKAMURA Takumi
82371e68e4 [Bazel] Fixup for D153758, D153850, and D153861 (global-isel-combiner-matchtable) 2023-07-11 22:53:38 +09:00
Fangrui Song
7f7f4a6b17 [bazel] Adjust llvm:DebugInfo after D149501 (BTF.h) 2023-07-10 15:05:51 -07:00
Guillaume Chatelet
bfd94882f2 [libc][NFC] Move aligned access implementations to separate header
Follow up on https://reviews.llvm.org/D154770

Differential Revision: https://reviews.llvm.org/D154800
2023-07-09 22:17:05 +00:00
Guillaume Chatelet
dbaa5838c1 [libc][NFC] Move memfunction's byte per byte implementations to a separate header
There will be subsequent patches to move things around and make the file layout more principled.

Differential Revision: https://reviews.llvm.org/D154770
2023-07-09 07:21:58 +00:00
Alex Zinenko
9ab34689b0 [mlir] add a simple gpu barrier elimination mechanism
GPU code generation, and specifically the shared memory copy insertion
may introduce spurious barriers guarding read-after-read dependencies or
read-after-write on non-aliasing data, which degrades performance due to
unnecessary synchronization. Add a pattern and transform op that removes
such barriers by analyzing memory effects that the barrier actually
guards that are not also guarded by other barriers. The code is adapted
from the Polygeist incubator project.

Co-authored-by: William Moses <gh@wsmoses.com>
Co-authored-by: Ivan Radanov Ivanov <ivanov.i.aa@m.titech.ac.jp>

Reviewed By: nicolasvasilache, wsmoses

Differential Revision: https://reviews.llvm.org/D154720
2023-07-07 18:51:49 +00:00
Guillaume Chatelet
cb1468d3cb [libc] Adding a version of memcpy w/ software prefetching
For machines with a lot of cores, hardware prefetchers can saturate the memory bus when utilization is high.
In this case it is desirable to turn off the hardware prefetcher completely.
This has a big impact on the performance of memory functions such as `memcpy` that rely on the fact that the next cache line will be readily available.

This patch adds the 'LIBC_COPT_MEMCPY_X86_USE_SOFTWARE_PREFETCHING' compile time option that generates a version of memcpy with software prefetching. While not fully restoring the original performances it mitigates the impact to an acceptable level.

Reviewed By: rtenneti

Differential Revision: https://reviews.llvm.org/D154494
2023-07-07 10:37:32 +00:00
Haojian Wu
99074aafc3 [bazel] Port for 88e95c1e4b 2023-07-07 09:02:05 +02:00
Michael Jones
cfbcbc8f88 [libc] fix MPFR rounding problems in fuzz test
The accuracy for the MPFR numbers in the strtofloat fuzz test was set
too high, causing rounding issues when rounding to a smaller final
result.

Reviewed By: lntue

Differential Revision: https://reviews.llvm.org/D154150
2023-07-05 10:53:40 -07:00
Alexander Belyaev
594643177f Fix bazel build after https://reviews.llvm.org/D150578. 2023-07-05 14:33:49 +02:00
Matthias Springer
cb7bda2ace [mlir][NFC] Use getConstantIntValue instead of casting to ConstantIndexOp
`getConstantIntValue` extracts constant values from all constant-like ops, not just `arith::ConstantIndexOp`.

Differential Revision: https://reviews.llvm.org/D154356
2023-07-04 14:08:37 +02:00
Benjamin Kramer
9846b9e2ca [bazel] Add missing dependency for d9d9be63a5 2023-07-04 13:34:03 +02:00
Matthias Springer
8b8e62d3f6 [mlir][SCF] Add loop.promote_if_one_iteration transform op
This transform op promotes loops with one iteration. I.e., the loop op is replaced by just the loop body.

Differential Revision: https://reviews.llvm.org/D154361
2023-07-04 08:58:49 +02:00
Matthias Springer
fa1a23a720 [mlir][transform] Add transform.apply_licm op
This op applies loop-invariant code motion to the targeted loop-like op.

Differential Revision: https://reviews.llvm.org/D154327
2023-07-03 15:28:53 +02:00
Adrian Kuegel
630b8d36c0 [mlir][Bazel] Add missing dependencies after 564713c471 2023-07-03 13:16:28 +02:00
Matthias Springer
180f9ef8b7 [mlir][linalg] LinalgOp-anchored empty tensor elimination
This revision adds a pre-bufferization transform that can reduce the number of allocation. It is similar to `bufferization.eliminate_empty_tensors`, but specific to LinalgOp.

The transform looks for `tensor.empty` ops where the SSA use-def chain ends in an "ins" operand of a `LinalgOp`. If the same `LinalgOp` has an unused "outs" operand (and some other conditions are met), this "outs" operand can be used instead of the `tensor.empty` and the "ins" operand can be turned into an "outs" operand.

Differential Revision: https://reviews.llvm.org/D153952
2023-07-03 09:17:48 +02:00
Haojian Wu
b28296c500 [bazel] Port bazel support for 5bf8efd269 2023-07-01 08:27:26 +02:00
Guillaume Chatelet
1c814c99aa [libc] Improve memcmp latency and codegen
This is based on ideas from @nafi to:
 - use a branchless version of 'cmp' for 'uint32_t',
 - completely resolve the lexicographic comparison through vector
   operations when wide types are available. We also get rid of byte
   reloads and serializing '__builtin_ctzll'.

I did not include the suggestion to replace comparisons of 'uint16_t'
with two 'uint8_t' as it did not seem to help the codegen. This can
be revisited in sub-sequent patches.

The code been rewritten to reduce nested function calls, making the
job of the inliner easier and preventing harmful code duplication.

Reviewed By: nafi3000

Differential Revision: https://reviews.llvm.org/D148717
2023-06-30 13:00:58 +00:00
Aart Bik
6b88c852b6 [mlir][sparse] Start migration to new surface syntax for STEA
We are in the progress of migrating to a much improved surface syntax for the Sparse Tensor Encoding Attribute (STEA).

You can see a preview of this in the StableHLO RFC at

 https://github.com/openxla/stablehlo/blob/main/rfcs/20230210-sparsity.md

//**This design is courtesy Wren Romano.**//

This initial revision
(1) Introduces the first version of a new parser written by Wren Romano
(2) Introduces a simple "migration plan" using NEW_SYNTAX on the STEA, which will allow us to test the new parser with new examples, as well as migrate existing examples over without the need to rewrite them all

This first "drop" merely provides the entry points to parse the new syntax. The parser is still under active development. For example, we need to address the "lookahead" issue when parsing the lvl spec (viz. do we see l0 = d0 or a direct d0). Another larger task is to actually implement "affine" parsing (since the MLIR affine parser is not accessible in other parts of the tree).

EXAMPLE:

Currently, CSR looks like

  #CSR = #sparse_tensor.encoding<{
    lvlTypes = ["dense","compressed"],
    dimToLvl = affine_map<(i,j) -> (i,j)>
  }>

but you can "force" the new parser with

  #CSR = #sparse_tensor.encoding<{
    NEW_SYNTAX =
    (d0, d1) -> (l0 = d0 : dense, l1 = d1 : compressed)
  }>

Reviewed By: Peiming

Differential Revision: https://reviews.llvm.org/D153997
2023-06-29 11:32:07 -07:00
Tue Ly
f320fefc4a [libc][math] Implement erff function correctly rounded to all rounding modes.
Implement correctly rounded `erff` functions.

For `x >= 4`, `erff(x) = 1` for `FE_TONEAREST` or `FE_UPWARD`, `0x1.ffffep-1` for `FE_DOWNWARD` or `FE_TOWARDZERO`.

For `0 <= x < 4`, we divide into 32 sub-intervals of length `1/8`, and use a degree-15 odd polynomial to approximate `erff(x)` in each sub-interval:
```
  erff(x) ~ x * (c0 + c1 * x^2 + c2 * x^4 + ... + c7 * x^14).
```

For `x < 0`, we can use the same formula as above, since the odd part is factored out.

Performance tested with `perf.sh` tool from the CORE-MATH project on AMD Ryzen 9 5900X:

Reciprocal throughput (clock cycles / op)
```
$ ./perf.sh erff --path2
GNU libc version: 2.35
GNU libc release: stable
-- CORE-MATH reciprocal throughput --  with -march=native      (with FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 11.790 + 0.182 clc/call; Median-Min = 0.154 clc/call; Max = 12.255 clc/call;
-- CORE-MATH reciprocal throughput --  with -march=x86-64-v2      (without FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 14.205 + 0.151 clc/call; Median-Min = 0.159 clc/call; Max = 15.893 clc/call;

-- System LIBC reciprocal throughput --
[####################] 100 %
Ntrial = 20 ; Min = 45.519 + 0.445 clc/call; Median-Min = 0.552 clc/call; Max = 46.345 clc/call;

-- LIBC reciprocal throughput --  with -mavx2 -mfma     (with FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 9.595 + 0.214 clc/call; Median-Min = 0.220 clc/call; Max = 9.887 clc/call;
-- LIBC reciprocal throughput --  with -msse4.2     (without FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 10.223 + 0.190 clc/call; Median-Min = 0.222 clc/call; Max = 10.474 clc/call;
```

and latency (clock cycles / op):
```
$ ./perf.sh erff --path2
GNU libc version: 2.35
GNU libc release: stable
-- CORE-MATH latency --  with -march=native      (with FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 38.566 + 0.391 clc/call; Median-Min = 0.503 clc/call; Max = 39.170 clc/call;
-- CORE-MATH latency --  with -march=x86-64-v2      (without FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 43.223 + 0.667 clc/call; Median-Min = 0.680 clc/call; Max = 43.913 clc/call;

-- System LIBC latency --
[####################] 100 %
Ntrial = 20 ; Min = 111.613 + 1.267 clc/call; Median-Min = 1.696 clc/call; Max = 113.444 clc/call;

-- LIBC latency --  with -mavx2 -mfma     (with FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 40.138 + 0.410 clc/call; Median-Min = 0.536 clc/call; Max = 40.729 clc/call;
-- LIBC latency --  with -msse4.2     (without FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 44.858 + 0.872 clc/call; Median-Min = 0.814 clc/call; Max = 46.019 clc/call;
```

Reviewed By: michaelrj

Differential Revision: https://reviews.llvm.org/D153683
2023-06-28 13:58:37 -04:00
Nicolas Vasilache
13f4e889c5 Revert "Revert "[mlir][Transform] Add support for mma.sync m16n8k16 f16 rewrite." and "[mlir][Transform] Introduce nvgpu transform extensions""
This reverts commit 6506692fe6.

Differential Revision: https://reviews.llvm.org/D153845
2023-06-28 06:50:05 +00:00
Mehdi Amini
6506692fe6 Revert "[mlir][Transform] Add support for mma.sync m16n8k16 f16 rewrite." and "[mlir][Transform] Introduce nvgpu transform extensions"
This reverts commit 40deed40ae.
and commit 1660f2174d.

The buildbot is broken, the two tests aren't passing.
2023-06-27 08:46:18 +02:00
Benjamin Kramer
a18266473b [bazel][mlir] Add missing dependencies for 5a1cdcbd86 2023-06-27 01:24:15 +02:00
Andres Villegas
939c03512d [llvm-libtool-darwin] Switch to OptTableSummary
Switch the parse of command line options fromllvm::cl to OptTable.
The motivation for this change is to continue adding llvm based tools
to the llvm driver multicall.

Differential Revision: https://reviews.llvm.org/D153665
2023-06-26 14:37:51 -07:00
Fangrui Song
19e9b9b589 [bazel] Add includes after 5a63b2b304 2023-06-26 12:55:48 -07:00
Nicolas Vasilache
40deed40ae [mlir][Transform] Introduce nvgpu transform extensions
Mapping to NVGPU operations such as mma.sync with mixed precision and ldmatrix with transposes and
various data types involves complex matchings from low-level IR.
This is akin to raising complex patterns after unnecessarily having lost structural information.
To avoid such unnecessary complexity, introduce a direct mapping step from a matmul on memrefs
to distributed NVGPU vector abstractions.
In this context, mapping to specific mma.sync operations is trivial and consists in simply
translating the documentation into indexing expressions.

Correctness is demonstrated with an end-to-end integration test.

Differential Revision: https://reviews.llvm.org/D153420
2023-06-26 16:21:28 +00:00
Christian Sigg
9feed59a91 [Bazel][llvm] Fix after 8de9f2b 2023-06-26 14:55:03 +02:00
Benjamin Kramer
4340ef141c [bazel] Add TargetParser dep to tblgen after 8de9f2b558 2023-06-26 12:04:54 +02:00
Christian Sigg
cd482968dc [Bazel][mlir] Avoid ODR violation introduced in 7ab749c.
This change also prepares for 9119325 to land again.

Adds `mlir_c_runner_utils_hdrs` and `mlir_runner_utils_hdrs` targets which do not depend on `//llvm::Support`.

These can be used by other 'runner.so' targets if they are loaded along with the 'runner_utils.so' without calling `__mlir_execution_engine_init()` twice.
2023-06-22 08:00:50 +02:00
Guillaume Chatelet
bd1cba9f4f Revert D148717 "[libc] Improve memcmp latency and codegen"
Once integrated in our codebase the patch triggered a bunch of failing
tests. We do not yet understand where the bug is but we revert it to
move forward with integration.
This reverts commit 5e32765c15.
2023-06-21 12:37:14 +00:00
Christian Sigg
699e64c0d9 Revert "[Bazel][mlir] Fix ODR violation introduced in 7ab749c."
This reverts commit e83c8c3600.

Depending only on the support header files is not sufficient.
2023-06-21 14:29:44 +02:00
Christian Sigg
e83c8c3600 [Bazel][mlir] Fix ODR violation introduced in 7ab749c. 2023-06-21 11:15:09 +02:00
Christian Sigg
7ab749c3a8 [Bazel][mlir] Fix after bba2b65611 2023-06-20 23:00:38 +02:00