- Add a `gpr32` suffix to test name to denote the specific register
class being checked
- Expand `-mtriple=arm64-apple-ios` to `-march=arm64` to broaden the
test context to the generic architecture, as the specific triple is not
required
- Port `bl` match to Linux too via the regex: `{{_?foo}}`
- Advance `-mcpu=cyclone` to the newer M series major `-mcpu=apple-m1`
- Use `-mcpu` so that `-mattr=-zcm` has a real effect
- Add a test that generic arm64 doesn't optimize for ZCM
- Distinguish 4 different assembly layouts: NOTCPU, CPU, NOTATTR, ATTR
- Fix broken test logic, for example: `; NOT: mov [[REG2:w[0-9]+]], w3`
matched `mov w1, w3` then `REG2` captured `w1` but then `; NOT: mov w1,
[[REG2]]` matched by prefix `mov, w1, w19` even though it should have
matched `mov w1, w1`. This change adds explicit matches for all of the
generated copies.
Deleting a basic block removes all references from jump tables, which
is O(n). When freeing a MachineFunction, all basic blocks are deleted
before the jump tables, causing O(n^2) runtime. Fix this by deallocating
the jump table first.
Test case generator:
import sys
n = int(sys.argv[1])
print("define void @f(i64 %c, ptr %p) {")
print(" switch i64 %c, label %d [")
for i in range(n):
print(f" i64 {i}, label %h{i}")
print(f" ]")
for i in range(n):
print(f'h{i}:')
print(f' store i64 {i*i}, ptr %p')
print(f' ret void')
print('d:')
print(' ret void')
print('}')
Improvement at 5000 entries:
Benchmark 1: ./llc.pre -filetype=obj -O0 <switch5k.bc
Time (mean ± σ): 49.7 ms ± 1.0 ms
Range (min … max): 48.0 ms … 52.1 ms 57 runs
Benchmark 2: ./llc.post -filetype=obj -O0 <switch5k.bc
Time (mean ± σ): 39.4 ms ± 0.8 ms
Range (min … max): 37.1 ms … 41.1 ms 72 runs
Summary
./llc.post -filetype=obj -O0 <switch5k.bc ran
1.26 ± 0.04 times faster than ./llc.pre -filetype=obj -O0 <switch5k.bc
Improvement at 20000 entries:
Benchmark 1: ./llc.pre -filetype=obj -O0 <switch20k.bc
Time (mean ± σ): 281.7 ms ± 1.0 ms
Range (min … max): 280.2 ms … 283.0 ms 10 runs
Benchmark 2: ./llc.post -filetype=obj -O0 <switch20k.bc
Time (mean ± σ): 123.9 ms ± 1.5 ms
Range (min … max): 121.4 ms … 129.2 ms 23 runs
Summary
./llc.post -filetype=obj -O0 <switch20k.bc ran
2.27 ± 0.03 times faster than ./llc.pre -filetype=obj -O0 <switch20k.bc
Pull Request: https://github.com/llvm/llvm-project/pull/144108
When dumping evaluate::Expr, show type names which contain a lot of
useful information.
For example show
```
expr <Fortran::evaluate::SomeType> {
expr <Fortran::evaluate::SomeKind<Fortran::common::TypeCategory::Integer>> {
expr <Fortran::evaluate::Type<Fortran::common::TypeCategory::Integer, 4>> {
...
```
instead of
```
expr T {
expr T {
expr T {
...
```
### Description
This patch improves the folding efficiency of `vector.insert` and
`vector.extract` operations by not returning early after successfully
converting dynamic indices to static indices.
This PR also renames the test pass `TestConstantFold` to
`TestSingleFold` and adds comprehensive documentation explaining the
single-pass folding behavior.
### Motivation
Since the `OpBuilder::createOrFold` function only calls `fold` **once**,
the current `fold` methods of `vector.insert` and `vector.extract` may
leave the op in a state that can be folded further. For example,
consider the following un-folded IR:
```
%v1 = vector.insert %e1, %v0 [0] : f32 into vector<128xf32>
%c0 = arith.constant 0 : index
%e2 = vector.extract %v1[%c0] : f32 from vector<128xf32>
```
If we use `createOrFold` to create the `vector.extract` op, then the
result will be:
```
%v1 = vector.insert %e1, %v0 [127] : f32 into vector<128xf32>
%e2 = vector.extract %v1[0] : f32 from vector<128xf32>
```
But this is not the optimal result. `createOrFold` should have returned
`%e1`.
The reason is that the execution of fold returns immediately after
`extractInsertFoldConstantOp`, causing subsequent folding logics to be
skipped.
---------
Co-authored-by: Yang Bai <yangb@nvidia.com>
### Summary
This PR resolves https://github.com/llvm/llvm-project/issues/143864
Add (~a | x) & (a | y) -> (a & (x ^ y)) ^y for foldMaskedMerge func
using SDPatternMatch
aftering adding this pattern, run ```ninja check-llvm-codegen```, all
other cases remain unchanged, so I add a
testcase(fold-masked-merge-demorgan.ll) for it
---------
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
Previously, the segmented iterator optimization was limited to `std::{for_each, for_each_n}`. This patch
extends the optimization to `std::ranges::for_each` and `std::ranges::for_each_n`, ensuring consistent
optimizations across these algorithms. This patch first generalizes the `std` algorithms by introducing
a `Projection` parameter, which is set to `__identity` for the `std` algorithms. Then we let the `ranges`
algorithms to directly call their `std` counterparts with a general `__proj` argument. Benchmarks
demonstrate performance improvements of up to 21.4x for ``std::deque::iterator`` and 22.3x for
``join_view`` of ``vector<vector<char>>``.
Addresses a subtask of #102817.
Change the test to check the exit status of the 'ls' command line
(instead of error message) since the error message is different when
running 'ls' command on the different Host machine.
Mark as many of the reportXX functions that take pointers const. This
avoid the need to use const_cast when calling these functions on an
already const pointer.
Fix reportHeaderCorruption calls where an argument was passed into an
append call that didn't use them.
This simply copies the structure of the vector.reverse patterns from
just above, and reimplements them for the vp.reverse intrinsics when the
mask is all ones and the EVLs exactly match.
Its unfortunate that we have three different ways to represent a reverse
(shuffle, vector.reverse, and vp.reverse) but I don't see an obvious way
to remove any them because the semantics are slightly different.
This significantly improves vectorization in TSVC_2's s112 and s1112
loops when using EVL tail folding.
The message "The atomic variable x should occur exactly once among the
arguments of the top-level [...] operator" was intended to convey that
(1) an atomic variable should be an argument, and (2) it should be
exactly one of the arguments. However, the wording turned out to be
sowing confusion instead.
Rework the corresponding check, and emit an individual error message for
each problematic situation:
- "atomic variable cannot be a proper subexpression of an argument",
- "atomic variable should appear as an argument",
- "atomic variable should be exactly one of the arguments".
Fixes https://github.com/llvm/llvm-project/issues/144599
When big-endian we need to use ld1/st1 for vector loads and stores so
that we get the elements in the correct order, but this prevents
postindex addressing from being used. Fix this by adding the appropriate
ISel patterns, plus the relevant changes in ISelLowering and
ISelDAGToDAG to cause postindex addressing to be used.
compiler-rt/lib/fuzzer/tests build was failing on armv7, with undefined
references to unwinder symbols, such as __aeabi_unwind_cpp_pr0.
This occurs because the test is built with `-nostdlib++` but `libunwind`
is not explicitly linked to the final test executable.
This patch resolves the issue by adding CMake logic to explicitly link
the required unwinder to the fuzzer tests, inspired by the same solution
used to fix Scudo build failures by https://reviews.llvm.org/D142888.
Following the addition of TensorLike and BufferLike type interfaces (see
00eaff3e9c), introduce minimal changes
required to bufferize a custom tensor operation into a custom buffer
operation.
To achieve this, new interface methods are added to TensorLike type
interface that abstract away the differences between existing (tensor ->
memref) and custom conversions.
The scope of the changes is intentionally limited (for example,
BufferizableOpInterface is untouched) in order to first understand the
basics and reach consensus design-wise.
---
Notable changes:
* mlir::bufferization::getBufferType() returns BufferLikeType (instead
of BaseMemRefType)
* ToTensorOp / ToBufferOp operate on TensorLikeType / BufferLikeType.
Operation argument "memref" renamed to "buffer"
* ToTensorOp's tensor type inferring builder is dropped (users now need
to provide the tensor type explicitly)
`__has_iterator_typedefs` is only used in the up-to-C++17 implementation
of `type_traits`. To make that clearer the struct is moved into that
code block.
A while ago, the test workflow was updated with a new preemption regex,
however it was only applied to the test job, and not the job
that's actually restarting the failed libc++ test runs.
This fix should correct the issue and get the restarter working
again.
There has been a number of deprecation warnings that have been added to
Flang, however these features are only deprecated when the OpenMP
Version being used is 5.2 or later. Previously, flang did not consider
the version with the warnings so would always be emitted.
Flang now ensures warnings are emitted for the appropriate version of
OpenMP, and tests are updated to reflect this change.
Background: The yaml-strtab format looks just like the yaml format,
except that the values in the key/value pairs of the remarks are
deduplicated and replaced by indices into a string table (see removed
test cases for examples). The motivation behind this format was to
reduce size of the remarks files. However, it was quickly superseded by
the bitstream format.
Therefore, remove the yaml-strtab format, as it doesn't have a good
usecase anymore:
- It isn't particularly efficient
- It isn't human-readable
- It isn't straightforward to parse in external tools that can't use the
remarks library. We don't even support it in opt-viewer.
llvm-remarkutil is also missing options to parse/convert yaml-strtab, so
the chance that anyone is actually using this format is low.
I don't think DO CONCURRENT fits the definition of a Canonical Loop Nest
(OpenMP 6.0 section 6.4.1).
It is however explicitly allowed for the LOOP construct (6.0 section
13.8).
There's some obscure language in OpenMP 6.0 for the LOOP construct:
> If the collapsed loop is a DO CONCURRENT loop, neither the
> data-sharing attribute clauses nor the collapse clause may be
specified.
From the surrounding context, I think "collapsed loop" just means the
loop that the LOOP construct applies to. So I will interpret this to
mean that DO CONCURRENT can only be used with the LOOP construct if it
does not contain the COLLAPSE clause.
This also fixes a bug where the associated clause was never cleared
after it was set.
Fixes#144178
Both the declare mapper directive argument, and the iterator modifier
can contain declaration-type-spec, so make sure that the processing of
one ends before processing of the other begins in semantic analysis.
`unresolvedMaterializations` is a mapping from
`UnrealizedConversionCastOp` to `UnresolvedMaterializationRewrite`. This
mapping is needed to find the correct type converter for an unresolved
materialization.
With this commit, `unresolvedMaterializations` is updated immediately
when an op is being erased. This also cleans up the code base a bit:
`SingleEraseRewriter` is now used only during the "cleanup" phase and no
longer needed as a field of `ConversionRewriterImpl`.
This commit is in preparation of the One-Shot Dialect Conversion
refactoring: `allowPatternRollback = false` will in the future trigger
immediate materialization of all IR changes.
Generally, bufferization should be able to create a memref from a tensor
without needing to know more than just a mlir::Type. Thus, change
BufferizationOptions::UnknownTypeConverterFn to accept just a type
(mlir::TensorType for now) instead of mlir::Value. Additionally, apply
the same rationale to getMemRefType() helper function.
Both changes are prerequisites to enable custom types support in
one-shot bufferization.
Currently, when hazard-padding is enabled a (fixed-size) hazard slot is
placed in the CS area, just after the frame record. The size of this
slot is part of the "CalleeSaveBaseToFrameRecordOffset". The SVE
epilogue emission code assumed this offset was always zero, and
incorrectly setting the stack pointer, resulting in all SVE registers
being reloaded from incorrect offsets.
```
| prev_lr |
| prev_fp |
| (a.k.a. "frame record") |
|-----------------------------------| <- fp(=x29)
| <hazard padding> |
|-----------------------------------| <- callee-saved base
| |
| callee-saved fp/simd/SVE regs |
| |
|-----------------------------------| <- SVE callee-save base
```
i.e. in the above diagram, the code assumed `fp == callee-saved base`.
The expansion of `memref.atomic_rmw` into a `memref.generic_atomic_rmw`
for floating-point min/max operations is no longer necessary as those
are now supported by the LLVM dialect and LLVM IR.
Furthermore, combining this expansion with direct lowering of
`generic_atomic_rmw` could leads to invalid LLVM dialect IR with
`cmpxchg` operating on floating-point values that it does not support.
Fuchsia sets CLANG_DEFAULT_UNWINDLIB to libunwind. As a result, when
rtlib is set to libgcc and unwindlib is not explicitly specified, tests
using Fuchsia as the default platform will fail. To address this, the
affected tests are now xfailed
This change fixes the following tests introduced in
45ea46c446:
clang/test/Driver/aarch64-toolchain-extra.c
clang/test/Driver/arm-toolchain-extra.c
clang/test/Driver/aarch64-toolchain.c
clang/test/Driver/arm-toolchain.c