## Purpose
This patch ensures that the BLAKE3 implementation in the LLVM Support
library exports its public interface with `__declspec(dllexport)` when
building LLVM as a Windows DLL.
## Background
The effort to support building LLVM as a Windows DLL is tracked in
#109483. Additional context is provided in [this
discourse](https://discourse.llvm.org/t/psa-annotating-llvm-public-interface/85307).
## Overview
Replicate [this
logic](https://github.com/llvm/llvm-project/blob/main/llvm/cmake/modules/AddLLVM.cmake#L662-L664)
from `llvm_add_library()` for the `LLVMSupportBlake3` target. Without
this change, the `llvm_blake_` functions will only be annotated with
`__declspec(dllimport)` when building LLVM as a Windows DLL which leads
to inconsistent DLL linkage warnings from MSVC and `clang-cl`.
Add remark format 'Auto', which performs automatic detection of the
remark format using the magic numbers at the beginning of the remarks
files.
The RemarkLinker already did something similar, so we streamlined this
and exposed this to llvm-remarkutil.
Allows memcpy to memcpy forwarding in cases where the second memcpy is
larger, but the overread is known to be undef, by shrinking the memcpy
size.
Refs https://github.com/llvm/llvm-project/pull/140954 which laid some of
the groundwork for this.
Const-qualifying Values in the analysis result makes them unusable with
IRBuilder. The issue was discovered when attempting to use the result of
the analysis for a transform.
Big-endian CRC tables are incorrect due to the initial value of CRC in
genSarwateTable being hard-coded for CRC-8. 128 is the signed-min value
for CRC-8, but it should be generalized to APInt::getSignedMinValue. The
issue was found when writing CRC verification tests for llvm-test-suite.
This patch is closely related to #139293 and addresses an existing issue
in the loop transformation codebase. Specifically, it corrects the
handling of the `NumGeneratedLoops` variable in
`OMPLoopTransformationDirective` AST nodes and its inheritors (such as
OMPUnrollDirective, OMPTileDirective, etc.).
Previously, this variable was inaccurately set for certain
transformations like reverse or tile. While this did not lead to
functional bugs, since the value was only checked to determine whether
it was greater than zero or equal to zero, the inconsistency could
introduce problems when supporting more complex directives in the
future.
This just adds some convenience methods to feature control and rewrites
old code in terms of those methods. Also cleans up some names that I
just realize were overloads of another method.
PPCTTIImpl defines hasActiveVectorLength and also getVPMemoryOpCost, but
they appear unused (i.e. no changes to tests).
Remove them, as they complicate the interface for hasActiveVectorLength.
This simplifies the only use in LV as now no placeholder values need to
be passed.
PR: https://github.com/llvm/llvm-project/pull/142310
- Add a `gpr32` suffix to test name to denote the specific register
class being checked
- Expand `-mtriple=arm64-apple-ios` to `-march=arm64` to broaden the
test context to the generic architecture, as the specific triple is not
required
- Port `bl` match to Linux too via the regex: `{{_?foo}}`
- Advance `-mcpu=cyclone` to the newer M series major `-mcpu=apple-m1`
- Use `-mcpu` so that `-mattr=-zcm` has a real effect
- Add a test that generic arm64 doesn't optimize for ZCM
- Distinguish 4 different assembly layouts: NOTCPU, CPU, NOTATTR, ATTR
- Fix broken test logic, for example: `; NOT: mov [[REG2:w[0-9]+]], w3`
matched `mov w1, w3` then `REG2` captured `w1` but then `; NOT: mov w1,
[[REG2]]` matched by prefix `mov, w1, w19` even though it should have
matched `mov w1, w1`. This change adds explicit matches for all of the
generated copies.
Deleting a basic block removes all references from jump tables, which
is O(n). When freeing a MachineFunction, all basic blocks are deleted
before the jump tables, causing O(n^2) runtime. Fix this by deallocating
the jump table first.
Test case generator:
import sys
n = int(sys.argv[1])
print("define void @f(i64 %c, ptr %p) {")
print(" switch i64 %c, label %d [")
for i in range(n):
print(f" i64 {i}, label %h{i}")
print(f" ]")
for i in range(n):
print(f'h{i}:')
print(f' store i64 {i*i}, ptr %p')
print(f' ret void')
print('d:')
print(' ret void')
print('}')
Improvement at 5000 entries:
Benchmark 1: ./llc.pre -filetype=obj -O0 <switch5k.bc
Time (mean ± σ): 49.7 ms ± 1.0 ms
Range (min … max): 48.0 ms … 52.1 ms 57 runs
Benchmark 2: ./llc.post -filetype=obj -O0 <switch5k.bc
Time (mean ± σ): 39.4 ms ± 0.8 ms
Range (min … max): 37.1 ms … 41.1 ms 72 runs
Summary
./llc.post -filetype=obj -O0 <switch5k.bc ran
1.26 ± 0.04 times faster than ./llc.pre -filetype=obj -O0 <switch5k.bc
Improvement at 20000 entries:
Benchmark 1: ./llc.pre -filetype=obj -O0 <switch20k.bc
Time (mean ± σ): 281.7 ms ± 1.0 ms
Range (min … max): 280.2 ms … 283.0 ms 10 runs
Benchmark 2: ./llc.post -filetype=obj -O0 <switch20k.bc
Time (mean ± σ): 123.9 ms ± 1.5 ms
Range (min … max): 121.4 ms … 129.2 ms 23 runs
Summary
./llc.post -filetype=obj -O0 <switch20k.bc ran
2.27 ± 0.03 times faster than ./llc.pre -filetype=obj -O0 <switch20k.bc
Pull Request: https://github.com/llvm/llvm-project/pull/144108
When dumping evaluate::Expr, show type names which contain a lot of
useful information.
For example show
```
expr <Fortran::evaluate::SomeType> {
expr <Fortran::evaluate::SomeKind<Fortran::common::TypeCategory::Integer>> {
expr <Fortran::evaluate::Type<Fortran::common::TypeCategory::Integer, 4>> {
...
```
instead of
```
expr T {
expr T {
expr T {
...
```
### Description
This patch improves the folding efficiency of `vector.insert` and
`vector.extract` operations by not returning early after successfully
converting dynamic indices to static indices.
This PR also renames the test pass `TestConstantFold` to
`TestSingleFold` and adds comprehensive documentation explaining the
single-pass folding behavior.
### Motivation
Since the `OpBuilder::createOrFold` function only calls `fold` **once**,
the current `fold` methods of `vector.insert` and `vector.extract` may
leave the op in a state that can be folded further. For example,
consider the following un-folded IR:
```
%v1 = vector.insert %e1, %v0 [0] : f32 into vector<128xf32>
%c0 = arith.constant 0 : index
%e2 = vector.extract %v1[%c0] : f32 from vector<128xf32>
```
If we use `createOrFold` to create the `vector.extract` op, then the
result will be:
```
%v1 = vector.insert %e1, %v0 [127] : f32 into vector<128xf32>
%e2 = vector.extract %v1[0] : f32 from vector<128xf32>
```
But this is not the optimal result. `createOrFold` should have returned
`%e1`.
The reason is that the execution of fold returns immediately after
`extractInsertFoldConstantOp`, causing subsequent folding logics to be
skipped.
---------
Co-authored-by: Yang Bai <yangb@nvidia.com>
### Summary
This PR resolves https://github.com/llvm/llvm-project/issues/143864
Add (~a | x) & (a | y) -> (a & (x ^ y)) ^y for foldMaskedMerge func
using SDPatternMatch
aftering adding this pattern, run ```ninja check-llvm-codegen```, all
other cases remain unchanged, so I add a
testcase(fold-masked-merge-demorgan.ll) for it
---------
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
Previously, the segmented iterator optimization was limited to `std::{for_each, for_each_n}`. This patch
extends the optimization to `std::ranges::for_each` and `std::ranges::for_each_n`, ensuring consistent
optimizations across these algorithms. This patch first generalizes the `std` algorithms by introducing
a `Projection` parameter, which is set to `__identity` for the `std` algorithms. Then we let the `ranges`
algorithms to directly call their `std` counterparts with a general `__proj` argument. Benchmarks
demonstrate performance improvements of up to 21.4x for ``std::deque::iterator`` and 22.3x for
``join_view`` of ``vector<vector<char>>``.
Addresses a subtask of #102817.
Change the test to check the exit status of the 'ls' command line
(instead of error message) since the error message is different when
running 'ls' command on the different Host machine.
Mark as many of the reportXX functions that take pointers const. This
avoid the need to use const_cast when calling these functions on an
already const pointer.
Fix reportHeaderCorruption calls where an argument was passed into an
append call that didn't use them.
This simply copies the structure of the vector.reverse patterns from
just above, and reimplements them for the vp.reverse intrinsics when the
mask is all ones and the EVLs exactly match.
Its unfortunate that we have three different ways to represent a reverse
(shuffle, vector.reverse, and vp.reverse) but I don't see an obvious way
to remove any them because the semantics are slightly different.
This significantly improves vectorization in TSVC_2's s112 and s1112
loops when using EVL tail folding.
The message "The atomic variable x should occur exactly once among the
arguments of the top-level [...] operator" was intended to convey that
(1) an atomic variable should be an argument, and (2) it should be
exactly one of the arguments. However, the wording turned out to be
sowing confusion instead.
Rework the corresponding check, and emit an individual error message for
each problematic situation:
- "atomic variable cannot be a proper subexpression of an argument",
- "atomic variable should appear as an argument",
- "atomic variable should be exactly one of the arguments".
Fixes https://github.com/llvm/llvm-project/issues/144599
When big-endian we need to use ld1/st1 for vector loads and stores so
that we get the elements in the correct order, but this prevents
postindex addressing from being used. Fix this by adding the appropriate
ISel patterns, plus the relevant changes in ISelLowering and
ISelDAGToDAG to cause postindex addressing to be used.
compiler-rt/lib/fuzzer/tests build was failing on armv7, with undefined
references to unwinder symbols, such as __aeabi_unwind_cpp_pr0.
This occurs because the test is built with `-nostdlib++` but `libunwind`
is not explicitly linked to the final test executable.
This patch resolves the issue by adding CMake logic to explicitly link
the required unwinder to the fuzzer tests, inspired by the same solution
used to fix Scudo build failures by https://reviews.llvm.org/D142888.
Following the addition of TensorLike and BufferLike type interfaces (see
00eaff3e9c), introduce minimal changes
required to bufferize a custom tensor operation into a custom buffer
operation.
To achieve this, new interface methods are added to TensorLike type
interface that abstract away the differences between existing (tensor ->
memref) and custom conversions.
The scope of the changes is intentionally limited (for example,
BufferizableOpInterface is untouched) in order to first understand the
basics and reach consensus design-wise.
---
Notable changes:
* mlir::bufferization::getBufferType() returns BufferLikeType (instead
of BaseMemRefType)
* ToTensorOp / ToBufferOp operate on TensorLikeType / BufferLikeType.
Operation argument "memref" renamed to "buffer"
* ToTensorOp's tensor type inferring builder is dropped (users now need
to provide the tensor type explicitly)
`__has_iterator_typedefs` is only used in the up-to-C++17 implementation
of `type_traits`. To make that clearer the struct is moved into that
code block.