clang-p2996/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp at 41a94de75caacb979070ec7a010dfe3c4e9f116f

Files

Fabian Ritter a4fd3dba6e [AMDGPU] Use wider loop lowering type for LowerMemIntrinsics (#112332 )

When llvm.memcpy or llvm.memmove intrinsics are lowered as a loop in
LowerMemIntrinsics.cpp, the loop consists of a single load/store pair
per iteration. We can improve performance in some cases by emitting
multiple load/store pairs per iteration. This patch achieves that by
increasing the width of the loop lowering type in the GCN target and
letting legalization split the resulting too-wide access pairs into
multiple legal access pairs.

This change only affects lowered memcpys and memmoves with large (>=
1024 bytes) constant lengths. Smaller constant lengths are handled by
ISel directly; non-constant lengths would be slowed down by this change
if the dynamic length was smaller or slightly larger than what an
unrolled iteration copies.

The chosen default unroll factor is the result of microbenchmarks on
gfx1030. This change leads to speedups of 15-38% for global memory and
1.9-5.8x for scratch in these microbenchmarks.

Part of SWDEV-455845.

2024-10-28 09:04:19 +01:00

52 KiB

Raw Blame History

View Raw

52 KiB Raw Blame History

52 KiB

Raw Blame History