A proposed fix for the issue #95611, [OpenMP][SIMD] ordered has no effect in a loop SIMD region as of LLVM 18.1.0 Changes: - Implement new lowering behavior: Conservatively serialize "omp simd" loops that have `omp simd ordered` directive to prevent incorrect vectorization (which results in incorrect execution behavior of the miscompiled program). Implementation outline: - We start with the optimistic default initial value of `LoopStack.setParallel(/Enable=/true);` in `CodeGenFunction::EmitOMPSimdInit(const OMPLoopDirective &D)`. - We only disable the loop parallel memory access assumption with `if (HasOrderedDirective) LoopStack.setParallel(/Enable=/false);` using the `HasOrderedDirective` (which tests for the presence of an `OMPOrderedDirective`). - This results in no longer incorrectly vectorizing the loop when the `omp simd ordered` directive is present. Motivation: We'd like to prevent incorrect vectorization of the loops marked with the `#pragma omp ordered simd` directive which has previously resulted in miscompiled code. At the same time, we'd like the usage outside of the `#pragma omp ordered simd` context to remain unaffected: Note that in the test "clang/test/OpenMP/ordered_codegen.cpp" we only "lose" the `!llvm.access.group` metadata in `foo_simd` alone. This is conservative, in that it's possible some of the loops would be possible to vectorize, but we prefer to avoid miscompilation of the loops that are currently illegal to vectorize. A concrete example follows: ```cpp // "test.c" #include <float.h> #include <math.h> #include <omp.h> #include <stdio.h> #include <stdlib.h> #include <time.h> int compare_float(float x1, float x2, float scalar) { const float diff = fabsf(x1 - x2); x1 = fabsf(x1); x2 = fabsf(x2); const float l = (x2 > x1) ? x2 : x1; if (diff <= l * scalar * FLT_EPSILON) return 1; else return 0; } #define ARRAY_SIZE 256 __attribute__((noinline)) void initialization_loop( float X[ARRAY_SIZE][ARRAY_SIZE], float Y[ARRAY_SIZE][ARRAY_SIZE]) { const float max = 1000.0; srand(time(NULL)); for (int r = 0; r < ARRAY_SIZE; r++) { for (int c = 0; c < ARRAY_SIZE; c++) { X[r][c] = ((float)rand() / (float)(RAND_MAX)) * max; Y[r][c] = X[r][c]; } } } __attribute__((noinline)) void omp_simd_loop(float X[ARRAY_SIZE][ARRAY_SIZE]) { for (int r = 1; r < ARRAY_SIZE; ++r) { for (int c = 1; c < ARRAY_SIZE; ++c) { #pragma omp simd for (int k = 2; k < ARRAY_SIZE; ++k) { #pragma omp ordered simd X[r][k] = X[r][k - 2] + sinf((float)(r / c)); } } } } __attribute__((noinline)) int comparison_loop(float X[ARRAY_SIZE][ARRAY_SIZE], float Y[ARRAY_SIZE][ARRAY_SIZE]) { int totalErrors_simd = 0; const float scalar = 1.0; for (int r = 1; r < ARRAY_SIZE; ++r) { for (int c = 1; c < ARRAY_SIZE; ++c) { for (int k = 2; k < ARRAY_SIZE; ++k) { Y[r][k] = Y[r][k - 2] + sinf((float)(r / c)); } } // check row for simd update for (int k = 0; k < ARRAY_SIZE; ++k) { if (!compare_float(X[r][k], Y[r][k], scalar)) { ++totalErrors_simd; } } } return totalErrors_simd; } int main(void) { float X[ARRAY_SIZE][ARRAY_SIZE]; float Y[ARRAY_SIZE][ARRAY_SIZE]; initialization_loop(X, Y); omp_simd_loop(X); const int totalErrors_simd = comparison_loop(X, Y); if (totalErrors_simd) { fprintf(stdout, "totalErrors_simd: %d \n", totalErrors_simd); fprintf(stdout, "%s : %d - FAIL: error in ordered simd computation.\n", __FILE__, __LINE__); } else { fprintf(stdout, "Success!\n"); } return totalErrors_simd; } ``` Before: ``` $ clang -fopenmp-simd -O3 -ffast-math -lm test.c -o test && ./test totalErrors_simd: 15408 test.c : 76 - FAIL: error in ordered simd computation. ``` clang 19.1.0: https://godbolt.org/z/6EvhxqEhe After: ``` $ clang -fopenmp-simd -O3 -ffast-math test.c -o test && ./test Success! ``` Co-authored-by: Matt P. Dziubinski <matt-p.dziubinski@hpe.com>
README for the LLVM* OpenMP* Runtime Library
============================================
How to Build Documentation
==========================
The main documentation is in Doxygen* format, and this distribution
should come with pre-built PDF documentation in doc/Reference.pdf.
However, an HTML version can be built by executing:
% doxygen doc/doxygen/config
in the runtime directory.
That will produce HTML documentation in the doc/doxygen/generated
directory, which can be accessed by pointing a web browser at the
index.html file there.
If you don't have Doxygen installed, you can download it from
www.doxygen.org.
How to Build the LLVM* OpenMP* Runtime Library
==============================================
In-tree build:
$ cd where-you-want-to-live
Check out openmp into llvm/projects
$ cd where-you-want-to-build
$ mkdir build && cd build
$ cmake path/to/llvm -DCMAKE_C_COMPILER=<C compiler> -DCMAKE_CXX_COMPILER=<C++ compiler>
$ make omp
Out-of-tree build:
$ cd where-you-want-to-live
Check out openmp
$ cd where-you-want-to-live/openmp/runtime
$ mkdir build && cd build
$ cmake path/to/openmp -DCMAKE_C_COMPILER=<C compiler> -DCMAKE_CXX_COMPILER=<C++ compiler>
$ make
For details about building, please look at README.rst in the parent directory.
Architectures Supported
=======================
* IA-32 architecture
* Intel(R) 64 architecture
* Intel(R) Many Integrated Core Architecture
* ARM* architecture
* Aarch64 (64-bit ARM) architecture
* IBM(R) Power architecture (big endian)
* IBM(R) Power architecture (little endian)
* MIPS and MIPS64 architecture
* RISCV64 architecture
* LoongArch64 architecture
Supported RTL Build Configurations
==================================
Supported Architectures: IA-32 architecture, Intel(R) 64, and
Intel(R) Many Integrated Core Architecture
----------------------------------------------
| icc/icl | gcc | clang |
--------------|---------------|----------------------------|
| Linux* OS | Yes(1,5) | Yes(2,4) | Yes(4,6,7) |
| FreeBSD* | No | No | Yes(4,6,7,8) |
| OS X* | Yes(1,3,4) | No | Yes(4,6,7) |
| Windows* OS | Yes(1,4) | No | No |
------------------------------------------------------------
(1) On IA-32 architecture and Intel(R) 64, icc/icl versions 12.x are
supported (12.1 is recommended).
(2) GCC* version 4.7 is supported.
(3) For icc on OS X*, OS X* version 10.5.8 is supported.
(4) Intel(R) Many Integrated Core Architecture not supported.
(5) On Intel(R) Many Integrated Core Architecture, icc/icl versions 13.0
or later are required.
(6) Clang* version 3.3 is supported.
(7) Clang* currently does not offer a software-implemented 128 bit extended
precision type. Thus, all entry points reliant on this type are removed
from the library and cannot be called in the user program. The following
functions are not available:
__kmpc_atomic_cmplx16_*
__kmpc_atomic_float16_*
__kmpc_atomic_*_fp
(8) Community contribution provided AS IS, not tested by Intel.
Supported Architectures: IBM(R) Power 7 and Power 8
-----------------------------
| gcc | clang |
--------------|------------|--------------|
| Linux* OS | Yes(1,2) | Yes(3,4) |
-------------------------------------------
(1) On Power 7, gcc version 4.8.2 is supported.
(2) On Power 8, gcc version 4.8.2 is supported.
(3) On Power 7, clang version 3.7 is supported.
(4) On Power 8, clang version 3.7 is supported.
Front-end Compilers that work with this RTL
===========================================
The following compilers are known to do compatible code generation for
this RTL: clang (from the OpenMP development branch at
http://clang-omp.github.io/ ), Intel compilers, GCC. See the documentation
for more details.
-----------------------------------------------------------------------
Notices
=======
*Other names and brands may be claimed as the property of others.