dda73336ad22bd0b5ecda17040c50fb10fcbe5fb
169 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
935e699173 |
[libc++] Optimize ranges::minmax (#87335)
This allows Clang to vectorize the loop. ``` --------------------------------------------------------------------- Benchmark old new --------------------------------------------------------------------- BM_std_minmax<char>/1 0.659 ns 1.41 ns BM_std_minmax<char>/2 1.08 ns 2.16 ns BM_std_minmax<char>/3 2.16 ns 2.96 ns BM_std_minmax<char>/4 2.82 ns 3.81 ns BM_std_minmax<char>/5 3.43 ns 4.69 ns BM_std_minmax<char>/6 4.08 ns 5.63 ns BM_std_minmax<char>/7 4.75 ns 6.51 ns BM_std_minmax<char>/8 5.42 ns 7.41 ns BM_std_minmax<char>/9 6.05 ns 8.34 ns BM_std_minmax<char>/10 6.68 ns 9.29 ns BM_std_minmax<char>/11 7.47 ns 10.6 ns BM_std_minmax<char>/12 7.95 ns 11.4 ns BM_std_minmax<char>/13 8.64 ns 12.4 ns BM_std_minmax<char>/14 9.35 ns 13.4 ns BM_std_minmax<char>/15 10.1 ns 14.4 ns BM_std_minmax<char>/16 10.6 ns 2.25 ns BM_std_minmax<char>/17 11.3 ns 2.82 ns BM_std_minmax<char>/18 11.8 ns 3.71 ns BM_std_minmax<char>/19 12.6 ns 4.52 ns BM_std_minmax<char>/20 13.2 ns 5.47 ns BM_std_minmax<char>/21 14.1 ns 6.67 ns BM_std_minmax<char>/22 14.5 ns 7.78 ns BM_std_minmax<char>/23 15.1 ns 8.67 ns BM_std_minmax<char>/24 15.7 ns 9.68 ns BM_std_minmax<char>/25 16.4 ns 10.7 ns BM_std_minmax<char>/26 17.1 ns 11.7 ns BM_std_minmax<char>/27 17.8 ns 12.8 ns BM_std_minmax<char>/28 18.4 ns 14.1 ns BM_std_minmax<char>/29 19.0 ns 15.0 ns BM_std_minmax<char>/30 19.6 ns 16.0 ns BM_std_minmax<char>/31 20.2 ns 17.0 ns BM_std_minmax<char>/32 20.8 ns 2.46 ns BM_std_minmax<char>/64 41.5 ns 2.97 ns BM_std_minmax<char>/512 340 ns 6.05 ns BM_std_minmax<char>/1024 667 ns 8.83 ns BM_std_minmax<char>/4000 2571 ns 28.6 ns BM_std_minmax<char>/4096 2632 ns 25.8 ns BM_std_minmax<char>/5500 3554 ns 51.1 ns BM_std_minmax<char>/64000 41175 ns 480 ns BM_std_minmax<char>/65536 42039 ns 490 ns BM_std_minmax<char>/70000 44931 ns 528 ns BM_std_minmax<short>/1 0.708 ns 1.20 ns BM_std_minmax<short>/2 1.18 ns 1.78 ns BM_std_minmax<short>/3 1.98 ns 2.42 ns BM_std_minmax<short>/4 2.47 ns 3.05 ns BM_std_minmax<short>/5 3.09 ns 3.72 ns BM_std_minmax<short>/6 3.49 ns 4.37 ns BM_std_minmax<short>/7 4.24 ns 5.03 ns BM_std_minmax<short>/8 4.65 ns 2.12 ns BM_std_minmax<short>/9 5.34 ns 2.51 ns BM_std_minmax<short>/10 5.82 ns 3.18 ns BM_std_minmax<short>/11 6.36 ns 3.97 ns BM_std_minmax<short>/12 6.73 ns 4.68 ns BM_std_minmax<short>/13 7.59 ns 5.49 ns BM_std_minmax<short>/14 7.77 ns 6.45 ns BM_std_minmax<short>/15 8.54 ns 7.55 ns BM_std_minmax<short>/16 8.74 ns 2.38 ns BM_std_minmax<short>/17 9.59 ns 2.76 ns BM_std_minmax<short>/18 9.88 ns 3.37 ns BM_std_minmax<short>/19 10.7 ns 4.17 ns BM_std_minmax<short>/20 10.9 ns 4.88 ns BM_std_minmax<short>/21 12.1 ns 5.70 ns BM_std_minmax<short>/22 12.6 ns 6.64 ns BM_std_minmax<short>/23 13.5 ns 7.72 ns BM_std_minmax<short>/24 13.2 ns 2.87 ns BM_std_minmax<short>/25 14.2 ns 3.10 ns BM_std_minmax<short>/26 14.2 ns 3.59 ns BM_std_minmax<short>/27 15.4 ns 4.35 ns BM_std_minmax<short>/28 15.3 ns 5.10 ns BM_std_minmax<short>/29 16.2 ns 5.87 ns BM_std_minmax<short>/30 16.2 ns 6.88 ns BM_std_minmax<short>/31 17.0 ns 7.78 ns BM_std_minmax<short>/32 17.2 ns 3.45 ns BM_std_minmax<short>/64 34.1 ns 3.35 ns BM_std_minmax<short>/512 279 ns 8.37 ns BM_std_minmax<short>/1024 549 ns 14.2 ns BM_std_minmax<short>/4000 2111 ns 50.1 ns BM_std_minmax<short>/4096 2167 ns 47.9 ns BM_std_minmax<short>/5500 2895 ns 69.7 ns BM_std_minmax<short>/64000 33454 ns 953 ns BM_std_minmax<short>/65536 34474 ns 970 ns BM_std_minmax<short>/70000 36691 ns 1037 ns BM_std_minmax<int>/1 0.664 ns 1.17 ns BM_std_minmax<int>/2 1.11 ns 1.69 ns BM_std_minmax<int>/3 2.36 ns 2.29 ns BM_std_minmax<int>/4 2.53 ns 2.91 ns BM_std_minmax<int>/5 3.23 ns 3.56 ns BM_std_minmax<int>/6 3.56 ns 4.23 ns BM_std_minmax<int>/7 4.28 ns 4.91 ns BM_std_minmax<int>/8 4.60 ns 5.60 ns BM_std_minmax<int>/9 5.38 ns 6.31 ns BM_std_minmax<int>/10 5.69 ns 7.03 ns BM_std_minmax<int>/11 6.41 ns 7.70 ns BM_std_minmax<int>/12 6.73 ns 8.39 ns BM_std_minmax<int>/13 7.38 ns 9.07 ns BM_std_minmax<int>/14 7.74 ns 9.79 ns BM_std_minmax<int>/15 8.53 ns 10.5 ns BM_std_minmax<int>/16 8.79 ns 11.2 ns BM_std_minmax<int>/17 9.63 ns 12.0 ns BM_std_minmax<int>/18 9.84 ns 12.7 ns BM_std_minmax<int>/19 10.6 ns 13.5 ns BM_std_minmax<int>/20 11.0 ns 14.3 ns BM_std_minmax<int>/21 11.7 ns 15.0 ns BM_std_minmax<int>/22 12.0 ns 15.7 ns BM_std_minmax<int>/23 13.1 ns 16.5 ns BM_std_minmax<int>/24 13.0 ns 17.3 ns BM_std_minmax<int>/25 13.7 ns 17.9 ns BM_std_minmax<int>/26 14.0 ns 18.6 ns BM_std_minmax<int>/27 14.8 ns 19.4 ns BM_std_minmax<int>/28 15.1 ns 20.3 ns BM_std_minmax<int>/29 15.8 ns 20.9 ns BM_std_minmax<int>/30 16.1 ns 21.7 ns BM_std_minmax<int>/31 16.9 ns 22.5 ns BM_std_minmax<int>/32 17.2 ns 3.40 ns BM_std_minmax<int>/64 33.9 ns 4.04 ns BM_std_minmax<int>/512 275 ns 14.6 ns BM_std_minmax<int>/1024 541 ns 27.5 ns BM_std_minmax<int>/4000 2093 ns 96.3 ns BM_std_minmax<int>/4096 2146 ns 98.3 ns BM_std_minmax<int>/5500 2866 ns 157 ns BM_std_minmax<int>/64000 33619 ns 1954 ns BM_std_minmax<int>/65536 34252 ns 2009 ns BM_std_minmax<int>/70000 36618 ns 2125 ns BM_std_minmax<long long>/1 0.709 ns 1.19 ns BM_std_minmax<long long>/2 1.01 ns 1.65 ns BM_std_minmax<long long>/3 2.14 ns 2.21 ns BM_std_minmax<long long>/4 2.45 ns 2.83 ns BM_std_minmax<long long>/5 3.09 ns 3.47 ns BM_std_minmax<long long>/6 3.44 ns 4.11 ns BM_std_minmax<long long>/7 4.16 ns 4.79 ns BM_std_minmax<long long>/8 4.54 ns 5.47 ns BM_std_minmax<long long>/9 5.37 ns 6.20 ns BM_std_minmax<long long>/10 5.71 ns 6.93 ns BM_std_minmax<long long>/11 6.00 ns 7.60 ns BM_std_minmax<long long>/12 6.43 ns 8.27 ns BM_std_minmax<long long>/13 7.01 ns 8.94 ns BM_std_minmax<long long>/14 7.45 ns 9.65 ns BM_std_minmax<long long>/15 8.16 ns 10.4 ns BM_std_minmax<long long>/16 8.46 ns 5.22 ns BM_std_minmax<long long>/17 9.16 ns 5.22 ns BM_std_minmax<long long>/18 9.53 ns 5.52 ns BM_std_minmax<long long>/19 10.2 ns 6.02 ns BM_std_minmax<long long>/20 10.5 ns 6.89 ns BM_std_minmax<long long>/21 11.3 ns 7.83 ns BM_std_minmax<long long>/22 11.6 ns 8.59 ns BM_std_minmax<long long>/23 12.3 ns 9.91 ns BM_std_minmax<long long>/24 12.6 ns 10.1 ns BM_std_minmax<long long>/25 13.2 ns 12.0 ns BM_std_minmax<long long>/26 13.6 ns 13.5 ns BM_std_minmax<long long>/27 14.2 ns 14.8 ns BM_std_minmax<long long>/28 14.7 ns 15.9 ns BM_std_minmax<long long>/29 15.3 ns 16.6 ns BM_std_minmax<long long>/30 15.8 ns 17.3 ns BM_std_minmax<long long>/31 16.3 ns 18.2 ns BM_std_minmax<long long>/32 16.7 ns 7.18 ns BM_std_minmax<long long>/64 33.1 ns 11.5 ns BM_std_minmax<long long>/512 268 ns 71.0 ns BM_std_minmax<long long>/1024 532 ns 138 ns BM_std_minmax<long long>/4000 2056 ns 533 ns BM_std_minmax<long long>/4096 2112 ns 539 ns BM_std_minmax<long long>/5500 2823 ns 749 ns BM_std_minmax<long long>/64000 32956 ns 8590 ns BM_std_minmax<long long>/65536 33795 ns 8791 ns BM_std_minmax<long long>/70000 36084 ns 9442 ns BM_std_minmax<unsigned char>/1 0.714 ns 1.41 ns BM_std_minmax<unsigned char>/2 0.955 ns 1.96 ns BM_std_minmax<unsigned char>/3 1.90 ns 2.63 ns BM_std_minmax<unsigned char>/4 2.40 ns 3.34 ns BM_std_minmax<unsigned char>/5 2.87 ns 4.10 ns BM_std_minmax<unsigned char>/6 3.47 ns 4.88 ns BM_std_minmax<unsigned char>/7 4.04 ns 5.66 ns BM_std_minmax<unsigned char>/8 4.65 ns 6.45 ns BM_std_minmax<unsigned char>/9 5.18 ns 7.24 ns BM_std_minmax<unsigned char>/10 5.80 ns 8.05 ns BM_std_minmax<unsigned char>/11 6.24 ns 8.86 ns BM_std_minmax<unsigned char>/12 6.78 ns 9.70 ns BM_std_minmax<unsigned char>/13 7.30 ns 10.6 ns BM_std_minmax<unsigned char>/14 7.86 ns 11.4 ns BM_std_minmax<unsigned char>/15 8.46 ns 12.3 ns BM_std_minmax<unsigned char>/16 9.00 ns 2.12 ns BM_std_minmax<unsigned char>/17 9.58 ns 2.83 ns BM_std_minmax<unsigned char>/18 10.1 ns 3.37 ns BM_std_minmax<unsigned char>/19 10.7 ns 4.11 ns BM_std_minmax<unsigned char>/20 11.2 ns 4.85 ns BM_std_minmax<unsigned char>/21 11.9 ns 5.69 ns BM_std_minmax<unsigned char>/22 12.3 ns 6.77 ns BM_std_minmax<unsigned char>/23 13.1 ns 7.56 ns BM_std_minmax<unsigned char>/24 13.5 ns 8.40 ns BM_std_minmax<unsigned char>/25 14.2 ns 9.30 ns BM_std_minmax<unsigned char>/26 14.4 ns 10.1 ns BM_std_minmax<unsigned char>/27 15.0 ns 11.1 ns BM_std_minmax<unsigned char>/28 15.3 ns 11.9 ns BM_std_minmax<unsigned char>/29 16.2 ns 12.9 ns BM_std_minmax<unsigned char>/30 16.5 ns 13.9 ns BM_std_minmax<unsigned char>/31 17.2 ns 14.8 ns BM_std_minmax<unsigned char>/32 17.6 ns 2.36 ns BM_std_minmax<unsigned char>/64 35.6 ns 3.21 ns BM_std_minmax<unsigned char>/512 288 ns 6.00 ns BM_std_minmax<unsigned char>/1024 573 ns 8.80 ns BM_std_minmax<unsigned char>/4000 2222 ns 28.6 ns BM_std_minmax<unsigned char>/4096 2265 ns 25.9 ns BM_std_minmax<unsigned char>/5500 3047 ns 48.8 ns BM_std_minmax<unsigned char>/64000 35059 ns 480 ns BM_std_minmax<unsigned char>/65536 35941 ns 491 ns BM_std_minmax<unsigned char>/70000 38922 ns 525 ns BM_std_minmax<unsigned short>/1 0.711 ns 1.18 ns BM_std_minmax<unsigned short>/2 0.957 ns 1.65 ns BM_std_minmax<unsigned short>/3 2.13 ns 2.21 ns BM_std_minmax<unsigned short>/4 2.14 ns 2.78 ns BM_std_minmax<unsigned short>/5 3.06 ns 3.29 ns BM_std_minmax<unsigned short>/6 2.89 ns 3.87 ns BM_std_minmax<unsigned short>/7 3.80 ns 4.55 ns BM_std_minmax<unsigned short>/8 3.68 ns 2.02 ns BM_std_minmax<unsigned short>/9 4.53 ns 2.40 ns BM_std_minmax<unsigned short>/10 4.60 ns 2.94 ns BM_std_minmax<unsigned short>/11 5.67 ns 3.67 ns BM_std_minmax<unsigned short>/12 5.39 ns 4.22 ns BM_std_minmax<unsigned short>/13 6.58 ns 4.78 ns BM_std_minmax<unsigned short>/14 6.33 ns 5.54 ns BM_std_minmax<unsigned short>/15 7.34 ns 6.30 ns BM_std_minmax<unsigned short>/16 7.17 ns 2.25 ns BM_std_minmax<unsigned short>/17 8.19 ns 2.61 ns BM_std_minmax<unsigned short>/18 8.02 ns 3.19 ns BM_std_minmax<unsigned short>/19 9.03 ns 3.72 ns BM_std_minmax<unsigned short>/20 8.89 ns 4.36 ns BM_std_minmax<unsigned short>/21 9.77 ns 5.10 ns BM_std_minmax<unsigned short>/22 9.70 ns 5.55 ns BM_std_minmax<unsigned short>/23 10.8 ns 6.29 ns BM_std_minmax<unsigned short>/24 10.6 ns 2.41 ns BM_std_minmax<unsigned short>/25 11.6 ns 2.75 ns BM_std_minmax<unsigned short>/26 11.4 ns 3.26 ns BM_std_minmax<unsigned short>/27 12.4 ns 3.86 ns BM_std_minmax<unsigned short>/28 12.3 ns 4.45 ns BM_std_minmax<unsigned short>/29 13.2 ns 5.07 ns BM_std_minmax<unsigned short>/30 13.1 ns 5.77 ns BM_std_minmax<unsigned short>/31 13.9 ns 6.65 ns BM_std_minmax<unsigned short>/32 13.9 ns 2.72 ns BM_std_minmax<unsigned short>/64 27.8 ns 3.25 ns BM_std_minmax<unsigned short>/512 220 ns 8.30 ns BM_std_minmax<unsigned short>/1024 435 ns 14.1 ns BM_std_minmax<unsigned short>/4000 1703 ns 49.8 ns BM_std_minmax<unsigned short>/4096 1746 ns 47.9 ns BM_std_minmax<unsigned short>/5500 2350 ns 69.9 ns BM_std_minmax<unsigned short>/64000 27388 ns 953 ns BM_std_minmax<unsigned short>/65536 28040 ns 975 ns BM_std_minmax<unsigned short>/70000 29967 ns 1040 ns BM_std_minmax<unsigned int>/1 0.712 ns 1.18 ns BM_std_minmax<unsigned int>/2 0.965 ns 1.65 ns BM_std_minmax<unsigned int>/3 2.13 ns 2.14 ns BM_std_minmax<unsigned int>/4 2.09 ns 2.64 ns BM_std_minmax<unsigned int>/5 3.02 ns 3.21 ns BM_std_minmax<unsigned int>/6 2.94 ns 3.81 ns BM_std_minmax<unsigned int>/7 3.91 ns 4.38 ns BM_std_minmax<unsigned int>/8 3.75 ns 4.93 ns BM_std_minmax<unsigned int>/9 4.71 ns 5.60 ns BM_std_minmax<unsigned int>/10 4.59 ns 6.26 ns BM_std_minmax<unsigned int>/11 5.57 ns 6.80 ns BM_std_minmax<unsigned int>/12 5.43 ns 7.47 ns BM_std_minmax<unsigned int>/13 6.45 ns 8.10 ns BM_std_minmax<unsigned int>/14 6.32 ns 8.69 ns BM_std_minmax<unsigned int>/15 7.29 ns 9.37 ns BM_std_minmax<unsigned int>/16 7.12 ns 9.99 ns BM_std_minmax<unsigned int>/17 8.24 ns 10.6 ns BM_std_minmax<unsigned int>/18 8.00 ns 11.2 ns BM_std_minmax<unsigned int>/19 8.94 ns 12.0 ns BM_std_minmax<unsigned int>/20 8.91 ns 12.6 ns BM_std_minmax<unsigned int>/21 9.73 ns 17.2 ns BM_std_minmax<unsigned int>/22 9.75 ns 13.8 ns BM_std_minmax<unsigned int>/23 10.6 ns 14.5 ns BM_std_minmax<unsigned int>/24 10.6 ns 15.1 ns BM_std_minmax<unsigned int>/25 11.5 ns 15.7 ns BM_std_minmax<unsigned int>/26 11.4 ns 16.3 ns BM_std_minmax<unsigned int>/27 12.3 ns 17.0 ns BM_std_minmax<unsigned int>/28 12.3 ns 17.6 ns BM_std_minmax<unsigned int>/29 13.2 ns 18.3 ns BM_std_minmax<unsigned int>/30 13.2 ns 19.0 ns BM_std_minmax<unsigned int>/31 14.0 ns 19.6 ns BM_std_minmax<unsigned int>/32 14.0 ns 3.39 ns BM_std_minmax<unsigned int>/64 27.6 ns 4.05 ns BM_std_minmax<unsigned int>/512 221 ns 14.2 ns BM_std_minmax<unsigned int>/1024 439 ns 25.5 ns BM_std_minmax<unsigned int>/4000 1720 ns 96.3 ns BM_std_minmax<unsigned int>/4096 1762 ns 97.8 ns BM_std_minmax<unsigned int>/5500 2364 ns 146 ns BM_std_minmax<unsigned int>/64000 27874 ns 1905 ns BM_std_minmax<unsigned int>/65536 28012 ns 1961 ns BM_std_minmax<unsigned int>/70000 29899 ns 2087 ns BM_std_minmax<unsigned long long>/1 0.707 ns 1.18 ns BM_std_minmax<unsigned long long>/2 0.909 ns 1.65 ns BM_std_minmax<unsigned long long>/3 1.65 ns 2.70 ns BM_std_minmax<unsigned long long>/4 1.93 ns 2.69 ns BM_std_minmax<unsigned long long>/5 2.45 ns 3.34 ns BM_std_minmax<unsigned long long>/6 2.78 ns 3.81 ns BM_std_minmax<unsigned long long>/7 3.28 ns 4.43 ns BM_std_minmax<unsigned long long>/8 3.70 ns 4.92 ns BM_std_minmax<unsigned long long>/9 4.12 ns 5.64 ns BM_std_minmax<unsigned long long>/10 4.44 ns 6.15 ns BM_std_minmax<unsigned long long>/11 4.91 ns 6.81 ns BM_std_minmax<unsigned long long>/12 5.31 ns 7.41 ns BM_std_minmax<unsigned long long>/13 5.72 ns 7.96 ns BM_std_minmax<unsigned long long>/14 6.05 ns 8.66 ns BM_std_minmax<unsigned long long>/15 6.55 ns 9.37 ns BM_std_minmax<unsigned long long>/16 6.89 ns 7.98 ns BM_std_minmax<unsigned long long>/17 7.34 ns 8.13 ns BM_std_minmax<unsigned long long>/18 7.73 ns 8.42 ns BM_std_minmax<unsigned long long>/19 8.26 ns 8.63 ns BM_std_minmax<unsigned long long>/20 8.54 ns 8.96 ns BM_std_minmax<unsigned long long>/21 9.14 ns 9.37 ns BM_std_minmax<unsigned long long>/22 9.39 ns 9.67 ns BM_std_minmax<unsigned long long>/23 10.1 ns 10.1 ns BM_std_minmax<unsigned long long>/24 10.4 ns 10.6 ns BM_std_minmax<unsigned long long>/25 11.0 ns 11.3 ns BM_std_minmax<unsigned long long>/26 11.3 ns 12.1 ns BM_std_minmax<unsigned long long>/27 11.8 ns 14.2 ns BM_std_minmax<unsigned long long>/28 12.1 ns 15.8 ns BM_std_minmax<unsigned long long>/29 12.6 ns 17.4 ns BM_std_minmax<unsigned long long>/30 13.1 ns 18.1 ns BM_std_minmax<unsigned long long>/31 13.4 ns 18.8 ns BM_std_minmax<unsigned long long>/32 13.8 ns 10.4 ns BM_std_minmax<unsigned long long>/64 27.3 ns 15.5 ns BM_std_minmax<unsigned long long>/512 222 ns 80.6 ns BM_std_minmax<unsigned long long>/1024 443 ns 156 ns BM_std_minmax<unsigned long long>/4000 1731 ns 591 ns BM_std_minmax<unsigned long long>/4096 1752 ns 609 ns BM_std_minmax<unsigned long long>/5500 2340 ns 819 ns BM_std_minmax<unsigned long long>/64000 27166 ns 9652 ns BM_std_minmax<unsigned long long>/65536 27869 ns 9876 ns BM_std_minmax<unsigned long long>/70000 29920 ns 10680 ns ``` |
||
|
|
985c1a44f8 |
[libc++] Optimize the two range overload of mismatch (#86853)
``` ----------------------------------------------------------------------------- Benchmark old new ----------------------------------------------------------------------------- bm_mismatch_two_range_overload<char>/1 0.941 ns 1.88 ns bm_mismatch_two_range_overload<char>/2 1.43 ns 2.15 ns bm_mismatch_two_range_overload<char>/3 1.95 ns 2.55 ns bm_mismatch_two_range_overload<char>/4 2.58 ns 2.90 ns bm_mismatch_two_range_overload<char>/5 3.75 ns 3.31 ns bm_mismatch_two_range_overload<char>/6 5.00 ns 3.83 ns bm_mismatch_two_range_overload<char>/7 5.59 ns 4.35 ns bm_mismatch_two_range_overload<char>/8 6.37 ns 4.84 ns bm_mismatch_two_range_overload<char>/16 11.8 ns 6.72 ns bm_mismatch_two_range_overload<char>/64 45.5 ns 2.59 ns bm_mismatch_two_range_overload<char>/512 366 ns 12.6 ns bm_mismatch_two_range_overload<char>/4096 2890 ns 91.6 ns bm_mismatch_two_range_overload<char>/32768 23038 ns 758 ns bm_mismatch_two_range_overload<char>/262144 142813 ns 6573 ns bm_mismatch_two_range_overload<char>/1048576 366679 ns 26710 ns bm_mismatch_two_range_overload<short>/1 0.934 ns 1.88 ns bm_mismatch_two_range_overload<short>/2 1.30 ns 2.58 ns bm_mismatch_two_range_overload<short>/3 1.76 ns 3.28 ns bm_mismatch_two_range_overload<short>/4 2.24 ns 3.98 ns bm_mismatch_two_range_overload<short>/5 2.80 ns 4.92 ns bm_mismatch_two_range_overload<short>/6 3.58 ns 6.01 ns bm_mismatch_two_range_overload<short>/7 4.29 ns 7.03 ns bm_mismatch_two_range_overload<short>/8 4.67 ns 7.39 ns bm_mismatch_two_range_overload<short>/16 9.86 ns 13.1 ns bm_mismatch_two_range_overload<short>/64 38.9 ns 4.55 ns bm_mismatch_two_range_overload<short>/512 348 ns 27.7 ns bm_mismatch_two_range_overload<short>/4096 2881 ns 225 ns bm_mismatch_two_range_overload<short>/32768 23111 ns 1715 ns bm_mismatch_two_range_overload<short>/262144 184846 ns 14416 ns bm_mismatch_two_range_overload<short>/1048576 742885 ns 57264 ns bm_mismatch_two_range_overload<int>/1 0.838 ns 1.19 ns bm_mismatch_two_range_overload<int>/2 1.19 ns 1.65 ns bm_mismatch_two_range_overload<int>/3 1.83 ns 2.06 ns bm_mismatch_two_range_overload<int>/4 2.38 ns 2.42 ns bm_mismatch_two_range_overload<int>/5 3.60 ns 2.47 ns bm_mismatch_two_range_overload<int>/6 3.68 ns 3.05 ns bm_mismatch_two_range_overload<int>/7 4.32 ns 3.36 ns bm_mismatch_two_range_overload<int>/8 5.18 ns 3.58 ns bm_mismatch_two_range_overload<int>/16 10.6 ns 2.84 ns bm_mismatch_two_range_overload<int>/64 39.0 ns 7.78 ns bm_mismatch_two_range_overload<int>/512 247 ns 53.9 ns bm_mismatch_two_range_overload<int>/4096 1927 ns 429 ns bm_mismatch_two_range_overload<int>/32768 15569 ns 3393 ns bm_mismatch_two_range_overload<int>/262144 125413 ns 28504 ns bm_mismatch_two_range_overload<int>/1048576 504549 ns 112729 ns ``` |
||
|
|
beaff78528 |
[libc++] Optimize the std::mismatch tail (#83440)
This adds vectorization to the last 0-3 vectors and, if the range is large enough, the remaining elements that don't fill a vector completely. ``` ----------------------------------------------------------------------- Benchmark old full vectors partial vector ----------------------------------------------------------------------- bm_mismatch<char>/1 1.40 ns 1.62 ns 2.09 ns bm_mismatch<char>/2 1.88 ns 2.10 ns 2.33 ns bm_mismatch<char>/3 2.67 ns 2.56 ns 2.72 ns bm_mismatch<char>/4 3.01 ns 3.20 ns 3.70 ns bm_mismatch<char>/5 3.51 ns 3.73 ns 3.64 ns bm_mismatch<char>/6 4.71 ns 4.85 ns 4.37 ns bm_mismatch<char>/7 5.12 ns 5.33 ns 4.37 ns bm_mismatch<char>/8 5.79 ns 6.02 ns 4.75 ns bm_mismatch<char>/15 9.20 ns 10.5 ns 7.23 ns bm_mismatch<char>/16 10.2 ns 10.1 ns 7.46 ns bm_mismatch<char>/17 10.2 ns 10.8 ns 7.57 ns bm_mismatch<char>/31 17.6 ns 17.1 ns 10.8 ns bm_mismatch<char>/32 17.4 ns 1.64 ns 1.64 ns bm_mismatch<char>/33 23.3 ns 2.10 ns 2.33 ns bm_mismatch<char>/63 31.8 ns 16.9 ns 2.33 ns bm_mismatch<char>/64 32.6 ns 2.10 ns 2.10 ns bm_mismatch<char>/65 33.6 ns 2.57 ns 2.80 ns bm_mismatch<char>/127 67.3 ns 18.1 ns 3.27 ns bm_mismatch<char>/128 2.17 ns 2.14 ns 2.57 ns bm_mismatch<char>/129 2.36 ns 2.80 ns 3.27 ns bm_mismatch<char>/255 67.5 ns 19.6 ns 4.68 ns bm_mismatch<char>/256 3.76 ns 3.71 ns 3.97 ns bm_mismatch<char>/257 3.77 ns 4.04 ns 4.43 ns bm_mismatch<char>/511 70.8 ns 22.1 ns 7.47 ns bm_mismatch<char>/512 7.27 ns 7.30 ns 6.95 ns bm_mismatch<char>/513 7.11 ns 7.05 ns 6.96 ns bm_mismatch<char>/1023 75.9 ns 27.4 ns 13.3 ns bm_mismatch<char>/1024 13.9 ns 13.8 ns 12.4 ns bm_mismatch<char>/1025 13.6 ns 13.6 ns 12.8 ns bm_mismatch<char>/2047 87.3 ns 37.5 ns 25.4 ns bm_mismatch<char>/2048 26.8 ns 27.4 ns 24.0 ns bm_mismatch<char>/2049 26.7 ns 27.3 ns 25.5 ns bm_mismatch<char>/4095 112 ns 64.7 ns 48.7 ns bm_mismatch<char>/4096 53.0 ns 54.2 ns 46.8 ns bm_mismatch<char>/4097 52.7 ns 54.2 ns 48.4 ns bm_mismatch<char>/8191 160 ns 118 ns 98.4 ns bm_mismatch<char>/8192 107 ns 108 ns 96.0 ns bm_mismatch<char>/8193 106 ns 108 ns 97.2 ns bm_mismatch<char>/16383 283 ns 234 ns 215 ns bm_mismatch<char>/16384 227 ns 223 ns 217 ns bm_mismatch<char>/16385 221 ns 221 ns 215 ns bm_mismatch<char>/32767 547 ns 499 ns 488 ns bm_mismatch<char>/32768 495 ns 492 ns 492 ns bm_mismatch<char>/32769 491 ns 489 ns 488 ns bm_mismatch<char>/65535 1028 ns 979 ns 971 ns bm_mismatch<char>/65536 976 ns 970 ns 974 ns bm_mismatch<char>/65537 970 ns 965 ns 971 ns bm_mismatch<char>/131071 2031 ns 1948 ns 2005 ns bm_mismatch<char>/131072 1973 ns 1955 ns 1974 ns bm_mismatch<char>/131073 1989 ns 1932 ns 2001 ns bm_mismatch<char>/262143 4469 ns 4244 ns 4223 ns bm_mismatch<char>/262144 4443 ns 4183 ns 4243 ns bm_mismatch<char>/262145 4400 ns 4232 ns 4246 ns bm_mismatch<char>/524287 10169 ns 9733 ns 9592 ns bm_mismatch<char>/524288 10154 ns 9664 ns 9843 ns bm_mismatch<char>/524289 10113 ns 9641 ns 10003 ns bm_mismatch<short>/1 1.86 ns 2.53 ns 2.32 ns bm_mismatch<short>/2 2.57 ns 2.77 ns 2.55 ns bm_mismatch<short>/3 3.26 ns 3.00 ns 2.79 ns bm_mismatch<short>/4 3.95 ns 3.39 ns 3.15 ns bm_mismatch<short>/5 4.83 ns 3.97 ns 3.72 ns bm_mismatch<short>/6 5.43 ns 4.34 ns 4.03 ns bm_mismatch<short>/7 6.11 ns 4.73 ns 4.44 ns bm_mismatch<short>/8 6.84 ns 5.02 ns 4.79 ns bm_mismatch<short>/15 11.5 ns 7.12 ns 6.50 ns bm_mismatch<short>/16 13.9 ns 1.87 ns 2.11 ns bm_mismatch<short>/17 14.0 ns 3.00 ns 2.47 ns bm_mismatch<short>/31 23.1 ns 7.87 ns 2.47 ns bm_mismatch<short>/32 23.8 ns 2.57 ns 2.81 ns bm_mismatch<short>/33 24.5 ns 3.70 ns 2.94 ns bm_mismatch<short>/63 44.8 ns 9.37 ns 3.46 ns bm_mismatch<short>/64 2.32 ns 2.57 ns 2.64 ns bm_mismatch<short>/65 2.52 ns 3.02 ns 3.51 ns bm_mismatch<short>/127 45.6 ns 9.97 ns 5.18 ns bm_mismatch<short>/128 3.85 ns 3.93 ns 3.94 ns bm_mismatch<short>/129 3.82 ns 4.20 ns 4.70 ns bm_mismatch<short>/255 50.4 ns 12.6 ns 8.07 ns bm_mismatch<short>/256 7.23 ns 6.91 ns 6.98 ns bm_mismatch<short>/257 7.24 ns 7.19 ns 7.55 ns bm_mismatch<short>/511 52.3 ns 17.8 ns 14.0 ns bm_mismatch<short>/512 13.6 ns 13.7 ns 13.6 ns bm_mismatch<short>/513 13.9 ns 13.8 ns 18.5 ns bm_mismatch<short>/1023 60.9 ns 30.9 ns 26.3 ns bm_mismatch<short>/1024 26.7 ns 27.7 ns 25.7 ns bm_mismatch<short>/1025 27.7 ns 27.6 ns 25.3 ns bm_mismatch<short>/2047 88.4 ns 58.0 ns 51.6 ns bm_mismatch<short>/2048 52.8 ns 55.3 ns 50.6 ns bm_mismatch<short>/2049 55.2 ns 54.8 ns 48.7 ns bm_mismatch<short>/4095 153 ns 113 ns 102 ns bm_mismatch<short>/4096 105 ns 110 ns 101 ns bm_mismatch<short>/4097 110 ns 110 ns 99.1 ns bm_mismatch<short>/8191 277 ns 219 ns 206 ns bm_mismatch<short>/8192 226 ns 214 ns 250 ns bm_mismatch<short>/8193 226 ns 207 ns 208 ns bm_mismatch<short>/16383 519 ns 492 ns 488 ns bm_mismatch<short>/16384 494 ns 492 ns 492 ns bm_mismatch<short>/16385 492 ns 488 ns 489 ns bm_mismatch<short>/32767 1007 ns 968 ns 964 ns bm_mismatch<short>/32768 977 ns 972 ns 970 ns bm_mismatch<short>/32769 972 ns 962 ns 967 ns bm_mismatch<short>/65535 1978 ns 1918 ns 1956 ns bm_mismatch<short>/65536 1940 ns 1927 ns 1970 ns bm_mismatch<short>/65537 1937 ns 1922 ns 1959 ns bm_mismatch<short>/131071 4524 ns 4193 ns 4304 ns bm_mismatch<short>/131072 4445 ns 4196 ns 4306 ns bm_mismatch<short>/131073 4452 ns 4278 ns 4311 ns bm_mismatch<short>/262143 9801 ns 10188 ns 9634 ns bm_mismatch<short>/262144 9738 ns 10151 ns 9651 ns bm_mismatch<short>/262145 9716 ns 10171 ns 9715 ns bm_mismatch<short>/524287 19944 ns 20718 ns 20044 ns bm_mismatch<short>/524288 21139 ns 20647 ns 20008 ns bm_mismatch<short>/524289 21162 ns 19512 ns 20068 ns bm_mismatch<int>/1 1.40 ns 1.84 ns 1.87 ns bm_mismatch<int>/2 1.87 ns 2.08 ns 2.09 ns bm_mismatch<int>/3 2.36 ns 2.31 ns 2.87 ns bm_mismatch<int>/4 3.06 ns 2.72 ns 2.95 ns bm_mismatch<int>/5 3.66 ns 3.37 ns 3.42 ns bm_mismatch<int>/6 4.55 ns 3.65 ns 3.73 ns bm_mismatch<int>/7 5.03 ns 3.93 ns 3.94 ns bm_mismatch<int>/8 5.67 ns 1.86 ns 1.87 ns bm_mismatch<int>/15 9.89 ns 4.41 ns 2.34 ns bm_mismatch<int>/16 10.1 ns 2.33 ns 2.34 ns bm_mismatch<int>/17 10.2 ns 3.34 ns 2.86 ns bm_mismatch<int>/31 17.2 ns 5.54 ns 3.28 ns bm_mismatch<int>/32 2.16 ns 2.15 ns 2.58 ns bm_mismatch<int>/33 2.36 ns 3.01 ns 3.28 ns bm_mismatch<int>/63 17.7 ns 6.50 ns 4.93 ns bm_mismatch<int>/64 3.81 ns 3.58 ns 3.90 ns bm_mismatch<int>/65 3.74 ns 4.36 ns 4.45 ns bm_mismatch<int>/127 19.5 ns 9.56 ns 7.74 ns bm_mismatch<int>/128 7.30 ns 6.41 ns 6.85 ns bm_mismatch<int>/129 7.09 ns 7.04 ns 7.06 ns bm_mismatch<int>/255 24.7 ns 14.8 ns 13.3 ns bm_mismatch<int>/256 14.0 ns 12.1 ns 12.3 ns bm_mismatch<int>/257 13.8 ns 12.7 ns 12.8 ns bm_mismatch<int>/511 34.3 ns 26.3 ns 24.8 ns bm_mismatch<int>/512 27.6 ns 23.6 ns 23.9 ns bm_mismatch<int>/513 27.3 ns 24.4 ns 25.1 ns bm_mismatch<int>/1023 62.5 ns 50.9 ns 48.3 ns bm_mismatch<int>/1024 54.4 ns 46.1 ns 46.6 ns bm_mismatch<int>/1025 54.2 ns 48.4 ns 47.5 ns bm_mismatch<int>/2047 116 ns 97.8 ns 94.1 ns bm_mismatch<int>/2048 108 ns 92.6 ns 92.4 ns bm_mismatch<int>/2049 108 ns 104 ns 94.0 ns bm_mismatch<int>/4095 233 ns 222 ns 205 ns bm_mismatch<int>/4096 226 ns 223 ns 225 ns bm_mismatch<int>/4097 221 ns 219 ns 210 ns bm_mismatch<int>/8191 499 ns 485 ns 488 ns bm_mismatch<int>/8192 496 ns 490 ns 495 ns bm_mismatch<int>/8193 491 ns 485 ns 488 ns bm_mismatch<int>/16383 982 ns 962 ns 964 ns bm_mismatch<int>/16384 974 ns 971 ns 971 ns bm_mismatch<int>/16385 971 ns 961 ns 968 ns bm_mismatch<int>/32767 2003 ns 1959 ns 1920 ns bm_mismatch<int>/32768 1996 ns 1947 ns 1928 ns bm_mismatch<int>/32769 1990 ns 1945 ns 1926 ns bm_mismatch<int>/65535 4434 ns 4275 ns 4312 ns bm_mismatch<int>/65536 4437 ns 4267 ns 4321 ns bm_mismatch<int>/65537 4442 ns 4261 ns 4321 ns bm_mismatch<int>/131071 9673 ns 9648 ns 9465 ns bm_mismatch<int>/131072 9667 ns 9671 ns 9465 ns bm_mismatch<int>/131073 9661 ns 9653 ns 9464 ns bm_mismatch<int>/262143 20595 ns 19605 ns 19064 ns bm_mismatch<int>/262144 19894 ns 19572 ns 19009 ns bm_mismatch<int>/262145 19851 ns 19656 ns 18999 ns bm_mismatch<int>/524287 39556 ns 39364 ns 38131 ns bm_mismatch<int>/524288 39678 ns 39573 ns 38183 ns bm_mismatch<int>/524289 40168 ns 39301 ns 38121 ns ``` |
||
|
|
b68e2eba0b |
[libc++] Vectorize mismatch (#73255)
``` --------------------------------------------------- Benchmark old new --------------------------------------------------- bm_mismatch<char>/1 0.835 ns 2.37 ns bm_mismatch<char>/2 1.44 ns 2.60 ns bm_mismatch<char>/3 2.06 ns 2.83 ns bm_mismatch<char>/4 2.60 ns 3.29 ns bm_mismatch<char>/5 3.15 ns 3.77 ns bm_mismatch<char>/6 3.82 ns 4.17 ns bm_mismatch<char>/7 4.29 ns 4.52 ns bm_mismatch<char>/8 4.78 ns 4.86 ns bm_mismatch<char>/16 9.06 ns 7.54 ns bm_mismatch<char>/64 31.7 ns 19.1 ns bm_mismatch<char>/512 249 ns 8.16 ns bm_mismatch<char>/4096 1956 ns 44.2 ns bm_mismatch<char>/32768 15498 ns 501 ns bm_mismatch<char>/262144 123965 ns 4479 ns bm_mismatch<char>/1048576 495668 ns 21306 ns bm_mismatch<short>/1 0.710 ns 2.12 ns bm_mismatch<short>/2 1.03 ns 2.66 ns bm_mismatch<short>/3 1.29 ns 3.56 ns bm_mismatch<short>/4 1.68 ns 4.29 ns bm_mismatch<short>/5 1.96 ns 5.18 ns bm_mismatch<short>/6 2.59 ns 5.91 ns bm_mismatch<short>/7 2.86 ns 6.63 ns bm_mismatch<short>/8 3.19 ns 7.33 ns bm_mismatch<short>/16 5.48 ns 13.0 ns bm_mismatch<short>/64 16.6 ns 4.06 ns bm_mismatch<short>/512 130 ns 13.8 ns bm_mismatch<short>/4096 985 ns 93.8 ns bm_mismatch<short>/32768 7846 ns 1002 ns bm_mismatch<short>/262144 63217 ns 10637 ns bm_mismatch<short>/1048576 251782 ns 42471 ns bm_mismatch<int>/1 0.716 ns 1.91 ns bm_mismatch<int>/2 1.21 ns 2.49 ns bm_mismatch<int>/3 1.38 ns 3.46 ns bm_mismatch<int>/4 1.71 ns 4.04 ns bm_mismatch<int>/5 2.00 ns 4.98 ns bm_mismatch<int>/6 2.43 ns 5.67 ns bm_mismatch<int>/7 3.05 ns 6.38 ns bm_mismatch<int>/8 3.22 ns 7.09 ns bm_mismatch<int>/16 5.18 ns 12.8 ns bm_mismatch<int>/64 16.6 ns 5.28 ns bm_mismatch<int>/512 129 ns 25.2 ns bm_mismatch<int>/4096 1009 ns 201 ns bm_mismatch<int>/32768 7776 ns 2144 ns bm_mismatch<int>/262144 62371 ns 20551 ns bm_mismatch<int>/1048576 254750 ns 90097 ns ``` |
||
|
|
07b18c5e1b |
[libc++] Optimize ranges::fill{,_n} for vector<bool>::iterator (#84642)
``` ------------------------------------------------------ Benchmark old new ------------------------------------------------------ bm_ranges_fill_n/1 1.64 ns 3.06 ns bm_ranges_fill_n/2 3.45 ns 3.06 ns bm_ranges_fill_n/3 4.88 ns 3.06 ns bm_ranges_fill_n/4 6.46 ns 3.06 ns bm_ranges_fill_n/5 8.03 ns 3.06 ns bm_ranges_fill_n/6 9.65 ns 3.07 ns bm_ranges_fill_n/7 11.5 ns 3.06 ns bm_ranges_fill_n/8 13.0 ns 3.06 ns bm_ranges_fill_n/16 25.9 ns 3.06 ns bm_ranges_fill_n/64 103 ns 4.62 ns bm_ranges_fill_n/512 711 ns 4.40 ns bm_ranges_fill_n/4096 5642 ns 9.86 ns bm_ranges_fill_n/32768 45135 ns 33.6 ns bm_ranges_fill_n/262144 360818 ns 243 ns bm_ranges_fill_n/1048576 1442828 ns 982 ns bm_ranges_fill/1 1.63 ns 3.17 ns bm_ranges_fill/2 3.43 ns 3.28 ns bm_ranges_fill/3 4.97 ns 3.31 ns bm_ranges_fill/4 6.53 ns 3.27 ns bm_ranges_fill/5 8.12 ns 3.33 ns bm_ranges_fill/6 9.76 ns 3.32 ns bm_ranges_fill/7 11.6 ns 3.29 ns bm_ranges_fill/8 13.2 ns 3.26 ns bm_ranges_fill/16 26.3 ns 3.26 ns bm_ranges_fill/64 104 ns 4.92 ns bm_ranges_fill/512 716 ns 4.47 ns bm_ranges_fill/4096 5772 ns 8.21 ns bm_ranges_fill/32768 45778 ns 33.1 ns bm_ranges_fill/262144 351422 ns 241 ns bm_ranges_fill/1048576 1404710 ns 965 ns ``` |
||
|
|
69279a8413 |
[libc++][test] add benchmarks for std::atomic::wait (#70571)
For the mutex vs atomic test: Old: `unique_lock<mutex>` New: a lock implemented with `atomic::wait` On 10 years old Intel Macbook, `atomic::wait` is 50% slower than `mutex` ``` Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------- BM_multi_thread_lock_unlock/1024 +0.3735 +2.4497 1724726 2368935 153159 528354 BM_multi_thread_lock_unlock/2048 +0.4174 +1.2487 3410538 4834012 435062 978311 BM_multi_thread_lock_unlock/4096 +0.5256 +1.9824 6903783 10532681 590266 1760405 BM_multi_thread_lock_unlock/8192 +0.5415 +0.4578 14536391 22408399 1456328 2123075 BM_multi_thread_lock_unlock/16384 +0.5663 +0.0513 30181991 47275023 3316850 |
||
|
|
4e112e5c1c |
Reapply "[libc++] Optimize vector growing of trivially relocatable types" (#80558)
This reapplies #76657. Non-trivial elements didn't get destroyed previously. This fixes the bug and adds tests for all the vector insertion functions. |
||
|
|
2352fdd202 |
Revert "[libc++] Optimize vector growing of trivially relocatable types (#76657)"
Broke sanitizer bots: https://lab.llvm.org/buildbot/#/builders/5/builds/40641
This reverts commit
|
||
|
|
67eee4a029 |
[libc++] Optimize vector growing of trivially relocatable types (#76657)
This patch introduces a new trait to represent whether a type is trivially relocatable, and uses that trait to optimize the growth of a std::vector of trivially relocatable objects. ``` -------------------------------------------------- Benchmark old new -------------------------------------------------- bm_grow<int> 1354 ns 1301 ns bm_grow<std::string> 5584 ns 3370 ns bm_grow<std::unique_ptr<int>> 3506 ns 1994 ns bm_grow<std::deque<int>> 27114 ns 27209 ns ``` This also changes to order of moving and destroying the objects when growing the vector. This should not affect our conformance. |
||
|
|
51e91b64d0 |
[libc++abi] Implement __cxa_init_primary_exception and use it to optimize std::make_exception_ptr (#65534)
This patch implements __cxa_init_primary_exception, an extension to the Itanium C++ ABI. This extension is already present in both libsupc++ and libcxxrt. This patch also starts making use of this function in std::make_exception_ptr: instead of going through a full throw/catch cycle, we are now able to initialize an exception directly, thus making std::make_exception_ptr around 30x faster. |
||
|
|
fdd089b500 |
[libc++] Implement ranges::contains (#65148)
Differential Revision: https://reviews.llvm.org/D159232 ``` Running ./ranges_contains.libcxx.out Run on (10 X 24.121 MHz CPU s) CPU Caches: L1 Data 64 KiB (x10) L1 Instruction 128 KiB (x10) L2 Unified 4096 KiB (x5) Load Average: 3.37, 6.77, 5.27 -------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------- bm_contains_char/16 1.88 ns 1.87 ns 371607095 bm_contains_char/256 7.48 ns 7.47 ns 93292285 bm_contains_char/4096 99.7 ns 99.6 ns 7013185 bm_contains_char/65536 1296 ns 1294 ns 540436 bm_contains_char/1048576 23887 ns 23860 ns 29302 bm_contains_char/16777216 389420 ns 389095 ns 1796 bm_contains_int/16 7.14 ns 7.14 ns 97776288 bm_contains_int/256 90.4 ns 90.3 ns 7558089 bm_contains_int/4096 1294 ns 1290 ns 543052 bm_contains_int/65536 20482 ns 20443 ns 34334 bm_contains_int/1048576 328817 ns 327965 ns 2147 bm_contains_int/16777216 5246279 ns 5239361 ns 133 bm_contains_bool/16 2.19 ns 2.19 ns 322565780 bm_contains_bool/256 3.42 ns 3.41 ns 205025467 bm_contains_bool/4096 22.1 ns 22.1 ns 31780479 bm_contains_bool/65536 333 ns 332 ns 2106606 bm_contains_bool/1048576 5126 ns 5119 ns 135901 bm_contains_bool/16777216 81656 ns 81574 ns 8569 ``` --------- Co-authored-by: Nathan Gauër <brioche@google.com> |
||
|
|
f7407411a1 |
[libc++] Optimize std::find for segmented iterators (#67224)
``` -------------------------------------------------------------------------- Benchmark old new -------------------------------------------------------------------------- bm_find<std::deque<char>>/1 6.06 ns 10.6 ns bm_find<std::deque<char>>/2 15.5 ns 10.6 ns bm_find<std::deque<char>>/3 19.0 ns 10.6 ns bm_find<std::deque<char>>/4 20.8 ns 10.6 ns bm_find<std::deque<char>>/5 22.0 ns 10.6 ns bm_find<std::deque<char>>/6 23.0 ns 10.5 ns bm_find<std::deque<char>>/7 24.8 ns 10.7 ns bm_find<std::deque<char>>/8 25.7 ns 10.6 ns bm_find<std::deque<char>>/16 28.3 ns 10.6 ns bm_find<std::deque<char>>/64 44.2 ns 27.0 ns bm_find<std::deque<char>>/512 133 ns 37.6 ns bm_find<std::deque<char>>/4096 867 ns 53.1 ns bm_find<std::deque<char>>/32768 6838 ns 160 ns bm_find<std::deque<char>>/262144 52897 ns 1495 ns bm_find<std::deque<char>>/1048576 215621 ns 6077 ns bm_find<std::deque<short>>/1 6.03 ns 6.28 ns bm_find<std::deque<short>>/2 15.8 ns 15.8 ns bm_find<std::deque<short>>/3 20.5 ns 20.3 ns bm_find<std::deque<short>>/4 21.0 ns 21.0 ns bm_find<std::deque<short>>/5 23.0 ns 22.1 ns bm_find<std::deque<short>>/6 22.6 ns 23.0 ns bm_find<std::deque<short>>/7 23.4 ns 23.7 ns bm_find<std::deque<short>>/8 24.4 ns 24.9 ns bm_find<std::deque<short>>/16 26.6 ns 27.2 ns bm_find<std::deque<short>>/64 43.2 ns 40.9 ns bm_find<std::deque<short>>/512 124 ns 90.7 ns bm_find<std::deque<short>>/4096 845 ns 525 ns bm_find<std::deque<short>>/32768 7273 ns 3194 ns bm_find<std::deque<short>>/262144 53710 ns 24385 ns bm_find<std::deque<short>>/1048576 216086 ns 96195 ns bm_find<std::deque<int>>/1 6.03 ns 10.3 ns bm_find<std::deque<int>>/2 15.6 ns 10.3 ns bm_find<std::deque<int>>/3 19.1 ns 10.3 ns bm_find<std::deque<int>>/4 22.3 ns 10.3 ns bm_find<std::deque<int>>/5 23.5 ns 10.4 ns bm_find<std::deque<int>>/6 23.1 ns 10.3 ns bm_find<std::deque<int>>/7 23.7 ns 10.2 ns bm_find<std::deque<int>>/8 24.5 ns 10.2 ns bm_find<std::deque<int>>/16 27.9 ns 26.6 ns bm_find<std::deque<int>>/64 42.6 ns 32.2 ns bm_find<std::deque<int>>/512 123 ns 43.0 ns bm_find<std::deque<int>>/4096 874 ns 93.5 ns bm_find<std::deque<int>>/32768 7031 ns 751 ns bm_find<std::deque<int>>/262144 57723 ns 6169 ns bm_find<std::deque<int>>/1048576 230867 ns 35851 ns bm_ranges_find<std::deque<char>>/1 5.97 ns 10.6 ns bm_ranges_find<std::deque<char>>/2 16.0 ns 10.5 ns bm_ranges_find<std::deque<char>>/3 19.5 ns 10.5 ns bm_ranges_find<std::deque<char>>/4 21.1 ns 10.6 ns bm_ranges_find<std::deque<char>>/5 22.8 ns 10.5 ns bm_ranges_find<std::deque<char>>/6 22.8 ns 10.6 ns bm_ranges_find<std::deque<char>>/7 23.4 ns 10.8 ns bm_ranges_find<std::deque<char>>/8 24.1 ns 10.5 ns bm_ranges_find<std::deque<char>>/16 26.9 ns 10.6 ns bm_ranges_find<std::deque<char>>/64 50.2 ns 27.2 ns bm_ranges_find<std::deque<char>>/512 126 ns 38.3 ns bm_ranges_find<std::deque<char>>/4096 868 ns 53.8 ns bm_ranges_find<std::deque<char>>/32768 6695 ns 161 ns bm_ranges_find<std::deque<char>>/262144 54411 ns 1497 ns bm_ranges_find<std::deque<char>>/1048576 241699 ns 6042 ns bm_ranges_find<std::deque<short>>/1 6.39 ns 6.31 ns bm_ranges_find<std::deque<short>>/2 15.8 ns 15.9 ns bm_ranges_find<std::deque<short>>/3 19.0 ns 19.8 ns bm_ranges_find<std::deque<short>>/4 20.8 ns 20.9 ns bm_ranges_find<std::deque<short>>/5 21.8 ns 22.1 ns bm_ranges_find<std::deque<short>>/6 23.0 ns 23.0 ns bm_ranges_find<std::deque<short>>/7 23.2 ns 23.9 ns bm_ranges_find<std::deque<short>>/8 23.7 ns 24.4 ns bm_ranges_find<std::deque<short>>/16 26.6 ns 26.8 ns bm_ranges_find<std::deque<short>>/64 43.4 ns 39.7 ns bm_ranges_find<std::deque<short>>/512 131 ns 90.5 ns bm_ranges_find<std::deque<short>>/4096 851 ns 523 ns bm_ranges_find<std::deque<short>>/32768 7370 ns 3166 ns bm_ranges_find<std::deque<short>>/262144 60778 ns 24814 ns bm_ranges_find<std::deque<short>>/1048576 229288 ns 99273 ns bm_ranges_find<std::deque<int>>/1 6.43 ns 10.2 ns bm_ranges_find<std::deque<int>>/2 16.6 ns 10.2 ns bm_ranges_find<std::deque<int>>/3 19.6 ns 10.2 ns bm_ranges_find<std::deque<int>>/4 21.0 ns 10.2 ns bm_ranges_find<std::deque<int>>/5 21.9 ns 10.4 ns bm_ranges_find<std::deque<int>>/6 22.7 ns 10.2 ns bm_ranges_find<std::deque<int>>/7 23.9 ns 10.2 ns bm_ranges_find<std::deque<int>>/8 23.8 ns 10.2 ns bm_ranges_find<std::deque<int>>/16 27.2 ns 27.1 ns bm_ranges_find<std::deque<int>>/64 42.4 ns 32.4 ns bm_ranges_find<std::deque<int>>/512 122 ns 43.0 ns bm_ranges_find<std::deque<int>>/4096 895 ns 93.7 ns bm_ranges_find<std::deque<int>>/32768 6890 ns 756 ns bm_ranges_find<std::deque<int>>/262144 54025 ns 6102 ns bm_ranges_find<std::deque<int>>/1048576 221558 ns 32783 ns ``` |
||
|
|
36dd743212 |
[libc++][test] add more benchmarks for stop_token (#69590)
|
||
|
|
1a5af34e6f |
[libc++] Speed up classic locale (take 2) (#73533)
Locale objects use atomic reference counting, which may be very expensive in parallel applications. The classic locale is used by default by all streams and can be very contended. But it's never destroyed, so the reference counting is also completely pointless on the classic locale. Currently ~70% of time in the parallel stringstream benchmarks is spent in locale ctor/dtor. And the execution radically slows down with more threads. Avoid reference counting on the classic locale. With this change parallel benchmarks start to scale with threads. This is a re-application of |
||
|
|
ac8c9f1e39 |
[libc++] Properly guard std::filesystem with >= C++17 (#72701)
<filesystem> is a C++17 addition. In C++11 and C++14 modes, we actually have all the code for <filesystem> but it is hidden behind a non-inline namespace __fs so it is not accessible. Instead of doing this unusual dance, just guard the code for filesystem behind a classic C++17 check like we normally do. |
||
|
|
4e0c48b907 |
Revert "[libc++] Speed up classic locale (#72112)"
Looks like it broke the ASAN build: https://lab.llvm.org/buildbot/#/builders/168/builds/17053/steps/9/logs/stdio
This reverts commit
|
||
|
|
f8afc53d64 |
[libc++] Speed up classic locale (#72112)
Locale objects use atomic reference counting, which may be very expensive in parallel applications. The classic locale is used by default by all streams and can be very contended. But it's never destroyed, so the reference counting is also completely pointless on the classic locale. Currently ~70% of time in the parallel stringstream benchmarks is spent in locale ctor/dtor. And the execution radically slows down with more threads. Avoid reference counting on the classic locale. With this change parallel benchmarks start to scale with threads. Co-authored-by: Louis Dionne <ldionne.2@gmail.com> ``` │ baseline │ optimized │ │ sec/op │ sec/op vs base │ Istream_numbers/0/threads:1 4.672µ ± 0% 4.419µ ± 0% -5.42% (p=0.000 n=30+39) Istream_numbers/0/threads:72 539.817µ ± 0% 9.842µ ± 1% -98.18% (p=0.000 n=30+40) Istream_numbers/1/threads:1 4.890µ ± 0% 4.750µ ± 0% -2.85% (p=0.000 n=30+40) Istream_numbers/1/threads:72 66.44µ ± 1% 10.14µ ± 1% -84.74% (p=0.000 n=30+40) Istream_numbers/2/threads:1 4.888µ ± 0% 4.746µ ± 0% -2.92% (p=0.000 n=30+40) Istream_numbers/2/threads:72 494.8µ ± 0% 410.2µ ± 1% -17.11% (p=0.000 n=30+40) Istream_numbers/3/threads:1 4.697µ ± 0% 4.695µ ± 5% ~ (p=0.391 n=30+37) Istream_numbers/3/threads:72 421.5µ ± 7% 421.9µ ± 9% ~ (p=0.665 n=30) Ostream_number/0/threads:1 183.0n ± 0% 141.0n ± 2% -22.95% (p=0.000 n=30) Ostream_number/0/threads:72 24196.5n ± 1% 343.5n ± 3% -98.58% (p=0.000 n=30) Ostream_number/1/threads:1 250.0n ± 0% 196.0n ± 2% -21.60% (p=0.000 n=30) Ostream_number/1/threads:72 16260.5n ± 0% 407.0n ± 2% -97.50% (p=0.000 n=30) Ostream_number/2/threads:1 254.0n ± 0% 196.0n ± 1% -22.83% (p=0.000 n=30) Ostream_number/2/threads:72 28.49µ ± 1% 18.89µ ± 5% -33.72% (p=0.000 n=30) Ostream_number/3/threads:1 185.0n ± 0% 185.0n ± 0% 0.00% (p=0.017 n=30) Ostream_number/3/threads:72 19.38µ ± 4% 19.33µ ± 5% ~ (p=0.425 n=30) ``` |
||
|
|
c81bfc61da |
[libc++] Optimize for_each for segmented iterators
``` --------------------------------------------------- Benchmark old new --------------------------------------------------- bm_for_each/1 3.00 ns 2.98 ns bm_for_each/2 4.53 ns 4.57 ns bm_for_each/3 5.82 ns 5.82 ns bm_for_each/4 6.94 ns 6.91 ns bm_for_each/5 7.55 ns 7.75 ns bm_for_each/6 7.06 ns 7.45 ns bm_for_each/7 6.69 ns 7.14 ns bm_for_each/8 6.86 ns 4.06 ns bm_for_each/16 11.5 ns 5.73 ns bm_for_each/64 43.7 ns 4.06 ns bm_for_each/512 356 ns 7.98 ns bm_for_each/4096 2787 ns 53.6 ns bm_for_each/32768 20836 ns 438 ns bm_for_each/262144 195362 ns 4945 ns bm_for_each/1048576 685482 ns 19822 ns ``` Reviewed By: ldionne, Mordante, #libc Spies: bgraur, sberg, arichardson, libcxx-commits Differential Revision: https://reviews.llvm.org/D151274 |
||
|
|
511236e074 |
[libc++][test] Add stop_token benchmark (#69117)
This is transforming the `stop_token` benchmark that Lewis Baker had created into Google Bench https://reviews.llvm.org/D154702 |
||
|
|
36bb134ac7 |
[libc++] Use -nostdlib++ on GCC unconditionally (#68832)
We support GCC 13, which supports the flag. This allows simplifying the CMake logic around the use of -nostdlib++. Note that there are other places where we don't assume -nostdlib++ yet in the build, but this patch is intentionally trying to be small because this part of our CMake is pretty tricky. |
||
|
|
a9138cdb36 |
[libc++] Optimize ranges::count for __bit_iterators
``` --------------------------------------------------------------- Benchmark old new --------------------------------------------------------------- bm_vector_bool_count/1 1.92 ns 1.92 ns bm_vector_bool_count/2 1.92 ns 1.92 ns bm_vector_bool_count/3 1.92 ns 1.92 ns bm_vector_bool_count/4 1.92 ns 1.92 ns bm_vector_bool_count/5 1.92 ns 1.92 ns bm_vector_bool_count/6 1.92 ns 1.92 ns bm_vector_bool_count/7 1.92 ns 1.92 ns bm_vector_bool_count/8 1.92 ns 1.92 ns bm_vector_bool_count/16 1.92 ns 1.92 ns bm_vector_bool_count/64 2.24 ns 2.25 ns bm_vector_bool_count/512 3.19 ns 3.20 ns bm_vector_bool_count/4096 14.1 ns 12.3 ns bm_vector_bool_count/32768 84.0 ns 83.6 ns bm_vector_bool_count/262144 664 ns 661 ns bm_vector_bool_count/1048576 2623 ns 2628 ns bm_vector_bool_ranges_count/1 1.07 ns 1.92 ns bm_vector_bool_ranges_count/2 1.65 ns 1.92 ns bm_vector_bool_ranges_count/3 2.27 ns 1.92 ns bm_vector_bool_ranges_count/4 2.68 ns 1.92 ns bm_vector_bool_ranges_count/5 3.33 ns 1.92 ns bm_vector_bool_ranges_count/6 3.99 ns 1.92 ns bm_vector_bool_ranges_count/7 4.67 ns 1.92 ns bm_vector_bool_ranges_count/8 5.19 ns 1.92 ns bm_vector_bool_ranges_count/16 11.1 ns 1.92 ns bm_vector_bool_ranges_count/64 52.2 ns 2.24 ns bm_vector_bool_ranges_count/512 452 ns 3.20 ns bm_vector_bool_ranges_count/4096 3577 ns 12.1 ns bm_vector_bool_ranges_count/32768 28725 ns 83.7 ns bm_vector_bool_ranges_count/262144 229676 ns 662 ns bm_vector_bool_ranges_count/1048576 905574 ns 2625 ns ``` Reviewed By: #libc, ldionne Spies: arichardson, ldionne, libcxx-commits Differential Revision: https://reviews.llvm.org/D156956 |
||
|
|
6fe4e033f0 |
[libc++] Optimize vector push_back to avoid continuous load and store of end pointer
Credits: this change is based on analysis and a proof of concept by gerbens@google.com. Before, the compiler loses track of end as 'this' and other references possibly escape beyond the compiler's scope. This can be see in the generated assembly: 16.28 │200c80: mov %r15d,(%rax) 60.87 │200c83: add $0x4,%rax │200c87: mov %rax,-0x38(%rbp) 0.03 │200c8b: → jmpq 200d4e ... ... 1.69 │200d4e: cmp %r15d,%r12d │200d51: → je 200c40 16.34 │200d57: inc %r15d 0.05 │200d5a: mov -0x38(%rbp),%rax 3.27 │200d5e: mov -0x30(%rbp),%r13 1.47 │200d62: cmp %r13,%rax │200d65: → jne 200c80 We fix this by always explicitly storing the loaded local and pointer back at the end of push back. This generates some slight source 'noise', but creates nice and compact fast path code, i.e.: 32.64 │200760: mov %r14d,(%r12) 9.97 │200764: add $0x4,%r12 6.97 │200768: mov %r12,-0x38(%rbp) 32.17 │20076c: add $0x1,%r14d 2.36 │200770: cmp %r14d,%ebx │200773: → je 200730 8.98 │200775: mov -0x30(%rbp),%r13 6.75 │200779: cmp %r13,%r12 │20077c: → jne 200760 Now there is a single store for the push_back value (as before), and a single store for the end without a reload (dependency). For fully local vectors, (i.e., not referenced elsewhere), the capacity load and store inside the loop could also be removed, but this requires more substantial refactoring inside vector. Differential Revision: https://reviews.llvm.org/D80588 |
||
|
|
dd788af74a |
Reapply "[libc++][ranges] Add benchmarks for the from_range constructors of vector and deque." (#67753)
This reverts commit
|
||
|
|
10edd5d943 |
Revert "[libc++][ranges] Add benchmarks for the from_range constructors of vector and deque."
This reverts commit
|
||
|
|
1abff08e9e |
[AIX] cxx_std_23 is currently not a known feature to ibm-clang (#66952)
Temporary workaround for the following CMake error: ``` CMake Error in /llvm/libcxx/benchmarks/CMakeLists.txt: The compiler feature "cxx_std_23" is not known to CXX compiler "IBMClang" ``` |
||
|
|
8bbcc6d548 |
[libc++] Add test coverage for unordered containers comparison (#66692)
This patch is a melting pot of changes picked up from https://llvm.org/D61878. It adds a few tests checking corner cases of unordered containers comparison and adds benchmarks for a few unordered_set operations. |
||
|
|
0218ea4aaa |
[libc++] Implement ranges::ends_with
Reviewed By: #libc, var-const Differential Revision: https://reviews.llvm.org/D150831 |
||
|
|
3df9c2984b |
[libc++] Fix minor warnings in libcxx benchmarks
This fixes some unused variable and "comparison of integers of different signs" warnings in the benchmarks. Differential Revision: https://reviews.llvm.org/D156613 |
||
|
|
d671126ad0 | [libc++][NFC] Remove stray #if 1 that was probably a debugging leftover | ||
|
|
679c0b48d7 |
[libc++abi] Refactor around __dynamic_cast
This commit contains refactorings around __dynamic_cast without changing its behavior. Some important changes include: - Refactor __dynamic_cast into various small helper functions; - Move dynamic_cast_stress.pass.cpp to libcxx/benchmarks and refactor it into a benchmark. The benchmark performance numbers are updated as well. Differential Revision: https://reviews.llvm.org/D138006 |
||
|
|
390ac82317 |
[libc++][ranges] Add benchmarks for the from_range constructors of vector and deque.
Differential Revision: https://reviews.llvm.org/D150747 |
||
|
|
68b1035965 |
[libc++][PSTL] Add a __parallel_sort implementation to libdispatch
Reviewed By: #libc, ldionne Spies: ldionne, libcxx-commits Differential Revision: https://reviews.llvm.org/D155136 |
||
|
|
66bb6b4fda |
[libc++] Optimize internal function in <system_error>
In the event the internal function __init is called with an empty string the code will take unnecessary extra steps, in addition, the code generated might be overall greater because, to my understanding, when initializing a string with an empty `const char*` "" (like in this case), the compiler might be unable to deduce the string is indeed empty at compile time and more code is generated. The goal of this patch is to make a new internal function that will accept just an error code skipping the empty string argument. It should skip the unnecessary steps and in the event `if (ec)` is `false`, it will return an empty string using the correct ctor, avoiding any extra code generation issues. After the conversation about this patch matured in the libcxx channel on the LLVM Discord server, the patch was analyzed quickly with "Compiler Explorer" and other tools and it was discovered that it does indeed reduce the amount of code generated when using the latest stable clang version (16) which in turn produces faster code. This patch targets LLVM 18 as it will break the ABI by addressing https://github.com/llvm/llvm-project/issues/63985 Benchmark tests run on other machines as well show in the best case, that the new version without the extra string as an argument performs 10 times faster. On the buildkite CI run it shows the new code takes less CPU time as well. In conclusion, the new code should also just appear cleaner because there are fewer checks to do when there is no message. Reviewed By: #libc, philnik Spies: emaste, nemanjai, philnik, libcxx-commits Differential Revision: https://reviews.llvm.org/D155820 |
||
|
|
8670b53e11 |
[libc++] Optimize ranges::find for vector<bool>
Benchmark results: ``` ---------------------------------------------------------------- Benchmark old new ---------------------------------------------------------------- bm_vector_bool_ranges_find/1 5.64 ns 6.08 ns bm_vector_bool_ranges_find/2 16.5 ns 6.03 ns bm_vector_bool_ranges_find/3 20.3 ns 6.07 ns bm_vector_bool_ranges_find/4 22.2 ns 6.08 ns bm_vector_bool_ranges_find/5 23.5 ns 6.05 ns bm_vector_bool_ranges_find/6 24.4 ns 6.10 ns bm_vector_bool_ranges_find/7 26.7 ns 6.10 ns bm_vector_bool_ranges_find/8 25.0 ns 6.08 ns bm_vector_bool_ranges_find/16 27.9 ns 6.07 ns bm_vector_bool_ranges_find/64 44.5 ns 5.35 ns bm_vector_bool_ranges_find/512 243 ns 25.7 ns bm_vector_bool_ranges_find/4096 1858 ns 35.6 ns bm_vector_bool_ranges_find/32768 15461 ns 93.5 ns bm_vector_bool_ranges_find/262144 126462 ns 571 ns bm_vector_bool_ranges_find/1048576 497736 ns 2272 ns ``` Reviewed By: #libc, Mordante Spies: var-const, Mordante, libcxx-commits Differential Revision: https://reviews.llvm.org/D156039 |
||
|
|
78addb2c32 |
Use hash value checks optimizations consistently
There are couple of optimizations of `__hash_table::find` which are applicable
to other places like `__hash_table::__node_insert_unique_prepare` and
`__hash_table::__emplace_unique_key_args`.
```
for (__nd = __nd->__next_; __nd != nullptr &&
(__nd->__hash() == __hash
// ^^^^^^^^^^^^^^^^^^^^^^
// (1)
|| std::__constrain_hash(__nd->__hash(), __bc) == __chash);
__nd = __nd->__next_)
{
if ((__nd->__hash() == __hash)
// ^^^^^^^^^^^^^^^^^^^^^^^^^^
// (2)
&& key_eq()(__nd->__upcast()->__value_, __k))
return iterator(__nd, this);
}
```
(1): We can avoid expensive modulo operations from `std::__constrain_hash` if
hashes matched. This one is from commit
|
||
|
|
5aa03b648b |
[libc++][NFC] Apply clang-format on large parts of the code base
This commit does a pass of clang-format over files in libc++ that don't require major changes to conform to our style guide, or for which we're not overly concerned about conflicting with in-flight patches or hindering the git blame. This roughly covers: - benchmarks - range algorithms - concepts - type traits I did a manual verification of all the changes, and in particular I applied clang-format on/off annotations in a few places where the result was less readable after than before. This was not necessary in a lot of places, however I did find that clang-format had pretty bad taste when it comes to formatting concepts. Differential Revision: https://reviews.llvm.org/D153140 |
||
|
|
f26b43fa5c |
[libc++] Removes CMake work-arounds.
CMake older than 3.20.0 is no longer supported. This removes work-arounds for no longer supported versions. Reviewed By: #libc, jloser, philnik Differential Revision: https://reviews.llvm.org/D152099 |
||
|
|
d965960fcf |
Revert "[libc++] Optimize for_each for segmented iterators"
This reverts commit
|
||
|
|
b1dc43aa3a |
[libc++] Optimize for_each for segmented iterators
``` --------------------------------------------------- Benchmark old new --------------------------------------------------- bm_for_each/1 3.00 ns 2.98 ns bm_for_each/2 4.53 ns 4.57 ns bm_for_each/3 5.82 ns 5.82 ns bm_for_each/4 6.94 ns 6.91 ns bm_for_each/5 7.55 ns 7.75 ns bm_for_each/6 7.06 ns 7.45 ns bm_for_each/7 6.69 ns 7.14 ns bm_for_each/8 6.86 ns 4.06 ns bm_for_each/16 11.5 ns 5.73 ns bm_for_each/64 43.7 ns 4.06 ns bm_for_each/512 356 ns 7.98 ns bm_for_each/4096 2787 ns 53.6 ns bm_for_each/32768 20836 ns 438 ns bm_for_each/262144 195362 ns 4945 ns bm_for_each/1048576 685482 ns 19822 ns ``` Reviewed By: ldionne, Mordante, #libc Spies: arichardson, libcxx-commits Differential Revision: https://reviews.llvm.org/D151274 |
||
|
|
1fd08edd58 |
[libc++] Forward to std::{,w}memchr in std::find
Reviewed By: #libc, ldionne Spies: Mordante, libcxx-commits, ldionne, mikhail.ramalho Differential Revision: https://reviews.llvm.org/D144394 |
||
|
|
7bfaa0f09d |
[NFC][Py Reformat] Reformat python files in libcxx/libcxxabi
This is an ongoing series of commits that are reformatting our Python code. Reformatting is done with `black`. If you end up having problems merging this commit because you have made changes to a python file, the best way to handle that is to run git checkout --ours <yourfile> and then reformat it with black. If you run into any problems, post to discourse about it and we will try to help. RFC Thread below: https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style Reviewed By: #libc, kwk, Mordante Differential Revision: https://reviews.llvm.org/D150763 |
||
|
|
63a2b206fa |
[libc++, std::vector] call the optimized version of __uninitialized_allocator_copy for trivial types
See: https://github.com/llvm/llvm-project/issues/61987 Fix suggested by: @philnik and @var-const Reviewers: philnik, ldionne, EricWF, var-const Differential Revision: https://reviews.llvm.org/D147741 Testing: ninja check-cxx check-clang check-llvm Benchmark Testcases (BM_CopyConstruct, and BM_Assignment) added. performance improvement: Run on (8 X 4800 MHz CPU s) CPU Caches: L1 Data 48 KiB (x4) L1 Instruction 32 KiB (x4) L2 Unified 1280 KiB (x4) L3 Unified 12288 KiB (x1) Load Average: 1.66, 3.02, 2.43 Comparing build-runtimes-base/libcxx/benchmarks/vector_operations.libcxx.out to build-runtimes/libcxx/benchmarks/vector_operations.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------- BM_ConstructSize/vector_byte/5140480 +0.0362 +0.0362 116906 121132 116902 121131 BM_CopyConstruct/vector_int/5140480 -0.4563 -0.4577 1755224 954241 1755330 951987 BM_Assignment/vector_int/5140480 -0.0222 -0.0220 990045 968095 989917 968125 BM_ConstructSizeValue/vector_byte/5140480 +0.0308 +0.0307 116970 120567 116977 120573 BM_ConstructIterIter/vector_char/1024 -0.0831 -0.0831 19 17 19 17 BM_ConstructIterIter/vector_size_t/1024 +0.0129 +0.0131 88 89 88 89 BM_ConstructIterIter/vector_string/1024 -0.0064 -0.0018 54455 54109 54208 54112 OVERALL_GEOMEAN -0.0845 -0.0842 0 0 0 0 FYI, the perf improvements for BM_CopyConstruct due to this patch is mostly subsumed by the https://reviews.llvm.org/D149826. However this patch still adds value by converting copy to memmove (the second testcase). Before the patch: ``` define linkonce_odr dso_local void @_ZNSt3__16vectorIiNS_9allocatorIiEEE18__construct_at_endIPiS5_EEvT_T0_m(ptr noundef nonnull align 8 dereferenceable(24) %0, ptr noundef %1, ptr noundef %2, i64 noundef %3) local_unnamed_addr #4 comdat align 2 { %5 = getelementptr inbounds %"class.std::__1::vector", ptr %0, i64 0, i32 1 %6 = load ptr, ptr %5, align 8, !tbaa !12 %7 = icmp eq ptr %1, %2 br i1 %7, label %16, label %8 8: ; preds = %4, %8 %9 = phi ptr [ %13, %8 ], [ %1, %4 ] %10 = phi ptr [ %14, %8 ], [ %6, %4 ] %11 = icmp ne ptr %10, null tail call void @llvm.assume(i1 %11) %12 = load i32, ptr %9, align 4, !tbaa !14 store i32 %12, ptr %10, align 4, !tbaa !14 %13 = getelementptr inbounds i32, ptr %9, i64 1 %14 = getelementptr inbounds i32, ptr %10, i64 1 %15 = icmp eq ptr %13, %2 br i1 %15, label %16, label %8, !llvm.loop !16 16: ; preds = %8, %4 %17 = phi ptr [ %6, %4 ], [ %14, %8 ] store ptr %17, ptr %5, align 8, !tbaa !12 ret void } ``` After the patch: ``` define linkonce_odr dso_local void @_ZNSt3__16vectorIiNS_9allocatorIiEEE18__construct_at_endIPiS5_EEvT_T0_m(ptr noundef nonnull align 8 dereferenceable(24) %0, ptr noundef %1, ptr noundef %2, i64 noundef %3) local_unnamed_addr #4 comdat align 2 { %5 = getelementptr inbounds %"class.std::__1::vector", ptr %0, i64 0, i32 1 %6 = load ptr, ptr %5, align 8, !tbaa !12 %7 = ptrtoint ptr %2 to i64 %8 = ptrtoint ptr %1 to i64 %9 = sub i64 %7, %8 %10 = ashr exact i64 %9, 2 tail call void @llvm.memmove.p0.p0.i64(ptr align 4 %6, ptr align 4 %1, i64 %9, i1 false) %11 = getelementptr inbounds i32, ptr %6, i64 %10 store ptr %11, ptr %5, align 8, !tbaa !12 ret void } ``` This is due to the optimized version of uninitialized_allocator_copy function. |
||
|
|
c9d475c937 |
[libc++abi] Improve performance of __dynamic_cast
The original `__dynamic_cast` implementation does not use the ABI-provided `src2dst_offset` parameter which helps improve performance on the hot paths. This patch improves the performance on the hot paths in `__dynamic_cast` by leveraging hints provided by the `src2dst_offset` parameter. This patch also includes a performance benchmark suite for the `__dynamic_cast` implementation. Reviewed By: philnik, ldionne, #libc, #libc_abi, avogelsgesang Spies: mikhail.ramalho, avogelsgesang, xingxue, libcxx-commits Differential Revision: https://reviews.llvm.org/D138005 |
||
|
|
aff3cdc604 |
[libc++] Optimize std::ranges::{min, max} for types that are cheap to copy
Don't forward to `min_element` for small types that are trivially copyable, and instead use a naive loop that keeps track of the smallest element (as opposed to an iterator to the smallest element). This allows the compiler to vectorize the loop in some cases. Reviewed By: #libc, ldionne Spies: ldionne, libcxx-commits Differential Revision: https://reviews.llvm.org/D143596 |
||
|
|
b4ecfd3c46 |
[libc++] Forward to std::memcmp for trivially comparable types in equal
Reviewed By: #libc, ldionne Spies: ldionne, Mordante, libcxx-commits Differential Revision: https://reviews.llvm.org/D139554 |
||
|
|
2a06757a20 |
[libc++][spaceship] Implement lexicographical_compare_three_way
The implementation makes use of the freedom added by LWG 3410. We have two variants of this algorithm: * a fast path for random access iterators: This fast path computes the maximum number of loop iterations up-front and does not compare the iterators against their limits on every loop iteration. * A basic implementation for all other iterators: This implementation compares the iterators against their limits in every loop iteration. However, it still takes advantage of the freedom added by LWG 3410 to avoid unnecessary additional iterator comparisons, as originally specified by P1614R2. https://godbolt.org/z/7xbMEen5e shows the benefit of the fast path: The hot loop generated of `lexicographical_compare_three_way3` is more tight than for `lexicographical_compare_three_way1`. The added benchmark illustrates how this leads to a 30% - 50% performance improvement on integer vectors. Implements part of P1614R2 "The Mothership has Landed" Fixes LWG 3410 and LWG 3350 Differential Revision: https://reviews.llvm.org/D131395 |
||
|
|
21f4232dd9 |
[libc++] Enable segmented iterator optimizations for join_view::iterator
Reviewed By: ldionne, #libc Spies: libcxx-commits Differential Revision: https://reviews.llvm.org/D138413 |
||
|
|
c90801457f |
[libc++] Refactor deque::iterator algorithm optimizations
This has multiple benefits: - The optimizations are also performed for the `ranges::` versions of the algorithms - Code duplication is reduced - it is simpler to add this optimization for other segmented iterators, like `ranges::join_view::iterator` - Algorithm code is removed from `<deque>` Reviewed By: ldionne, huixie90, #libc Spies: mstorsjo, sstefan1, EricWF, libcxx-commits, mgorny Differential Revision: https://reviews.llvm.org/D132505 |
||
|
|
549a5fd0b7 |
[libc++] Make pmr::monotonic_buffer_resource bump down
Bumping down is significantly faster than bumping up. This is ABI breaking, but the ABI of `pmr::monotonic_buffer_resource` was only stabilized in this release cycle, so we can still change it. For a more detailed explanation why bumping down is better, see https://fitzgeraldnick.com/2019/11/01/always-bump-downwards.html. Reviewed By: ldionne, #libc Spies: libcxx-commits Differential Revision: https://reviews.llvm.org/D141435 |
||
|
|
355e0ce3c5 |
[libc++] Extend check for non-ASCII characters to src/, test/ and benchmarks/
Differential Revision: https://reviews.llvm.org/D132180 |