This patch adds the necessary support for the fopen and fclose functions
to work on the GPU via RPC. I added a new test that enables testing this
with the minimal features we have on the GPU. I will update it once we
have `fread` and `fwrite` to actually check the outputted strings. For
now I just relied on checking manually via the outpuot temp file.
Reviewed By: JonChesterfield, sivachandra
Differential Revision: https://reviews.llvm.org/D154519
Another low hanging fruit we can put on the GPU, this ports the tests
over to the hermetic framework so we can run them on the GPU.
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D154540
Summary:
Reviewer requested that this routine not be a macro, however that means
that it was not being intitialized as the static initializer was done
before the memcpy from the device. Fix this so we can get timing
information.
This patch adds the necessary support to provide timing information in
`libc` tests. This is useful for determining which tests look what
amount of time. We also can use this as a test basis for providing more
fine-grained timing when implementing things on the GPU.
The main difficulty with this is the fact that the AMDGPU fixed
frequency clock operates at an unknown frequency. We need to read this
on a per-card basis from the driver and then copy it in. NVPTX on the
other hand has a fixed clock at a resolution of 1ns. I have also
increased the resolution of the print-outs as the majority of these are
below a millisecond for me.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D154446
This is based on ideas from @nafi to:
- use a branchless version of 'cmp' for 'uint32_t',
- completely resolve the lexicographic comparison through vector
operations when wide types are available. We also get rid of byte
reloads and serializing '__builtin_ctzll'.
I did not include the suggestion to replace comparisons of 'uint16_t'
with two 'uint8_t' as it did not seem to help the codegen. This can
be revisited in sub-sequent patches.
The code been rewritten to reduce nested function calls, making the
job of the inliner easier and preventing harmful code duplication.
Reviewed By: nafi3000
Differential Revision: https://reviews.llvm.org/D148717
Clean up exhaustive tests. Let check functions return number of failures instead of passed/failed.
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D153682
CMake plumbing cargo culted from other tests.
Minor changes to Process to allow statically allocating a buffer.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D153594
Once integrated in our codebase the patch triggered a bunch of failing
tests. We do not yet understand where the bug is but we revert it to
move forward with integration.
This reverts commit 5e32765c15.
Before this change, a separate static method named cleanup was used to
cleanup the file. Instead, now the close method cleans up the full file
object using the platform's close function.
Reviewed By: lntue
Differential Revision: https://reviews.llvm.org/D153377
This patch:
(1) adds the add_with_carry_const and sub_with_borrow_const constexpr calls
to add and sub, respectively. Both add and sub are constexpr calls and
were call the non-constexpr version of add/sub_with_borrow.
(2) adds explicit UIntType construct calls in some fp tests.
Reviewed By: lntue
Differential Revision: https://reviews.llvm.org/D150223
These currently give warnings for unused variables or a default case
where everything is covered.
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D153137
This patch adds a test directly for the `fputs` function similar to the
existing `puts` test. This lets us know that the default file pointers
are function and the `fputs` interface works.
Reviewed By: lntue
Differential Revision: https://reviews.llvm.org/D152288
The GPU port of the LLVM C library needs to export a few extensions to
the interface such that users can interface with it. This patch adds the
necessary logic to define a GPU extension. Currently, this only exports
a `rpc_reset_client` function. This allows us to use the server in
D147054 to set up the RPC interface outside of `libc`.
Depends on https://reviews.llvm.org/D147054
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D152283
Fixing an issue with LLVM libc's fenv.h defined rounding mode macros
differently from system libc, making get_round() return different values from
fegetround(). Also letting math tests to skip rounding modes that cannot be
set. This should allow math tests to be run on platforms in which fenv.h is not
implemented yet.
This allows us to re-enable hermatic floating point tests in
https://reviews.llvm.org/D151123 and reverting https://reviews.llvm.org/D152742.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D152873
This patch mimics the behavior of Google Test and allow users to log custom messages after all flavors of ASSERT_ / EXPECT_.
Reviewed By: sivachandra, lntue
Differential Revision: https://reviews.llvm.org/D152630
This patch mimics the behavior of Google Test and allow users to log custom messages after all flavors of ASSERT_ / EXPECT_.
Reviewed By: sivachandra, lntue
Differential Revision: https://reviews.llvm.org/D152630
Most of the time `memmove` is called on buffers that are disjoint, in that case we can use `memcpy` which is faster.
The additional test is branchless on x86, aarch64 and RISCV with the zbb extension (bitmanip).
On x86 this patch adds a latency of 2 to 3 cycles.
Before
```
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------
BM_Memmove/0/0_median 5.00 ns 5.00 ns 10 bytes_per_cycle=1.25477/s bytes_per_second=2.62933G/s items_per_second=199.87M/s __llvm_libc::memmove,memmove Google A
BM_Memmove/1/0_median 6.21 ns 6.21 ns 10 bytes_per_cycle=3.22173/s bytes_per_second=6.75106G/s items_per_second=160.955M/s __llvm_libc::memmove,memmove Google B
BM_Memmove/2/0_median 8.09 ns 8.09 ns 10 bytes_per_cycle=5.31462/s bytes_per_second=11.1366G/s items_per_second=123.603M/s __llvm_libc::memmove,memmove Google D
BM_Memmove/3/0_median 5.95 ns 5.95 ns 10 bytes_per_cycle=2.71865/s bytes_per_second=5.69687G/s items_per_second=167.967M/s __llvm_libc::memmove,memmove Google L
BM_Memmove/4/0_median 5.63 ns 5.63 ns 10 bytes_per_cycle=2.28294/s bytes_per_second=4.78383G/s items_per_second=177.615M/s __llvm_libc::memmove,memmove Google M
BM_Memmove/5/0_median 5.68 ns 5.68 ns 10 bytes_per_cycle=2.16798/s bytes_per_second=4.54295G/s items_per_second=176.015M/s __llvm_libc::memmove,memmove Google Q
BM_Memmove/6/0_median 7.46 ns 7.46 ns 10 bytes_per_cycle=3.97619/s bytes_per_second=8.332G/s items_per_second=134.044M/s __llvm_libc::memmove,memmove Google S
BM_Memmove/7/0_median 5.40 ns 5.40 ns 10 bytes_per_cycle=1.79695/s bytes_per_second=3.76546G/s items_per_second=185.211M/s __llvm_libc::memmove,memmove Google U
BM_Memmove/8/0_median 5.62 ns 5.62 ns 10 bytes_per_cycle=3.18747/s bytes_per_second=6.67927G/s items_per_second=177.983M/s __llvm_libc::memmove,memmove Google W
BM_Memmove/9/0_median 101 ns 101 ns 10 bytes_per_cycle=9.77359/s bytes_per_second=20.4803G/s items_per_second=9.9333M/s __llvm_libc::memmove,uniform 384 to 4096
```
After
```
BM_Memmove/0/0_median 3.57 ns 3.57 ns 10 bytes_per_cycle=1.71375/s bytes_per_second=3.59112G/s items_per_second=280.411M/s __llvm_libc::memmove,memmove Google A
BM_Memmove/1/0_median 4.52 ns 4.52 ns 10 bytes_per_cycle=4.47557/s bytes_per_second=9.37843G/s items_per_second=221.427M/s __llvm_libc::memmove,memmove Google B
BM_Memmove/2/0_median 5.70 ns 5.70 ns 10 bytes_per_cycle=7.37396/s bytes_per_second=15.4519G/s items_per_second=175.399M/s __llvm_libc::memmove,memmove Google D
BM_Memmove/3/0_median 4.47 ns 4.47 ns 10 bytes_per_cycle=3.4148/s bytes_per_second=7.15563G/s items_per_second=223.743M/s __llvm_libc::memmove,memmove Google L
BM_Memmove/4/0_median 4.53 ns 4.53 ns 10 bytes_per_cycle=2.86071/s bytes_per_second=5.99454G/s items_per_second=220.69M/s __llvm_libc::memmove,memmove Google M
BM_Memmove/5/0_median 4.19 ns 4.19 ns 10 bytes_per_cycle=2.5484/s bytes_per_second=5.3401G/s items_per_second=238.924M/s __llvm_libc::memmove,memmove Google Q
BM_Memmove/6/0_median 5.02 ns 5.02 ns 10 bytes_per_cycle=5.94164/s bytes_per_second=12.4505G/s items_per_second=199.14M/s __llvm_libc::memmove,memmove Google S
BM_Memmove/7/0_median 4.03 ns 4.03 ns 10 bytes_per_cycle=2.47028/s bytes_per_second=5.17641G/s items_per_second=247.906M/s __llvm_libc::memmove,memmove Google U
BM_Memmove/8/0_median 4.70 ns 4.70 ns 10 bytes_per_cycle=3.84975/s bytes_per_second=8.06706G/s items_per_second=212.72M/s __llvm_libc::memmove,memmove Google W
BM_Memmove/9/0_median 90.7 ns 90.7 ns 10 bytes_per_cycle=10.8681/s bytes_per_second=22.7739G/s items_per_second=11.02M/s __llvm_libc::memmove,uniform 384 to 4096
```
Reviewed By: courbet
Differential Revision: https://reviews.llvm.org/D152811
Add Int<> and Int128 types to replace the usage of __int128_t in math
functions. Clean up to make sure that (U)Int128 and __(u)int128_t are
interchangeable in the code base.
Reviewed By: sivachandra, mikhail.ramalho
Differential Revision: https://reviews.llvm.org/D152459
A patch enabled this test which uses that `add_fp_unittest`.
Unfortunately we do not support these on the GPU because it attempts to
link in the floating point utils which are not built supporting
hermetic tests. This was attempted to be fixed in D151123 but that had
to be reverted. For now disable these so the tests pass.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D152742
This patch adds the reentrent qsort entrypoint, qsort_r. This is done by
extending the qsort functionality and moving it to a shared utility
header. For this reason the qsort_r tests focus mostly on the places
where it differs from qsort, since they share the same sorting code.
Reviewed By: sivachandra, lntue
Differential Revision: https://reviews.llvm.org/D152467
This is based on ideas from @nafi to:
- use a branchless version of 'cmp' for 'uint32_t',
- completely resolve the lexicographic comparison through vector
operations when wide types are available. We also get rid of byte
reloads and serializing '__builtin_ctzll'.
I did not include the suggestion to replace comparisons of 'uint16_t'
with two 'uint8_t' as it did not seem to help the codegen. This can
be revisited in sub-sequent patches.
The code been rewritten to reduce nested function calls, making the
job of the inliner easier and preventing harmful code duplication.
Reviewed By: nafi3000
Differential Revision: https://reviews.llvm.org/D148717
Many math functions need to check for floating point rounding modes to
return correct values. Currently most of them use the internal implementation
of `fegetround`, which is platform-dependent and blocking math functions to be
enabled on platforms with unimplemented `fegetround`. In this change, we add
platform independent rounding mode checks and switching math functions to use
them instead. https://github.com/llvm/llvm-project/issues/63016
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D152280
This is based on ideas from @nafi to:
- use a branchless version of 'cmp' for 'uint32_t',
- completely resolve the lexicographic comparison through vector
operations when wide types are available. We also get rid of byte
reloads and serializing '__builtin_ctzll'.
I did not include the suggestion to replace comparisons of 'uint16_t'
with two 'uint8_t' as it did not seem to help the codegen. This can
be revisited in sub-sequent patches.
The code been rewritten to reduce nested function calls, making the
job of the inliner easier and preventing harmful code duplication.
Reviewed By: nafi3000
Differential Revision: https://reviews.llvm.org/D148717
str method of FPBits class is only used for pretty printing its objects
in tests. It brings cpp::string dependency to FPBits class, which is not ideal
for embedded use case. We move str method to a free function in test utils and
remove this dependency of FPBits class.
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D152607
A few of these tests were disabled due to failing on NVPTX. After
looking into it the vast majority of these cases were due to
insufficient stack memory. This can be worked around by increasing the
stack size in the loader or by reducing the memory usage in the case of
large string constants.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D152583
This is a bit of cleanup before working on logging via stream operator (i.e., `EXPECT_XXX() << ...`).
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D152503
This test started failing on Nvidia, we need to disable it to keep the
bot green until we can investigate the root cause.
Differential Revision: https://reviews.llvm.org/D152481
The results for the %Lf tests were calculated on 80 bit long double
systems, meaning the results are not necessarily accurate for float128
systems. In future these will be fixed, but for the moment I'm just
turning them off.
Differential Revision: https://reviews.llvm.org/D152471
This patch adds three options for printf decimal long doubles, and these
can also apply to doubles.
1. Use a giant table which is fast and accurate, but takes up ~5MB).
2. Use dyadic floats for approximations, which only gives ~50 digits of
accuracy but is very fast.
3. Use large integers for approximations, which is accurate but very
slow.
Reviewed By: sivachandra, lntue
Differential Revision: https://reviews.llvm.org/D150399
This patch adds the initial support required to support basic priting in
`stdio.h` via `puts` and `fputs`. This is done using the existing LLVM C
library `File` API. In this sense we can think of the RPC interface as
our system call to dump the character string to the file. We carry a
`uintptr_t` reference as our native "file descriptor" as it will be used
as an opaque reference to the host's version once functions like
`fopen` are supported.
For some unknown reason the declaration of the `StdIn` variable causes
both the AMDGPU and NVPTX backends to crash if I use the `READ` flag.
This is not used currently as we only support output now, but it needs
to be fixed
Reviewed By: sivachandra, lntue
Differential Revision: https://reviews.llvm.org/D151282
This patch adds support for the `malloc` and `free` functions. These
currently aren't implemented in-tree so we first add the interface
filies.
This patch provides the most basic support for a true `malloc` and
`free` by using the RPC interface. This is functional, but in the future
we will want to implement a more intelligent system and primarily use
the RPC interface more as a `brk()` or `sbrk()` interface only called
when absolutely necessary. We will need to design an intelligent
allocator in the future.
The semantics of these memory allocations will need to be checked. I am
somewhat iffy on the details. I've heard that HSA can allocate
asynchronously which seems to work with my tests at least. CUDA uses an
implicit synchronization scheme so we need to use an explicitly separate
stream from the one launching the kernel or the default stream. I will
need to test the NVPTX case.
I would appreciate if anyone more experienced with the implementation details
here could chime in for the HSA and CUDA cases.
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D151735