This requires the user to set the upper bounds of the heap by defining
the symbol `__libc_heap_limit`. The heap begins at `_end` and ends
`__libc_heap_limit` bytes afterwards. This prevents a completely unused
heap from requiring any space, and it prevents the heap from being
zeroed at initialization time as part of BSS. It also allows users to
customize the available heap location without recompiling libc.
I'd think this should eventually be replaced with an implemenation based
on a morecore() library. This would allow the same implementation to use
sbrk() on POSIX, `_end` and `__libc_heap_limit` on embedded, and a
buffer in tests. It would also provide better "wilderness" behavior that
tends to decrease heap fragementation (see Wilson et al.)
See #98096
Summary:
We can enable the sscanf function on the GPU now. This required adding
the configs to the scanf list so that the GPU build didn't do float
conversions.
Summary:
Previously, the GPU built the `libc` in a fat binary version that was
used to pass this to the link job in offloading languages like CUDA or
OpenMP. This was mostly required because NVIDIA couldn't consume the
standard static library version. Recent patches have now created the
`clang-nvlink-wrapper` which lets us do that. Now, the C library is just
included implicitly by the toolchain (or passed with -Xoffload-linker
-lc).
This code can be fully removed, which will heavily simplify the build
(and removed some bugs and garbage files I've encoutnered).
Division-less Newton iterations algorithm for cube roots.
1. **Range reduction**
For `x = (-1)^s * 2^e * (1.m)`, we get 2 reduced arguments `x_r` and `a`
as:
```
x_r = 1.m
a = (-1)^s * 2^(e % 3) * (1.m)
```
Then `cbrt(x) = x^(1/3)` can be computed as:
```
x^(1/3) = 2^(e / 3) * a^(1/3).
```
In order to avoid division, we compute `a^(-2/3)` using Newton method
and then
multiply the results by a:
```
a^(1/3) = a * a^(-2/3).
```
2. **First approximation to a^(-2/3)**
First, we use a degree-7 minimax polynomial generated by Sollya to
approximate `x_r^(-2/3)` for `1 <= x_r < 2`.
```
p = P(x_r) ~ x_r^(-2/3),
```
with relative errors bounded by:
```
| p / x_r^(-2/3) - 1 | < 1.16 * 2^-21.
```
Then we multiply with `2^(e % 3)` from a small lookup table to get:
```
x_0 = 2^(-2*(e % 3)/3) * p
~ 2^(-2*(e % 3)/3) * x_r^(-2/3)
= a^(-2/3)
```
with relative errors:
```
| x_0 / a^(-2/3) - 1 | < 1.16 * 2^-21.
```
This step is done in double precision.
3. **First Newton iteration**
We follow the method described in:
Sibidanov, A. and Zimmermann, P., "Correctly rounded cubic root
evaluation
in double precision", https://core-math.gitlabpages.inria.fr/cbrt64.pdf
to derive multiplicative Newton iterations as below:
Let `x_n` be the nth approximation to `a^(-2/3)`. Define the n^th error
as:
```
h_n = x_n^3 * a^2 - 1
```
Then:
```
a^(-2/3) = x_n / (1 + h_n)^(1/3)
= x_n * (1 - (1/3) * h_n + (2/9) * h_n^2 - (14/81) * h_n^3 + ...)
```
using the Taylor series expansion of `(1 + h_n)^(-1/3)`.
Apply to `x_0` above:
```
h_0 = x_0^3 * a^2 - 1
= a^2 * (x_0 - a^(-2/3)) * (x_0^2 + x_0 * a^(-2/3) + a^(-4/3)),
```
it's bounded by:
```
|h_0| < 4 * 3 * 1.16 * 2^-21 * 4 < 2^-17.
```
So in the first iteration step, we use:
```
x_1 = x_0 * (1 - (1/3) * h_n + (2/9) * h_n^2 - (14/81) * h_n^3)
```
Its relative error is bounded by:
```
| x_1 / a^(-2/3) - 1 | < 35/242 * |h_0|^4 < 2^-70.
```
Then we perform Ziv's rounding test and check if the answer is exact.
This step is done in double-double precision.
4. **Second Newton iteration**
If the Ziv's rounding test from the previous step fails, we define the
error
term:
```
h_1 = x_1^3 * a^2 - 1,
```
And perform another iteration:
```
x_2 = x_1 * (1 - h_1 / 3)
```
with the relative errors exceed the precision of double-double.
We then check the Ziv's accuracy test with relative errors < 2^-102 to
compensate for rounding errors.
5. **Final iteration**
If the Ziv's accuracy test from the previous step fails, we perform
another
iteration in 128-bit precision and check for exact outputs.
Summary:
This patch implements `clock_gettime` using the monotonic clock. This
allows users to get time elapsed at nanosecond resolution. This is
primarily to facilitate compiling the `chrono` library from `libc++`.
For this reason we provide both `CLOCK_MONOTONIC`, which we can
implement
with the GPU's global fixed-frequency clock, and `CLOCK_REALTIME` which
we cannot. The latter is provided just to make people who use this
header happy and it will always return failure.
Rather than selecting the errno implementation based on the platform
which doesn't provide the necessary flexibility, make it configurable.
The errno value location is returned by `int *__llvm_libc_errno()` which
is a common design used by other C libraries.
Summary:
This function is used by the CUDA / HIP / OpenMP headers and exists as
an NVIDIA extension basically. This function is implemented in the C23
standard as `pown`, but for now we need to provide `powi` for backwards
compatibility. In the future this entrypoint will just be a redirect to
`pown` once that is implemented.
Summary:
This patch adds the options for configuring floating point and index
mode for scanf just like `printf`. Not enabling it on the GPU yet, need
to fix something else first.
Fixes https://github.com/llvm/llvm-project/issues/92874
Algorithm: Let `x = (-1)^s * 2^e * (1 + m)`.
- Step 1: Range reduction: reduce the exponent with:
```
y = cbrt(x) = (-1)^s * 2^(floor(e/3)) * 2^((e % 3)/3) * (1 + m)^(1/3)
```
- Step 2: Use the first 4 bit fractional bits of `m` to look up for a
degree-7 polynomial approximation to:
```
(1 + m)^(1/3) ~ 1 + m * P(m).
```
- Step 3: Perform the multiplication:
```
2^((e % 3)/3) * (1 + m)^(1/3).
```
- Step 4: Check for exact cases to prevent rounding and clear
`FE_INEXACT` floating point exception.
- Step 5: Combine with the exponent and sign before converting down to
`float` and return.
Adds `dlopen` and friends. This is needed as part of the effort to
compile `libunwind` + `libc` without baremetal mode. This is part of
https://github.com/llvm/llvm-project/issues/97191. This should still be
spec compliant, since `dlopen` always returns `NULL` and `dlerror`
always returns an error message.
> If dlopen() fails for any reason, it returns NULL.
> The function dlclose() returns 0 on success, and nonzero on error.
> Since the value of the symbol could actually be NULL (so that a NULL
return from dlsym() need not indicate an error), the correct way to test
for an error is to call dlerror() to clear any old error conditions,
then call dlsym(), and then call dlerror() again, saving its return
value into a variable, and check whether this saved value is not NULL.
See:
- https://linux.die.net/man/3/dlopen
I also fixed a comment in sinpif.cpp in the first commit. Should this be
included in this PR?
All tests were passed, including the exhaustive test.
CC: @lntue
This defines to LIBC_NAMESPACE with
`__attribute__((visibility("hidden")))` so all the symbols under it have
hidden visibility. This new macro should be used when declaring a new
namespace that will have internal functions/globals and LIBC_NAMESPACE
should be used as a means of accessing functions/globals declared within
LIBC_NAMESPACE_DECL.
Using the same range reduction as `sin`, `cos`, and `sincos`:
1) Reducing `x = k*pi/128 + u`, with `|u| <= pi/256`, and `u` is in
double-double.
2) Approximate `tan(u)` using degree-9 Taylor polynomial.
3) Compute
```
tan(x) ~ (sin(k*pi/128) + tan(u) * cos(k*pi/128)) / (cos(k*pi/128) - tan(u) * sin(k*pi/128))
```
using the fast double-double division algorithm in [the CORE-MATH
project](https://gitlab.inria.fr/core-math/core-math/-/blob/master/src/binary64/tan/tan.c#L1855).
4) Perform relative-error Ziv's accuracy test
5) If the accuracy tests failed, we redo the computations using 128-bit
precision `DyadicFloat`.
Fixes https://github.com/llvm/llvm-project/issues/96930
Summary:
The `errno` variable is expected to be `thread_local` by the standard.
However, the GPU targets do not support `thread_local` and implementing
that would be a large endeavor. Because of that, we previously didn't
provide the `errno` symbol at all. However, to build some programs we at
least need to be able to link against `errno`. Many things that would
normally set `errno` completely ignore it currently (i.e. stdio) but
some programs still need to be able to link against correct C programs.
For this purpose this patch exports the `errno` symbol as a simple
global. Internally, this will be updated atomically so it's at least not
racy. Externally, this will be on the user. I've updated the
documentation to state as such. This is required to get `libc++` to
build.
Summary:
Straightforward RPC implementation of the `remove` function for the GPU.
Copies over the string and calls `remove` on it, passing the result
back. This is required for building some `libc++` functionality.