Summary:
The AMDGPU target needs an external clock symbol so the driver can set
the frequency with the correct value. This was left over from the
previous implementation and I forgot to remove it when actually
implementing the timing utilities.
Summary:
This is a massive patch because it reworks the entire build and
everything that depends on it. This is not split up because various bots
would fail otherwise. I will attempt to describe the necessary changes
here.
This patch completely reworks how the GPU build is built and targeted.
Previously, we used a standard runtimes build and handled both NVPTX and
AMDGPU in a single build via multi-targeting. This added a lot of
divergence in the build system and prevented us from doing various
things like building for the CPU / GPU at the same time, or exporting
the startup libraries or running tests without a full rebuild.
The new appraoch is to handle the GPU builds as strict cross-compiling
runtimes. The first step required
https://github.com/llvm/llvm-project/pull/81557 to allow the `LIBC`
target to build for the GPU without touching the other targets. This
means that the GPU uses all the same handling as the other builds in
`libc`.
The new expected way to build the GPU libc is with
`LLVM_LIBC_RUNTIME_TARGETS=amdgcn-amd-amdhsa;nvptx64-nvidia-cuda`.
The second step was reworking how we generated the embedded GPU library
by moving it into the library install step. Where we previously had one
`libcgpu.a` we now have `libcgpu-amdgpu.a` and `libcgpu-nvptx.a`. This
patch includes the necessary clang / OpenMP changes to make that not
break the bots when this lands.
We unfortunately still require that the NVPTX target has an `internal`
target for tests. This is because the NVPTX target needs to do LTO for
the provided version (The offloading toolchain can handle it) but cannot
use it for the native toolchain which is used for making tests.
This approach is vastly superior in every way, allowing us to treat the
GPU as a standard cross-compiling target. We can now install the GPU
utilities to do things like use the offload tests and other fun things.
Some certain utilities need to be built with
`--target=${LLVM_HOST_TRIPLE}` as well. I think this is a fine
workaround as we
will always assume that the GPU `libc` is a cross-build with a
functioning host.
Depends on https://github.com/llvm/llvm-project/pull/81557
Summary:
Currently, doing `ninja install` will fail in fullbuild mode due to the
startup utilities not being built by default. This was hidden previously
by the fact that if tests were run, it would build the startup utilities
and thus they would be present.
This patch solves this issue by making the `libc-startup` target a
dependncy on the final library. Furthermore we simply factor out the
library install directory into the base CMake directory next to the
include directory handling. This change makes the `crt` files get
installed in `lib/x86_64-unknown-linu-gnu` instead of just `lib`.
This fixes an error I had where doing a runtimes failed to install its
libraries because the install step always errored.
```
AppProperties app;
```
is marked as a weak symbol in header now. One can just use `&app !=
nullptr` to check if `app` is defined. There is no need to define it for
overlay mode.
For some reasons, we are using `-fpie`
(libc/cmake/modules/LLVMLibCObjectRules.cmake:31) without supporting it.
According to @lntue, some of the hermetic tests are broken without
proper PIE support. This patch implements basic relocations support for
PIE.
* separate initialization routines into _start and do_start for all
architectures.
* lift do_start as a separate object library to avoid code duplication.
* (addtionally) address the problem of building hermetic libc with
-fstack-pointer-*
The `crt1.o` is now a merged result of three components:
```
___
|___ x86_64
| |_______ start.cpp.o <- _start (loads process initial stack and aligns stack pointer)
| |_______ tls.cpp.o <- init_tls, cleanup_tls, set_thread_pointer (TLS related routines)
|___ do_start.cpp.o <- do_start (sets up global variables and invokes the main function)
```
As part of startup refactoring, this patch adds a function to merge
multiple objects into a single relocatable object:
cc -r obj1.o obj2.o -o obj.o
A relocatable object is an object file that is not fully linked into an
executable or a shared library. It is an intermediate file format that
can be passed into the linker.
A crt object can have arch-specific code and arch-agnostic code. To
reduce code cohesion, the implementation is splitted into multiple
units. As a result, we need to merge them into a single relocatable
object.
__stack_chk_fail should be provided by libc.a, not startup files.
Add __stack_chk_fail to existing linux and arm entrypoints. On Windows
(when
not targeting MinGW), it seems that the corresponding function
identifier is
__security_check_cookie, so no entrypoint is added for Windows.
Baremetal
targets also ought to be compileable with `-fstack-protector*`
There is no common header for this prototype, since calls to
__stack_chk_fail
are meant to be inserted by the compiler upon function return when
compiled
`-fstack-protector*`.
This patch lifts aux vector related definitions to app.h. Because
startup's refactoring is in progress, this patch still contains
duplicated changes. This problem will be addressed very soon in an
incoming patch.
If a function is declared with stack-protector, the compiler may try to
load the TLS. However, inside certain runtime functions, TLS may not be
available. This patch disables stack protectors for such routines to fix
the problem.
Closes#74487.
Summary:
We call the global constructors by function pointer. For whatever reason
the NVPTX architecture relies very specifically on the arguments to the
function pointer invocation matching what the function is implemented
as. This is problematic as most of these constructors are generated
with no arguments. This patch removes the extended arguments that GNU
and LLVM use for the constructors optionally so that it can support the
common case.
This patch enables the compilation of libc for rv32 by unifying the
current rv64 and rv32 implementation into a single rv implementation.
We updated the cmake file to match the new riscv32 arch and force
LIBC_TARGET_ARCHITECTURE to be "riscv" whenever we find "riscv32" or
"riscv64". This is required as LIBC_TARGET_ARCHITECTURE is used in the
path for several platform specific implementations.
Reviewed By: michaelrj
Differential Revision: https://reviews.llvm.org/D148797
Summary:
This patch removes the `rpc_reset` function. This was previously used to
initialize the RPC client on the device by setting up the pointers to
communicate with the server. The purpose of this was to make it easier
to initialize the device for testing. However, this prevented us from
enforcing an invariant that the buffers are all read-only from the
client side.
The expected way to initialize the server is now to copy it from the
host runtime. This will allow us to maintain that the RPC client is in
the constant address space on the GPU, potentially through inference,
and improving caching behaviour.
This patch changes the default types of argc/argv so it's no longer a
uint64_t in all systems, instead, it's now a uintptr_t, which fixes
crashes in 32-bit systems that expect 32-bit types. This patch also adds
two uintptr_t types (EnvironType and AuxEntryType) for the same reason.
The patch also adds a PgrHdrTableType type behind an ifdef that's
Elf64_Phdr in 64-bit systems and Elf32_Phdr in 32-bit systems.
Summary:
There is currently effort to change over the default AMDGPU code object
version https://github.com/llvm/llvm-project/pull/65410. However, this
unfortunately causes problems in the LLVM LibC test suite that leads to
a hang while executing. This is most likely a bug to do with indirect
call optimization, as it can be avoided without optimizations or with
manually preventing inlining in the AMDGPU startup code.
This patch sets the AMDGPU code object version to be four explicitly on
the LibC test suite. This should unblock the efforts to move the default
to 5 without breaking the test suite. This isn't a great solution, but
there is currently some time pressure to get COV5 landed and this seems
to be the easiest solution.
This patch changes the instruction in set_thread_ptr from ld to mv,
as rv32 doesn't have the ld instruction, and mv is supported by both
rv32 and rv64.
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D159110
This patch is large, but is almost entirely just adding casts to calls
to syscall_impl. Much of the work was done programatically, with human
checking when the syntax or types got confusing.
Reviewed By: mcgrathr
Differential Revision: https://reviews.llvm.org/D156950
Currently we keep an internal buffer of device memory that is used to
indicate ownership of a port. Since we only use this as a single bit we
can simply turn this into a bitfield. I did this manually rather than
having a separate type as we need very special handling of the masks
used to interact with the locks.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D155511
Summary:
This caused test failures on the gfx90a buildbot. This works on my
gfx1030 and the Nvidia buildbots, so we'll need to investigate what is
going wrong here. For now revert it to get the bots green.
This reverts commit 05abcc5792.
This patch mostly renames files so it better reflects the function they declare.
Reviewed By: michaelrj
Differential Revision: https://reviews.llvm.org/D155607
Currently we keep an internal buffer of device memory that is used to
indicate ownership of a port. Since we only use this as a single bit we
can simply turn this into a bitfield. I did this manually rather than
having a separate type as we need very special handling of the masks
used to interact with the locks.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D155511
This patch adds the necessary support to provide timing information in
`libc` tests. This is useful for determining which tests look what
amount of time. We also can use this as a test basis for providing more
fine-grained timing when implementing things on the GPU.
The main difficulty with this is the fact that the AMDGPU fixed
frequency clock operates at an unknown frequency. We need to read this
on a per-card basis from the driver and then copy it in. NVPTX on the
other hand has a fixed clock at a resolution of 1ns. I have also
increased the resolution of the print-outs as the majority of these are
below a millisecond for me.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D154446
This pass used to cause huge compile time regressions, That has been
address and can now be re-added.
Differential Revision: https://reviews.llvm.org/D153374
Currently the implementation of the RPC interface requires a flexible
struct. This caused problems when compilling the RPC server with GCC as
would be required if trying to export the RPC server interface. This
required that we either move to the `x[1]` workaround or make it a
template parameter. While just using `x[1]` would be much less noisy,
this is technically undefined behavior. For this reason I elected to use
templates.
The downside to using templates is that the server code must now be able
to handle multiple different types at runtime. I was unable to find a
good solution that didn't rely on type erasure so I simply branch off of
the given value.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D153304
Recently the AMDGPU backend automatically enables a pass to optimize
atomics. This results in the LTO build taking about 10x longer in all
cases. For now we disable this by default as was the case before the
patch in D152649.
Reviewed By: lntue
Differential Revision: https://reviews.llvm.org/D153232
The AMDGPU backend has a built-in pass to lower constructors. We do this
manually in the `start.cpp` implementation so we can disable this to
keep the binaries smaller.
Differential Revision: https://reviews.llvm.org/D151213
Allows moving the pointer swap between server and client into reset.
Single allocation simplifies whatever allocates the client/server, currently
the libc loaders.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D150337
Replaces the globals currently used. Worth changing to a bitmap
before allowing runtime number of ports >> 64. One bit per port is likely
to be cheap enough that sizing for the worst case is always fine, otherwise
in the future we can change to dynamically allocating it.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D150309
Previously we used a single port to implement the RPC. This was
sufficient for single threaded tests but can potentially cause deadlocks
when using multiple threads. The reason for this is that GPUs make no
forward progress guarantees. Therefore one group of threads waiting on
another group of threads can spin forever because there is no guarantee
that the other threads will continue executing. The typical workaround
for this is to allocate enough memory that a sufficiently large number
of work groups can make progress. As long as this number is somewhat
close to the amount of total concurrency we can obtain reliable
execution around a shared resource.
This patch enables using multiple ports by widening the arrays to a
predetermined size and indexes into them. Empty ports are currently
obtained via a trivial linker scan. This should be imporoved in the
future for performance reasons. Portions of D148191 were applied to
achieve parallel support.
Depends on D149581
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D149598
The GPU has a different execution model to standard `_start`
implementations. On the GPU, all threads are active at the start of a
kernel. In order to correctly intitialize and call the constructors we
want single threaded semantics. Previously, this was done using a
makeshift global barrier with atomics. However, it should be easier to
simply put the portions of the code that must be single threaded in
separate kernels and then call those with only one thread. Generally,
mixing global state between kernel launches makes optimizations more
difficult, similarly to calling a function outside of the TU, but for
testing it is better to be correct.
Depends on D149527 D148943
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D149581
The execution model of the GPU expects that groups of threads will
execute in lock-step in SIMD fashion. It's both important for
performance and correctness that we treat this as the smallest possible
granularity for an RPC operation. Thus, we map multiple threads to a
single larger buffer and ship that across the wire.
This patch makes the necessary changes to support executing the RPC on
the GPU with multiple threads. This requires some workarounds to mimic
the model when handling the protocol from the CPU. I'm not completely
happy with some of the workarounds required, but I think it should work.
Uses some of the implementation details from D148191.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D148943
This patch adds the necessary hacks to support global constructors and
destructors. This is an incredibly hacky process caused by the primary
fact that Nvidia does not provide any binary tools and very little
linker support. We first had to emit references to these functions and
their priority in D149451. Then we dig them out of the module once it's
loaded to manually create the list that the linker should have made for
us. This patch also contains a few Nvidia specific hacks, but it passes
the test, albeit with a stack size warning from `ptxas` for the
callback. But this should be fine given the resource usage of a common
test.
This also adds a dependency on LLVM to the NVPTX loader, which hopefully doesn't
cause problems with our CUDA buildbot.
Depends on D149451
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D149527
This patch makes the necessary changes to support calling global
constructors and destructors on the GPU. The patch in D149340 allows the
`lld` linker to create the symbols pointing us to these globals. These
should be executed by a single thread, which is more difficult on the
GPU because all threads are active. I chose to use an atomic counter to
sync every thread on the GPU. This is very slow if you use more than a
few thousand threads, but for testing purposes it should be sufficient.
Depends on D149340 D149363
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D149398
This patch adds extra intrinsics for the GPU. Some of these are unused
for now but will be used later. We use these currently to update the
`RPC` handling. Currently, every thread can update the RPC client, which
isn't correct. This patch adds code neccesary to allow a single thread
to perfrom the write while the others wait.
Feedback is welcome for the naming of these functions. I'm copying the
OpenMP nomenclature where we call an AMD `wavefront` or NVIDIA `warp` a
`lane`.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D148810
This patch reworks the RPC interface to allow more generic memory
operations using the shared better. This patch decomposes the entire RPC
interface into opening a port and calling `send` or `recv` on it.
The `send` function sends a single packet of the length of the buffer.
The `recv` function is paired with the `send` call to then use the data.
So, any aribtrary combination of sending packets is possible. The only
restriction is that the client initiates the exchange with a `send`
while the server consumes it with a `recv`.
The operation of this is driven by two independent state machines that
tracks the buffer ownership during loads / stores. We keep track of two
so that we can transition between a send state and a recv state without
an extra wait. State transitions are observed via bit toggling, e.g.
This interface supports an efficient `send -> ack -> send -> ack -> send`
interface and allows for the last send to be ignored without checking
the ack.
A following patch will add some more comprehensive testing to this interface. I
I informally made an RPC call that simply incremented an integer and it took
roughly 10 microsends to complete an RPC call.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D148288
The NVIDIA compilation path requires some special options. This is
mostly because compilation is dependent on having a valid CUDA
toolchain. We don't actually need the CUDA toolchain to create the
exported `libcgpu.a` library because it's pure LLVM-IR. However, for
some language features we need the PTX version to be set. This is
normally set by checking the CUDA version, but without one installed it
will fail to build. We instead choose a minimum set of features on the
desired target, inferred from
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#release-notes
and the PTX refernece for functions like `nanosleep`.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D148532
This patch standardizes the error messages when a syscall is not
available to be in the format: "ABC and DEF syscalls are not available."
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D148373
The packaged version of the `libc` library does not depend on the CUDA
installation because it only uses `clang` and emits LLVM-IR. However,
for testing we directly need the CUDA toolkit to emit and execute the
files. This patch explicitly passes `--cuda-path` to the relevant
compilations for NVPTX testing.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D147653
This patch adds the necessary code to impelement the existing RPC client
/ server interface when targeting NVPTX GPUs. This follows closely to
the implementation in the AMDGPU version. This does not yet enable unit
testing as the `nvlink` linker does not support static libraries. So
that will need to be worked around.
I am ignoring the RPC duplication between the AMDGPU and NVPTX loaders. This
will be changed completely later so there's no point unifying the code at this
stage. The implementation was tested manually with the following file and
compilation flags.
```
namespace __llvm_libc {
void write_to_stderr(const char *msg);
void quick_exit(int);
} // namespace __llvm_libc
using namespace __llvm_libc;
int main(int argc, char **argv, char **envp) {
for (int i = 0; i < argc; ++i) {
write_to_stderr(argv[i]);
write_to_stderr("\n");
}
quick_exit(255);
}
```
```
$ clang++ crt1.o rpc_client.o quick_exit.o io.o main.cpp --target=nvptx64-nvidia-cuda -march=sm_70 -o image
$ ./nvptx_loader image 1 2 3
image
1
2
3
$ echo $?
255
```
Depends on D146681
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D146846