This test checks if MLIR code is lowered according to schema presented
below:
func1() {
call __kmpc_parallel_51(..., func2, ...)
}
func2() {
call __kmpc_for_static_loop_4u(..., func3, ...)
}
func3() {
//loop body
}
This test demonstrates how the PhysicalStorageBuffer extension can be
used end-2-end in a spir-v module.
This module has been verified to pass serialization, deserialization,
and validation with spirv-val.
Add an emitc.expression operation that models C expressions, and provide
transforms to form and fold expressions. The translator emits the body
of
emitc.expression ops as a single C expression.
This expression is emitted by default as the RHS of an EmitC SSA value,
but if
possible, expressions with a single use that is not another expression
are
instead inlined. Specific expression's inlining can be fine tuned by
lowering
passes and transforms.
Extend the `amendOperation` mechanism for translating dialect attributes
attached to operations from another dialect when translating MLIR to
LLVM IR. Previously, this mechanism would have no knowledge of the LLVM
IR instructions created for the given operation, making it impossible
for it to perform local modifications such as attaching operation-level
metadata. Collect instructions inserted by the LLVM IR builder and pass
them to `amendOperation`.
Check if workshare loop without loop body is lowered correctly i.e.:
1) null pointer is passed to OpenMP device RTL function as a
parameter which denotes loop function body aggregated parameters
2) Outlined loop function body has only one parameter - loop counter
This is an experimental address space for strided buffers. These buffers
can have structs as elements and
a stride > 1.
These pointers allow the indexed access in units of stride, i.e., they
point at `buffer[index * stride]`.
Thus, we can use the `idxen` modifier for buffer loads.
We assign address space 9 to 192-bit buffer pointers which contain a
128-bit descriptor, a 32-bit offset and a 32-bit index. Essentially,
they are fat buffer pointers with an additional 32-bit index.
This commit implements the LLVM IR invariant intrinsics in LLVM dialect.
These intrinsics can be used to mark a program regions in which the
contents of a specific memory object will not change.
The LLVM dialect implementation also implements the
PromotableOpInterface to ensure Mem2Reg & SROA are able to promote
pointers that are marked using the invariant intrinsics.
This commit ensures that we model DI information for global constants
correctly. These constructs can lack scopes, names, and linkage names,
so these parameters were made optional for the DIGlobalVariable
attribute.
This test checks if proper OpenMP device RTL function is called to
handle workshare loop for GPU.
The code generation for GPU worksharing loops is implemented by the
patch: https://github.com/llvm/llvm-project/pull/73360
There's an issue in the translator today where, for a CallsiteLoc, if
the callee does not have a DI scope (perhaps due to compile options or
optimizations), it may get propagated the DI scope of its callsite's
parent function, which will create a non-existent DILocation combining
line & col number from one file, and the filename from another.
The root problem is we cannot propagate the parent scope when
translating the callee location, as it no longer applies to inlined
locations (see code diff and hopefully this will make sense).
To facilitate this, the importer is also changed so that callee scopes
are fused with the callee FileLineCol loc, instead of on the Callsite
loc itself. This comes with the benefit that we now have a symmetric
Callsite loc representation. If we required the callee scope be always
annotated on the Callsite loc, it would be hard for generic inlining
passes to maintain that, since it would have to somehow understand the
semantics of the fused metadata and pull it out while inlining.
GPU Dialect lowering to SYCL runtime is driven by spirv.target_env
attached to gpu.module. As a result of this, spirv.target_env remains as
an input to LLVMIR Translation.
A SPIRVToLLVMIRTranslation without any actual translation is added to
avoid an unregistered error in mlir-cpu-runner.
SelectObjectAttr.cpp is updated to
1) Pass binary size argument to getModuleLoadFn
2) Pass parameter count to getKernelLaunchFn
This change does not impact CUDA and ROCM usage since both
mlir_cuda_runtime and mlir_rocm_runtime are already updated to accept
and ignore the extra arguments.
Currently the parser & printer of `CallOp` do not match when both
varargs and attr-dict are present (round tripping is broken). This fixes
the parser so that it conforms to the written asm format in the
comments.
Continuation of https://github.com/llvm/llvm-project/pull/74247 to fix
https://github.com/llvm/llvm-project/issues/56962. Fixes verifier for
(Integer Attr):
```mlir
llvm.mlir.constant(1 : index) : f32
```
and (Dense Attr):
```mlir
llvm.mlir.constant(dense<100.0> : vector<1xf64>) : f32
```
## Integer Attr
The addition that this PR makes to `LLVM::ConstantOp::verify` is meant
to be exactly verifying the code in
`mlir::LLVM::detail::getLLVMConstant`:
9f78edbd20/mlir/lib/Target/LLVMIR/ModuleTranslation.cpp (L350-L353)
One failure mode is when the `type` (`llvm.mlir.constant(<value>) :
<type>`) is not an `Integer`, because then the `cast` in
`getIntegerBitWidth` will crash:
dca432cb7b/llvm/include/llvm/IR/DerivedTypes.h (L97-L99)
So that's now caught in the verifier.
Apart from that, I don't see anything we could check for. `sextOrTrunc`
means "Sign extend or truncate to width" and that one is quite
permissive. For example, the following doesn't have to be caught in the
verifier as it doesn't crash during `mlir-translate -mlir-to-llvmir`:
```mlir
llvm.func @main() -> f32 {
%cst = llvm.mlir.constant(100 : i64) : f32
llvm.return %cst : f32
}
```
## Dense Attr
Crash if not either a MLIR Vector type or one of these:
9f78edbd20/mlir/lib/Target/LLVMIR/ModuleTranslation.cpp (L375-L391)
This patch adds a target_features (TargetFeaturesAttr) to the LLVM
dialect to allow setting and querying the features in use on a function.
The motivation for this comes from the Arm SME dialect where we would
like a convenient way to check what variants of an operation are
available based on the CPU features.
Intended usage:
The target_features attribute is populated manually or by a pass:
```mlir
func.func @example() attributes {
target_features = #llvm.target_features<["+sme", "+sve", "+sme-f64f64"]>
} {
// ...
}
```
Then within a later rewrite the attribute can be checked, and used to
make lowering decisions.
```c++
// Finds the "target_features" attribute on the parent
// FunctionOpInterface.
auto targetFeatures = LLVM::TargetFeaturesAttr::featuresAt(op);
// Check a feature.
// Returns false if targetFeatures is null or the feature is not in
// the list.
if (!targetFeatures.contains("+sme-f64f64"))
return failure();
```
For now, this is rather simple just checks if the exact feature is in
the list, though it could be possible to extend with implied features
using information from LLVM.
Add:
* an Op for 'cp.async.mbarrier.arrive', targeting the
nvvm_cp_async_mbarrier_arrive* family of intrinsics.
* The 'noinc' intrinsic property is modelled as a default-valued-attr of
type I1.
* Test cases are added to verify the Op as well as the intrinsic
lowering.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
This PR introduces DIGlobalVariableAttr and
DIGlobalVariableExpressionAttr so that ModuleTranslation can emit the
required metadata needed for debug information about global variable.
The translator implementation for debug metadata needed to be refactored
in order to allow translation of nodes based on MDNode
(DIGlobalVariableExpressionAttr and DIExpression) in addition to
DINode-based nodes.
A DIGlobalVariableExpressionAttr can now be passed to the GlobalOp
operation directly and ModuleTranslation will create the respective
DIGlobalVariable and DIGlobalVariableExpression nodes. The compile unit
that DIGlobalVariable is expected to be configured with will be updated
with the created DIGlobalVariableExpression.
This reworks the ArmSME dialect to use attributes for tile allocation.
This has a number of advantages and corrects some issues with the
previous approach:
* Tile allocation can now be done ASAP (i.e. immediately after
`-convert-vector-to-arm-sme`)
* SSA form for control flow is now supported (e.g.`scf.for` loops that
yield tiles)
* ArmSME ops can be converted to intrinsics very late (i.e. after
lowering to control flow)
* Tests are simplified by removing constants and casts
* Avoids correctness issues with representing LLVM `immargs` as MLIR
values
- The tile ID on the SME intrinsics is an `immarg` (so is required to be
a compile-time constant), `immargs` should be mapped to MLIR attributes
(this is already the case for intrinsics in the LLVM dialect)
- Using MLIR values for `immargs` can lead to invalid LLVM IR being
generated (and passes such as -cse making incorrect optimizations)
As part of this patch we bid farewell to the following operations:
```mlir
arm_sme.get_tile_id : i32
arm_sme.cast_tile_to_vector : i32 to vector<[4]x[4]xi32>
arm_sme.cast_vector_to_tile : vector<[4]x[4]xi32> to i32
```
These are now replaced with:
```mlir
// Allocates a new tile with (indeterminate) state:
arm_sme.get_tile : vector<[4]x[4]xi32>
// A placeholder operation for lowering ArmSME ops to intrinsics:
arm_sme.materialize_ssa_tile : vector<[4]x[4]xi32>
```
The new tile allocation works by operations implementing the
`ArmSMETileOpInterface`. This interface says that an operation needs to
be assigned a tile ID, and may conditionally allocate a new SME tile.
Operations allocate a new tile by implementing...
```c++
std::optional<arm_sme::ArmSMETileType> getAllocatedTileType()
```
...and returning what type of tile the op allocates (ZAB, ZAH, etc).
Operations that don't allocate a tile return `std::nullopt` (which is
the default behaviour).
Currently the following ops are defined as allocating:
```mlir
arm_sme.get_tile
arm_sme.zero
arm_sme.tile_load
arm_sme.outerproduct // (if no accumulator is specified)
```
Allocating operations become the roots for the tile allocation pass,
which currently just (naively) assigns all transitive uses of a root
operation the same tile ID. However, this is enough to handle current
use cases.
Once tile IDs have been allocated subsequent rewrites can forward the
tile IDs to any newly created operations.
NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA).
It is a new level of parallelism, allowing clustering of Cooperative
Thread Arrays (CTA) to synchronize and communicate through shared memory
while running concurrently.
This PR enables support for CGA within the `gpu.launch_func` in the GPU
dialect. It extends `gpu.launch_func` to accommodate this functionality.
The GPU dialect remains architecture-agnostic, so we've added CGA
functionality as optional parameters. We want to leverage mechanisms
that we have in the GPU dialects such as outlining and kernel launching,
making it a practical and convenient choice.
An example of this implementation can be seen below:
```
gpu.launch_func @kernel_module::@kernel
clusters in (%1, %0, %0) // <-- Optional
blocks in (%0, %0, %0)
threads in (%0, %0, %0)
```
The PR also introduces index and dimensions Ops specific to clusters,
binding them to NVVM Ops:
```
%cidX = gpu.cluster_id x
%cidY = gpu.cluster_id y
%cidZ = gpu.cluster_id z
%cdimX = gpu.cluster_dim x
%cdimY = gpu.cluster_dim y
%cdimZ = gpu.cluster_dim z
```
We will introduce cluster support in `gpu.launch` Op in an upcoming PR.
See [the
documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays)
provided by NVIDIA for details.
In the early days of MLIR-to-LLVM IR translation, it had to forcefully
inject declarations of `malloc` and `free` functions as then-standard
(now `memref`) dialect ops were unconditionally lowering to libc calls.
This is no longer the case. Even when they do lower to libc calls, the
signatures of those methods are injected at lowering since calls must
target declared functions in valid IR. Don't inject those declarations
anymore.
This extends `LLVM_IntrOpBase` so that it can be passed a list of
`immArgPositions` and a list (of the same length) of `immArgAttrNames`.
`immArgPositions` contains the positions of `immargs` on the LLVM IR
intrinsic, and `immArgAttrNames` maps those to a corresponding MLIR
attribute.
This allows modeling LLVM `immargs` as MLIR attributes, which is the
closest match semantically (and had already been done manually for the
LLVM dialect intrinsics).
This has two upsides:
* It's slightly easier to implement intrinsics with immargs now
(especially if they make use of other features, such as overloads)
* It clearly defines that `immargs` should map to attributes, before
there was no mention of `immargs` in LLVMOpBase.td, so implementing them
was unclear
This works with other features of the `LLVM_IntrOpBase`, so `immargs`
can be marked as overloaded too (which is used in some intrinsics).
As part of this patch (and to test correctness) existing intrinsics have
been updated to use these new parameters.
This also uncovered a few issues with the
`llvm.intr.vector.insert/extract` intrinsics. First, the argument order
for insert did not match the LLVM intrinsic, and secondly, both were
missing a mlirBuilder (so failed to import from LLVM IR). This is
corrected with this patch (and a test case added).
Data layout queries may be issued for types whose size exceeds the range
of 32-bit integer as well as for types that don't have a size known at
compile time, such as scalable vectors. Use best practices from LLVM IR
and adopt `llvm::TypeSize` for size-related queries and `uint64_t` for
alignment-related queries.
See #72678.
The svldr_vnum and svstr_vnum builtins always modify the base register
and tile slice and provide immediate offsets of zero, even when the
offset provided to the builtin is an immediate. This patch optimises the
output of the builtins when the offset is an immediate, to pass it
directly to the instruction and to not need the base register and tile
slice updates.
This renames the `emitc.call` op to `emitc.call_opaque` as the existing
call op does not refer to the callee by symbol. The rename allows to
introduce a new call op alongside with a future `emitc.func` op to model
and facilitate functions and function calls.
Align with the rest of the spirv dialect by using a functional type
syntax.
Regex for updating existing code:
`spirv\.VectorShuffle (\[.+\]) (%[^:]+): ([^,]+), (%[^:]+): ([^\s]+) ->(.+)`
==>
`spirv.VectorShuffle $1 $2, $4 : $3, $5 ->$6`
Add initial support for DIExpression in LLVM dialect.
Similar to LLVM IR, DI Expression is encoded as a list of uint64. The
difference is that LLVM IR has helpers for understanding the expression
(e.g. for verification and pretty printing), whereas the current support
added by this PR treats the expression elements as opaque.
Currently there's an edge cases where constant indexing in target
regions can lead to incorrect results as we do not correctly replace
uses of mapped variables in generated target functions with the target
arguments (and accessor instructions) that replace them. This patch
seeks to fix that by extending the current logic in the OMPIRBuilder.
Things like GEP's can come in the form of Constants/ConstantExprs,
Constants and ConstantExpr's do not have access to the knowledge of what
they're contained in, so we must dig a little to find an instruction so
we can tell if they're used inside of the function we're outlining so we
can be sure they are replaceable and we are not accidentally replacing a
usage somewhere else in the module that's still necessary.
This patch handles these by replacing the original constant expression
with a new instruction equivalent; an instruction as it allows easy
modification in the following loop, as we can now know the constant
(instruction) is owned by our target function (as it holds this
knowledge) and replaceUsesOfWith can now be invoked on it (cannot do
this with constants it seems), a brand new one also allows us to be
cautious as it is perhaps possible the old expression was used inside of
the function but exists and is used externally (unlikely by the nature
of a Constant, but still a positive side affect).
Previously, we were inserting za.enable/disable intrinsics for functions
with the "arm_za" attribute (at the MLIR level), rather than using the
backend attributes. This was done to avoid a dependency on the SME ABI
functions from compiler-rt (which have only recently been implemented).
Doing things this way did have correctness issues, for example, calling
a streaming-mode function from another streaming-mode function (both
with ZA enabled) would lead to ZA being disabled after returning to the
caller (where it should still be enabled). Fixing issues like this would
require re-doing the ABI work already done in the backend within MLIR.
Instead, this patch switches to use the "arm_new_za" (backend) attribute
for enabling ZA for an MLIR function. For the integration tests, this
requires some way of linking the SME ABI functions. This is done via the
`%arm_sme_abi_shlib` lit substitution. By default, this expands to a
stub implementation of the SME ABI functions, but this can be overridden
by providing the `ARM_SME_ABI_ROUTINES_SHLIB` CMake cache variable
(pointing it at an alternative implementation). For now, the ArmSME
integration tests pass with just stubs, as we don't make use of nested
ZA-enabled calls.
A future patch may add an option to compiler-rt to build the SME
builtins into a standalone shared library to allow easily
building/testing with the actual implementation.
This patch adds a `llvm.linker.options` operation taking a list of
strings to pass to the linker when the resulting object file is linked.
This is particularly useful on Windows to specify the CRT version to use
for this object file.
Function __kmpc_push_num_threads should be called only if we specify
number of threads for host parallel region.
Number of threads specified by the user should be passed as one of
arguments of __kmpc_parallel_51 function.
Before we tracked the size of the teams reduction buffer in order to
allocate it at runtime per kernel launch. This patch splits the number
into two parts, the size of the reduction data (=all reduction
variables) and the (maximal) length of the buffer. This will allow us to
allocate less if we need less, e.g., if we have less teams than the
maximal length. It also allows us to move code from clangs codegen into
the runtime as we now know how large the reduction data is.
Fix a corner case missed in #71296 when operands generated by literals
are mixed with the args attribute of a call op.
Additionally remove a range check that is already handled by the CallOp
verifier.
This patch adds the MLIR translation changes required for add the IsolatedFromAbove and OutlineableOpenMPOpInterface traits to omp.target. It links the newly added block arguments to their corresponding llvm values.
Depends on #67164.
The KernelEnvironment is for compile time information about a kernel. It
allows the compiler to feed information to the runtime. The
KernelLaunchEnvironment is for dynamic information *per* kernel launch.
It allows the rutime to feed information to the kernel that is not
shared with other invocations of the kernel. The first use case is to
replace the globals that synchronize teams reductions with per-launch
versions. This allows concurrent teams reductions. More uses cases will
follow, e.g., per launch memory pools.
Fixes: https://github.com/llvm/llvm-project/issues/70249
This commit ensures that the debug info import skips `DICompositeTypes`
that have an array type tag and failed to translate the base type. This
is necessary because array `DICompositeTypes` require a base type to be
valid.
Note that this is currently not verified in LLVM, instead it leads to an
explosion of the `ASMPrinter`.
This patch seeks to add initial lowering of OpenMP array sections within
target region map clauses from MLIR to LLVM IR.
This patch seeks to support fixed sized contiguous (don't think OpenMP
supports anything other than contiguous sections from my reading but i
could be wrong) arrays initially, before looking toward assumed size and
shaped arrays. The patch also currently does not include stride, it's
left for future work.
Although, assumed size works in some fashion (dummy arguments) with some
minor alterations to the OMPEarlyOutliner, so it is possible changes
made in the IsolatedFromAbove series may allow this to work with no
further required patches.
It utilises the generated omp.bounds to calculate the size of the mapped
OpenMP array (both for sectioned and un-sectioned arrays) as well as the
offset to be passed to the kernel argument structure.
Alongside these changes some refactoring of how map data is handled is
attempted, using a new MapData structure to keep track of information
utilised in the lowering of mapped values.
The initial addition of a more complex createDeviceArgumentAccessor that
utilises capture kinds similarly to (and loosely based on) Clang to
generate different kernel argument accesses is also added.
A similar function for altering how the kernel argument is passed to the
kernel argument structure on the host is also utilised
(createAlteredByCaptureMap), which allows modification of the
pointer/basePointer based on their capture (and bounds information).
It's of note ByRef, is the default for explicit mappings and ByCopy will
be the default for implicit captures, so the former is currently tested
in this patch and the latter is not for the moment.