This is correct for all values, i.e. the same as promoting the division to fp32 in the NVPTX backend. But it is faster (~10% in average, sometimes more) because:
- it performs less Newton iterations
- it avoids the slow path for e.g. denormals
- it allows reuse of the reciprocal for multiple divisions by the same divisor
Test program:
```
#include <stdio.h>
#include "cuda_fp16.h"
// This is a variant of CUDA's own __hdiv which is fast than hdiv_promote below
// and doesn't suffer from the perf cliff of div.rn.fp32 with 'special' values.
__device__ half hdiv_newton(half a, half b) {
float fa = __half2float(a);
float fb = __half2float(b);
float rcp;
asm("{rcp.approx.ftz.f32 %0, %1;\n}" : "=f"(rcp) : "f"(fb));
float result = fa * rcp;
auto exponent = reinterpret_cast<const unsigned&>(result) & 0x7f800000;
if (exponent != 0 && exponent != 0x7f800000) {
float err = __fmaf_rn(-fb, result, fa);
result = __fmaf_rn(rcp, err, result);
}
return __float2half(result);
}
// Surprisingly, this is faster than CUDA's own __hdiv.
__device__ half hdiv_promote(half a, half b) {
return __float2half(__half2float(a) / __half2float(b));
}
// This is an approximation that is accurate up to 1 ulp.
__device__ half hdiv_approx(half a, half b) {
float fa = __half2float(a);
float fb = __half2float(b);
float result;
asm("{div.approx.ftz.f32 %0, %1, %2;\n}" : "=f"(result) : "f"(fa), "f"(fb));
return __float2half(result);
}
__global__ void CheckCorrectness() {
int i = threadIdx.x + blockIdx.x * blockDim.x;
half x = reinterpret_cast<const half&>(i);
for (int j = 0; j < 65536; ++j) {
half y = reinterpret_cast<const half&>(j);
half d1 = hdiv_newton(x, y);
half d2 = hdiv_promote(x, y);
auto s1 = reinterpret_cast<const short&>(d1);
auto s2 = reinterpret_cast<const short&>(d2);
if (s1 != s2) {
printf("%f (%u) / %f (%u), got %f (%hu), expected: %f (%hu)\n",
__half2float(x), i, __half2float(y), j, __half2float(d1), s1,
__half2float(d2), s2);
//__trap();
}
}
}
__device__ half dst;
__global__ void ProfileBuiltin(half x) {
#pragma unroll 1
for (int i = 0; i < 10000000; ++i) {
x = x / x;
}
dst = x;
}
__global__ void ProfilePromote(half x) {
#pragma unroll 1
for (int i = 0; i < 10000000; ++i) {
x = hdiv_promote(x, x);
}
dst = x;
}
__global__ void ProfileNewton(half x) {
#pragma unroll 1
for (int i = 0; i < 10000000; ++i) {
x = hdiv_newton(x, x);
}
dst = x;
}
__global__ void ProfileApprox(half x) {
#pragma unroll 1
for (int i = 0; i < 10000000; ++i) {
x = hdiv_approx(x, x);
}
dst = x;
}
int main() {
CheckCorrectness<<<256, 256>>>();
half one = __float2half(1.0f);
ProfileBuiltin<<<1, 1>>>(one); // 1.001s
ProfilePromote<<<1, 1>>>(one); // 0.560s
ProfileNewton<<<1, 1>>>(one); // 0.508s
ProfileApprox<<<1, 1>>>(one); // 0.304s
auto status = cudaDeviceSynchronize();
printf("%s\n", cudaGetErrorString(status));
}
```
Reviewed By: herhut
Differential Revision: https://reviews.llvm.org/D126158
This patch adds the knobs to use peeling in the codegen strategy
infrastructure.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D126842
`scf.foreach_thread` results alias with the underlying `scf.foreach_thread.parallel_insert_slice` destination operands
and they bufferize to equivalent buffers in the absence of other conflicts.
`scf.foreach_thread.parallel_insert_slice` conflict detection is similar to `tensor.insert_slice` conflict detection.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D126769
Add an option to predicate the epilogue within the kernel instead of
peeling the epilogue. This is a useful option to prevent generating
large amount of code for deep pipeline. This currently require a user
lamdba to implement operation predication.
Differential Revision: https://reviews.llvm.org/D126753
This commit enables providing long-form documentation more seamlessly to the LSP
by revamping decl documentation. For ODS imported constructs, we now also import
descriptions and attach them to decls when possible. For PDLL constructs, the LSP will
now try to provide documentation by parsing the comments directly above the decls
location within the source file. This commit also adds a new parser flag
`enableDocumentation` that gates the import and attachment of ODS documentation,
which is unnecessary in the normal build process (i.e. it should only be used/consumed
by tools).
Differential Revision: https://reviews.llvm.org/D124881
This commit defines a dataflow analysis for integer ranges, which
uses a newly-added InferIntRangeInterface to compute the lower and
upper bounds on the results of an operation from the bounds on the
arguments. The range inference is a flow-insensitive dataflow analysis
that can be used to simplify code, such as by statically identifying
bounds checks that cannot fail in order to eliminate them.
The InferIntRangeInterface has one method, inferResultRanges(), which
takes a vector of inferred ranges for each argument to an op
implementing the interface and a callback allowing the implementation
to define the ranges for each result. These ranges are stored as
ConstantIntRanges, which hold the lower and upper bounds for a
value. Bounds are tracked separately for the signed and unsigned
interpretations of a value, which ensures that the impact of
arithmetic overflows is correctly tracked during the analysis.
The commit also adds a -test-int-range-inference pass to test the
analysis until it is integrated into SCCP or otherwise exposed.
Finally, this commit fixes some bugs relating to the handling of
region iteration arguments and terminators in the data flow analysis
framework.
Depends on D124020
Depends on D124021
Reviewed By: rriddle, Mogball
Differential Revision: https://reviews.llvm.org/D124023
The example was still using the -now- removed sparse_tensor.init_tensor.
Also, I made the input operands of the matrix multiplication sparse too
(since it looks a bit strange to multiply two dense matrices into a sparse).
Reviewed By: bixia
Differential Revision: https://reviews.llvm.org/D126897
These ops complement the tiling/padding transformations by transforming
higher-level named structured operations such as depthwise convolutions into
lower-level and/or generic equivalents that are better handled by some
downstream transformations.
Differential Revision: https://reviews.llvm.org/D126698
Now that we have an AllocTensorOp (previously InitTensorOp) in the bufferization dialect, the InitOp in the sparse dialect is no longer needed.
Differential Revision: https://reviews.llvm.org/D126180
The trick of using an empty token in the `FOREVERY_O` x-macro relies on preprocessor behavior which is only standard since C99 6.10.3/4 and C++11 N3290 16.3/4 (whereas it was undefined behavior up through C++03 16.3/10). Since the `ExecutionEngine/SparseTensorUtils.cpp` file is required to be compile-able under C++98 compatibility mode (unlike the C++11 used elsewhere in MLIR), we shouldn't rely on that behavior.
Also, using a non-empty suffix helps improve uniformity of the API, since all other primary/overhead suffixes are also non-empty. I'm using the suffix `0` since that's the value used by the `SparseTensorEncoding` attribute for indicating the index overhead-type.
Depends On D126720
Reviewed By: aartbik
Differential Revision: https://reviews.llvm.org/D126724
Ctlz is an intrinsic in LLVM but does not have equivalent operations in SPIR-V.
Including a decomposition gives an alternative path for these platforms.
Reviewed By: NatashaKnk
Differential Revision: https://reviews.llvm.org/D126261
Fix the warning: comparison of unsigned expression in ‘>= 0’ is always
true.
Reviewed By: kiranchandramohan, shraiysh
Differential Revision: https://reviews.llvm.org/D126784
This reverts commit 9b7193f852.
This is an older branch that was committed by mistake and does not include addressed review comments, an updated version will come next.
The primary goal of this change is to define readSparseTensorShape. Whereas the SparseTensorFile class is merely introduced as a way to reduce code duplication along the way.
Depends On D126106
Reviewed By: aartbik
Differential Revision: https://reviews.llvm.org/D126233
Remove boilerplate examples and add a text at the dialect level to describe
what kind of operands the operations accept (i.e., scalar, tensor or vector).
Left a shorter sentence describing the input operands for each operation as
this redundancy is convenient when browsing the documentation using the
website.
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D126648
This patch supports to convert the llvm intrinsic to the corresponding op. It still leaves some intrinsics to be handled specially.
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D126639
The current translation uses the old "ugly"/"raw" form which used PDLValue for the arguments
and results. This commit updates the C++ generation to use the recently added sugar that
allows for directly using the desired types for the arguments and result of PDL functions.
In addition, this commit also properly imports the C++ class for ODS operations, constraints,
and interfaces. This allows for a much more convienent C++ API than previously granted
with the raw/low-level types.
Differential Revision: https://reviews.llvm.org/D124817
This allows for the results of operations to be inferred in certain contexts,
and matches the support in PDL for result type inference. The main two
initial circumstances are when used as a replacement of another operation,
or when the operation being created implements InferTypeOpInterface.
Differential Revision: https://reviews.llvm.org/D124782
Vectorization is a key transformation to achieve high performance on most
architectures. In the transform dialect, vectorization is implemented as a
parameterizable transform op. It currently applies to a scope of payload IR
delimited by some isolated-from-above op, mainly because several enabling
transformations (such as affine simplification) are needed to perform
vectorization and these transformation would apply to ops other than the "main"
computational payload op. A separate "navigation" transform op that obtains the
isolated-from-above ancestor of an op is introduced in the core transform
dialect. Even though it is currently only useful for vectorization,
isolated-from-above ops are a common anchor for transformations (usually
implemented as passes) that is likely to be reused in the future.
Depends On D126374
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D126542
Add ops to the structured transform extension of the transform dialect that
perform interchange, padding and scalarization on structured ops. Along with
tiling that is already defined, this provides a minimal set of transformations
necessary to build vectorizable code for a single structured op.
Define two helper traits: one that implements TransformOpInterface by applying
a function to each payload op independently and another that provides a simple
"functional-style" producer/consumer list of memory effects for the transform
ops.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D126374
This patch adds support for applying a relation on domain/range of a relation.
Reviewed By: arjunp, ftynse
Differential Revision: https://reviews.llvm.org/D126339
Now that analysis and bufferization are better separated, post-analysis steps are no longer needed. Users can directly interleave analysis and bufferization as needed.
Differential Revision: https://reviews.llvm.org/D126571
This supports the operation conversion for threadprivate directive. The
support for memref type conversion is not implemented.
Reviewed By: kiranchandramohan, shraiysh
Differential Revision: https://reviews.llvm.org/D124610
This patch fixes the following compiler error:
error: declaration of ‘mlir::LLVM::cconv::CConv mlir::LLVM::detail::CConvAttrStorage::CConv’ changes meaning of ‘CConv’ [-fpermissive]
CConv as a member variable name was shadowing CConv as an enumeration,
hence the compiler error.
Reviewed By: ftynse, alexbatashev
Differential Revision: https://reviews.llvm.org/D126530
This patch adds support for Calling Convention attribute in LLVM
dialect, including enums, custom syntax and import from LLVM IR.
Additionally fix import of dso_local attribute.
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D126161
This change makes the public API of SparseTensorUtils.cpp explicit, whereas before the publicity of these functions was only implicit. Implicit publicity is sufficient for mlir-opt to generate calls to these functions, but it's not enough to enable C/C++ code to call them directly in the usual way (i.e., without going through codegen). Thus, leaving the publicity implicit prevents development of other tools (e.g., microbenchmarks).
In addition this change also marks the functions MLIR_CRUNNERUTILS_EXPORT, which is required by the JIT under certain configurations (albeit not for anything in our test suite).
Reviewed By: aartbik
Differential Revision: https://reviews.llvm.org/D126105
The Transform dialect uses the side effect modeling mechanism to record the
effects of the transform ops on the mapping between Transform IR values and
Payload IR ops. Introduce a checker pass that warns if a Transform IR value is
used after it has been freed (consumed). This pass is mostly intended as a
debugging aid in addition to the verification/assertion mechanisms in the
transform interpreter. It reports all potential use-after-free situations.
The implementation makes a series of simplifying assumptions to be simple and
conservative. A more advanced implementation would rely on the data flow-like
analysis associated with a side-effect resource rather than a value, which is
currently not supported by the analysis infrastructure.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D126381