In OMPIRBuilder, we have many cases where we don't handle the debug
location correctly while changing the location or insertion point. This
is one of those cases.
Please see the following test program.
```
program main
implicit none
integer i, j
integer array(16384)
!$omp target teams distribute
DO i=1,16384
!$omp parallel do
DO j=1,16384
array(j) = i
ENDDO
!$omp end parallel do
ENDDO
!$omp end target teams distribute
print *, array
end program main
```
When tried to compile with the follownig command
`flang -g -O2 -fopenmp test.f90 -o test --offload-arch=gfx90a`
will fail in the verification with the following errors: `!dbg
attachment points at wrong subprogram for function`
This happens because we were dropping the debug location in the
createCanonicalLoop and the call to the functions like
`__kmpc_distribute_static_4u` get generated without a debug location.
When it gets inlined, the locations inside it are not adjusted as the
call instruction does not have the debug locations
(`llvm/lib/Transforms/Utils/InlineFunction.cpp:fixupLineNumbers`). Later
Verifier finds that the caller have instructions with debug locations
that point to another function and fails.
The fix is simple to not drop the debug location.
Add unrolling support for create_tdesc, load, store, prefetch, and update_offset.
---------
Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>
Co-authored-by: Chao Chen <chao.chen@intel.com>
When we offload to the target, the pointers to data used by the kernel
are passed in arrays created by `OMPIRBuilder`. These arrays of pointers
are allocated on the stack on the host. This is fine for the most part
because absent the `nowait` clause, the default behavior is that target
tasks are included tasks. That is, the host is blocked until the
offloaded target kernel is done. In turn, this means that the host's
stack frame is intact and accessing the array of pointers when
offloading is safe. However, when `nowait` is used on the `!$ omp
target` instance, then the target task is a deferred task meaning, the
generating task on the host does not have to wait for the target task
to finish. In such cases, it is very likely that the stack frame of the
function invoking the target call is wound up thereby leading to memory
access errors as shown below.
```
AMDGPU error: Error in hsa_amd_memory_pool_allocate: HSA_STATUS_ERROR_INVALID_ALLOCATION: The requested allocation is not valid.
AMDGPU error: Error in hsa_amd_memory_pool_allocate: HSA_STATUS_ERROR_INVALID_ALLOCATION: The requested allocation is not valid. "PluginInterface" error: Failure to allocate device memory: Failed to allocate from memory manager
fort.cod.out: /llvm/llvm-project/offload/plugins-nextgen/common/src/PluginInterface.cpp:1434: Error llvm::omp::target::plugin::PinnedAllocationMapTy::lockMappedHostBuffer(void *, size_t): Assertion `HstPtr && "Invalid pointer"' failed.
Aborted (core dumped)
```
This PR implements support in `OMPIRBuilder` to store these arrays of
pointers in the task structure that is passed to the target task thereby
ensuring it is available to the target task when the target task is
eventually scheduled.
---------
Co-authored-by: Sergio Afonso <safonsof@amd.com>
Moves the transformation logic from the AffineLinearizeOp and
AffineDelinearizeOp lowerings into separate transform functions that can
now be called separately. This provides a more controlled way to apply
the op lowerings.
---------
Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
Interpret an option value with multiple values, either in the form of an
`ArrayAttr` (either static or passed through a param) or as the multiple
attrs associated to a param, as a comma-separated list, i.e. as a
ListOption on a pass.
Add `gen-attr-constraint-decls` and `gen-attr-constraint-defs`, which
generate public C++ functions for attribute constraints. The name of the C++
function is specified in the `cppFunctionName` field.
This generalize `cppFunctionName` from `TypeConstraint` introduced in
https://github.com/llvm/llvm-project/pull/104577 to be usable also in `AttrConstraint`.
…shapes
Commit 6e5a142 changed the behavior of the function when computing
reassociations between tensors (consisting of unit/dynamic dimensions)
and scalars/0d vectors. The IR representation for such reshapes actually
expects an empty reassociation, like so:
```
func.func @example(%arg0 : tensor<?x?x?xf32>) -> tensor<f32> {
%0 = tensor.collapse_shape %arg0 [] : tensor<?x?x?xf32> into tensor<f32>
}
```
Restore the original behavior - the routine should resort to reporting
failures when compile time-known non-unit dimensions are part of the
attempted reassociation.
Signed-off-by: Artem Gindinson <gindinson@roofline.ai>
This PR introduces support for `scf::ForOp`, `scf::WhileOp`, `scf::If`,
and `scf::Condition` within the workgroup-subgroup-distribution pass,
leveraging the `SCFStructuralTypeConversionsAndLegality`.
Tosa scatter operator disallow duplicate indices (per batch)
This patch adds, to the validation pass, checking for duplicate values
in scatter operator's constant indices values.
Signed-off-by: Tai Ly <tai.ly@arm.com>
This PR fixes a bug in GatherToLDSOpLowering, we were getting the
MemRefType of source for the destination. Additionally, some related
typos are corrected.
CC: @krzysz00 @umangyadav @lialan
`FuncOpVectorUnroll` contains logic that replaces function arguments by
placeholders values. These replacements also involve changing all
instructions in the function that use the arguments to use these
placeholders. These placeholder values will later be changed back to use
the function arguments (either new or original if already legal).
The current implementation however only replaces back (the second
replacement, i.e. replacing the placeholder values to new/legal
arguments) the first block of instructions and not all of the blocks.
This may leave some instructions to use these placeholder values (which
for already legal arguments are just zeroattr values that will get
DCE'd) instead of the arguments, which is incorrect.
Closes#132158.
Adds bf16 support to SPIRV by using the `SPV_KHR_bfloat16` extension.
Only a few operations are supported, including loading from and storing
to memory, conversion to/from other types, cooperative matrix operations
(including coop matrix arithmetic ops) and dot product support.
This PR adds the type definition and implements the basic cast
operations. Arithmetic/coop matrix ops will be added in a separate PR.
When fusing two `linalg.genericOp`, where the producer has index
semantics, invalid `affine.apply` ops can be generated where the number
of indices do not match the number of loops in the fused genericOp.
This patch fixes the issue by directly using the number of loops from
the generated fused op.
Following GitHub organizations were merged into the ROCm org:
* ROCm-Developer-Tools
* RadeonOpenCompute
* ROCmSoftwarePlatform
Ensure that all hyperlinks to the old organizations now point to the new
organization at https://github.com/ROCm.
This patch removes the `test-linalg-to-vector-patterns` option from the
`-test-linalg-transform-patterns=` test flag. It was only used in one
test, where a more specialized transform dialect op can be used instead:
* `transform.apply_patterns.linalg.pad_vectorization`
While we could preserve `test-linalg-to-vector-patterns`, it's better to
rely on finer-grained transformations — this way, we know exactly what
is being run and tested. Now that its only use has been removed, it
feels natural to delete `test-linalg-to-vector-patterns`.
…WriteToBroadcast
The pattern would previously apply in spurious cases and generate
incorrect IR.
In the process, we disable the application of this pattern in the case
where there is no broadcast; this should be handled separately and may
more easily support masking.
The case {no-broadcast, yes-transpose} was previously caught by this
pattern and arguably could also generate incorrect IR (and was also
untested): this case does not apply anymore.
The last cast {yes-broadcast, yes-transpose} continues to apply but
should arguably be removed from the future because creating transposes
as part of canonicalization feels dangerous.
There are other patterns that move permutation logic:
- either into the transfer, or
- outside of the transfer
Ideally, this would be target-dependent and not a canonicalization (i.e.
does your DMA HW allow transpose on the fly or not) but this is beyond
the scope of this PR.
Co-authored-by: Nicolas Vasilache <nicolasvasilache@users.noreply.github.com>
Restores mistakenly removed AMX interface which ensures that the custom
tile type is converted to its LLVM equivalent within other operations
such as control flow.
Fix after #140559
This flag was used to let us incrementally introduce debug records
into LLVM, however everything is now using records. It serves no
purpose now, so delete it.
Prior to this patch, generating test checks in place put the ATTR
definitions at the very top of the file, above the RUN lines and
autogenerated note. All CHECK lines should below the RUN lines and
autogenerated note.
This change ensures that the attribute definitions are emitted with the
rest of the CHECK lines.
---------
Co-authored-by: Michael Maitland <michaelmaitland@meta.com>
Prior to this PR, the script removed the already existing autogenerated
note if we came across a line that was equal to the note. But the
default note is multiple lines, so there would never be a match.
Instead, check to see if the current line is a substring of the
autogenerated note.
Co-authored-by: Michael Maitland <michaelmaitland@meta.com>
Changes `findCollapsingReassociation` to return nullopt in all cases
where source shape has `>=2` dynamic dims. `expand(collapse)` can
reshape to in any valid output shape but a collapse can only collapse
contiguous dimensions. When there are `>=2` dynamic dimensions it is
impossible to determine if it can be simplified to a collapse or if it
is preforming a more advanced reassociation.
This problem was uncovered by
https://github.com/llvm/llvm-project/pull/137963
---------
Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
This patch simplifies code by removing the values from
insert/try_emplace. Note that default values inserted by try_emplace
are immediately overrideen in all these cases.
The OpenACC specification for `acc loop` describe that a loop's
parallelism determination mode is either auto, independent, or seq. The
rules are as follows.
- As per OpenACC 3.3 standard section 2.9.6 independent clause: A loop
construct with no auto or seq clause is treated as if it has the
independent clause when it is an orphaned loop construct or its parent
compute construct is a parallel construct.
- As per OpenACC 3.3 standard section 2.9.7 auto clause: When the parent
compute construct is a kernels construct, a loop construct with no
independent or seq clause is treated as if it has the auto clause.
- Additionally, loops marked with gang, worker, or vector are not
guaranteed to be parallel. Specifically noted in 2.9.7 auto clause: If
not, or if it is unable to make a determination, it must treat the auto
clause as if it is a seq clause, and it must ignore any gang, worker, or
vector clauses on the loop construct.
The verifier for `acc.loop` was updated to enforce this marking because
the context in which a loop appears is not trivially determined once IR
transformations begin. For example, orphaned loops are implicitly
`independent`, but after inlining into an `acc.kernels` region they
would be implicitly considered `auto`. Thus now the verifier requires
that a frontend specifically generates acc dialect with this marking
since it knows the context.
PR #143720 adds a requirement to the ACC dialect that every acc.loop
must have a seq, independent, or auto attribute for the 'default'
device_type. The standard has rules for how this can be intuited:
orphan/parallel/parallel loop: independent
kernels/kernels loop: auto
serial/serial loop: seq, unless there is a gang/worker/vector, at which
point it should be 'auto'.
This patch implements all of this rule as a 'cleanup' step on the IR
generation for combined/loop operations. Note that the test impact is
much less since I inadvertently have my 'operation' terminating curley
matching the end curley from 'attribute' instead of the front of the
line, so I've added sufficient tests to ensure I captured the above.
Improve ApplyRegisteredPassOp's support for taking options by taking
them as a dict (vs a list of string-valued key-value pairs).
Values of options are provided as either static attributes or as params
(which pass in attributes at interpreter runtime). In either case, the
keys and value attributes are converted to strings and a single
options-string, in the format used on the commandline, is constructed to
pass to the `addToPipeline`-pass API.
This PR updates the flang lowering to explicitly implement the OpenACC
rules:
- As per OpenACC 3.3 standard section 2.9.6 independent clause: A loop
construct with no auto or seq clause is treated as if it has the
independent clause when it is an orphaned loop construct or its parent
compute construct is a parallel construct.
- As per OpenACC 3.3 standard section 2.9.7 auto clause: When the parent
compute construct is a kernels construct, a loop construct with no
independent or seq clause is treated as if it has the auto clause.
- Loops in serial regions are `seq` if they have no other parallelism
marking such as gang, worker, vector.
For now the `acc.loop` verifier has not yet been updated to enforce
this.
If not truncated the SPIRV serialization would not fail but instead
produce an invalid SPIR-V module.
---------
Signed-off-by: Davide Grohmann <davide.grohmann@arm.com>
This change is trigger by encountering the following error:
```
<unknown>:0: error: 'spirv.Load' op result #0 must be void
or bool or 8/16/32/64-bit integer or 16/32/64-bit float or
vector of bool or 8/16/32/64-bit integer or 16/32/64-bit
float values of length 2/3/4/8/16 or any SPIR-V pointer type
or any SPIR-V array type or any SPIR-V run time array type
or any SPIR-V struct type or any SPIR-V cooperative matrix
type or any SPIR-V matrix type or any SPIR-V sampled image
type, but got '!spirv.image<f32, Dim2D, NoDepth, NonArrayed,
SingleSampled, NoSampler, Rgba8>'<unknown>:0: note: see current
operation:
%126 = "spirv.Load"(%125) {relaxed_precision} : (!spirv.ptr<!spirv.image<f32, Dim2D, NoDepth, NonArrayed, SingleSampled, NoSampler, Rgba8>, UniformConstant>) -> !spirv.image<f32, Dim2D, NoDepth, NonArrayed, SingleSampled, NoSampler, Rgba8>
```