Add a new index page to be the Flang documentation mainpage instead of Overview.md, which jumps straight into the compiler Design. The index file needs to be in .rst format to use the toctree directive to create table of contents. Also use the sphinx_markdown_tables extension to generate html tables form markdown. A number of additional style changes to the existing docs were needed to make this work well: * Convert all headings to the # style, which works better with toctree's titlesonly option. Ensure that there is only one top-level heading per document. * Add a title to documents that don't have one for rendering on the index. * Convert the grammar docs from .txt to .md. for better rendering * Fixed broken link to a section in another document - sphinx does not seem to support anchor links in markdown files. Depends on D87226 Reviewed By: sameeranjoshi Differential Revision: https://reviews.llvm.org/D87242
153 lines
6.4 KiB
Markdown
153 lines
6.4 KiB
Markdown
<!--===- docs/Character.md
|
|
|
|
Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
|
See https://llvm.org/LICENSE.txt for license information.
|
|
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
|
|
|
-->
|
|
|
|
# Implementation of `CHARACTER` types in f18
|
|
|
|
```eval_rst
|
|
.. contents::
|
|
:local:
|
|
```
|
|
|
|
## Kinds and Character Sets
|
|
|
|
The f18 compiler and runtime support three kinds of the intrinsic
|
|
`CHARACTER` type of Fortran 2018.
|
|
The default (`CHARACTER(KIND=1)`) holds 8-bit character codes;
|
|
`CHARACTER(KIND=2)` holds 16-bit character codes;
|
|
and `CHARACTER(KIND=4)` holds 32-bit character codes.
|
|
|
|
We assume that code values 0 through 127 correspond to
|
|
the 7-bit ASCII character set (ISO-646) in every kind of `CHARACTER`.
|
|
This is a valid assumption for Unicode (UCS == ISO/IEC-10646),
|
|
ISO-8859, and many legacy character sets and interchange formats.
|
|
|
|
`CHARACTER` data in memory and unformatted files are not in an
|
|
interchange representation (like UTF-8, Shift-JIS, EUC-JP, or a JIS X).
|
|
Each character's code in memory occupies a 1-, 2-, or 4- byte
|
|
word and substrings can be indexed with simple arithmetic.
|
|
In formatted I/O, however, `CHARACTER` data may be assumed to use
|
|
the UTF-8 variable-length encoding when it is selected with
|
|
`OPEN(ENCODING='UTF-8')`.
|
|
|
|
`CHARACTER(KIND=1)` literal constants in Fortran source files,
|
|
Hollerith constants, and formatted I/O with `ENCODING='DEFAULT'`
|
|
are not translated.
|
|
|
|
For the purposes of non-default-kind `CHARACTER` constants in Fortran
|
|
source files, formatted I/O with `ENCODING='UTF-8'` or non-default-kind
|
|
`CHARACTER` value, and conversions between kinds of `CHARACTER`,
|
|
by default:
|
|
* `CHARACTER(KIND=1)` is assumed to be ISO-8859-1 (Latin-1),
|
|
* `CHARACTER(KIND=2)` is assumed to be UCS-2 (16-bit Unicode), and
|
|
* `CHARACTER(KIND=4)` is assumed to be UCS-4 (full Unicode in a 32-bit word).
|
|
|
|
In particular, conversions between kinds are assumed to be
|
|
simple zero-extensions or truncation, not table look-ups.
|
|
|
|
We might want to support one or more environment variables to change these
|
|
assumptions, especially for `KIND=1` users of ISO-8859 character sets
|
|
besides Latin-1.
|
|
|
|
## Lengths
|
|
|
|
Allocatable `CHARACTER` objects in Fortran may defer the specification
|
|
of their lengths until the time of their allocation or whole (non-substring)
|
|
assignment.
|
|
Non-allocatable objects (and non-deferred-length allocatables) have
|
|
lengths that are fixed or assumed from an actual argument, or,
|
|
in the case of assumed-length `CHARACTER` functions, their local
|
|
declaration in the calling scope.
|
|
|
|
The elements of `CHARACTER` arrays have the same length.
|
|
|
|
Assignments to targets that are not deferred-length allocatables will
|
|
truncate or pad the assigned value to the length of the left-hand side
|
|
of the assignment.
|
|
|
|
Lengths and offsets that are used by or exposed to Fortran programs via
|
|
declarations, substring bounds, and the `LEN()` intrinsic function are always
|
|
represented in units of characters, not bytes.
|
|
In generated code, assumed-length arguments, the runtime support library,
|
|
and in the `elem_len` field of the interoperable descriptor `cdesc_t`,
|
|
lengths are always in units of bytes.
|
|
The distinction matters only for kinds other than the default.
|
|
|
|
Fortran substrings are rather like subscript triplets into a hidden
|
|
"zero" dimension of a scalar `CHARACTER` value, but they cannot have
|
|
strides.
|
|
|
|
## Concatenation
|
|
|
|
Fortran has one `CHARACTER`-valued intrinsic operator, `//`, which
|
|
concatenates its operands (10.1.5.3).
|
|
The operands must have the same kind type parameter.
|
|
One or both of the operands may be arrays; if both are arrays, their
|
|
shapes must be identical.
|
|
The effective length of the result is the sum of the lengths of the
|
|
operands.
|
|
Parentheses may be ignored, so any `CHARACTER`-valued expression
|
|
may be "flattened" into a single sequence of concatenations.
|
|
|
|
The result of `//` may be used
|
|
* as an operand to another concatenation,
|
|
* as an operand of a `CHARACTER` relation,
|
|
* as an actual argument,
|
|
* as the right-hand side of an assignment,
|
|
* as the `SOURCE=` or `MOLD=` of an `ALLOCATE` statemnt,
|
|
* as the selector or case-expr of an `ASSOCIATE` or `SELECT` construct,
|
|
* as a component of a structure or array constructor,
|
|
* as the value of a named constant or initializer,
|
|
* as the `NAME=` of a `BIND(C)` attribute,
|
|
* as the stop-code of a `STOP` statement,
|
|
* as the value of a specifier of an I/O statement,
|
|
* or as the value of a statement function.
|
|
|
|
The f18 compiler has a general (but slow) means of implementing concatenation
|
|
and a specialized (fast) option to optimize the most common case.
|
|
|
|
### General concatenation
|
|
|
|
In the most general case, the f18 compiler's generated code and
|
|
runtime support library represent the result as a deferred-length allocatable
|
|
`CHARACTER` temporary scalar or array variable that is initialized
|
|
as a zero-length array by `AllocatableInitCharacter()`
|
|
and then progressively augmented in place by the values of each of the
|
|
operands of the concatenation sequence in turn with calls to
|
|
`CharacterConcatenate()`.
|
|
Conformability errors are fatal -- Fortran has no means by which a program
|
|
may recover from them.
|
|
The result is then used as any other deferred-length allocatable
|
|
array or scalar would be, and finally deallocated like any other
|
|
allocatable.
|
|
|
|
The runtime routine `CharacterAssign()` takes care of
|
|
truncating, padding, or replicating the value(s) assigned to the left-hand
|
|
side, as well as reallocating an nonconforming or deferred-length allocatable
|
|
left-hand side. It takes the descriptors of the left- and right-hand sides of
|
|
a `CHARACTER` assignemnt as its arguments.
|
|
|
|
When the left-hand side of a `CHARACTER` assignment is a deferred-length
|
|
allocatable and the right-hand side is a temporary, use of the runtime's
|
|
`MoveAlloc()` subroutine instead can save an allocation and a copy.
|
|
|
|
### Optimized concatenation
|
|
|
|
Scalar `CHARACTER(KIND=1)` expressions evaluated as the right-hand sides of
|
|
assignments to independent substrings or whole variables that are not
|
|
deferred-length allocatables can be optimized into a sequence of
|
|
calls to the runtime support library that do not allocate temporary
|
|
memory.
|
|
|
|
The routine `CharacterAppend()` copies data from the right-hand side value
|
|
to the remaining space, if any, in the left-hand side object, and returns
|
|
the new offset of the reduced remaining space.
|
|
It is essentially `memcpy(lhs + offset, rhs, min(lhsLength - offset, rhsLength))`.
|
|
It does nothing when `offset > lhsLength`.
|
|
|
|
`void CharacterPad()`adds any necessary trailing blank characters.
|