diff --git a/flang/parser-combinators.txt b/flang/parser-combinators.txt new file mode 100644 index 000000000000..456d47e13508 --- /dev/null +++ b/flang/parser-combinators.txt @@ -0,0 +1,127 @@ +The Fortran language recognizer here is an LL recursive descent parser +composed from a "parser combinator" library that defines a few fundamental +parsers and a few ways to compose them into more powerful parsers. + +For our purposes here, a *parser* is any object that can attempt to recognize +an instance of some syntax from an input stream. It may succeed or fail. +On success, it may return some semantic value to its caller. + +In C++ terms, a parser is any instance of a class that + (1) has a constexpr default constructor, + (2) defines a resultType typedef, and + (3) provides a member or static function + + std::optional Parse(ParseState *) const; + static std::optional Parse(ParseState *); + + that accepts a pointer to a ParseState as its argument and returns + a std::optional as a result, with the presence or absence + of a value in the std::optional<> signifying success or failure + respectively. + +The resultType of a parser is typically the class type of some particular +node type in the parse tree. + +ParseState is a class that encapsulates a position in the source stream, +collects messages, and holds a few state flags that can affect tokenization +(e.g., are we in a character literal?). Instances of ParseState are +independent and complete -- they are cheap to duplicate when necessary to +implement backtracking. + +The constexpr default constructor of a parser is important. The functions +(below) that operate on instances of parsers are themselves all constexpr. +This use of compile-time expressions allows the entirety of a recursive +descent parser for a language to be constructed at compilation time through +the use of templates. + +These objects and functions are (or return) the fundamental parsers: + + ok always succeeds without advancing + pure(x) always succeeds without advancing, returning some value x + fail(msg) always fails with the given message; optionally typed + cut always fails, with no message + guard(pred) succeeds if the predicate expression evaluates to true + rawNextChar returns the next raw character; fails at EOF + cookedNextChar returns the next character after preprocessing, skipping + Fortran line continuations and comments; fails at EOF + +These functions and operators generate new parsers from combinations of +other parsers: + + !p ok if p fails, cut if p succeeds + p >> q match p, then q, returning q's value + p / q match p, then q, returning p's value + p || q match p if it succeeds, else match q; p and q must be same type + lookAhead(p) succeeds iff p does, but doesn't modify state + attempt(p) succeeds iff p does, safely preserving state on failure + many(p) a greedy sequence of zero or more nonempty successes of p; + returns std::list<> of values + some(p) a greedy sequence of one or more successes of p + skipMany(p) same as many(p), but discards result (performance optimizer) + maybe(p) try to match p, returning optional + defaulted(p) matches p, or else returns a default-constructed instance + of p's resultType + nonemptySeparated(p, q) repeatedly match p q p q p q ... p, returning + the values of the p's + extension(p) parses p if strict standard compliance is disabled, + with a warning if nonstandard usage warnings are enabled + deprecated(p) parses p if strict standard compliance is disabled, + with a warning if deprecated usage warnings are enabled + inContext("...", p) run p within an error message context + +Note that "a >> b >> c / d / e" matches a sequence of five parsers, +but returns only the result that was obtained by matching c. + +The following "applicative" combinators modify or combine the values returned +by parsers: + + construct{}(p1, p2, ...) + matches zero or more parsers in succession, collecting their + results and then passing them with move semantics to a + constructor for the type T if they all succeed + applyFunction(f, p1, p2, ...) + matches one or more parsers in succession, collecting their + results and passing them as rvalue reference arguments to + some function, returning its result + applyLambda([](&&x){}, p1, p2, ...) + is the same thing, but for lambdas and other function objects + applyMem(mf, p1, p2, ...) + is the same thing, but invokes a member function of the + result of the first parser + +These are non-advancing state inquiry and update parsers: + + getColumn returns 1-based column position + inCharLiteral succeeds under withinCharLiteral + inFortran succeeds unless in a preprocessing directive + inFixedForm succeeds in fixed-form source + setInFixedForm sets the fixed-form flag, returns prior value + columns returns the 1-based column number after which source is clipped + setColumns(c) sets "columns", returns prior value + +When parsing depends on the result values of earlier parses, the +"monadic bind" combinator is available (but please try to avoid using it, +as it makes automatic analysis of the grammar difficult): + + p >>= f match p, yielding some value x on success, then match the + parser returned from the function call f(x) + +Last, we have these basic parsers on which the actual grammar of the Fortran +is built. All of the following parsers consume characters acquired from +"cookedNextChar". + + spaces always succeeds after consuming any spaces or tabs + digit matches one cooked decimal digit (0-9) + letter matches one cooked letter (A-Z) + CharMatch<'c'>{} matches one specific cooked character + "..."_tok match contents, skipping spaces before and after, and + with multiple spaces accepted for any internal space + "..." >> p the tok suffix is optional on a string before >> and after / + parenthesized(p) shorthand for "(" >> p / ")" + bracketed(p) shorthand for "[" >> p / "]" + + withinCharLiteral(p) apply p, tokenizing for CHARACTER/Hollerith literals + nonEmptyListOf(p) matches a comma-separated list of one or more p's + optionalListOf(p) ditto, but can be empty + + "..."_debug emit the string and succeed, for parser debugging