RPN in Lex & Yacc

Lex & Yacc

Text parsing is an important part of many programs, such as compilers, interpreters, and various kinds of text analysers and transformers. To parse a text means to formally analyse it in order to discover its structure.

It usually helps to preprocess the input stream of characters, splitting it into a sequence of tokens (lexemes). This is called lexical analysis or scanning or tokenizing, and a program that does this is called a scanner or sometimes a lexer. Thus, in the more strict sense, the process of parsing is analysing the token sequence in order to (re)organize it into a structure of some kind. Parsing is also called syntactical analysis, and the corresponding programs are known as parsers.

Sometimes no further structuring is needed after tokenizing, and in yet other cases a parser may extract the tokens itself, one at a time, as it advances down the input stream.

Scanning and parsing can be successfully automated, as exemplified by the Lex and Yacc programs, which are a scanner and a parser generators, respectively. Lex and Yacc were created in the 1970s as parts of the Unix programming tool set. Both have a number of incarnations for various environments, and are still among the most widely used in their domain. Flex and Bison are improved, modern versions of Lex and Yacc, standardly distributed with most GNU/Linux variants.

Detailed documentation on Lex and Yacc and other information on parsing can be found at several places, e.g. The Lex & Yacc Page. Here we give a brief overview of their structure and use. As the two programs have much in common, and are often used in collaboration with each other, we describe them together.

With respect to C, Lex and Yacc are symbiotic tools. Given input specifications, they generate C sources which are then compiled in the usual way. The generated sources may be combined with other C sources, or they themselves may constitute the whole program. In this sense, there is no use of Lex and Yacc alone, without a C compiler. (But there exist analogues of Lex and Yacc for other programming languages.)

A specification for Lex or Yacc has three sections, with %% separating each two. The middle section is the central one. It contains a grammar – a set of rules that define patterns to be matched against pieces of input. For Lex the input is a plain stream of characters. For Yacc it is a sequence of tokens. The generated analyser systematically applies the rules from the specification, so that for every piece of input a match is found, until all the input is processed. Each rule has a C block attached which is the code to be executed when the corresponding pattern matches.

All parts enclosed in %{ and %} in the initial section, as well as the entire final section, contain text to be copied verbatim in the generated C program. The initial section also contains definitions of objects used in the main section. For example, a grammar for Yacc may need tokens, precedence levels, and associativity to be defined, in order for that grammar to be complete and unambiguous. In a Lex specification, defining named patterns is often helpful.

The kinds of grammars that Lex and Yacc use in their specifications are different, and correspondingly the two kinds of analysers recognize structures at different levels of complexity.

Lex's patterns are regular expressions, and different pieces of text are matched by different regular expressions. The scanner is a deterministic finite automaton.

In Yacc, on the other hand, the set of rules defines a context-free grammar. Each rule describes a syntactical category: a named sequence of tokens and/or other categories. The rule matches if all the items in that sequence match, in the specified order. Any rule may directly or indirectly refer to the same category that it defines. One of the rules is distinguished as the main rule, from where the recognition commences. Thus, a Yacc grammar describes a top-down hierarchical recursive matching process.

Two or more rules can represent variants of the same category, in which case they can be combined for brevity: the variants are then separated by |-characters and all they are given the same category name.

In writing a syntax analyser, one may choose to use Lex to automate scanning, but do the higher-level parsing manually. Or one may use Yacc to generate a parser, but do the scanning by hand. Usually, however, it is even better to use Lex and Yacc together. As the two generated analysers then have to communicate certain information in both directions, Lex and Yacc know and use each other's conventions.

For example, the parser generated by Yacc calls yylex each time it needs a token. Correspondingly, the scanner generated by Lex has the same name, yylex, and produces values in just the way the parser expects them. (The parser's own main function is named yyparse.)

Each token returned by yylex is distinguished by a ‘type’ and a value. The types are actually integer values as follows. The type of a self-represented token, i.e., a character, is its ASCII code. All other tokens must have type values above 256. Such values are automatically generated by Yacc using %token declarations in the initial section of the specification. As for the values of the tokens, the self-represented tokens have none, and the rest have their values stored (one token at a time) in the global variable yylval, known to both Lex and Yacc. yylval's own type, which is the type of the token values, is user-defined, but the name YYSTYPE must be #defined to refer to it in the Yacc specification.

Links of Relevance:

The Lex & Yacc Page: manuals, implementations, references.