PLACE
spec
solutions
Lex & Yacc

RPN Calculator in C with Lex & Yacc

Using a tool to automate parsing can simplify text processing in C considerably, and Yacc and Lex used together is the natural choice of a tool. The input to the RPN calculator and the RPN expressions in particular can be defined by a rather simple grammar for Yacc, and Lex is well suited for translating the input into a sequence of tokens for that grammar.

Together, the two specifications shown below are the equivalent of a program implementing an RPN calculator, as they completely and inambiguously define the actual generated program. The overall size of these specifications is noticeably smaller than the program implementing the calculator directly in C, despite the overhead of things not specific to our problem in both Yacc and Lex parts. Moreover, even in the case of a small program like our RPN calculator the gain in simplicity and clarity is obvious.

How the specifications are structured and what each part contains in general is shortly described here. Below, we concentrate on how the input is scanned and parsed according to the grammatical rules presented.

The Yacc specification is very close to an example in the Bison manual, also describing an RPN calculator. The main rule is responsible for parsing the input as a sequence of lines. To this end it recursively defines the category input as a possibly empty sequence of instances of the category line. A line is either empty (an end-of-line character), or an RPN expression, or something else. The last case is of course an erroneous input for our program and is represented by the pseudo-category error which ‘matches’ if no other rule does. Matching error by definition yields calling the user-defined function yyerror, after which we invoke the special action yyerrok to suppress a further propagation of the state of error.

The rpn category matches an RPN expression and so is recursively defined: it is either a number or an operator acting on two smaller expressions. The $-named variables in the action parts of the line and rpn rules are provided by Yacc as carriers of the values associated with the righ-hand side parts within a rule and with the rule as a whole, i.e. its left-hand side part. By use of them the corresponding subexpressions are evaluated and their results propagated up to where the entire expression is finally matched and its result printed (in line).

The Lex specification has a rule to match a number and generate a token of type NUM (which is actually the number 257) and value stored in yylval. The other two rules prescribe that nonblank characters not matched as parts of numbers (including the arithmetic operators), as well as the end-of-line character, generate self-represented tokens, while series of blank characters generate nothing and thus remain unseen by the parser. The variable yytext is a string: the matched part of the input stream for the currently matching rule.

Note that we use the name NUM also to refer to the regular expression that defines the grammar of a number in the initial section of the specification. The two uses of the name NUM in the number matching rule refer to different objects; indeed, these are two different names that happen to be equal.

The function yywrap must be provided by the user in the Lex specification. It is consulted when the scanner encounters end of input. Returning 1 from yywrap causes the scanner (yylex) to return a 0, which signifies end of input to the parser (yyparse) as well.

In the Yacc specification we provide a main function for the generated C program. It has nothing to do except calling yyparse, but its presence ensures completeness of the program in the sense that no other code needs to be compiled and linked besides the two files generated by Lex and Yacc.

Yacc specification:

%{
#include <stdio.h>
void yyerror(char*s) {printf("%s\n",s);}
extern int yylex();
extern int yywrap();
#define YYSTYPE double
%}
%token NUM
%%
input: /**/ | input line
;
line:   '\n' | rpn '\n' {printf("%f\n",$1);}
      | error '\n' {yyerrok;}
;
rpn:    NUM
      | rpn rpn '+' {$$=$1+$2;}
      | rpn rpn '-' {$$=$1-$2;}
      | rpn rpn '*' {$$=$1*$2;}
      | rpn rpn '/' {$$=$1/$2;}
;
%%
int main() {yyparse(); return 0;}

Lex specification:

%{
#include <stdlib.h>
#define NUM 257
double yylval;
%}
INT [0-9]+
NUM [+-]?(\.{INT})|({INT}(\.{INT}?)?)
%%
{NUM}  {yylval = atof(yytext); return NUM;}
[ \t]+ ;
.|\n   {return yytext[0];}
%%
int yywrap()  {return 1;}

After I wrote the above specification for Lex, I realised that the tokenizer it defined deviates from the specification of the calculator which requires that any two consecutive tokens be separated by one or more blank characters. Implemented as above, the calculator accepts expressions even if some pairs of tokens are not separated by blanks, provided that they can be unambiguously split anyway. E.g., it would tokenize the string  2+13-5/+  as  2 +13 -5 / +.

Note that I did not intend to implement this more general behaviour (a feature, one might say). Somehow, it came naturally as a consequence of the separation of scanning from parsing, and perhaps of the way Lex works. This is a good testimony of the value of separation.

However, to achieve equivalence with the implementations in the other languages I present, I show another Lex specification. The parser need not be changed. Once more, notice the convenience of dealing with scanning separately from parsing proper.

This time, instead of explicitly defining the grammar of a number, we rely on the C library function strtod to tell if a string represents a number. If a string of nonblank characters is a number, we pass a NUM token to the parser. If it is not a number, but it consists of only one character, we pass that character as a self-represented token. This includes the arithmetic operators and the end-of-line character. For all other nonblank strings we pass '.' – an arbitrary token that is not valid in an RPN expression. Blanks are skipped as in the previous version of the specification.

The variable yyleng is provided by Lex and is the length of yytext.

%{
#include <stdlib.h>
#define NUM 257
double yylval;
%}
%%
[^ \t\n]+|\n {char *p;  yylval = strtod(yytext,&p);
              return  *p==0 ? NUM : yyleng==1 ? yytext[0] : '.';}
.            ;
%%
int yywrap() {return 1;}

boykobbatgmaildotcom