Advanced Use of Flex

This is G o o g l e's cache of http://www.lrde.epita.fr/people/akim/compil/gnuprog2/Advanced-Use-of-Flex.html.
G o o g l e's cache is the snapshot that we took of the page as we crawled the web.
The page may have changed since that time. Click here for the current page without highlighting.
To link to or bookmark this page, use the following url:

http://www.google.com/search?q=cache:I3NWfSHXDJUC:www.lrde.epita.fr/people/akim/compil/gnuprog2/Advanced-Use-of-Flex.html+&hl=en&ie=UTF-8

Google is not affiliated with the authors of this page nor responsible for its content.

Noeud:Advanced Use of Flex, Noeud «Next»:Using Flex with the GNU Build System, Noeud «Previous»:Start Conditions, Noeud «Up»:Scanning with Flex

Advanced Use of Flex

In this section we will develop a scanner for arithmetics, which will later be used together with a Bison generated parser to implement an alternative implementation of M4's eval builtin, see Bison, ylparse.y (FIXME: Ref Bison, ylparse.y.). Our project is composed of:

yleval.h a header common to all the files,
ylscan.l the scanner for arithmetics
ylparse.y the parser for arithmetics (FIXME: ref.).
yleval.c the driver for the whole module (FIXME: ref.).

Because locations are extremely important in error messages, we will look for absolute preciseness: we will not only track the line and column where a token starts, but also where it ends. Maintaining them by hand is tedious and error prone, so we will insert actions at appropriate places for Flex to maintain them for us. We will rely on Bison's notion of location:

typedef struct yyltype
{
  int first_line, first_column, last_line, last_column;
} yyltype;

which we will handle thanks to the following macros:

LOCATION_RESET (location) Macro

Initialize the location: first and last cursor are set to the first line, first column.

LOCATION_LINE (location, num) Macro

Advance the end cursor of num lines, and of course reset its column. A macro LOCATION_COLUMN is less needed, since it would consist simply in increasing the last_column member.

LOCATION_STEP (location) Macro

Move the start cursor to the end cursor. This is used when we read a new token. For instance, denoting the start cursor S and the end cursor E, we move from
1000 + 1000 ^ ^ S E

to
1000 + 1000 ^ S=E

LOCATION_PRINT (file, location) Macro

Output a human readable representation of the location to the stream file. This hairy macro aims at providing simple locations by factoring common parts: if the start and end cursors are on two different lines, it produces 1.1-2.3; otherwise if the location is wider than a single character it produces 1.1-3, and finally, if the location designates a single character, it results in 1.1.

Their code is part of yleval.h:

/* Initialize LOC. */
# define LOCATION_RESET(Loc)                  \
  (Loc).first_column = (Loc).first_line = 1;  \
  (Loc).last_column =  (Loc).last_line = 1;

/* Advance of NUM lines. */
# define LOCATION_LINES(Loc, Num)             \
  (Loc).last_column = 1;                      \
  (Loc).last_line += Num;

/* Restart: move the first cursor to the last position. */
# define LOCATION_STEP(Loc)                   \
  (Loc).first_column = (Loc).last_column;     \
  (Loc).first_line = (Loc).last_line;

/* Output LOC on the stream OUT. */
# define LOCATION_PRINT(Out, Loc)                               \
  if ((Loc).first_line != (Loc).last_line)                      \
    fprintf (Out, "%d.%d-%d.%d",                                \
             (Loc).first_line, (Loc).first_column,              \
             (Loc).last_line, (Loc).last_column - 1);           \
  else if ((Loc).first_column < (Loc).last_column - 1)          \
    fprintf (Out, "%d.%d-%d", (Loc).first_line,                 \
             (Loc).first_column, (Loc).last_column - 1);        \
  else                                                          \
    fprintf (Out, "%d.%d", (Loc).first_line, (Loc).first_column)

Example 6.14: yleval.h (i) -- Handling Locations

Because we want to remain in the yleval_ name space, we will use %option prefix, but this will also rename the output file. Because we use Automake which expects flex to behave like Lex, we use %option outfile to restore the Lex behavior.

%option debug nodefault noyywrap nounput
%option prefix="yleval_" outfile="lex.yy.c"

%{
#if HAVE_CONFIG_H
#  include <config.h>
#endif
#include <m4module.h>
#include "yleval.h"
#include "ylparse.h"

Example 6.15: ylscan.l -- Scanning Arithmetics

Our strategy to track locations is simple, see Flex Actions. Each time yylex is invoked, we move the first cursor to the last position thanks to the user-yylex-prologue. Each time a rule is matched, we advance the ending cursor of yyleng characters, except for the rule matching a new line. This is performed thanks to YY_USER_ACTION. Each time we read insignificant characters, such as white spaces, we also move the first cursor to the latest position. This is done in the regular actions:

/* Each time we match a string, move the end cursor to its end. */
#define YY_USER_ACTION  yylloc->last_column += yyleng;
%}
%%
%{
  /* At each yylex invocation, mark the current position as the
     start of the next token.  */
  LOCATION_STEP (*yylloc);
%}
  /*  Skip the blanks, i.e., let the first cursor pass over them.  */
[\t ]+     LOCATION_STEP (*yylloc);
\n+        LOCATION_LINES (*yylloc, yyleng); LOCATION_STEP (*yylloc);

The case of the keywords is straightforward and boring:

"+"        return PLUS;
"-"        return MINUS;
"*"        return TIMES;
...

Integers are more interesting: we use strtol to convert a string of digits into an integer. The result is stored into the member number of the variable yylval, provided by Bison via ylparse.h. We support four syntaxes: 10 is decimal (equal to... 10), 0b10 is binary (2), 010 is octal (8), and 0x10 is hexadecimal (16). Notice the risk of reading 010 as a decimal number with the naive pattern [0-9]+; you can either improve the regular expression, or rely on the order of the rules¹. We chose the latter.

  /* Binary numbers. */
0b[01]+   yylval->number = strtol (yytext + 2, NULL, 2); return NUMBER;
  /* Octal numbers. */
0[0-7]+   yylval->number = strtol (yytext + 1, NULL, 8); return NUMBER;
  /* Decimal numbers. */
[0-9]+    yylval->number = strtol (yytext, NULL, 10); return NUMBER;
  /* Hexadecimal numbers. */
0x[:xdigit:]+ yylval->number = strtol (yytext + 2, NULL, 16); return NUMBER;

Finally, we include a catch-all rule for invalid characters: report an error but do not return any token. In other words, invalid characters are neutralized by the scanner:

  /* Catch all the alien characters. */
.   {
      yleval_error (yycontrol, yylloc, "invalid character: %c", *yytext);
      LOCATION_STEP(*yylloc);
    }
%%

where yleval_error is a variadic function (as is fprintf) and yycontrol a variable that will be both defined later.

This scanner is complete, it merely lacks its partner: the parser. But this is yet another chapter...

Notes en bas de page

Note that the two solutions proposed are not equivalent! Spot the difference between

[1-9][0-9]*  return NUMBER;
0[0-7]+      return NUMBER;

and

[0-9]+    return NUMBER;
0[0-7]+   return NUMBER;

`yleval.h`	a header common to all the files,
`ylscan.l`	the scanner for arithmetics
`ylparse.y`	the parser for arithmetics (FIXME: ref.).
`yleval.c`	the driver for the whole module (FIXME: ref.).