|
Noeud:Advanced Use of Flex, Noeud «Next»:Using Flex with the GNU Build System, Noeud «Previous»:Start Conditions, Noeud «Up»:Scanning with Flex
In this section we will develop a scanner for arithmetics, which will
later be used together with a Bison generated parser to implement an
alternative implementation of M4's eval
builtin, see Bison, ylparse.y
(FIXME: Ref
Bison, ylparse.y.). Our project is composed of:
yleval.h
|
a header common to all the files,
|
ylscan.l
|
the scanner for arithmetics
|
ylparse.y
|
the parser for arithmetics (FIXME: ref.).
|
yleval.c
|
the driver for the whole module (FIXME: ref.).
|
Because locations are extremely important in error messages, we will look for absolute preciseness: we will not only track the line and column where a token starts, but also where it ends. Maintaining them by hand is tedious and error prone, so we will insert actions at appropriate places for Flex to maintain them for us. We will rely on Bison's notion of location:
typedef struct yyltype { int first_line, first_column, last_line, last_column; } yyltype;
which we will handle thanks to the following macros:
LOCATION_RESET (location) | Macro |
Initialize the location: first and last cursor are set to the first line, first column. |
LOCATION_LINE (location, num) | Macro |
Advance the end cursor of num lines, and of course reset its
column. A macro LOCATION_COLUMN is less needed, since it would
consist simply in increasing the last_column member.
|
LOCATION_STEP (location) | Macro |
Move the start cursor to the end cursor. This is used when we read a
new token. For instance, denoting the start cursor S and the end
cursor E , we move from
1000 + 1000 ^ ^ S E to 1000 + 1000 ^ S=E |
LOCATION_PRINT (file, location) | Macro |
Output a human readable representation of the location to the
stream file. This hairy macro aims at providing simple locations
by factoring common parts: if the start and end cursors are on two
different lines, it produces 1.1-2.3 ; otherwise if the location
is wider than a single character it produces 1.1-3 , and finally,
if the location designates a single character, it results in 1.1 .
|
Their code is part of yleval.h
:
/* Initialize LOC. */ # define LOCATION_RESET(Loc) \ (Loc).first_column = (Loc).first_line = 1; \ (Loc).last_column = (Loc).last_line = 1; /* Advance of NUM lines. */ # define LOCATION_LINES(Loc, Num) \ (Loc).last_column = 1; \ (Loc).last_line += Num; /* Restart: move the first cursor to the last position. */ # define LOCATION_STEP(Loc) \ (Loc).first_column = (Loc).last_column; \ (Loc).first_line = (Loc).last_line; /* Output LOC on the stream OUT. */ # define LOCATION_PRINT(Out, Loc) \ if ((Loc).first_line != (Loc).last_line) \ fprintf (Out, "%d.%d-%d.%d", \ (Loc).first_line, (Loc).first_column, \ (Loc).last_line, (Loc).last_column - 1); \ else if ((Loc).first_column < (Loc).last_column - 1) \ fprintf (Out, "%d.%d-%d", (Loc).first_line, \ (Loc).first_column, (Loc).last_column - 1); \ else \ fprintf (Out, "%d.%d", (Loc).first_line, (Loc).first_column)
Example 6.14: yleval.h
(i) -- Handling Locations
Because we want to remain in the yleval_
name space, we will use
%option prefix
, but this will also rename the output file.
Because we use Automake which expects flex
to behave like Lex, we
use %option outfile
to restore the Lex behavior.
%option debug nodefault noyywrap nounput %option prefix="yleval_" outfile="lex.yy.c" %{ #if HAVE_CONFIG_H # include <config.h> #endif #include <m4module.h> #include "yleval.h" #include "ylparse.h"
Example 6.15: ylscan.l
-- Scanning Arithmetics
Our strategy to track locations is simple, see Flex Actions. Each
time yylex
is invoked, we move the first cursor to the last
position thanks to the user-yylex-prologue. Each time a rule is
matched, we advance the ending cursor of yyleng
characters,
except for the rule matching a new line. This is performed thanks to
YY_USER_ACTION
. Each time we read insignificant characters, such
as white spaces, we also move the first cursor to the latest position.
This is done in the regular actions:
/* Each time we match a string, move the end cursor to its end. */ #define YY_USER_ACTION yylloc->last_column += yyleng; %} %% %{ /* At each yylex invocation, mark the current position as the start of the next token. */ LOCATION_STEP (*yylloc); %} /* Skip the blanks, i.e., let the first cursor pass over them. */ [\t ]+ LOCATION_STEP (*yylloc); \n+ LOCATION_LINES (*yylloc, yyleng); LOCATION_STEP (*yylloc);
The case of the keywords is straightforward and boring:
"+" return PLUS; "-" return MINUS; "*" return TIMES; ...
Integers are more interesting: we use strtol
to convert a string
of digits into an integer. The result is stored into the member
number
of the variable yylval
, provided by Bison via
ylparse.h
. We support four syntaxes: 10
is decimal (equal
to... 10), 0b10
is binary (2), 010
is octal (8), and
0x10
is hexadecimal (16). Notice the risk of reading 010
as a decimal number with the naive pattern [0-9]+
; you can either
improve the regular expression, or rely on the order of the
rules1. We chose the latter.
/* Binary numbers. */ 0b[01]+ yylval->number = strtol (yytext + 2, NULL, 2); return NUMBER; /* Octal numbers. */ 0[0-7]+ yylval->number = strtol (yytext + 1, NULL, 8); return NUMBER; /* Decimal numbers. */ [0-9]+ yylval->number = strtol (yytext, NULL, 10); return NUMBER; /* Hexadecimal numbers. */ 0x[:xdigit:]+ yylval->number = strtol (yytext + 2, NULL, 16); return NUMBER;
Finally, we include a catch-all rule for invalid characters: report an error but do not return any token. In other words, invalid characters are neutralized by the scanner:
/* Catch all the alien characters. */ . { yleval_error (yycontrol, yylloc, "invalid character: %c", *yytext); LOCATION_STEP(*yylloc); } %%
where yleval_error
is a variadic function (as is fprintf
)
and yycontrol
a variable that will be both defined later.
This scanner is complete, it merely lacks its partner: the parser. But this is yet another chapter...
Note that the two solutions proposed are not equivalent! Spot the difference between
[1-9][0-9]* return NUMBER; 0[0-7]+ return NUMBER;and
[0-9]+ return NUMBER; 0[0-7]+ return NUMBER;