In this lecture we will look at scanners or lexers as
they are sometimes called. We will look at how we can build scanners by hand
and how they can be generated automatically using tools such as JLex. We will
also look at the JavaCC compiler compiler. This tool can help you generate (at
least the front-end of) recursive decent compilers.
The slides for this lecture can be found here.
Sebesta section 3.1 to 3.4 and section 4.1 to 4.4
The JLex manual, by Elliot Berk. The manual can be
downloaded from
http://www.cs.princeton.edu/~appel/modern/java/JLex/current/manual.html
The article in Java World: “Build your own language
with JavaCC”, by Oliver Enseling, which can be downloaded from http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-cooltools_p.html
As background reading I will recommend you read:
The JavaCC FAQ:
http://www.engr.mun.ca/~theo/JavaCC-FAQ/javacc-faq.htm
You can download a free copy of JavaCC from the website
JavaCC Home
There is a repository of grammars for languages,
including Java and SQL on the below URL:
http://www.cobase.cs.ucla.edu/pub/javacc/
The Java Tree Builder tool can be found on
The JLex system can be found on the following URL
http://www.cs.princeton.edu/~appel/modern/java/JLex/
An alternative LL(1) compiler generator is the Compiler Generator
Coco/R. There are versions of CoCo/R for Java, C#, C++, Oberon, Modula-2
and Pascal.
Exercises for lecture 4 will be done from 12.30 till
14.15 before Lecture 5 on Thursday the 26th of February.
Given an alphabet A = { 0, 1 }, and the
languages defined by the following rules (a) - (e), construct (by hand) a
deterministic finite state automaton recognizing each language. Represent your
automatons as state diagrams.
(a)
|
The string of three characters, 101. |
(b)
|
All strings of arbitrary length that end in 101. |
(c)
|
All strings that contain a 101 at least once
anywhere. |
(d)
|
All strings that contain no consecutive ones. |
(e)
|
All strings in which the number of zeros is even. |
digit = '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'
integer = digit digit*
sign = '+'|'-'
exponent = 'e' (sign | empty) integer
number = integer('.' integer | empty) (exponent | empty) | exponent
Construct a DFA to recognize this language, and
represent it as a state diagram. You may find it useful to construct a
NDFA-ε, then convert it to a NDFA and finally convert it to a DFA
The first character must be alphabetic (a letter);
following characters may be alphabetic, numeric, or the underscore character;
however, an underscore may not be final character, and two underscores may not
be adjacent.
(b) Express the automaton as a regular expression. Use
concatenation, alternation (|), closure (*), and, if needed, parentheses for
grouping the items. You may find it helpful to introduce short-hand notation to
represent any character that is a member of a small specified set, and another
notation for a character that is not a member of a given set.
Try JLex on the sample grammar sample.lex:
http://www.cs.princeton.edu/~appel/modern/java/JLex/current/sample.lex