LEX, Flex, JLEX - SisInf Lab

Transcript

LEX, Flex, JLEX - SisInf Lab
Formal Languages
and Compilers
Master’s Degree Course in
Computer Engineering
A.Y. 2015/2016
FORMAL LANGUAGES AND COMPILERS
LEX, FLEX, JLEX
Floriano Scioscia
1
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
LEX/FLEX/JLEX (1/2)
• Since the conversion of regular expressions to deterministic finite-state
automata and the implementation of the latter are mechanical (and
boring) processes, automatic scanner generators are often used.
• LEX is a well-known and widely used scanner generator in Unix. It was
designed expressly to work with the parser generator YACC. Many
assumptions in the code generated by LEX fit well with those of YACC.
For example, the scanner produced by LEX is a C function named
yylex(), which is exactly what YACC expects from the lexical
analyzer.
• LEX was developed by M. E. Lesk and E. Schmidt at AT&T Bell
Laboratories. It generates a scanner in C from a set of regular
expressions defining tokens.
• The input is a specification: a text file containing token patterns as
regular expressions. LEX produces a whole scanner module which can
be compiled and linked with the other modules of a compiler.
LEX, FLEX, JLEX - Floriano Scioscia
2
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
LEX/FLEX/JLEX (2/2)
•
FLEX (Fast LEXical analyzer generator), originally written in C by V. Paxson in
1987, is a free software alternative to LEX and represents a more recent and
faster version.
•
It is often used with Bison, which is in turn a parser generator alternative to
YACC.
•
LEX is distributed with the Unix operating system, while FLEX is a product of
the Free Software Foundation.
•
JLEX is a Java version of LEX. Its regular expressions are very similar to the
ones used by LEX/FLEX. JLEX generates a scanner in Java. It is often paired
with CUP (Constructor of Useful Parsers), a Java alternative to YACC/BISON.
•
•
JLEX: http://www.cs.princeton.edu/~appel/modern/java/JLex/
•
CUP: http://www2.cs.tum.edu/projects/cup/
LEX, FLEX and JLEX are mostly non-procedural: one does not need to state
how the tools must perform scanning. Stating what must be scanned is all it is
needed, by means of a definition of valid tokens. This approach greatly
simplifies scanner construction, since most scanning details (I/O, buffering,
etc.) are managed automatically.
LEX, FLEX, JLEX - Floriano Scioscia
3
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
•
FLEX: description and
operation
Flex generator accepts in input (in a file with .l extension, i.e. test.l) a set
of regular expressions (rules) and actions (as C code) associated to each
expression and produces in output a scanning routine yylex() (in the file
lex.yy.c) which can detect and return the individual lexemes admitted in a
language.
Regular
expressions
LEX
C program
•
The file lex.yy.c, without main() (which is implicitly defined, for a standalone scanner, with the compiling option –lfl), contains the scanning routine
yylex() with other auxiliary routines and macros.
•
The output file is compiled and linked with library fl to generate an
executable file.
lex test.l
cc lex.yy.c -o test -lfl
•
When the executable (test) is run, it scans the input file(s) looking for
occurrences of regular expressions complying with defined patterns. If one is
detected, the associated C code is executed.
LEX, FLEX, JLEX - Floriano Scioscia
4
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
Useful FLEX links
•
http://flex.sourceforge.net/
home page
•
http://flex.sourceforge.net/manual/
manual
•
http://www.quut.com/c/ANSI-C-grammar-l-1998.html
ANSI C grammar (LEX specification)
LEX, FLEX, JLEX - Floriano Scioscia
5
FLEX: description and
operation
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
• Dual significance
– Language
– Compiler (scanner generator)
file.l
FLEX
compiler
lex.yy.c
C compiler
lexer
• Finally
source file
Lexer
tokens
LEX, FLEX, JLEX - Floriano Scioscia
6
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
FLEX: structure of the
generated program
• FLEX produces a C program without main() whose entry point is the
function int yylex()
• This function reads from file yyin and copies to file yyout the
unrecognized text.
• If not specified otherwise in the actions (by means of the return
instruction), the function ends only when the whole input file has been
analyzed.
• After each action, the automaton returns to the start state to recognize
new tokens.
• As a default, files yyin and yyout are initialized to stdin and
stdout respectively.
• The user (programmer) can change this setting by re-initializing these
global variables.
LEX, FLEX, JLEX - Floriano Scioscia
7
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
Regular expressions in FLEX
(1/4)
• For the specification of the scanner, LEX uses regular expressions, a
formalism more efficient but less powerful than context-free
grammars.
• The difference between CFGs and regular expressions lies in the fact
that regular expressions cannot recognize recursive syntactic
structures, while CFGs can.
• A syntactic structure like balanced arithmetic expressions, requiring the
same number of open and closed parentheses, cannot be recognized
by a scanner. That’s why a parser must be used.
• On the contrary, number constants, identifiers and keywords are
recognized by a scanner.
LEX, FLEX, JLEX - Floriano Scioscia
8
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
Regular expressions in FLEX
(2/4)
• Regular expressions describe ASCII character sequences and use a
set of operators:
“\[]^-?.*+|()$/{}%<>
• Letters and numbers in the text are self-descriptive:
– the regular expression val1 represents the sequence ‘v’ ‘a’ ‘l’ ‘1’ in the input text.
• Non-alphanumeric characters are represented in LEX by enclosing
them between double quotes, in order to avoid ambiguity with
operators:
– the expression xyz“++” represents the sequence ‘x’ ‘y’ ‘z’ ‘+’ ‘+’ in the input text.
• Non-alphanumeric characters can be described also by means of a
preceding \ symbol.
– the expression xyz\+\+ represents the sequence ‘x’ ‘y’ ‘z’ ‘+’ ‘+’ in the input
text.
LEX, FLEX, JLEX - Floriano Scioscia
9
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
Regular expressions in FLEX
(3/4)
• Character classes are described through operators []
– the expression [0123456789] represents a digit in the input text.
• In character class descriptions, the - symbol denotes a character
range:
– the expression [0-9] represents a digit in the input text.
• To include the character - in a character class, it must be specified as
the first or the last one:
– the expression [-+0-9] represents a digit or a sign in the input text.
• In character class descriptions, the ^ symbol at the beginning denotes
a set of characters to exclude:
– the expression [^0-9] represents any character except a digit in the input text.
• The set of all characters except the new line one is denoted with .
• The new line character is denoted with \n
• The tabulation character is denoted with \t
LEX, FLEX, JLEX - Floriano Scioscia
10
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
Regular expressions in FLEX
(4/4)
• The operator ? denotes that the preceding expression is optional:
– ab?c represents both ac and abc
• The operator * denotes that the preceding expression can repeat 0 or
more times:
– ab*c represents all the sequences starting with a, ending with c and having inside
any number of bs
• The operator + denotes that the preceding expression can repeat 1 or
more times:
– ab+c represents all sequences starting with a, ending with c and having inside at
least one b.
• The operator | denotes an alternative between two expressions:
– ab|cd represents the sequence ab or the sequence cd
• Parentheses ( ) allow to express precedence among operators:
– (ab|cd+)?ef represents sequences such as ef, abef, cdddef.
LEX, FLEX, JLEX - Floriano Scioscia
11
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
FLEX: main supported
regular expressions
x
.
[xyz]
character 'x'
any character but '\n'
a character class; in this case, 'x', 'y' or 'z'
[a-z]
a class with a range; any character between 'a' and 'z'
[^A-Z]
r*
r+
r?
a negated class: any character NOT in the class
zero or more r, with r a regular expression
one or more r
zero or one r
r{2,5}
between two and five r
r{2,}
r{4}
two or more r
exactly four r
{name}
(r)
rs
r|s
r/s
^r
r$
the expansion of the definition of name
r, parenthesized for grouping
concatenation: r followed by s
alternative: r or s
restriction: r but only if followed by s
r but only at the beginning of a line
r but only at the end of a line
LEX, FLEX, JLEX - Floriano Scioscia
12
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
FLEX input file format
(1/5)
A LEX/FLEX input file consists in three distinct sections, separated by the %%
symbol.
Definitions
%
#include
constant definitions
scanner macros
%
basic definitions
Rules
%%
token definitions and actions
User code
%%
support procedures, C user code
LEX, FLEX, JLEX - Floriano Scioscia
13
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
•
FLEX input file format
(2/5)
Section 1 (which may be empty) contains:
– definitions of custom constants and/or macros and library #include directives of
the user program, enclosed within % and %; this text section will be literally copied
into the output C program; when the scanner is used in combination with a YACCor Bison-generated parser, this section should contain a directive #include
y.tab.h, which is the header file of the generated parser, containing the definition
of multi-character tokens for parsing purposes;
– the basic definitions used in the next section to describe regular expressions.
•
Section 2 contains the token definitions with the related actions to be
executed, in the form
pattern
action
– Actions must start on the same line where the pattern regular expression ends and
are separated from it by means of blanks (whitespace or tabulations).
•
Section 3 (which may be empty) contains the support routines the developer
intends to use in the actions defined in the previous section; if this section Is
empty, the %% delimiter is omitted.
LEX, FLEX, JLEX - Floriano Scioscia
14
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
FLEX input file format
(3/5)
•
In the first section, every line starting with a non-whitespace character is a
definition:
number [+-]?[0-9]+
•
Expressions defined that way can be used in section 2 by enclosing their name
within braces:
{number} printf(“number found\n”);
•
Code fragments can be inserted both in the first section (within %{ %}) and in
the second section (within { } right after any regular expression to be
recognized), and they are copied entirely into the output file.
•
If no action is specified next to a pattern, when a token of the corresponding
type is recognized during lexical analysis it will be discarded.
•
Lines in the third section (for support routines) are also copied into the
lex.yy.c output file generated by LEX.
LEX, FLEX, JLEX - Floriano Scioscia
15
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
•
FLEX input file format
(4/5)
Example 1 of input file content
%{
#include <stdio.h>
%}
%%
[0-9]+
printf(“INTEGER NUMBER\n");
[a-zA-Z][a-zA-Z0-9]*
printf("IDENTIFIER\n");
%%
•
•
•
This file describes 2 patterns, i.e. 2 types of tokens: [0-9]+ with the
associated action of printing INTEGER NUMBER and [a-zA-Z][a-zA-Z0-9]* with
the associated action of printing IDENTIFIER
Notice the presence, in section 1, of the directive #include <stdio.h>
needed to enable the use of printf
This simple example presumes LEX/FLEX is used independently of
YACC/BISON.
LEX, FLEX, JLEX - Floriano Scioscia
16
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
•
FLEX input file format
(5/5)
Example 2 of input file content
%{
int num_lines = 0, num_chars = 0;
%}
%%
\n
num_lines=num_lines+1; num_chars=num_chars+1;
.
num_chars=num_chars+1;
%%
main()
{
yylex();
printf( "# of lines = %d, # of chars = %d\n",
num_lines, num_chars );
}
•
•
This scanner counts the number of characters and of lines in the input file and
prints the values of such counters.
Notice that the first line declares two global variables, accessible to yylex()
function as well to main() declared after the second %%.
LEX, FLEX, JLEX - Floriano Scioscia
17
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
•
Lexical ambiguities
There are two types of lexical ambiguities:
1. A prefix of a character sequence recognized by a regular expression is
matched also by another regular expression.
–
In this case, the scanner executes the action associated with the regular expression
which has recognized the longest string (longest match or maximal munch rule).
2. The same character sequence is matched by two different regular expressions.
– In this case the scanner executes the action associated with the regular
expression declared first in the LEX/FLEX input file.
•
Example: consider the file
%%
for {return FOR_CMD;}
format {return FORMAT_CMD;}
[a-z]+ {return GENERIC_ID;}
and the input string “format”, the yylex function returns FORMAT_CMD,
preferring the second rule with respect to the first one because it matches a
longest string, and also with respect to the third one because it is defined earlier
in the input file.
LEX, FLEX, JLEX - Floriano Scioscia
18
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
Solving lexical ambiguities
•
Given the LEX/FLEX ambiguity resolution strategies, it becomes necessary to
define the rules for keywords before the ones for identifiers.
•
The longest-match principle requires caution:
’.*’ {return QUOTED_STRING;}
seeks to recognize the second quote as far as possible: hence, with the
following input
’first’ quoted string here, ’second’ here
the scanner will take 36 characters instead of 7.
•
Then a better rule is this one:
’[^’\n]+’ {return QUOTED_STRING;}
LEX, FLEX, JLEX - Floriano Scioscia
19
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
Actions associated with
regular expressions in LEX
•
In LEX, each regular expression is associated with an action, executed upon
recognition.
•
Actions are expressed in C code: if this code fragment includes more than one
statement or spans more than one line, it must be enclosed in curly braces.
•
The simplest action is ignoring the recognized text: a void action is expressed
with the ; character.
•
Recognized text is stored in the yytext variable, defined as char pointer.
Working on this variable, more complex actions can be specified.
The number of recognized characters is stored in the yyleng variable, defined
as integer.
•
•
A default action exists for text not recognized by any regular expression: the
unrecognized text is copied in output, character by character.
LEX, FLEX, JLEX - Floriano Scioscia
20
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
Examples of LEX input
(1/3)
1. Generation of lines introduced by ordinal number.
%{
#include <stdio.h>
int l = 1;
%}
line
.*\n
%%
{line} {printf(“%d %s”, l++, yytext);}
%%
main() {
yylex();
return(0);
}
Regular expression
Action
Lexeme
Compilable scanner code
LEX, FLEX, JLEX - Floriano Scioscia
21
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
Examples of LEX input
(2/3)
2. Replace numbers in decimal notation to hexadecimal notation and print the
number of real substitutions.
%{
#include <stdio.h>
#include <stdlib.h>
int count = 1;
%}
digit
[0-9]
num
[digit]+
%%
{num}
{ int n = atoi(yytext);
printf(“%x”, n);
if (n > 9) count++;
}
%%
main() {
yylex();
fprintf(stderr, “Substitution count
= %d\n”, count);
return(0);
}
LEX, FLEX, JLEX - Floriano Scioscia
Note:
Default action: when a
string is not part of any
token
ECHO on output
22
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
Examples of LEX input
(3/3)
3. Replace lines not starting or ending with character a.
%{
#include <stdio.h>
%}
a_line
a.*\n
line_a
.*a\n
%%
{a_line}
ECHO;
{line_a}
ECHO;
.*\n
;
%%
main() {
yylex();
return(0);
}
Empty action
Notice:
Ambiguous rule set: a string can match several
regular expression (e.g.: a)
Built-in priority guidelines
1. Maximal munch principle.
2. If more rules match the string, select the earliest
specified one.
.*\n
{a_line}
{line_a}
;
ECHO;
ECHO;
Empty output!
{a_line}
{line_a}
ECHO;
ECHO;
Output = input
LEX, FLEX, JLEX - Floriano Scioscia
23
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
•
•
FLEX: combined usage
with BISON (1/3)
The output of LEX/FLEX is a lex.yy.c file: a C program without main(),
containing the yylex() scanning routine with other auxiliary routines and
macros.
By default, yylex() is declared as:
int yylex()
{
... various definitions and the actions in here ...
}
•
When one combines lexical and syntax analysis, the lex.yy.c file (produced
by the scanner generator) is typically included (by means of #include) in the
source code generated by YACC. Many declarations, such as tokens and data
structures to communicate with the parser, are declared in the source
generated by YACC, y.tab.c.
LEX, FLEX, JLEX - Floriano Scioscia
24
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
•
FLEX: combined usage
with BISON (2/3)
Example: BASIC compiler
bas.l: lexical rules
bas.y: syntax rules with token definitions
cc: command to create the compiler
bas.exe: compiler
y.tab.h: token definitions for LEX
yacc –d bas.y # create y.tab.h, y.tab.c
lex bas.l # create lex.yy.c
cc lex.yy.c y.tab.c –o bas.exe # compile/link
LEX, FLEX, JLEX - Floriano Scioscia
25
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
FLEX: combined usage
with BISON (3/3)
LEX, FLEX, JLEX - Floriano Scioscia
26
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
FLEX: output (1/3)
• When the generated scanner is executed, it scans input recognizing
character strings compliant with the specified patterns.
• As already noted, if more than one possible match is found, the
longest one is taken. In the case of 2 equal-length matches, the one is
chosen which corresponds to the rule appearing earlier in the FLEX
input file.
• Once the match is found, the corresponding text (which represents a
token) is made globally available through the char pointer
(char *yytext), and its length is globally reported in the integer
(int yyleng).
• Then the scanner executes the action which corresponds to the found
pattern and proceeds to scanning the remaining text.
LEX, FLEX, JLEX - Floriano Scioscia
27
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
FLEX: output (2/3)
• Once a token is recognized, various possibilities exist: it can be ignored
(like usually done for whitespace) and go to the next token; or the
code of the recognized token is returned. When a token is returned,
yylex() function ends but it will be called again by the parser when it
needs another token.
• Some supplementary action can be required, besides returning or
ignoring a token. For example, when a newline is found, the input line
counter is increased.
• Even more important is the fact that for some tokens something more
must be known beyond their type. For example, it is not enough to
know a variable has been found: we must know which variable it is.
LEX, FLEX, JLEX - Floriano Scioscia
28
Formal Languages
and Compilers
FLEX: output (3/3)
A.Y. 2015/2016
DEI – Politecnico di Bari
•
FLEX calls the yywrap() function at the end of its input and returns the
global variable char *yytext, storing the characters of the current token,
and the global variable int yyleng, storing the length of that string.
•
If yywrap() returns value 0 (false), this means the function predicts yyin
must be set to another input file, so that scanning can continue on that file. If
it returns a non-zero value (true), the scanner terminates and returns value 0
to the caller function.
yylex()
Input file scan
EOF
found
yywrap()
More
files?
no
yes
action
return
Readdress yyin file
1. Initialize file yyin
2. Symbol
3. End of
(default: stdin)
processing
scanning
LEX, FLEX, JLEX - Floriano Scioscia
29
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
Example exercises
•
Write the LEX specification for a scanner which, given a C program in input,
produces an equal but comment-free program in output.
•
Change the above specification so that the scanner recognizes #include
directives and, when it does, report an error and end the analysis.
LEX, FLEX, JLEX - Floriano Scioscia
30
Formal Languages
and Compilers
A.Y. 2015/2016
DEI – Politecnico di Bari
JLEX: LEX’s Java version
•
LEX generates scanners in C language. To generate scanners in Java
language, the JLEX scanner generator can be used.
•
This tool, entirely written in Java, produces in output Java classes
implementing methods to execute lexical analysis of an input string.
The main produced class is Yylex, containing the yylex() method, which
gets and analyzes the next input token.
Another method of the class is yytext(), which returns the text recognized
by yylex()
•
•
•
Also JLEX requires an input specification file, containing all details about the
lexical analysis to be performed.
•
As said, JLEX is often used together with CUP (Constructor of Useful Parsers),
a Java alternative to YACC/Bison.
LEX, FLEX, JLEX - Floriano Scioscia
31