|
|
Compiler Construction Lecture Notes
|
Why study compilers?
You may never write a commercial
compiler, but that's not why we study compilers. We study compiler construction
for the following reasons:
-
Writing a compiler gives a student
experience with large-scale applications development. Your compiler program may
be the largest program you write as a student. Experience working with really
big data structures and complex interactions between algorithms will help you
out on your next big programming project.
-
Compiler writing is one of the
shining triumphs of CS theory. It demonstrates the value of theory over the
impulse to just "hack up" a solution.
-
Compiler writing is a basic element
of programming language research. Many language researchers write compilers for
the languages they design.
-
Many applications have similar
properties to one or more phases of a compiler, and compiler expertise and
tools can help an application programmer working on other projects besides
compilers.
|
|
Compilers - a brief overview
Purpose: translate a program in some
language (the source language) into a lower-level language (the target
language).
Phases:
-
Lexical Analysis:
-
Converts a sequence of characters
into words, or tokens
-
Syntax Analysis:
-
Converts a sequence of tokens into a parse
tree
-
Semantic Analysis:
-
Manipulates parse tree to verify
symbol and type information
-
Intermediate Code Generation:
-
Converts parse tree into a sequence
of intermediate code instructions
-
Optimization:
-
Manipulates intermediate code to
produce a more efficient program
-
Final Code Generation:
-
Translates intermediate code into
final (machine/assembly) code
Example of the Compilation Process
Consider the example from Figure 1.10
on p. 13 of the book in detail.
position = initial + rate * 60
30 or so characters, from a single
line of source code, are first transformed by lexical analysis into a sequence
of 7 tokens. Those tokens are then used to build a tree of height 4 during
syntax analysis. Semantic analysis may transform the tree into one of height 5,
that includes a type conversion necessary for real addition on an integer
operand. Intermediate code generation uses a simple traversal algorithm to
linearize the tree back into a sequence of machine-independent
three-address-code instructions.
t1 = inttoreal(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
Optimization of the intermediate code
allows the four instructions to be reduced to two machine-independent
instructions. Final code generation might implement these two instructions
using 5 machine instructions, in which the actual registers and addressing
modes of the CPU are utilized.
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
Overview of Lexical Analysis
Lexical Analyzer a.k.a. scanner
-
Convert from a sequence of
characters into a (shorter) sequence of tokens
-
Identify and categorize specific
character sequences into tokens
-
Skip comments & whitespace
-
Handle lexical errors (illegal
characters, malformed tokens)
-
Efficiency is crucial; scanner may
perform elaborate input buffering
-
Tokens are specified as regular
expressions, e.g. IDENTIFIER=[a-zA-Z][a-zA-Z0-9]*
-
Lexical Analyzers are implemented by
DFAs.
The term "token" usually refers to an
object (or struct) containing complete information about a single lexical
entity, but it is often also used to refer the category ("class" if you prefer)
of that entity. The term "lexeme" denotes the actual string of characters that
comprise a particular occurrence ("instance" if you like) of a token.
Regular Expressions
The notation we use to precisely
capture all the variations that a given category of token may take are called
"regular expressions" (or, less formally, "patterns". The word "pattern" is
really vague and there are lots of other notations for patterns besides regular
expressions). Regular expressions are a shorthand notation for sets of strings.
In order to even talk about "strings" you have to first define an alphabet,
the set of characters which can appear.
-
Epsilon is a regular expression
denoting the set containing the empty string
-
Any letter in the alphabet is also a
regular expression denoting the set containing a one-letter string consisting
of that letter.
-
For regular expressions r and s,
r | s
is a regular expression denoting the union of r and s
-
For regular expressions r and s,
r s
is a regular expression denoting the set of strings consisting of a member of r
followed by a member of s
-
For regular expression r,
r*
is a regular expression denoting the set of strings consisting of zero or more
occurrences of r.
-
You can parenthesize a regular
expression to specify operator precedence (otherwise, alternation is like plus,
concatenation is like times, and closure is like exponentiation)
Although these operators are
sufficient to describe all regular languages, in practice everybody uses
extensions:
-
For regular expression r,
r+
is a regular expression denoting the set of strings consisting of one or more
occurrences of r. Equivalent to rr*
-
For regular expression r,
r?
is a regular expression denoting the set of strings consisting of zero or one
occurrence of r. Equivalent to r|epsilon
-
The notation [abc] is short for
a|b|c. [a-z] is short for a|b|...|z
Finite Automata
A finite automaton is an abstract,
mathematical machine, also known as a finite state machine, with the following
components:
-
A set of states S
-
A set of input symbols E (the
alphabet)
-
A transition function move(state,
symbol) : new state(s)
-
A start state S0
-
A set of final states F
For a deterministic finite automaton
(DFA), the function move(state, symbol) goes to at most one state, and symbol
is never epsilon.
Finite automata correspond in a 1:1
relationship to transition diagrams; from any transition diagram one can write
down the formal automaton in terms of items #1-#5 above, and vice versa. To
draw the transition diagram for a finite automaton:
-
draw a circle for each state s in S;
put a label inside the circles to identify each state by number or name
-
draw an arrow between Si and
Sj, labeled with x whenever the transition says to move(Si,
x) : Sj
-
draw a "wedgie" into the start state
S0 to identify it
-
draw a second circle inside each of
the final states in F
DFA Implementation
The nice part about DFA's is that they
are efficiently implemented on computers. What DFA does the following code
correspond to? What is the corresponding regular expression? You can speed this
code fragment up even further if you are willing to use goto's or write it in
assembler.
state := S0
for(;;)
switch (state) {
case 0:
switch (input) {
'a': state = 1; input = getchar(); break;
'b': input = getchar(); break;
default: printf("dfa error\n"); exit(1);
}
case 1:
switch (input) {
EOF: printf("accept\n"); exit(0);
default: printf("dfa error\n"); exit(1);
}
}
Nondeterministic Finite Automata
(NFA's)
Notation convenience motivates more
flexible machines in which function move() can go to more than one state on a
given input symbol, and some states can move to other states even without
consuming an input symbol.
Fortunately, one can prove that for
any NFA, there is an equivalent DFA. They are just a notational convenience.
So, finite automata help us get from a set of regular expressions to a computer
program that recognizes them efficiently.
Regular expressions can be converted
automatically to NFA's
Each rule in the definition of regular
expressions has a corresponding NFA; NFA's are composed using epsilon
transitions. This is cited in the text as "Thompson's construction" (Algorithm
3.3). We will work examples such as (a|b)*abb in class and during lab.
-
For epsilon, draw two states with a
single epsilon transition.
-
For any letter in the alphabet, draw
two states with a single transition labeled with that letter.
-
For regular expressions r and s,
draw r | s by adding a new start state with epsilon transitions to the start
states of r and s, and a new final state with epsilon transitions from each
final state in r and s.
-
For regular expressions r and s,
draw rs by adding epsilon transitions from the final states of r to the start
state of s.
-
For regular expression r, draw r* by
adding new start and final states, and epsilon transitions (a) from the start
state to the final state, (b) from the final state back to the start state, (c)
from the new start to the old start and from the old final states to the new
final state.
-
For parenthesed regular expression
(r) you can use the NFA for r.
NFA's can be converted automatically
to DFA's
In: NFA N
OUt: DFA D
Method: Construct transition table Dtran (a.k.a. the "move function"). Each DFA
state is a set of NFA states. Dtran simulates in parallel all possible moves N
can make on a given string.
Operations to keep track of sets of
NFA states:
-
e_closure(s)
-
set of states reachable from state s
via epsilon
-
e_closure(T)
-
set of states reachable from any
state in set T via epsilon
-
move(T,a)
-
set of states to which there is an
NFA transition from states in T on symbol a
Algorithm:
Dstates := {e_closure(start_state)}
while T := unmarked_member(Dstates) do {
mark(T)
for each input symbol a do {
U := e_closure(move(T,a))
if not member(Dstates, U) then
insert(Dstates, U)
Dtran[T,a] := U
}
}
lex(1) and flex(1)
These programs generally take a
lexical specification given in a .l file and create a corresponding C language
lexical analyzer in a file named lex.yy.c. The lexical analyzer is then linked
with the rest of your compiler.
The C code generated by lex has the
following public interface. Note the use of global variables instead of
parameters, and the use of the prefix yy to distinguish scanner names from your
program names. This prefix is also used in the YACC parser generator.
FILE *yyin; /* set this variable prior to calling yylex() */
int yylex(); /* call this function once for each token */
char yytext[]; /* yylex() writes the token's lexeme to this array */
The .l file format consists of a
mixture of lex syntax and C code fragments. The percent sign (%) is used to
signify lex elements. The whole file is divided into three sections separated
by %%:
header
%%
body
%%
helper functions
The header consists of C code
fragments enclosed in %{ and %} as well as macro definitions consisting of a
name and a regular expression denoted by that name. lex macros are invoked
explicitly by enclosing the macro name in curly braces. Following are some
example lex macros.
letter [a-zA-Z]
digit [0-9]
ident {letter}({letter}|{digit})*
The body consists of of a sequence of
regular expressions for different token categories and other lexical entities.
Each regular expression can have a C code fragment enclosed in curly braces
that executes when that regular expression is matched. For most of the regular
expressions this code fragment (also called a semantic action consists
of returning an integer that identifies the token category to the rest of the
compiler, particularly for use by the parser to check syntax. Some typical
regular expressions and semantic actions might include:
" " { /* no-op, discard whitespace */ }
{ident} { return IDENTIFIER; }
"*" { return ASTERISK; }
You also need regular expressions for
lexical errors such as unterminated character constants, or illegal characters.
The helper functions in a lex file
typically compute lexical attributes, such as the actual integer or string
values denoted by literals.
Lex extended regular expressions
Lex further extends the regular
expressions with several helpful operators. Lex's regular expressions include:
-
c
-
normal characters mean themselves
-
\c
-
backslash escapes remove the meaning
from most operator characters. Inside character sets and quotes, backslash
performs C-style escapes.
-
"s"
-
Double quotes mean to match the C
string given as itself. This is particularly useful for multi-byte operators
and may be more readable than using backslash multiple times.
-
[s]
-
This character set operator matches
any one character among those in s.
-
[^s]
-
A negated-set matches any one
character not among those in s.
-
.
-
The dot operator matches any one
character except newline: [^\n]
-
r*
-
match r 0 or more times.
-
r+
-
match r 1 or more times.
-
r?
-
match r 0 or 1 time.
-
r{m,n}
-
match r between m and n times.
-
r1r2
-
concatenation. match r1 followed
by r2
-
r1|r2
-
alternation. match r1 or
r2
-
(r)
-
parentheses specify precedence but
do not match anything
-
r1/r2
-
lookahead. match r1 when
r2 follows, without consuming r2
-
^r
-
match r only when it occurs at the
beginning of a line
-
r$
-
match r only when it occurs at the
end of a line
Lexical Attributes and Token Objects
Besides the token's category, the rest
of the compiler may several pieces of information about a token in order to
perform semantic analysis, code generation, and error handling. These are
stored in an object instance of class Token, or in C, a struct. The fields are
generally something like:
struct token {
int category;
char *text;
int linenumber;
int column;
char *filename;
union literal value;
}
The union literal will hold computed
values of integers, real numbers, and strings.
Lexical Analysis and the Symbol Table
In many compilers, the symbol table
and memory management components of the compiler interact with several phases
of compilation, starting with lexical analysis.
-
Efficient storage is necessary to
handle large input files.
-
There is a colossal amount of
duplication in lexical data: variable names, strings and other literal values
duplicate frequently
-
What token type to use may depend on
previous declarations.
A hash table or other efficient data
structure can avoid this duplication. The software engineering design pattern
to use is called the "flyweight". Example: Figure 3.18, p. 109. Use
"install_id()" instead of "strdup()" to avoid duplication in the lexical data.
Syntax Analysis
Parsing is the act of
performing syntax analysis to verify an input program's compliance with the
source language. A by-product of this process is typically a tree that
represents the structure of the program.
Context Free Grammars
A context free grammar G has:
-
A set of terminal symbols, T
-
A set of nonterminal symbols, N
-
A start symbol, s, which is a member
of N
-
A set of production rules of the
form A -> w, where A is a nonterminal and w is a string of terminal and
nonterminal symbols.
A context free grammar can be used
to generate strings in the corresponding language as follows:
let X = the start symbol s
while there is some nonterminal Y in X do
apply any one production rule using Y, e.g. Y -> w
When X consists only of terminal
symbols, it is a string of the language denoted by the grammar. Each iteration
of the loop is a derivation step. If an iteration has several
nonterminals to choose from at some point, the rules of derviation would allow
any of these to be applied. In practice, parsing algorithms tend to always
choose the leftmost nonterminal, or the rightmost nonterminal, resulting in
strings that are leftmost derivations or rightmost derivations.
Grammar Ambiguity
The grammar
E -> E + E
E -> E * E
E -> ( E )
E -> ident
allows two different derivations for
strings such as "x + y * z". The grammar is ambiguous, but the semantics of the
language dictate a particular operator precedence that should be used. One way
to eliminate such ambiguity is to rewrite the grammar. For example, we can
force the precedence we want by adding some nonterminals and production rules.
E -> E + T
E -> T
T -> T * F
T -> F
F -> ( F )
F -> ident
Recursive Descent Parsing
Perhaps the simplest parsing method,
for a large subset of context free grammars, is called recursive descent. It is
simple because the algorithm closely follows the production rules of
nonterminal symbols.
-
Write 1 procedure per nonterminal
rule
-
Within each procedure, a) match
terminals at appropriate positions, and b) call procedures for non-terminals.
-
Pitfalls: left recursion is FATAL;
must distinguish between several production rules, or potentially, one has to
try all of them via backtracking.
Recursive Descent Parsing Example #1
The grammar
S -> A B C
A -> a A
A -> epsilon
B -> b
C -> c
maps to code like
procedure S()
if A() & B() & C() then succeed
end
procedure A()
if currtoken == a then # use production 2
currtoken = scan()
return A()
else
succeed # production rule 3, match epsilon
end
procedure B()
if currtoken == b then
currtoken = scan()
succeed
else fail
end
procedure C()
if currtoken == c then
currtoken = scan()
succeed
else fail
end
Removing Left Recursion
E -> E + T | T
T -> T * F | F
F -> ( E ) | ident
We can remove the left recursion by
introducing new nonterminals and new production rules.
E -> T E'
E' -> + T E' | epsilon
T -> F T'
T' -> * F T' | epsilon
F -> ( E ) | ident
Getting rid of such immediate left
recursion is not enough, one must get rid of indirect left recursion,
where two or more nonterminals are mutually left-recursive. One can rewrite any
CFG to remove left recursion (Algorithm 4.1).
for i := 1 to n do
for j := 1 to i-1 do begin
replace each Ai -> Aj gamma with productions
Ai -> delta1gamma | delta2gamma
end
eliminate immediate left recursion
Backtracking?
Current token could begin more than
one of your possible production rules? Try all of them, remember and reset
state for each try.
Left factoring can often solve such
problems
S -> cAd
A -> a A'
A'-> b
A'-> (epsilon)
One can also perform left factoring
(Algorithm 4.2) to reduce or eliminate the lookahead or backtracking needed to
tell which production rule to use. If the end result has no lookahead or
backtracking needed, the resulting CFG can be solved by a "predictive parser"
and coded easily in a conventional language. If backtracking is needed, a
recursive descent parser takes more work to implement, but is still feasible.
As a more concrete example:
S -> if E then S
S -> if E then S1 else S2
can be factored to:
S -> if E then S S'
S'-> else S2 | epsilon
Some More Parsing Theory
Automatic techniques for
constructing parsers start with computing some basic functions for symbols in
the grammar. These functions are useful in understanding both recursive descent
and bottom-up LR parsers.
First(a)
First(a) is the set of terminals
that begin strings derived from a, which can include epsilon.
-
First(X) starts with the empty set.
-
if X is a terminal, First(X) is {X}.
-
if X -> epsilon is a production,
add epsilon to First(X).
-
if X is a non-terminal and X -> Y1
Y2 ... Yk is a production, add First(Y1) to
First(X).
-
for (i = 1; if Yi can derive epsilon; i++)
add First(Yi+1) to First(X)
Follow(A)
Follow(A) for nonterminal A is the
set of terminals that can appear immediately to the right of A in some
sentential form S -> aAxB... To compute Follow, apply these rules to all
nonterminals in the grammar:
-
Add $ to Follow(S)
-
if A -> aBb then add First(b) -
epsilon to Follow(B)
-
if A -> aB or A -> aBb where
epsilon is in First(b), then add Follow(A) to Follow(B).
Bottom Up Parsing
Bottom up parsers start from the
sequence of terminal symbols and work their way back up to the start symbol by
repeatedly replacing grammar rules' right hand sides by the corresponding
non-terminal. This is the reverse of the derivation process, and is called
"reduction".
Example. For the grammar
(1) S->aABe
(2) A->Abc
(3) A->b
(4) B->d
the string "abbcde" can be parsed
bottom-up by the following reduction steps:
abbcde
aAbcde
aAde
aABe
S
Definition: a handle is a
substring that
-
matches a right hand side of a
production rule in the grammar and
-
whose reduction to the nonterminal
on the left hand side of that grammar rule is a step along the reverse of a
rightmost derivation.
Shift Reduce Parsing
A shift-reduce parser performs its
parsing using the following structure
At each step, the parser performs
one of the following actions.
-
Shift one symbol from the input onto
the parse stack
-
Reduce one handle on the top of the
parse stack. The symbols from the right hand side of a grammar rule are popped
of the stack, and the nonterminal symbol is pushed on the stack in their place.
-
Accept is the operation performed
when the start symbol is alone on the parse stack and the input is empty.
-
Error actions occur when no
successful parse is possible.
LR Parsers
LR denotes a class of bottom up
parsers that is capable of handling virtually all programming language
constructs. LR is efficient; it runs in linear time with no backtracking
needed. The class of languages handled by LR is a proper superset of the class
of languages handled by top down "predictive parsers". LR parsing detects an
error as soon as it is possible to do so. Generally building an LR parser is
too big and complicated a job to do by hand, we use tools to generate LR
parsers.
The LR parsing algorithm is given
below. See Figure 4.29 for a schematic.
ip = first symbol of input
repeat {
s = state on top of parse stack
a = *ip
case action[s,a] of {
SHIFT s': { push(a); push(s') }
REDUCE A->beta: {
pop 2*|beta| symbols; s' = new state on top
push A
push goto(s', A)
}
ACCEPT: return 0 /* success */
ERROR: { error("syntax error", s, a); halt }
}
}
Conflicts in Shift-Reduce Parsing
"Conflicts" occur when an ambiguity
in the grammar creates a situation where the parser does not know which step to
perform at a given point during parsing. There are two kinds of conflicts that
occur.
-
shift-reduce
-
a shift reduce conflict occurs when
the grammar indicates that different successful parses might occur with either
a shift or a reduce at a given point during parsing. The vast majority of
situations where this conflict occurs can be correctly resolved by shifting.
-
reduce-reduce
-
a reduce reduce conflict occurs when
the parser has two or more handles at the same time on the top of the stack.
Whatever choice the parser makes is just as likely to be wrong as not. In this
case it is usually best to rewrite the grammar to eliminate the conflict,
possibly by factoring.
Example shift reduce conflict:
S->if E then S
S->if E then S else S
In many languages two nested "if"
statements produce a situation where an "else" clause could legally belong to
either "if". The usual rule (to shift) attaches the else to the nearest (i.e.
inner) if statement. Example reduce reduce conflict:
(1) S -> id LP plist RP
(2) S -> E GETS E
(3) plist -> plist, p
(4) plist -> p
(5) p -> id
(6) E -> id LP elist RP
(7) E -> id
(8) elist -> elist, E
(9) elist -> E
By the point the stack holds ...id
LP id
the parser will not know which rule to use to reduce the id: (5) or (7).
Constructing SLR Parsing Tables:
Definition: An LR(0) item of a
grammar G is a production of G with a dot at some position of the RHS.
Example: The production A->aAb
gives the items:
A->.aAb A->a.Ab A->aA.b
A->aAb.
Note: A production A-> e generates
only one item:
A->.
Intuition: an item A->a. b denotes:
-
a - we have already seen a string
derivable from a
-
b - we hope to see a string
derivable from b
Functions on Sets of Items
Closure: if I is a set of items for
a grammar G, then closure(I) is the set of items constructed as follows:
-
Every item in I is in closure(I).
-
If A->aBb
is in closure(I) and B->g is a production, then add B-> .g
to closure(I).
These two rules are applied repeatedly
until no new items can be added.
Intuition: If A->a.Bb e closure(I)
then we home to see a string derivable from B in the input. So if B-> g is a
production, we should hope to see a string derivable from g. Hence, B->.g e
closure(I).
|
Goto: if I is a set of items and X is
a grammar symbol, then goto(I,X) is defined to be:
goto(I,X) = closure({[A->aX. b] |
[A->a.Xb] e I})
Intuition:
-
[A->a.Xb] e I => we've seen a
string derivable from a; we hope to see a string derivable from Xb.
-
Now suppose we see a string
derivable from X
-
Then, we should "goto" a state where
we've seen a string derivable from aX, and where we hope to see a string
derivable from b. The item corresponding to this is [A->aX. b]
E -> E+T | T
T -> T*F | F
F -> (E) | id
Let I = {[E -> E+.T]} then:
goto(I,+) = closure({[E -> E+.T]})
= closure({[E -> E+.T], [E -> .T*F], [T -> .F]})
= closure({[E -> E+.T], [E -> .T*F], [T -> .F], [F-> .(E)], [F -> .id]})
= { [E -> E + .T],[T -> .T * F],[T -> .F],[F -> .(E)],[F -> .id]}
The Sets of Items Construction
-
Given a grammar G with start symbol
S, construct the augmented grammar by adding a special production S'->S
where S' does no appear in G.
-
Algorithm for constructing the
canonical collection of LR(0) items for an augmented grammar G':
begin
C := { closure({[S' -> .S]}) };
repeat
for each set of items I e C:
for each grammar symbol X:
if goto(I,X) != 0 and goto(I,X) !e C then
add goto(I,X) to C;
until no new sets of items can be added to C;
return C;
end
Valid Items: an item A ->
b 1. b 2 is valid for a viable prefix
ab 1 if there is a derivation:
S' =>*rm aA w =>*rma b1 b 2w
Suppose A -> b1. b 2 is
valid for ab1, and aB1 is on the parsing stack
-
if b2 != e, we should
shift
-
if b2 = e, A -> b1
is the handle, and we should reduce by this production
Note: two valid items may tell us to
do different things for the same viable prefix. Some of these conflicts can be
resolved using lookahead on the input string.
Constructing an SLR Parsing Table
-
Given a grammar G, construct the
augmented grammar by adding the production S' -> S.
-
Construct C = {I0, I1,
… In}, the set of sets of LR(0) items for G'.
-
State I is constructed from Ii,
with parsing action determined as follows:
-
[A -> a.aB] e Ii, a a
terminal; goto(Ii,a) = Ij : set action[i,a] = "shift j"
-
[A -> a.] e Ii : set
action[i,a] to "reduce A -> x" for all a e FOLLOW(A), where A != S'
-
[S' -> S] e Ii : set
action[i,$] to "accept"
-
goto transitions constructed as
follows: for all non-terminals: if goto(Ii, A) = Ij, then
goto[i,A] = j
-
All entries not defined by (3) &
(4) are made "error". If there are any multiply defined entries, grammar is not
SLR.
-
Initial state S0 of
parser: that constructed from I0 or [S' -> S]
Example:
S -> aABe FIRST(S) = {a} FOLLOW(S) = {$}
A -> Abc FIRST{A} = {b} FOLLOW(A) = {b,d}
A -> b FIRST{B} = {d} FOLLOW{B} = {e}
B -> d FIRST{S'}= {a} FOLLOW{S'}= {$}
I0 = closure([S'->.S]
= closure([S'->.S],[S->.aABe])
goto(I0,S) = closure([S'->S.]) = I1
goto(I0,a) = closure([S->a.Abe])
= closure([S->a.Abe],[A->.Abc],[A->.b]) = I2
goto(I2,A) = closure([S->aA.Be],[A->A.bc])
= closure([S->aA.Be],[A->A.bc],[B->.d]) = I3
goto(I2,B) = closure([A->b.]) = I4
goto(I3,B) = closure([S->aAB.e]) = I5
goto(I3,b) = closure([A->Ab.c]) = I6
goto(I3,d) = closure([B->d.]) = I7
goto(I5,e) = closure([S->aABe.]) = I8
goto(I6,c) = closure([A->Abc.]) = I9
YACC
YACC files end in .y and take the
form
declarations
%%
grammar
%%
subroutines
The declarations section defines the
terminal symbols (tokens) and nonterminal symbols. The most useful declarations
are:
-
%token a
-
declares terminal symbol a; YACC can
generate a set of #define's that map these symbols onto integers, in a y.tab.h
file
-
%start A
-
specifies the start symbol for the
grammar (defaults to nonterminal on left side of the first production rule).
The grammar gives the production
rules, interspersed with program code fragments called semantic actions that
let the programmer do what's desired when the grammar productions are reduced.
They follow the syntax
Where body is a sequence of 0 or
more terminals, nonterminals, or semantic actions (code, in curly braces)
separated by spaces. As a notational convenience, multiple production rules may
be grouped together using the vertical bar (|).
The Value Stack
-
YACC's parse stack contains only
states
-
YACC maintains a parallel set of
values
-
$ is used in semantic actions to
name elements on the value stack
-
$$ denotes the value associated with
the LHS (nonterminal) symbol
-
$n denotes the value associated with
RHS symbol at position n.
-
Value stack typically used to
construct the parse tree
-
Typical rule with semantic action: A
: b C d { $$ = tree(R,3,$1,$2,$3); }
-
The default value stack is an array
of integers
-
The value stack can hold arbitrary
values in an array of unions
-
The union type is declared with
%union and is named YYSTYPE
YACC precedence and associativity
declarations
YACC headers can specify precedence
and associativity rules for otherwise heavily ambiguous grammars. Precedence is
determined by increasing order of these declarations. Example:
%right ASSIGN
%left PLUS MINUS
%left TIMES DIVIDE
%right POWER
%%
expr: expr ASSIGN expr
| expr PLUS expr
| expr MINUS expr
| expr TIMES expr
| expr DIVIDE expr
| expr POWER expr
;
YACC error handling and recovery
-
Use special predefined token error
where errors expected
-
On an error, the parser pops states
until it enters one that has an action on the error token.
-
For example: statement: error ';' ;
-
The parser must see 3 good tokens
before it decides it has recovered.
-
yyerrok tells parser to skip the 3
token recovery rule
-
yyclearin throws away the current
(error-causing?) token
-
yyerror(s) is called when a syntax
error occurs (s is the error message)
Semantic Analysis
Semantic ("meaning") analysis refers
to a phase of compilation in which the input program is studied in order to
determine what operations are to be carried out. The two primary components of
a classic semantic analysis phase are variable reference analysis and type
checking. These components both rely on an underlying symbol table.
What we have at the start of
semantic analysis is a tree built definitions; they can have all the
synthesized attributes they want.
In practice, attributes get stored
in parse tree nodes and the semantic rules are evaluated either (a) during
parsing (for easy rules) or (b) during one or more (sub)tree traversals.
Symbol Table Module
Symbol tables are used to resolve
names within name spaces. Symbol tables are generally organized hierarchically
according to the scope rules of the language. See the operations defined on
pages 474-476 of the text. To wit:
-
mktable(parent)
-
creates a new symbol table, whose
scope is local to (or inside) parent
-
enter(table, symbolname, type,
offset)
-
insert a symbol into a table
-
addwidth(table)
-
sums the widths of all entries in
the table
-
enterproc(table, name, newtable)
-
enters the local scope of the named
procedure
Type Checking
Perhaps the primary component of
semantic analysis in many traditional compilers consists of the type checker.
In order to check types, one first must have a representation of those types (a
type system) and then one must implement comparison and composition operators
on those types using the semantic rules of the source language being compiled.
Lastly, type checking will involve adding synthesized attributes through those
parts of the language grammar that involve expressions and values.
Type Systems
Types are defined recursively
according to rules defined by the source language being compiled. A type system
might start with rules like:
-
Base types (int, char, etc.) are
types
-
Named types (via typedef, etc.) are
types
-
Types composed using other types are
types, for example:
-
array(T, indices) is a type. In some
languages indices always start with 0, so array(T, size) works.
-
T1 x T2 is a type (specifying, more
or less, the tuple or sequence T1 followed by T2; x is a so-called
cross-product operator).
-
record((f1 x T1) x (f2 x T2) x ... x
(fn x Tn)) is a type
-
in languages with pointers,
pointer(T) is a type
-
(T1 x ... Tn)
-> Tn+1 is a type denoting a function mapping parameter types to
a return type
-
In some language type expressions
may contain variables whose values are types.
In addition, a type system includes
rules for assigning these types to the various parts of the program; usually
this will be performed using attributes assigned to grammar symbols.
Representing C (C++, Java, etc.)
Types
The type system is represented using
data structures in the compiler's implementation language. In the symbol table
and in the parse tree attributes used in type checking, there is a need to
represent and compare source language types. You might start by trying to
assign a numeric code to each type, kind of like the integers used to denote
each terminal symbol and each production rule of the grammar. But what about
arrays? What about structs? There are an infinite number of types; any attempt
to enumerate them will fail. Instead, you should create a new data type to
explicitly represent type information. This might look something like the
following:
struct c_type {
int base_type; /* 1 = int, 2=float, ... */
union {
struct array {
int size;
struct c_type *elemtype;
} a;
struct ctype *p;
struct struc {
char *label;
struct field **f;
} s;
} u;
}
struct field {
char *name;
struct ctype *elemtype;
}
Given this representation, how would
you initialize a variable to represent each of the following types:
int [10][20]
struct foo { int x; char *s; }
Run-time Environments
Scopes and Bindings
Variables may be declared explicitly
or implicitly in some languages
Scope rules for each language
determine how to go from names to declarations.
Each use of a variable name must be
associated with a declaration. This is generally done via a symbol table. In
most compiled languages it happens at compile time (in contrast, for example
,with LISP).
Environment and State
Environment maps source code names
onto storage addresses (at compile time), while state maps storage addresses
into values (at runtime). Environment relies on binding rules and is used in
code generation; state operations are loads/stores into memory, as well as
allocations and deallocations. Environment is concerned with scope rules, state
is concerned with things like the lifetimes of variables.
Runtime Memory Regions
Operating systems vary in terms of
how the organize program memory for runtime execution, but a typical scheme
looks like this:
|
code
|
|
static data
|
|
stack (grows down)
|
| heap (may grow up, from bottom of address space)
|
The code section may be read-only,
and shared among multiple instances of a program. Dynamic loading may introduce
multiple code regions, which may not be contiguous, and some of them may be
shared by different programs. The static data area may consist of two sections,
one for "initialized data", and one section for uninitialized (i.e. all zero's
at the beginning). Some OS'es place the heap at the very end of the address
space, with a big hole so either the stack or the heap may grow arbitrarily
large. Other OS'es fix the stack size and place the heap above the stack and
grow it down.
Questions to ask about a language,
before writing its code generator
-
May procedures be recursive? (Duh,
all modern languages...)
-
What happens to locals when a
procedure returns? (Lazy deallocation rare)
-
May a procedure refer to non-local,
non-global names? (Pascal-style nested procedures, and object field names)
-
How are parameters passed? (Many
styles possible, different declarations for each (Pascal), rules hardwired by
type (C)?)
-
May procedures be passed as
parameters? (Not too awful)
-
May procedures be return values?
(Adds complexity for non-local names)
-
May storage be allocated dynamically
(Duh, all modern languages... but some languages do it with syntax (new) others
with library (malloc))
-
Must storage by deallocated
explicitly (garbage collector?)
Activation Records
Activation records organize the
stack, one record per method/function call.
|
return value
|
|
parameter
|
|
...
|
|
parameter
|
|
previous frame pointer (FP)
|
|
saved registers
|
|
...
|
| FP-->
|
saved PC
|
|
local
|
|
...
|
|
local
|
|
temporaries
|
| SP-->
|
...
|
At any given instant, the live
activation records form a chain and follow a stack discipline. Over the
lifetime of the program, this information (if saved) would form a gigantic
tree. If you remember prior execution up to a current point, you have a big
tree in which its rightmost edge are live activation records, and the
non-rightmost tree nodes are an execution history of prior calls.
Object-oriented programs are the same,
only every activation record has an associated object instance; they need one
extra "register" in the activation record.
Goal-directed programs have an
activation tree each instant, due to suspended activations that may be resumed
for additional results. The lifetime view is a sort of multidimensional tree,
with three types of nodes.
Block Structure, Variable Access
Issues
Given a variable name, how do we
compute its address?
-
globals
-
easy, symbol table lookup
-
locals
-
easy, symbol table gives offset in
(current) activation record
-
objects
-
easy, symbol table gives offset in
object, activation record has pointer to object in a standard location
-
locals in some enclosing
block/method/procedure
-
ugh. Pascal, Ada, and friends offer
their own unique kind of pain. Q: does the current block support recursion?
Example: for procedures the answer would be yes; for nested { { } } blocks in C
the answer would be no.
-
if no recursion, just count back
some number of frame pointers based on source code nesting
-
if recursion, you need an extra
pointer field in activation record to keep track of the "static link", follow
static link back some # of times to find a name defined in an enclosing scope
Garbage Collection
Automatic storage management is one
of the single most important features that makes programming easier.
Basic problem in garbage collection:
given a piece of memory, are there any pointers to it? (And if so, where
exactly are all of them please). Approaches:
Intermediate Code Generation
Goal: list of machine-independent
instructions for each procedure/method in the program. Basic data layout of all
variables.
Can be formulated as syntax-directed
translation
-
add new attributes where necessary,
e.g. for expression E we might have
-
E.place
-
the name that holds the value of E
-
E.code
-
the sequence of intermediate code
statements evaluating E.
-
new helper functions, e.g.
-
newtemp()
-
returns a new temporary variable
each time it is called
-
newlabel()
-
returns a new label each time it is
called
-
actions that generate intermediate
code formulated as semantic rules
|
Production |
Semantic Rules |
| S -> id ASN E
|
S.code = E.code || gen(ASN, id.place, E.place)
|
|