8.Parsing Introduction
Parsing: Deciding if x∈Σ∗ is in L(G) for
some CFG G.
Consider the CFG G:
S→Aa
A→AA∣ABa∣λ
B→BBa∣b∣λ
Is ba in L(G)? Running time?
How do you determine whether a string is in L(G)?
Note: ba is not in L(G) for this G!
Try all possible derivations, but don’t know when to stop.
This runs forever!
Introduction Example (2)
Same grammar without lambda-rules:
Remove λ-rules, then unit productions, and
then useless productions from the grammar G above.
New grammar G′ is:
S→Aa∣a
A→AA∣ABa∣Aa∣Ba∣a
B→BBa∣Ba∣a∣b
Is ba in L(G)? Running time?
Try all possible derivations, there will be at most |w| rounds.
NOTE THIS IS NOT LINEAR TIME, IT TAKES A LONG TIME.
Actual time is |w|∗p where p is the maximum number of
rules for any variable.
Introduction Example (3)
Grammar:
S→Aa∣a
A→AA∣ABa∣Aa∣Ba∣a
B→BBa∣Ba∣a∣b
Consider string baa.
Goal: We would like to only try the rules that give us the
derivation and ignore false paths. This would be fast!
Top-down Parser
Bottom-up Parser
Start with string, work backwards, and try to derive S.
Examples: Shift-reduce, Operator-Precedence, LR Parser
Making Parse Tables
We want to construct simple tables that tell us:
When you are working on this rule, and see this input, do something
We will use the functions FIRST and FOLLOW to aid in
computing parse tables.
Notation that we will use in defining FIRST and FOLLOW.
G=(V,T,S,P)
w,v∈(V∪T)∗
a∈T
X,A,B∈V
XI∈(V∪T)+
The function FIRST
Definition: FIRST(w)= the set of terminals that
begin strings derived from w.
If w∗⇒av then
If w∗⇒λ then
To compute FIRST (1)
1. FIRST(a)={a}
where a is a terminal.
2. FIRST(X) where X is a variable.
(a) If X→aw then
(b) If X→λ then
(c) If X→Aw
and λ∈FIRST(A) then
Everything in FIRST(w) is in FIRST(X)
To compute FIRST (2)
3. In general, FIRST(X1X2X3...XK)=
* FIRST(X1)
* ∪ FIRST(X2) if λ is in
FIRST(X1)
* ∪ FIRST(X3) if λ is in
FIRST(X1)
* ∪ FIRST(XK) if λ is in
FIRST(X1)
and λ is in FIRST(X2)
… and λ is in FIRST(XK−1)
* − {λ} if λ∉FIRST(XJ)
for all J
(where XI represents a terminal or a variable)
To compute FIRST (3)
We will be computing FIRST(w) where w is the
right hand side of a rule.
Thus, we will need to compute FIRST(X) for each
symbol X (either terminal or variable) that appears in the
right hand side of a rule.
Example (1)
L={anbmcn:n≥0,0≤m≤1}
FIRST(B)={b,λ}
Using B→b gives that b is in
FIRST(B).
Using B→λ gives that λ is
in FIRST(B).
FIRST(S)={a,b,λ}
Using S→aSc gives that a is in
FIRST(S).
Using S→B and λ is in
FIRST(B) gives that everything in
FIRST(B) is in FIRST(S), so b
and λ are in FIRST(S).
FIRST(Sc)={a,b,c}
Example (2a)
S→BCD∣aD
A→CEB∣aA
B→b∣λ
C→dB∣λ
D→cA∣λ
E→e∣fE
Example (2b)
FIRST(B)=
FIRST(C)=
FIRST(D)=
FIRST(E)=
FIRST(A)=
FIRST(S)=
The function FOLLOW
Definition: FOLLOW(X)= set of terminals that can
appear to the right of X in some derivation.
(We only compute FOLLOW for variables.)
If S∗⇒wAav then
(where w and v are strings of terminals and
variables, a is a terminal, and A is a variable)
Computing FOLLOW
$ is in FOLLOW(S)
If A→wBv and v≠λ then
FIRST(v)−{λ} is in FOLLOW(B)
If A→wB or
A→wBv and λ is in
FIRST(v) then
FOLLOW(A) is in FOLLOW(B)
λ is never in FOLLOW
Example (1)
Reminder: λ is never in a FOLLOW set.
FOLLOW(S)={$,c}
$ goes into FOLLOW(S) by rule 1.
Then c goes into FOLLOW(S) by rule 2 since
S→aSc and FIRST(c)={c}.
FOLLOW(B)={$,c}
By rule 3 and S→B, FOLLOW(S) is
added to FOLLOW(B).
Example (2a)
S→BCD∣aD
A→CEB∣aA
B→b∣λ
C→dB∣λ
D→cA∣λ
E→e∣fE
Example (2b)
FOLLOW(S)=
FOLLOW(A)=
FOLLOW(B)=
FOLLOW(C)=
FOLLOW(D)=
FOLLOW(E)=
LL(k) Parsing
We discussed this in principle before. Now we want to operationalize it.
Note: A language is not LL(k), a grammar is.
L={aiabci∣i>0}
G1=S→aSc{aaa}
G2=S→aA
G3=S→aaAc
LL parsing process
First, convert CFG to PDA
L={anbbn:n≥0}
The PDA is nondeterministic.
Use lookahead to make it deterministic: determine which rewrite rule to use.
Parsing routine (for this grammar)
symbol
is the lookahead symbol and $ is the end-of-string marker:
state = s
push(S)
state = q
read(symbol) obtain the lookahead symbol
while top-of-stack <> z do while stack is not empty
case top-of-stack of
S: if symbol == a then cases for variables
{ pop(); push(aSb) } replace S by aSb
else if symbol == b then
{ pop(); push(b) } replace S by b
else error
a: if symbol <> a, then error cases for terminals
else { pop(); read(symbol) } pop a, get next lookahead
b: if symbol <> b, then error
else { pop(); read(symbol) } pop b, get next lookahead
end case
end while
pop() pop z from the stack
if symbol <> $ then error
state = f
LL Parse Table
When the grammar is large, the parsing routine will have many cases.
Alternatively, store the information for which rule to apply in
a table.
Rows: variables
Columns: terminals, $ (end of string marker)
LL[i,j]
contains the right-hand-side of a rule.
This right-hand-side is pushed onto the stack when the
left-hand-side of the rule is the variable representing the
i th row and the lookahead is the symbol representing the
j th column.
For any CFG that we can specify by this type of parse table,
we can use a generic parser to determine if strings
are in this language.
Gets rid of use of states
Parse Table Example
Parse table for
ab$SaSbberror
A generic parsing routine
(LL[,]
is the parse table.):
push(S)
read(symbol) obtain the lookahead symbol
while stack not empty do
case top-of-stack of
terminal:
if top-of-stack == symbol
then { pop(); read(symbol) } pop terminal and get next lookahead
else
error
variable:
if LL[top-of-stack, symbol] <> error
then { pop(), pop the lhs
push(LL[top-of-stack,symbol]) } push the rhs
else
error
end case
end while
if symbol <> $, then error
Example
abc$SaSberrorcerror
When S is on the stack and a is the lookahead,
replace S by aSb
When S is on the stack and b is the lookahead,
there is an error (there must be a c between the
a ‘s and b ‘s)
When S is on the stack and $ is the lookahead,
then there is an error (S must be replaced by at
least one terminal)
When S is on the stack, and c is the lookahead,
then S should be replaced by c.
Example
When the grammar has a λ-rule, it
can be difficult to compute parse tables.
In this example,
A can disappear (due to A→λ).
So when S is on the stack, it can be replaced by Ac
if either “a” or “c” are the lookahead, or it can be replaced
by Bc if “b” is the lookahead.
Constructing an LL parse table
1. For each rule A→w
a. For each a in FIRST(w)
b. If λ is in FIRST(w)
add w to LL[A,b] for each b in FOLLOW(A)
where b∈T∪{$}
2. Each undefined entry is an error.
Example (1): Need FIRST and FOLLOW
We have already calculated FIRST and FOLLOW for this Grammar:
FIRSTFOLLOWSa,b,λ$,cBb,λ$,c
Example (2): Compute Parse Table
For S→aSc,
FIRST(aSc)={a}, so add aSc to
LL[S,a]
by step 1a.
For S→B,
FIRST(B)={b,λ}
FOLLOW(S)={$,c}
By step 1a, add B to LL[S,b]
By step 1b, add B to LL[S,c]
and LL[S,$]
For B→b,
FIRST(b)={b}, so by step 1a add b to LL[B,b]
For B→λ
FIRST(λ)={λ} and
FOLLOW(B)={$,c},
so by step 1b
add λ to LL[B,c]
and add λ
to LL[B,$]
.
Example (3): Sample Trace
abc$SaScBBBBerrorbλλ
Parse string: aacc
aaSSBSSccccStack:S_c_c_c_c_c_c_c_symbol:aaa′a′cccc′
where a′ is the second a in the string and symbol
is
the lookahead symbol.
This table is an LL(1) table because only 1 symbol of lookahead is needed.
Example (4): Sample Trace
Trace aabcc
aaSSBbSScccccStack:S_c_c_c_c_c_c_c_c_symbol:aaa′a′bbbcc′
where a′ is the second a in the string and symbol
is the lookahead symbol.
This table is an LL(1) table because only 1 symbol of lookahead is needed.
LL(k) Can’t Parse All CFGs
L={an:n≥0}∪{anbn:n≥0}
S→A
S→B
A→aA
A→λ
B→aBb
B→λ
This grammar cannot be recognized by an LL(k) parser for any
k.
Consider the string aabb.
Need 3 lookahead to realize that we want S→B.
For aaabbb, we need 4 lookahead.
For anbn, we need n lookahead.
Conclusion
There are some CFL’s that have no LL(k) Parser
There are some languages for which some grammars have
LL(k) parsers and some don’t.