6.3. Writing Compilers¶

Part of the definition for a programming language is its grammar. It is critical that the definition for any programming language be unambiguous, otherwise two different compilers can “correctly” compile and execute a program, but give two different results. This is very bad if our program behaves differently just because we compile it with a different compiler!

Traditionally, grammars have been partly (though not entirely) defined using a grammar. Of course, all aspects of the definition must be unambiguous. Sometimes, two compilers might give different outcomes for a program for the simple reason that one of them has a bug. But it sometimes happens that the problem is that the language defintion has an ambiguity. We have already seen that one possible source of ambiguity is the grammar used as part of the definition.

6.3.1. Backus-Naur Form¶

Traditionally, compiler writers have used grammars to define as much of the language as possible. This is because there are very good tools available that can take a grammar and automaticall build a parser and part of the code generator for the language, which saves a lot of programming effort and removes one source of bugs (human error in the programming).

Backus-Naur form is a popular notation to use for writing context-free grammars. It has been around since the late 1950’s and early 1960’s when it was first used in the definition of the grammar for the early programming language ALGOL.

Backus-Naur Form of a grammar:

Nonterminals are enclosed in brackets \(<>\)

For “\(\rightarrow\)” use instead “\(::=\)”

Sample C++ Program::

main () {
  int a;     int b;   int sum;
  a = 40;    b = 6;   sum = a + b;
  cout << "sum is "<< sum << endl;
}

“Attempt” to write a CFG for C++ in BNF (Note: \(<\mbox{program}>\) is start symbol of grammar.)

\[\begin{split}\begin{eqnarray*} <\mbox{program}> &::=& \mbox{main} ()\ <\mbox{block}>\\ <\mbox{block}> &::=& \{\ <\mbox{stmt-list}>\ \}\\ <\mbox{stmt-list}> &::=& <\mbox{stmt}>\ |\ <\mbox{stmt}>\ <\mbox{stmt-list}>\ |\ <\mbox{decl}>\ |\ <\mbox{decl}> <\mbox{stmt-list}> \\ <\mbox{decl}> &::=& \mbox{int}\ <\mbox{id}>\ ;\ |\ \mbox{double}\ <\mbox{id}>\ ; \\ <\mbox{stmt}> &::=& <\mbox{asgn-stmt}>\ |\ <\mbox{cout-stmt}>\\ <\mbox{asgn-stmt}> &::=& <\mbox{id}>\ =\ <\mbox{expr}>\ ;\\ <\mbox{expr}> &::=& <\mbox{expr}>\ +\ <\mbox{expr}>\ |\ <\mbox{expr}>\ *\ <\mbox{expr}>\ |\ (\ <\mbox{expr}>\ )\ |\ <\mbox{id}>\\ <\mbox{cout-stmt}> &::=& \mbox{cout}\ <\mbox{out-list}>\\ \end{eqnarray*}\end{split}\]

etc., Must expand all nonterminals!

So a derivation of the program test would look like:

\[\begin{split}<\mbox{program}> &\Rightarrow&\ \mbox{main} ()\ <\mbox{block}> \\ &\Rightarrow&\ \mbox{main} ()\ \{\ <\mbox{stmt-list}>\ \} \\ &\Rightarrow&\ \mbox{main} ()\ \{\ <\mbox{decl}>\ <\mbox{stmt-list}>\ \} \\ &\Rightarrow&\ \mbox{main} ()\ \{\ \mbox{int}\ <\mbox{id}>\ <\mbox{stmt-list}>\ \} \\ &\Rightarrow&\ \mbox{main} ()\ \{\ \mbox{int}\ \mbox{a}\ <\mbox{stmt-list}>\ \} \\ &\stackrel{*}{\Rightarrow}&\ \mbox{complete C++ program}\end{split}\]

This problem asks you to provide a characterization in English of the language generated by a BNF grammar. After you finish it, there is one more problem about Extended Backus-Naur Form, which is described before the problem.

6.3.2. Extended BNF¶

The symbols we have used in our representation of grammars collectively comprise what is known as Backus-Naur Form (BNF). In Extended Backus-Naur Form (EBNF) we add five meta-symbols to those already used in BNF notation:

Kleene closure operator \(*\), which means “zero or more”. Hence if \(<fn\_name>\) were a non-terminal representing a valid function name and \(<argument>\) were a non-terminal representing a valid argument, then the EBNF notation for function calls with zero or more arguments (with no commas between them) would be

\[<fn\_name> "(" <argument>* ")"\]

Positive closure operator \(+\). The EBNF notation for function calls that must have at least one argument would be

\[<fn\_name> "(" <argument>+ ")"\]

The two paired parenthesis symbols \(( \; )\), which are used for grouping. For example, if \(<positive\_number>\) were the non-terminal denoting a valid positive number, then the following EBNF would dictate that we must have a plus or minus sign preceding a number

\[(+ | -) <positive\_number>\]

The “optional operator” \(?\), which specifies that you can have zero or one of whatever grammatical structure precedes the operator. For example, if our language allowed an optional plus or minus sign in front of a number, we would use the EBNF

\[(+ | -)? <positive\_number>\]

EBNF is used to reduce the number of productions a grammar needs to specify a language. However, it does not increase the expressive power of grammars, that is, any grammatical structure that can be expressed in EBNF can also be expressed in BNF if one is willing to use more productions.

This last problem is about the equivalence between a given BNF grammar (the same one as in part 4 above) and a smaller EBNF grammar.

More on CFG for C++

Last time we “attempted” to write a CFG for C++, it is possible to write a CFG that recognizes all syntactically correct C++ programs, but there is a problem that the CFG also accepts incorrect programs. For example, it can’t recognize that it is an error to declare the same variable twice, once as an integer and once as a char.

We can write a CFG \(G\) such that \(L(G) = \{ \mbox{syntactically correct C++ programs} \}\).

But note that \(\{ \mbox{semantically correct C++ programs} \} \subset L(G)\).

Another example: Can’t recognize if formal parameters match actual parameters in number and type:

declare: int Sum(int a, int b, int c) …

call: newsum = Sum(x,y);

CS4114 Formal Languages and Automata, Spring 2023

Chapter 6 Context-Free Grammars and Languages

6.3. Writing Compilers¶

6.3.1. Backus-Naur Form¶

6.3.2. Extended BNF¶