Context Free Grammars and Parsing
Recall the primary task of the parser: construct a parse tree
for an input string (program). There are two ways to construct a parse tree:
- Directly. In this case, an actual parse tree is
constructed. This method is normally always used in class for describing how a parse
proceeds. An actual tree is also maintained by more complex compilers (usually those
requiring multiple passes).
- Indirectly. In this case we don't actually construct the
tree as a data structure with nodes and links, but rather we do the parsing using other
methods, such as the use of a stack (recall that a pushdown automaton exists for every
context free language that can parse---accept or reject---a string submitted to it).
This method is used most often in one-pass compilers, which don't need to keep the
entire parse tree around for later passes on the way to producing target machine code.
In our class, then, we will be using actual parse trees to describe the
parsing process and indirect methods to describe how a program can be designed to parse.
Starting with Standard EBNF
As usual, we start the task of designing a parser with the formal
definition of a language, normally in EBNF. Note the following things:
- EBNF is generally for human consumption. It is easier to read
than a standard cfg.
- EBNF is often not in the desired form for parsing.
- EBNF must often be converted to an equivalent context free grammar form in
order to use it for a hand written compiler, although there are techniques
used in recursive descent compilers that closely follow the EBNF rules.
- Once the EBNF is converted to an equivalent context free grammar, the
grammar must usually be modified for use with a particular parsing
technique.
- EBNF is appropriate as input to some parser tools (notably the one
called yacc). Even then, however, the EBNF must often be
manipulated to make it usable for a particular parsing strategy.
For example, the EBNF rule
IfStatement = "if" BooleanExpression "then" Statement [ "else" Statement ]
is equivalent to the two context free grammar rules
<IfStatement> --> if <BooleanExpression> then <Statement>
<IfStatement> --> if <BooleanExpression> then <Statement> else <Statement>
Here, we put angle braces (< and >) around nonterminals and make terminals (tokens) bold.
One could also use the following context free grammar rules, which are also equivalent to the above
single EBNF rule:
<IfStatement> --> if <BooleanExpression> then <Statement> <IfTail>
<IfTail> --> else <Statement>
<IfTail> --> l
In other words, there is no unique set of context free grammar rules that correspond to a set of
EBNF rules, but every EBNF grammar can be expressed as a context free grammar.
Similarly, the EBNF rule
StatementSequence = Statement { ";" Statement }
can be expressed by the following two context free grammar rules:
<StatementSequence> --> <Statement> <MoreStatements>
<MoreStatements> --> ; <Statement> <MoreStatements>
<MoreStatements> --> l
This means that the team designing the parser must often fiddle with the
EBNF to turn it into a CFG, and then further fiddle with the CFG to ensure that it has the
properties necessary for the chosen parsing scheme.
For example, we don't want ambiguity in our grammar, so we modify ambiguous
grammars to make them unambiguous (if possible). We do similar things with
other undesirable grammar properties
that we may encounter (to be discussed later). There are a few hurdles, however.
- Some context free languages are inherently ambiguous. This just
means that there are no unambiguous grammars for these languages.
- If some small parts of a programming language are inherently ambiguous,
as is usual when practice meets theory, we just find a workaround for those small pieces.
- One aspect of programming languages that is not possible to capture in
any context free grammar at all is the requirement that variables be used only in the
context of their declarations. Since declarations are made arbitrarily far earlier
than the use of a variable, such information is context sensitive and cannot be captured
in a context free grammar. The symbol table is the workaround for this problem.
Approaches to Constructing a Parse Tree
The changes we must make to a grammar to get it in a form ready for
parsing a particular language are actually dependent on the parsing method chosen.
There are two major methods:
Top Down Parsing
Top-down. In this case the parse proceeds from
the root (the start symbol of the grammar). At each step in the parse, the leaves of
the current parse tree are scanned from left to right, and the leftmost nonterminal is the
next to be expanded using a rule from the grammar.
Bottom Up Parsing
Bottom-up. In this case the parse proceeds from
the tokens back towards the root. At each step in the parse, all unattached nodes
(nodes with no parent) are reviewed to see whether a right hand side of some rule in the
grammar can be found. When the correct right hand side is found, the left hand side
is inserted into the tree, and its children (the right hand side elements of the rule) are
attached to it.
Practical Parsing Considerations
As with all programming tasks, however, we want to make the task of parsing as
efficient as possible. In particular, we want to be able to build parse trees under
the following constraints:
- No backtracking! We do not want to build a parser
that tries to apply certain rules in constructing the parse tree, only to discover later
that different rules should have been applied, causing the parser to backtrack in order to
deconstruct the tree built so far and reconstruct it with different rules. For
example, there may be three rules with <expression> on the left. We want to be
able to choose the correct rule to use in a particular parse without being required to try
all three to see which one (if any) works.
- One lookahead token! We want to be able to
proceed as far as possible in a non-backtracking parse knowing just the next token seen by
the scanner. The parser should not be required to examine many tokens at one time in
order to determine which rule to apply in the parse tree.
Meeting the Objectives of Practical Parsing
Remember, there are many different grammars for a single programming language.
Not all of them will fit our objectives of parsing with no backtracking and one
lookahead token. This means that care must be taken in constructing a grammar that
meets the objectives. The process proceeds as follows:
- Start with the formal EBNF definition of the programming language.
- Turn the EBNF into context free grammar form
- Ensure that the rules of the grammar fit one of two different objectives:
- Top-down parsing with no backtracking and one token of lookahead (a grammar that has
these characteristics is called an LL(1) grammar).
- Bottom-up parsing with no backtracking and one token of lookahead (a grammar that
has these characteristics is called an LR(1) grammar).
Examples
Try some parses with the following three grammars.
Grammar 1:
<expression> --> <expression> - <expression>
<expression> --> <expression> + <expression>
<expression> --> <expression> * <expression>
<expression> --> <expression> / <expression>
<expression> --> <expression> ^ <expression>
<expression> --> identifier
<expression> --> integer_literal
<expression> --> ( <expression> )
Try to parse a + b * c and a * b + c using this grammar. Notice that
the grammar is ambiguous. There is more than one way to do each of these
parses. The result is that operator precedence is not properly handled
here.
Grammar 2:
<expression> --> <expression> + <term>
<expression> --> <expression> - <term>
<expression> --> <term>
<term> --> <term> * <factor>
<term> --> <term> / <factor>
<term> --> <factor>
<factor> --> <factor> ^ <primary>
<factor> --> <primary>
<primary> --> identifier
<primary> --> integer_literal
<primary> --> ( <expression> )
Try parsing a + b * c and a * b + c with this grammar as well. The
grammar is certainly uglier in terms of readability than the first grammar,
but there is only one way to do each of these parses. In fact, this
grammar is unambiguous. So, this grammar is better for programmed
parsers, even though it is less clear on first glance for a human.
Grammar 2 also has the advantage that it captures operator precedence
correctly. It is also LR(1), which means that it can be parsed bottom up with no
backtracking and one token of lookahead.
Grammar 3:
<expression> --> <term> <expression_tail>
<expression_tail> --> + <term> <expression_tail>
<expression_tail> --> - <term> <expression_tail>
<expression_tail> --> lambda
<term> --> <factor> <term_tail>
<term_tail> --> * <factor> <term_tail>>
<term_tail> --> / <factor> <term_tail>
<term_tail> --> lambda
<factor> --> <primary> <factor_tail>
<factor_tail> --> ^ <primary> <factor_tail>
<factor_tail> --> lambda
<primary> --> identifier
<primary> --> integer_literal
<primary> --> ( <expression> )
Try parsing a + b * c and a * b + c with this grammar as well. The
grammar is certainly even uglier in terms of readability than the first two grammars,
but there is only one way to do each of these parses. In fact, this
grammar is also unambiguous. So, this grammar is better for programmed
parsers, even though it is less clear on first glance for a human.
Grammar 3 also has the advantage that it captures operator precedence
correctly. It is also LL(1), which means that it can be parsed top down with no backtracking
and one token of lookahead.