March 26

Parser Error Handling and Recovery

Keeping Things Straight

As mentioned last time, there are three things we need to know at least a little bit about when we begin doing semantic analysis in our compiler project. If we were not simultaneously learning about compiling and doing a project, we could perhaps study these things in succession. However, since we are writing a compiler at the same time as we are learning about compiling, we need to get at least an overview of the main issues to proceed.

The three things we have already mentioned are

symbol table construction
run time memory management
translation of source code to IR

There is one more thing we will now add to this list

Error Handling

We give a quick review here before proceeding.

The symbol table (section 6.3 in the textbook) is needed to hold semantic information about identifiers encountered during the program, including their lexeme, type (e.g., integer, real, array), kind (e.g., variable, constant, etc.), and other attributes (e.g., parameter). Scoping is represented by, for example, having a different symbol table for each scope and maintaining these as a stack of symbol tables as parsing progresses, pushing the new symbol table when a scope is entered and popping the symbol table when the scope is exited. One thing that is needed for eventual translation is the memory location where each identifier represented in the symbol table will eventually be stored when the translated program is run on the computer.

Run time memory management (chapter 7 in the textbook) gives us a model for memory that allows us to finish construction of the symbol table. In other words, this model lets us envision where each identifier will be eventually represented in memory. So, with this model in mind, we can include an entry in the symbol table for each identifier that indicates where that identifier will be in memory, so that we can generate the right addresses during translation. The most common memory management model uses a run time stack for managing variables. Each scope has its own activation record, which is pushed onto the stack when the scope is entered, and popped off when the scope is exited. Variables are located at fixed offsets from the start of the activation record for the scope.

In translating the source code to IR (covered differently in the book and not in depth), the semantic analyzer uses the symbol table to do type checking (to see that identifiers have been declared and used properly), and does the translation using the run time memory address information stored in the symbol table to generate correct addresses for the identifiers.

Error handling (not clearly identified in separate sections in the textbook) is the topic of this lecture.

Error Handling

Error handling refers to two things: detection of an error and recovering from an error. There are three points at which errors can be encountered and resolved:

During Scanning
During Parsing
During Semantic Analysis

Once semantic analysis is complete, errors should not occur, because an entire translation of the original source program into equivalent IR form has been accomplished without error. There should be no possibility of encountering errors when translating IR into machine code. This does not, of course, mean that the translated program is correct. That is still the responsibility of the programmer.

Recovering from Scanner Detected Errors

The method we have applied for recovering from scanner errors is to raise an exception that is noted by the parser and then restart scanning with the next input character following the one that started the string in error when the scanner is called again. Another approach discussed was to restart the scanning with the character immediately following the token that the scanner got stuck on (i.e., skip to the next white space character). The first method seems better, because we may be able to scan more valid tokens. In any case, we

encounter an error
report the error
recover from the error using one of the methods
set a flag to turn off IR generation

The last point is important. There is no reason to be generating a translation of a program that has an error in it.

Recovering from Parser Detected Errors

In general, there are three approaches:

Catastrophic Mode: In this case, when an error is detected, an exception is raised, a message printed about where the error occurred and what it might be, and the entire compilation process is terminated.

Panic Mode: In this case, when an error is detected, the parser raises an exception that causes an error message to be printed about where the error occurred and what it might be, and also causes scanning to proceed to some delimiter, such as a semicolon, at which point parsing continues with the new construct. Just one delimiter can be chosen (e.g., the semicolon) as the synchronization token, or for more sophisticated approaches, a set of synchronizing tokens can be established based on the construct that is being parsed at the moment. For example, if an error is detected during the parse of an assignment statement, parsing would begin again with the statement following the next encountered semicolon if the semicolon is the synchronizing token. In this case, exceptions should be raised (thrown) and handled (caught) in the procedure where the synchronizing token is found so that parsing can be restarted at that point.

Non-Panic Mode: In this mode, the parser pretends not to panic quite so much and attempts to restart the parse at some point sooner than the next delimiter (after raising an exception and posting an appropriate error message).

Error Repair Mode: In this instance, the parser actually tries to repair the code so that the construct in which the error was found can continue to be parsed. Of course, the repair is likely not correct. The point is not to produce a correct program (although there were early attempts to do this) but to be able to continue the parsing process and be able to parse as much of the program as possible. So, for example, a common repair technique is to insert a semicolon wherever it might appear that one is missing.

Example of Panic Mode Error Recovery

Suppose you are parsing the statement

a := b * 4 d * (a + b);

In this case there appears to be an operator missing between c and 4. In this instance, our parse would look something like

Notice that we are trying to expand <factor_tail>, but the lookahead is an id. This, according to our LL(1) table, is impossible, so we have an error.

In panic mode error recovery, we basically throw this statement away and try to parse the next statement instead. In other words, we back up to the most recent <statement> nonterminal in our tree, skip past all tokens until we see the semicolon, and then restart the parse from <statement> according to the rule

<statement_tail> --> ; <statement> <statement_tail>

(Of course, this only applies to this grammar.)

In our parser, this same thing can be accomplished by rasing a Syntax_Error exception any time while parsing from procedure Statement. This exception is then propagated back up until procedure Statement is reached. At that point, the exception is handled by printing a proper error message, such as

Expected an operator or delimiter, but found an identifier at line 35 column 33. Skipping until after next semicolon.

Then the exception handler would repeatedly call the scanner until the next token after the next semicolon is reached. Finally, the exception handler would then make sure that procedure Statement begins afresh at this point with this new lookahead. So, if we were parsing the three statements

     .
     .
     .
read (a, b, d);
a := b * 4 d * (a + b);
write(a);
     .
     .
     .

the read statement would parse properly. The parser would then encounter an error in processing the assignment statement (as shown above). If the panic mode error recovery is used and the synchronizing token is the semicolon, the tokens d, *, (, a, +, b, and ) would be scanned and skipped. Parsing would then restart at the semicolon in the above rule for statement tail, and the call to <statement> would then result in a proper parse for the write statement.

Cascading Errors

All of the error recovery modes except the catastrophic mode can lead to a phenomenon called "error cascading." For example, in attempting to restart a parse at some point after an error is encountered may lead to yet more errors, because a proper restarting point was not chosen. Error repair is often not correct, either, yielding more errors as an incorrectly repaired construct is further parsed. As an example, consider an assignment statement with a missing operator:

Malt := Oranges + Grapes  Sugar * Yeast;

In this case, catastrophic mode would terminate parsing after the scanner returned the ID Sugar. Panic mode with scanning restarting after the semicolon would work pretty well in this case. Non-panic mode might restart parsing with Sugar, assuming that Sugar is the start of a new statement, which would lead to cascading errors (because then it might try to restart with Yeast after the next error is encountered, and so on). Error repair might try to fix the construct by inserting a semicolon after Grapes and treating Sugar as the start of a new assignment statement, again leading to cascading errors.

Semantic Errors

Semantic errors are those encountered when a semantic action routine is running. The most common of these are errors related to the symbol table such as attempts to declare an identifier twice, or referring to an identifier that has not been declared. Probably the best way to handle such errors is to raise an exception that is propagated all the way out to the level in the parser where the semantic information of this semantic action (that now is in error) is no longer needed. This can also be understood by thinking of semantic records. Suppose that in processing an an arithmetic expression an undeclared variable is encountered. In this case, instead of passing back a semantic record with the lexeme corresponding to the variable, an error semantic record is returned. This error record means that IR cannot be produced for the operation that this variable was a part of, so instead of an intermediate value being returned as the semantics of the operation, an error record is further returned. If this expression is part of an assignment statement, then the error record continues to be passed back until the parser moves on to parsing the next statement (where no semantic record is needed anyway). Consider the example

X := A + B * C;

where C is an undeclared variable. In this case, the action #process_id(ID_Rec) would discover this fact and return an error in ID_Rec (that is, it would return the variant of the record that is an error record, or it would accomplish the same thing by raising an exception). The semantic action for gen_infix should at this point be called by the parser to do B*C. However, with one valid semantic record and one error record, gen_infix would just return an error record itself, representing the intermediate value for B*C. Similarly, gen_infix for A+T, where T is the semantic record for the temporary value (in this case, an error semantic record) would return an error record. Finally, gen_assign would have a semantic record for X and the error semantic record for the right hand side, and so it, too, would not generate code. In this case, no semantic record is returned from gen_assign, so the parser just continues, by parsing the next statement normally. Similar things can be accomplished with exceptions, in which the exception is passed back until the parser starts on the next statement.

Turning Off IR Generation

Regardless of where an error is encountered, in the scanner, the parser, or the semantic analyzer, it is best to turn off IR generation even though the parsing process continues (as well as static semantic checking, such as building the symbol table and checking whether variables are declared before use), because the generated IR would likely be incorrect. Other approaches include continuing to generate IR, but deleting the IR file before it can be run through an interpreter, or not allowing the code generation phase to proceed.