February 3

February 2

Automatic Scanner Generation

It is possible to generate a scanner for a programming language automatically. How can this be done? Remembering what you learned from theoretical computer science will help.

Equivalence of Regular Expressions and Finite State Automata

First, recall that the theory shows that finite state automata and regular expressions are equivalent. That is, the following two lemmas have been proved in the theory of computing.

Lemma 1: For each regular expression r, there is a finite state automaton M such that L(M) = L(r).

Proof: By construction (that is, an algorithm is given that converts any instance of a regular expression into a finite state automaton that recognizes exactly the set that the regular expression denotes, or stands for):

              r     regular expression to      M
           ------>  finite state automaton  ------->
                    conversion algorithm

This implication of this lemma is that finite state automata are at least as powerful as regular expressions, because for each regular expression we can construct a finite state automaton that recognizes the set of strings that the regular expression stands for. One might wonder whether finite state automata can be constructed that recognize languages that cannot be expressed by regular expressions. The next lemma give the answer.

Lemma 2: For each finite state automaton M there is a regular expression r such that L(r) = L(M).

Proof: By construction (that is, an algorithm is given that converts any instance of a finite state automaton into a regular expression that denotes exactly the set of strings that the finite state automaton recognizes):

           M      finite state automaton     r
         -------> to regular expression    ------->
                    conversion algorithm

Together with Lemma 1, Lemma 2 just implies that regular expressions and finite state automata are equivalent.

Application of Lemma 1 to Scanning

The great thing about the theory of regular expressions and finite state automata is that it can be applied directly in a very practical way to scanning. For example, look at Lemma 1. Since the theoreticians have already given us the algorithm for converting regular expressions to finite state automata, this means that we can do the following:

Develop by hand the regular expressions for each token in the programming language we are trying to scan
Run each regular expression through the given algorithm to get the corresponding finite state automaton automatically.

There is one slight problem, though. The output of the regular expression to fsa conversion algorithm is usually a nondeterministic finite state automaton (nfsa). Fortunately, more theory shows us how to convert any nfsa to an equivalent dfsa.

Equivalence of Deterministic and Nondeterministic FSAs

Lemma 3. For each nfsa M there is an equivalent dfsa M' such that L(M) = L(M').

Proof: By construction.

            nfsa M     nfsa to dfsa    dfsa M'
          -----------> conversion    ----------->
                       algorithm

Again, in this case the theoreticians have not only given us a theorem that states that deterministic and nondeterministic finite state automata are equivalent, but they have given us an actual algorithm for converting between any nfsa to an equivalent dfsa as part of proving their theory.

Application of Lemma 3 to Scanning

We can now use lemma 3 to further our objective of building a scanner.

Take the outputs of the previous algorithm, which are nondeterministic finite state automata that recognize the tokens of our programming language.
Run each of these nondeterministic finite state automata through the algorithm of lemma 3 to obtain deterministic finite state automata that recognize the same tokens.

Still a minor problem exists. The deterministic finite state automata that result from the algorithm of lemma 3 may have more states than necessary. Once again theory comes to our rescue.

The Existence of Minimal-State Deterministic FSAs

A final (for today) lemma from the theory of computing shows that we can find the minimal-state deterministic finite state automaton from any given deterministic finite state automaton.

Lemma 4. For each deterministic finite state automaton M there is a minimal-state deterministic finite state automaton M' with L(M') = L(M).

Proof: By construction. Once again, the theoreticians have not only proved the existence of a minimal-state finite state automaton, but they have also given us the algorithm for making such an fsa.

Application of Lemma 4 to Scanning

Following the same tack as above, we now provide the finite state automata for our tokens that are produced from the algorithm of lemma 3 as input to the algorithm of lemma 4 to get the minimal-state finite state automata for each token that we need to scan.

Tying it all together

Being able to automatically produce the minimal-state deterministic finite state automaton for each token in our programming language still seems like being a long way from producing a practical scanner. What we want is:

            formal definition       Automatic     Scanner in
          ------------------------> Scanner    ----------------->
                of tokens           Generator        Java

In other words, we want to be able to write down all of the regular expressions for the tokens in the programming language we are trying to scan and have a program produce an actual scanner (say, in Java). As usual, when practice meets theory, we have some additional work to do. For example, we can see that the necessary minimal-finite state automata can be generated by the above algorithms for scanning. What the scanner generator can't do without some help from us is decide what to do when it finds a token. So, the formal definition of tokens given as input to the scanner generator must have a special form that not only gives the regular expressions for tokens, but directions on what to do when a token is scanned. So, input to a scanner generator might look like the following:

     ** Part 1, Auxiliary Expressions
        Digit  ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
        Letter ::= a | A | b | B | c | C ... | z | Z
        .
        .
        .
     ** Part 2, Tokens
        '('                            return(l_paren)
        ')'                            return(r_paren)
        ':='                           return(assign_op)
        letter {letter | digit}        return(id)
        digit {digit}                  return(integer_literal)
        "'" {^("'" | "''" | eol)} "'"  Strip_Quotes;
                                       return(string_literal)
        .
        .
        .
     ** Part 3, Special Procedures
        procedure Strip_Quotes is

          begin -- Strip_Quotes
            -- Strip the beginning and ending quotes
            -- and replace all double apostrophes in
            -- the string with a single apostrophe.
            -- Be sure to build up the lexeme in the
            -- process.
          end Strip_Quotes;

Note that the scanner that is generated automatically keeps track of the lexeme scanned, and perhaps other information such as the line and column numbers of the start of a lexeme. These may kept in public attributes, that can be accessed by whatever program uses the scanner. These things are different than the actions shown above, such as Strip_Quotes which specify other actions to be done when a string is successfully scanned.

Here we have given all of the items necessary for a scanner to be generated. You already know a standard format for building a scanner given all of the fsa's for the tokens in the language. All you have to do is add the parts to finish the scanner. For example after you insert the fsa for an id in its proper place in the code, you simply add the line "return(id)" to the code. After the fsa for string literals, you add the code to call the procedure Strip_Quotes (and you include procedure Strip_Quotes at the top of your program) followed by "return(string_literal)" and so forth. You must also make publicly available values for lexeme, line_number, column_number, and error_string.

Scanner Errors

There are three aspects to scanner errors:

detection
reporting
recovery
correction

Detection

Scanner errors are detected when the scanner cannot find a token when the get_token method is called. If we consider the scanner to be one giant FSA that starts with the dispatcher, then a scanning error occurs when no accept state is encountered before scanning for the current token is terminated (by an "others" branch on the dispatcher or one of the fsa's). In other words, during the current scan, the string read contains no prefix that is a valid token. At this point, the scanner must set the token to error, so that the parser realizes that a scanning error has occurred. One thing the parser will need to do is turn off the translation part of the compiler, because there is no reason to continue to produce a translation if the source code contains errors.

Recovery

Most compilers do not quit when the first error is encountered. Rather, the compiler "recovers" from the error in some fashion and tries to continue scanning and parsing (but not generating intermediate code) in order to give the programmer as much information as possible about mistakes in the program. For scanning, recovery means that the scanner must

set the error token
leave the file pointer in a location that allows the next call to get_token to function in a reasonable fashion.
set the row and column variables properly

What options are there?

Once the error has been detected on one of the "others" branches, set the file pointer to the character that took the fsa down that "others" branch. That way, all of the characters leading to this one are skipped and scanning will resume with this character.
Once the error has been detected on one of the "others" branches, set the file pointer to the second character of the current string in error. In other words, set the file pointer to the character after the one used by the dispatcher. Perhaps only that first character is in error and the second starts a valid token.
Skip all remaining characters from the end of the error string to the first white space character. This approach assumes that the most likely place to find a valid token after an error is after the next sequence of white space characters.

Of these options, number 2 is the one least likely to miss a valid token.

Reporting

There are various places that scanner errors can be reported. One easy way to report most scanner errors is to leave the reporting to the parser, if it is acting as the driver for the scanner. In this case:

set the token to a special error token, such as runon_comment
be sure the lexeme contains the offending string that scanned improperly
be sure the row and column numbers give the starting point of the offending lexeme in the source code

If these things are done in the scanner, then the parser, when it receives an error token, can print an appropriate error message before calling the scanner again to get the next token.

Correction

It is possible to try to correct the code during error recovery. (This is actually more of a possibility in the parser, where we know what kind of tokens we are looking for). Error correction is not a problem if the intent is to try to recover from the error and try to continue scanning and parsing in some reasonable fashion in order to give the programmer as much information about errors as possible. Error correction is a problem if the intent is to try to actually generate a running translation of the program based on attempts by the compiler to correct the source program. This would lead to a state where a program might translate and execute, but the executable translation is not what the programmer intended. Or, even if the translation executes the way the programmer intended, the fact that the source program has errors in it is still unsatisfactory.

Attempts at error correction and translations based on the corrected source code have been made. PL/I had such a compiler at one point.

An Example of a Complex Scan

This part actually belongs earlier with the sections on determining how to scan for reserved words or identifiers.

As an example of how programming language design can make the job of the compiler writer difficult, consider the for loop in Fortran. Such a loop is written with the Fortran DO statement, as in

      DO 10 I = 1,100,1
        .
        .
        .
10    Continue

The meaning of this construct is that the statement labeled 10 is the last statement in the loop (the Continue statement acts like a "no operation" statement). I is started with 1, goes to 100, incremented by 1 after each pass through the loop.

Fortran was the first large scale high level language, and so was the first to be compiled (before much was known about compiling). The first thing the compiler does is to remove all blanks and white space from the program, so the DO line looks like

DO10I=1,100,1

Thus, when scanning, the file pointer has to actually reach the first comma

DO10I=1,100,1
       ^

to distinguish whether this statement is a DO statement, or an assignment statement with the variable DO10I being assigned a value, as in

DO10I=1

(the = is the assignment operator in Fortran, another unfortunate fact). At the point where it can be properly determined which is the correct interpretation, the file pointer has to be moved back the appropriate number of characters and the proper token returned.

Turing Award

The Turing Award is the most prestigious award for computer scientists (there is no computer science category for the Nobel Prize). One Turing Award was given to Fred Brooks, most well known for his book, The Mythical Man Month. He wrote this book after working as the chief designer of IBM's fist large operating system, OS 360. Among some "rules of thumb" that he presented in the book which apply to the development of large software systems (like the compiler you are writing) are:

Adding personnel to a late project makes it later.

As you can imagine even with your compiler project, if your project was going to be late, it would hurt, rather than help, if I suddenly assigned a new programmer (of your same level of expertise) to help you. You would need to bring the new person up to speed on the project before he or she could even begin to contribute, and by that time the deadline would have passed.

Build one to throw away.

Hardly anyone has this luxury, but after completing a large software project, you actually understand enough about it to "do it right," or at least do it much better.