Lemma 1: For each regular expression r, there is a finite state automaton M such that L(M) = L(r).
Proof: By construction (that is, an algorithm is given that converts any instance of a regular expression into a finite state automaton that recognizes exactly the set that the regular expression denotes, or stands for):
r regular expression to M ------> finite state automaton -------> conversion algorithm
This implication of this lemma is that finite state automata are at least as powerful as regular expressions, because for each regular expression we can construct a finite state automaton that recognizes the set of strings that the regular expression stands for. One might wonder whether finite state automata can be constructed that recognize languages that cannot be expressed by regular expressions. The next lemma give the answer.
Lemma 2: For each finite state automaton M there is a regular expression r such that L(r) = L(M).
Proof: By construction (that is, an algorithm is given that converts any instance of a finite state automaton into a regular expression that denotes exactly the set of strings that the finite state automaton recognizes):
M finite state automaton r -------> to regular expression -------> conversion algorithm
Together with Lemma 1, Lemma 2 just implies that regular expressions and finite state automata are equivalent.
Proof: By construction.
nfsa M nfsa to dfsa dfsa M' -----------> conversion -----------> algorithm
Again, in this case the theoreticians have not only given us a theorem that states that deterministic and nondeterministic finite state automata are equivalent, but they have given us an actual algorithm for converting between any nfsa to an equivalent dfsa as part of proving their theory.
Lemma 4. For each deterministic finite state automaton M there is a minimal-state deterministic finite state automaton M' with L(M') = L(M).
Proof: By construction. Once again, the theoreticians have not only proved the existence of a minimal-state finite state automaton, but they have also given us the algorithm for making such an fsa.
formal definition Automatic Scanner in ------------------------> Scanner -----------------> of tokens Generator Java
In other words, we want to be able to write down all of the regular expressions for the tokens in the programming language we are trying to scan and have a program produce an actual scanner (say, in Java). As usual, when practice meets theory, we have some additional work to do. For example, we can see that the necessary minimal-finite state automata can be generated by the above algorithms for scanning. What the scanner generator can't do without some help from us is decide what to do when it finds a token. So, the formal definition of tokens given as input to the scanner generator must have a special form that not only gives the regular expressions for tokens, but directions on what to do when a token is scanned. So, input to a scanner generator might look like the following:
** Part 1, Auxiliary Expressions
Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Letter ::= a | A | b | B | c | C ... | z | Z
.
.
.
** Part 2, Tokens
'(' return(l_paren)
')' return(r_paren)
':=' return(assign_op)
letter {letter | digit} return(id)
digit {digit} return(integer_literal)
"'" {^("'" | "''" | eol)} "'" Strip_Quotes;
return(string_literal)
.
.
.
** Part 3, Special Procedures
procedure Strip_Quotes is
begin -- Strip_Quotes
-- Strip the beginning and ending quotes
-- and replace all double apostrophes in
-- the string with a single apostrophe.
-- Be sure to build up the lexeme in the
-- process.
end Strip_Quotes;
Note that the scanner that is generated automatically keeps track of the lexeme scanned, and perhaps
other information such as the line and column numbers of the start of a lexeme. These
may kept
in public attributes, that can be accessed by whatever program uses the scanner.
These things are different than the actions shown above, such as Strip_Quotes which specify other
actions to be done when a string is successfully scanned.Here we have given all of the items necessary for a scanner to be generated. You already know a standard format for building a scanner given all of the fsa's for the tokens in the language. All you have to do is add the parts to finish the scanner. For example after you insert the fsa for an id in its proper place in the code, you simply add the line "return(id)" to the code. After the fsa for string literals, you add the code to call the procedure Strip_Quotes (and you include procedure Strip_Quotes at the top of your program) followed by "return(string_literal)" and so forth. You must also make publicly available values for lexeme, line_number, column_number, and error_string.
There are three aspects to scanner errors:
Scanner errors are detected when the scanner cannot find a token when the get_token method is called. If we consider the scanner to be one giant FSA that starts with the dispatcher, then a scanning error occurs when no accept state is encountered before scanning for the current token is terminated (by an "others" branch on the dispatcher or one of the fsa's). In other words, during the current scan, the string read contains no prefix that is a valid token. At this point, the scanner must set the token to error, so that the parser realizes that a scanning error has occurred. One thing the parser will need to do is turn off the translation part of the compiler, because there is no reason to continue to produce a translation if the source code contains errors.
Most compilers do not quit when the first error is encountered. Rather, the compiler "recovers" from the error in some fashion and tries to continue scanning and parsing (but not generating intermediate code) in order to give the programmer as much information as possible about mistakes in the program. For scanning, recovery means that the scanner must
set the error token
leave the file pointer in a location that allows the next call to get_token to function in a reasonable fashion.
set the row and column variables properly
What options are there?
Once the error has been detected on one of the "others" branches, set the file pointer to the character that took the fsa down that "others" branch. That way, all of the characters leading to this one are skipped and scanning will resume with this character.
Once the error has been detected on one of the "others" branches, set the file pointer to the second character of the current string in error. In other words, set the file pointer to the character after the one used by the dispatcher. Perhaps only that first character is in error and the second starts a valid token.
Skip all remaining characters from the end of the error string to the first white space character. This approach assumes that the most likely place to find a valid token after an error is after the next sequence of white space characters.
Of these options, number 2 is the one least likely to miss a valid token.
There are various places that scanner errors can be reported. One easy way to report most scanner errors is to leave the reporting to the parser, if it is acting as the driver for the scanner. In this case:
set the token to a special error token, such as runon_comment
be sure the lexeme contains the offending string that scanned improperly
be sure the row and column numbers give the starting point of the offending lexeme in the source code
If these things are done in the scanner, then the parser, when it receives an error token, can print an appropriate error message before calling the scanner again to get the next token.
It is possible to try to correct the code during error recovery. (This is actually more of a possibility in the parser, where we know what kind of tokens we are looking for). Error correction is not a problem if the intent is to try to recover from the error and try to continue scanning and parsing in some reasonable fashion in order to give the programmer as much information about errors as possible. Error correction is a problem if the intent is to try to actually generate a running translation of the program based on attempts by the compiler to correct the source program. This would lead to a state where a program might translate and execute, but the executable translation is not what the programmer intended. Or, even if the translation executes the way the programmer intended, the fact that the source program has errors in it is still unsatisfactory.
Attempts at error correction and translations based on the corrected source code have been made. PL/I had such a compiler at one point.
This part actually belongs earlier with the sections on determining how to scan for reserved words or identifiers.
As an example of how programming language design can make the job of the compiler writer difficult, consider the for loop in Fortran. Such a loop is written with the Fortran DO statement, as in
DO 10 I = 1,100,1 . . . 10 Continue
The meaning of this construct is that the statement labeled 10 is the last statement in the loop (the Continue statement acts like a "no operation" statement). I is started with 1, goes to 100, incremented by 1 after each pass through the loop.
Fortran was the first large scale high level language, and so was the first to be compiled (before much was known about compiling). The first thing the compiler does is to remove all blanks and white space from the program, so the DO line looks like
DO10I=1,100,1
Thus, when scanning, the file pointer has to actually reach the first comma
DO10I=1,100,1
^
to distinguish whether this statement is a DO statement, or an assignment statement with the variable DO10I being assigned a value, as in
DO10I=1
(the = is the assignment operator in Fortran, another unfortunate fact). At the point where it can be properly determined which is the correct interpretation, the file pointer has to be moved back the appropriate number of characters and the proper token returned.
The Turing Award is the most prestigious award for computer scientists (there is no computer science category for the Nobel Prize). One Turing Award was given to Fred Brooks, most well known for his book, The Mythical Man Month. He wrote this book after working as the chief designer of IBM's fist large operating system, OS 360. Among some "rules of thumb" that he presented in the book which apply to the development of large software systems (like the compiler you are writing) are:
Adding personnel to a late project makes it later.
As you can imagine even with your compiler project, if your project was going to be late, it would hurt, rather than help, if I suddenly assigned a new programmer (of your same level of expertise) to help you. You would need to bring the new person up to speed on the project before he or she could even begin to contribute, and by that time the deadline would have passed.
Build one to throw away.
Hardly anyone has this luxury, but after completing a large software project, you actually understand enough about it to "do it right," or at least do it much better.