Wednesday, January 28
Odds and Ends of Scanning


By now, we should have a good idea of how a scanner can be designed and implemented.  There are still a few details to learn about scanning, however.

The Forest

There is an old saying, "I can't see the forest for the trees."  We have been looking at the trees, or individual details, of the scanner for a while.  Don't forget the big picture.  The scanner is really just a black box to the driver (parser) with a few interface methods:  get_token, get_lexeme, get_column, get_line, and open_file. (some teams may implement the scanner to provide a token object which itself contains the token name, lexeme, row number and column number).  The dispatcher that we have talked about, as well as the individual fsa's for scanning, are all just internal, or private, routines inside the scanner (they could even be part of another package or object that the parser doesn't know about).  There are no restrictions on how you design the scanner in general, except that it must follow the specifications for the various components (pre and post assertions) given in the previous lectures, and the algorithms for implementing the fsa's for each token class must follow the case structure described in class.  Of course, we expect you to apply the principles of good program design and implementation.

Combining Theory with Practice

Think about doing the scanner on your own.  Suppose that all you knew were the specifications, namely, that your scanner was to have methods for returning the next token, the lexeme, the line number and the column number (the latter two for the start of the lexeme).    In this case you would have to figure out how to identify tokens on your own.  There is no doubt that you could come up with something that worked, but it would be a laborious process and probably pretty messy. Instead, we have a large body of both theory and practice at our fingertips.  We can be grateful that there were theorists that studied the problems of scanning and recorded a solutions that are easy to apply and mathematically provable to be correct.   We can also thank the practitioners who also took the time to record ways of combining theory with practice so that the process of constructing a scanner is quite clear now. Today's lecture will help us take care of a few more of the practical details of scanning.

One of the aspects of a course like this, then is that we actually do depend quite a bit directly on theoretical underpinnings.  So, for example, you might be asked in an exercise to give a regular expression for some set of strings.  When the term "regular expression" is used, it refers to something very specific.  This is not a request for you to come up with some plausible way of your own for describing the set of strings.  "Regular expression" is a term that, if you have forgotten, you need to look up in your CS 350 textbook.  Having such formalisms is a great help.  When you go to a standards document, for example, it is very nice that commonly understood terms and methods (e.g., regular expressions and EBNF) are used, as this makes it much easier to use the document.  You need to become comfortable with these terms.

Resolving Reserved Words

Reserved words could be scanned directly by way of finite state automata just like we do for other tokens that have a single string that matches them.  For example, the string := (token assign_op) is a multicharacter string that is the only match for assign_op.  Similarly, the reserved work "begin" is the only string that matches the token reserved_begin (if that's what we want to call it).  In other words, the theory tells us that it is possible to scan for reserved words just like we scan for all other tokens.  However, since all reserved words also look like identifiers, trying to fit them into our standard approach of one fsa for each token would make for a very messy construction.  Practice tells us that it is much cleaner and easier to understand to just scan everything that looks like an identifier as an identifier, then look the corresponding lexeme up in a table to see whether it is really a reserved word instead of an identifier. 

Once this is determined, the question then needs to be answered:  "Which part of the compiler should be responsible for resolving reserved words?"  We could plan to have the scanner just return the token id and the lexeme and require the parser to look the lexeme up in a list of reserved words.  Or we could decide to have the scanner first look the lexeme up in a list of reserved words and return either id as the token or the reserved word (e.g., reserved_begin) as the token if it is determined that the lexeme is a reserved word. Arguments could be made for either choice.  However, it may be more uniform to think of the scanner as being responsible for precisely identifying each token and not leaving that task to the parser.  That is, in this line of thinking, the scanner should return the proper token for each reserved word.  In the fsa that scans for identifiers, the line that sets the variable token would thus be something like

Token := Resolve(Lexeme);

where Resolve is a function or method that looks the string in Lexeme up in the table of reserved words and returns the proper token as a result (either identifier or the reserved word token, such as reserved_begin).

An ADT for Reserved Words

In the above example, Resolve is a function or method that looks Lexeme up in a table of reserved words.  This means that we need an abstract data type (ADT) for dealing with reserved words with the single method Resolve.   Notice that there are a number of differences between this ADT and other list or table ADTs you have encountered in the past. Given these properties, how should we design our ADT for reserved words? So there we have it.  All methods, since the table of reserved words is constant, have a constant maximum lookup time.  Does this mean that it makes no difference?  How many times will the table be accessed when a program is compiled? Once for every identifier and reserved word in the program! This means that even though the lookup time for a single identifier is a constant, we want to choose the ADT that has the lowest constant lookup time given the size of our table.  

Here is a rule of thumb for choosing an ADT that will work well for you nearly all of the time:

Other things to note:

Historical Note

At least one programming language (PL/I) tried the approach of not having any reserved words.  The word "if," for example, could be used as a variable name.  The ideas behind this were

While a noble venture, this approach has not been used in any recent programming language.  The reasons are that

if if then if := not if;

In Fortran, there were reserved words, but Fortran had its own unique set of problems.  For one, spaces were not treated as important to the syntax of the program.  So, before a Fortran program was compiled, all of the spaces were squeezed out.  A loop statement, which normally would be written as

DO 10 I = 1, 15, 1

would look to the scanner like

DO10I=1,15,1

The scanner could not tell until it reached the first comma that it was dealing with the reserved word DO and not a variable named DO10I!