January 29

Wednesday, January 28
Odds and Ends of Scanning

By now, we should have a good idea of how a scanner can be designed and implemented. There are still a few details to learn about scanning, however.

The Forest

There is an old saying, "I can't see the forest for the trees." We have been looking at the trees, or individual details, of the scanner for a while. Don't forget the big picture. The scanner is really just a black box to the driver (parser) with a few interface methods: get_token, get_lexeme, get_column, get_line, and open_file. (some teams may implement the scanner to provide a token object which itself contains the token name, lexeme, row number and column number). The dispatcher that we have talked about, as well as the individual fsa's for scanning, are all just internal, or private, routines inside the scanner (they could even be part of another package or object that the parser doesn't know about). There are no restrictions on how you design the scanner in general, except that it must follow the specifications for the various components (pre and post assertions) given in the previous lectures, and the algorithms for implementing the fsa's for each token class must follow the case structure described in class. Of course, we expect you to apply the principles of good program design and implementation.

Combining Theory with Practice

Think about doing the scanner on your own. Suppose that all you knew were the specifications, namely, that your scanner was to have methods for returning the next token, the lexeme, the line number and the column number (the latter two for the start of the lexeme). In this case you would have to figure out how to identify tokens on your own. There is no doubt that you could come up with something that worked, but it would be a laborious process and probably pretty messy. Instead, we have a large body of both theory and practice at our fingertips. We can be grateful that there were theorists that studied the problems of scanning and recorded a solutions that are easy to apply and mathematically provable to be correct. We can also thank the practitioners who also took the time to record ways of combining theory with practice so that the process of constructing a scanner is quite clear now. Today's lecture will help us take care of a few more of the practical details of scanning.

One of the aspects of a course like this, then is that we actually do depend quite a bit directly on theoretical underpinnings. So, for example, you might be asked in an exercise to give a regular expression for some set of strings. When the term "regular expression" is used, it refers to something very specific. This is not a request for you to come up with some plausible way of your own for describing the set of strings. "Regular expression" is a term that, if you have forgotten, you need to look up in your CS 350 textbook. Having such formalisms is a great help. When you go to a standards document, for example, it is very nice that commonly understood terms and methods (e.g., regular expressions and EBNF) are used, as this makes it much easier to use the document. You need to become comfortable with these terms.

Resolving Reserved Words

Reserved words could be scanned directly by way of finite state automata just like we do for other tokens that have a single string that matches them. For example, the string := (token assign_op) is a multicharacter string that is the only match for assign_op. Similarly, the reserved work "begin" is the only string that matches the token reserved_begin (if that's what we want to call it). In other words, the theory tells us that it is possible to scan for reserved words just like we scan for all other tokens. However, since all reserved words also look like identifiers, trying to fit them into our standard approach of one fsa for each token would make for a very messy construction. Practice tells us that it is much cleaner and easier to understand to just scan everything that looks like an identifier as an identifier, then look the corresponding lexeme up in a table to see whether it is really a reserved word instead of an identifier.

Once this is determined, the question then needs to be answered: "Which part of the compiler should be responsible for resolving reserved words?" We could plan to have the scanner just return the token id and the lexeme and require the parser to look the lexeme up in a list of reserved words. Or we could decide to have the scanner first look the lexeme up in a list of reserved words and return either id as the token or the reserved word (e.g., reserved_begin) as the token if it is determined that the lexeme is a reserved word. Arguments could be made for either choice. However, it may be more uniform to think of the scanner as being responsible for precisely identifying each token and not leaving that task to the parser. That is, in this line of thinking, the scanner should return the proper token for each reserved word. In the fsa that scans for identifiers, the line that sets the variable token would thus be something like

Token := Resolve(Lexeme);

where Resolve is a function or method that looks the string in Lexeme up in the table of reserved words and returns the proper token as a result (either identifier or the reserved word token, such as reserved_begin).

An ADT for Reserved Words

In the above example, Resolve is a function or method that looks Lexeme up in a table of reserved words. This means that we need an abstract data type (ADT) for dealing with reserved words with the single method Resolve. Notice that there are a number of differences between this ADT and other list or table ADTs you have encountered in the past.

This list of keywords has a constant size
The contents of this ADT are known from the outset, so the table itself is constructed directly in the program, not read in from a file, so there is no need for insert or delete operations
This ADT is relatively small

Given these properties, how should we design our ADT for reserved words?

Unordered list
- Sequential search requires O(n) lookup time, where n is the number of reserved words in the table.
- Since n is constant, this really implies O(1).
- Probably the simplest to implement if you are in a hurry.
Ordered list
- O(log_₂ n) lookup with binary search
- Since n is constant, however, this implies O(1)
- Can easily be ordered at the start, since the list is constant
- Requires a simple array implementation with no pointers
Binary search tree
- O(log_₂ n) lookup, where n is the number of reserved words
- Since n is constant, however, this implies actually O(1)
- Can be constructed to be balanced at the start
- Requires a pointer structure
Hash table
- O(1) average lookup time
- Since n is constant, this would be O(1) in any case
- Since the list is constant, we can build a good hash table from the start (e.g., one with no or few collisions).
- Requires a more complex implementation than some other methods.

So there we have it. All methods, since the table of reserved words is constant, have a constant maximum lookup time. Does this mean that it makes no difference? How many times will the table be accessed when a program is compiled? Once for every identifier and reserved word in the program! This means that even though the lookup time for a single identifier is a constant, we want to choose the ADT that has the lowest constant lookup time given the size of our table.

Here is a rule of thumb for choosing an ADT that will work well for you nearly all of the time:

Design the ADT well, so that its operations are fixed.
Choose the method for implementing the ADT that is the easiest to do. For example, the unordered list with sequential search might well be the quickest and least error-prone to implement.
Later, as time permits, go back, analyze the time complexity of various ADTs given the fixed size of the table, choose the most efficient ADT, and change the internal implementation of the original ADT (e.g., from an unordered list using sequential search to a hash table). If the interface for the ADT has been done properly the rest of the program will not be affected (operation resolve will still work, just faster).

Other things to note:

The list of reserved words is so small that you would not be able to detect a difference for a single lookup regardless of which implementation you used for the ADT.
The table will be accessed for every identifier and keyword encountered in the program. So, the time complexities should be multiplied by m, where m is the number of identifier and keyword instances in the program. In this case, m is really the variable, not n (the number of reserved words in the table). Thus, you need to decide which method gives the best lookup time (even though they are all constants, they are all different constants). For example, binary search applied to a list of 40 reserved words would only take 6 operations to resolve. With sequential search and average of 20 would be needed. With a binary search tree, again only 6 operations would be needed, but the time required to follow a linked list would add to the actual time. A hash table would only require 1 probe, but doing the actual hash function is time consuming, so it might actually take more time than the 6 operations of binary search. Sometimes these things can only be resolved through experimentation.
It feels good to know how to analyze a problem and come up with the best approach.

Historical Note

At least one programming language (PL/I) tried the approach of not having any reserved words. The word "if," for example, could be used as a variable name. The ideas behind this were

it was possible for the compiler to tell from the context whether such a word was a variable name or a keyword
it was difficult for humans to remember the entire list of keywords and avoid their use as variable names.

While a noble venture, this approach has not been used in any recent programming language. The reasons are that

it really isn't difficult for programmers to avoid the use of reserved words, because most of the reserved words would not be chosen as variable names anyway, and if one is, the compiler can flag the problem, which can then easily be remedied by the programmer.
it is very poor programming practice to include variable names that are keywords in other contexts, as it makes the program difficult to read.
some truly awful programs can be written if there are no reserved words. Consider the following if statement, where if is also used as a Boolean variable:

if if then if := not if;

In Fortran, there were reserved words, but Fortran had its own unique set of problems. For one, spaces were not treated as important to the syntax of the program. So, before a Fortran program was compiled, all of the spaces were squeezed out. A loop statement, which normally would be written as

DO 10 I = 1, 15, 1

would look to the scanner like

DO10I=1,15,1

The scanner could not tell until it reached the first comma that it was dealing with the reserved word DO and not a variable named DO10I!