Wednesday, January 28
Odds and Ends of Scanning
By now, we should have a good idea of how a scanner can be designed and
implemented. There are still a few details to learn about scanning,
however.
The Forest
There is an old saying, "I can't see the forest for the
trees." We have been looking at the trees, or individual details, of
the scanner for a while. Don't forget the big picture. The scanner
is really just a black box to the driver (parser) with a few interface
methods: get_token, get_lexeme, get_column, get_line, and open_file. (some
teams may implement the scanner to provide a token object which itself contains
the token name, lexeme, row number and column number). The dispatcher that we have talked about, as well as the
individual fsa's for scanning, are all just internal, or private, routines
inside the scanner (they could even be part of another package or object that
the parser doesn't know about). There are no restrictions on how you
design the scanner in general, except that it must follow the specifications for the
various components (pre and post assertions) given in the previous lectures, and
the algorithms for implementing the fsa's for each token class must follow the
case structure described in class. Of course, we expect you to apply the principles of good program
design and implementation.
Combining Theory with Practice
Think about doing the scanner on your own. Suppose that all you knew were the
specifications, namely, that your scanner was to have methods for returning the next
token, the lexeme, the line number and the column number (the latter two for the start of
the lexeme). In this case you would have to figure out how to identify
tokens on your own. There is no doubt that you could come up with something that
worked, but it would be a laborious process and probably pretty messy.
Instead, we have a large body of both theory and practice at our fingertips. We
can be grateful that there were theorists that studied the problems of scanning and
recorded a solutions that are easy to apply and mathematically provable to be correct.
We can also thank the practitioners who also took the time to record ways of
combining theory with practice so that the process of constructing a scanner is quite
clear now.
Today's lecture will help us take care of a few more of the practical details of
scanning.
One of the aspects of a course like this, then is that we actually do depend
quite a bit directly on theoretical underpinnings. So, for example, you
might be asked in an exercise to give a regular expression for some set of
strings. When the term "regular expression" is used, it refers
to something very specific. This is not a request for you to come up with
some plausible way of your own for describing the set of strings.
"Regular expression" is a term that, if you have forgotten, you need
to look up in your CS 350 textbook. Having such formalisms is a great
help. When you go to a standards document, for example, it is very nice
that commonly understood terms and methods (e.g., regular expressions and EBNF)
are used, as this makes it much easier to use the document. You need to
become comfortable with these terms.
Resolving Reserved Words
Reserved words could be scanned directly by way of finite state automata just like we
do for other tokens that have a single string that matches them. For example, the
string := (token assign_op) is a multicharacter string that is the only match for
assign_op. Similarly, the reserved work "begin" is the only string that
matches the token reserved_begin (if that's what we want to call it).
In other words, the theory tells us that it is possible to scan for reserved words
just like we scan for all other tokens. However, since all reserved words also look
like identifiers, trying to fit them into our standard approach of one fsa for each token
would make for a very messy construction. Practice tells us that it is much cleaner
and easier to understand to just scan everything that looks like an identifier as an
identifier, then look the corresponding lexeme up in a table to see whether it is really a
reserved word instead of an identifier.
Once this is determined, the question then needs to be answered: "Which
part of the compiler should be responsible for resolving reserved words?" We
could plan to have the scanner just return the token id and the lexeme and require the
parser to look the lexeme up in a list of reserved words. Or we could decide to have
the scanner first look the lexeme up in a list of reserved words and return either id as
the token or the reserved word (e.g., reserved_begin) as the token if it is determined
that the lexeme is a reserved word.
Arguments could be made for either choice. However, it may be more uniform to
think of the scanner as being responsible for precisely identifying each token and not
leaving that task to the parser. That is, in this line of thinking, the scanner
should return the proper token for each reserved word. In the fsa that
scans for identifiers, the line that sets the variable token would thus be
something like
Token := Resolve(Lexeme);
where Resolve is a function or method that looks the string in
Lexeme up in the table of reserved words and returns the proper token as a
result (either identifier or the reserved word token, such as reserved_begin).
An ADT for Reserved Words
In the above example, Resolve is a function or method that looks Lexeme up in a
table of reserved words. This means that we
need an abstract data type (ADT) for dealing with reserved words with the single
method Resolve. Notice that there are a number of differences between this ADT and other
list or table ADTs you have encountered in the past.
- This list of keywords has a constant size
- The contents of this ADT are known from the outset, so the table itself is
constructed directly in the program, not read in from a file, so there is no need for insert or delete
operations
- This ADT is relatively small
Given these properties, how should we design our ADT for reserved words?
- Unordered list
- Sequential search requires O(n) lookup time, where n is the number of
reserved words in the table.
- Since n is constant, this really implies O(1).
- Probably the simplest to implement if you are in a hurry.
- Ordered list
- O(log2 n) lookup with binary search
- Since n is constant, however, this implies O(1)
- Can easily be ordered at the start, since the list is constant
- Requires a simple array implementation with no pointers
- Binary search tree
- O(log2 n) lookup, where n is the number of reserved words
- Since n is constant, however, this implies actually O(1)
- Can be constructed to be balanced at the start
- Requires a pointer structure
- Hash table
- O(1) average lookup time
- Since n is constant, this would be O(1) in any case
- Since the list is constant, we can build a good hash table from the start (e.g., one
with no or few collisions).
- Requires a more complex implementation than some other methods.
So there we have it. All methods, since the table of reserved words is constant,
have a constant maximum lookup time. Does this mean that it makes no
difference? How many times will the table be accessed when a program is
compiled? Once for every identifier and reserved word in the program! This
means that even though the lookup time for a single identifier is a constant, we
want to choose the ADT that has the lowest constant lookup time given the size
of our table.
Here is a rule of thumb for choosing an ADT that will work well for you nearly all of the time:
- Design the ADT well, so that its operations are fixed.
- Choose the method for implementing the ADT that is the easiest
to do. For example, the unordered list with sequential search might well be the
quickest and least error-prone to implement.
- Later, as time permits, go back, analyze the time
complexity of various ADTs given the fixed size of the table, choose the
most efficient ADT, and change the internal
implementation of the original ADT (e.g., from an unordered list using sequential search to a hash
table). If the interface for the ADT has been done properly the rest of the program
will not be affected (operation resolve will still work, just faster).
Other things to note:
- The list of reserved words is so small that you would not be able to detect a
difference for a single lookup regardless of which implementation you used for the
ADT.
- The table will be accessed for every identifier and keyword encountered in the
program. So, the time complexities should be multiplied by m, where m is the number
of identifier and keyword instances in the program. In this case, m is really the
variable, not n (the number of reserved words in the table). Thus, you need to
decide which method gives the best lookup time (even though they are all constants, they
are all different constants). For example, binary search applied to a list of 40
reserved words would only take 6 operations to resolve. With sequential search and
average of 20 would be needed. With a binary search tree, again only 6 operations
would be needed, but the time required to follow a linked list would add to the actual
time. A hash table would only require 1 probe, but doing the actual hash function is
time consuming, so it might actually take more time than the 6 operations of binary
search. Sometimes these things can only be resolved through experimentation.
- It feels good to know how to analyze a problem and come up with the best approach.
Historical Note
At least one programming language (PL/I) tried the approach of not having any
reserved words. The word "if," for example, could be used as a
variable name. The ideas behind this were
- it was possible for the compiler to tell from the context whether such a
word was a variable name or a keyword
- it was difficult for humans to remember the entire list of keywords and
avoid their use as variable names.
While a noble venture, this approach has not been used in any recent
programming language. The reasons are that
- it really isn't difficult for programmers to avoid the use of reserved
words, because most of the reserved words would not be chosen as variable
names anyway, and if one is, the compiler can flag the problem, which can
then easily be remedied by the programmer.
- it is very poor programming practice to include variable names that are
keywords in other contexts, as it makes the program difficult to read.
- some truly awful programs can be written if there are no reserved
words. Consider the following if statement, where if is also used as a
Boolean variable:
if if then if := not if;
In Fortran, there were reserved words, but Fortran had its own unique set of
problems. For one, spaces were not treated as important to the syntax of
the program. So, before a Fortran program was compiled, all of the spaces
were squeezed out. A loop statement, which normally would be written as
DO 10 I = 1, 15, 1
would look to the scanner like
DO10I=1,15,1
The scanner could not tell until it reached the first comma that it was dealing with the reserved word DO and not a variable named DO10I!