Friday, January 30
More Odds and Ends of Scanning
In this lecture we cover most of the remaining aspects of scanning that we have not
yet encountered.
Commenting your Compiler
Our requirements for commenting a program are simple. As far as possible,
the program you write should be self documenting. This just simply means
that you should choose very clear identifier names, whether for variables, functions,
procedures, packages, classes, or anything else that is named, and that you should adhere
to a very clear structured style of programming that is easy to follow and understand.
Tricky programming is only justifiable if it truly leads to a substantial
efficiency (such as replacing an O(n2) algorithm with an O(n log2 n) algorithm). If
the program is self documenting, then each module (class, method, procedure, function,
package, etc.) needs only to have a brief comment at the beginning describing what that
module does. We require these comments to be in the form of pre- and
post-conditions. The only other comments needed are those necessary in the body of
the module to describe obscure or "tricky" code (again, such programming should
be avoided wherever possible).
Other comments may be required for archival reasons, such as the name(s) of the
author(s) of the module, the date completed, and so forth. These always go at the
very start of the module and are policy dependent.
Incremental Implementation via Stubs
The use of stubs to implement a large program in stages is a common and useful
practice. Recall that the word "stub" implies something that is short and
only partially complete. So, for example, if a module is to be written that is to
provide a number of operations to other parts of the program, the operations (e.g.,
procedures, functions, methods, etc.) can be stubbed so that they can be called and so
that fake values are returned where necessary. That way, other members of the team,
or even an individual programmer, can work on the parts of the program that call these
operations. They can compile and run the program to be sure that their components do
indeed call the operations properly (i.e., they can ensure that the interface is correct),
and thereby debug and complete those aspects of the program. Later, as time permits,
each stubbed operation can be completed without requiring any changes to the calling
components.
Upper-Lower Case Letter Considerations
In most programming languages no distinction is made between upper and lower case
letters in program identifiers and reserved words. There is a good reason for this.
It would be very confusing to have two different variable names that differ only in
their case, such as fred and Fred. Java is a case in point (Java syntax is
described as "case sensitive"). While Java has some conventions that
programmers are encouraged to follow, such as beginning the names of attributes
and methods with lower case letters and using uppercase letters to begin class
names, these rules are not enforced. Programs can be difficult to read and maintain
when case distinctions are made, and programmers are discouraged (but not
prohibited) from using the same names that differ only in their use of capital
letters to stand for different things in the same program.If a programming
language is not case sensitive there are some issues to deal with. When the scanner scans an identifier,
it must (in our approach) give the scanned lexeme to the ADT for resolving
reserved words, which must in turn calculate the proper token -- either the token
identifier or the proper reserved word token, such as mp_begin.
As an example, notice that the reserved word "begin" could be written as
BEGIN, begin, Begin, or even BeGin (no, we aren't advocating actually doing this
in a program, but pointing out that it could be done and must be compiled
properly in any case). The way to handle this is to keep the reserved
word abstract data type list in one case, say lower case, and then to convert each lexeme
to be compared to that case before the comparison is made. There are two things to
note:
- The original lexeme must be left in its original form in case a source listing is
needed and for the parser, which is going to be responsible for some symbol table actions
(to be discussed later in the course).
- The programming language you are using to write your compiler may well have a
"built in" routine for case conversion. This is a commonly required
operation and is therefore often available in a programming language.
Scanning Comments
What should be done with comments encountered during a scan? Comments are for
the program reader, not for the program. Therefore, comments can be skipped, being
treated much like white space. Here are some points to notice:
- Be sure to update line and column numbers while scanning comments
- Handle runaway comments as a scanner error if the end of file is encountered before
the end of a comment is reached.
This last point raises an interesting problem about comments. In early
programming languages, such as Fortran, comments ended at the end of a line (if column 1
of a line had a C in it, the rest of the line was considered to be a comment). Then,
since it was clear that since multi-line comments were often needed, new languages
incorporated opening and closing comment markers. For example
(* this is a Pascal
comment *)
{So is this. In the early days of card punch machines
curly braces were not available, so comments were given
in Pascal as in the previous example. Later, Pascal
was extended to allow the curly braces to delimit
comments}
A := {you can also embed comments like this} 5;
/* similarly, this is a comment in C
*/
Interestingly, some modern languages, such as Ada, have reverted to comments that have
beginning markers and are automatically terminated at the end of the line, as in
-- This is an Ada comment. If you want it to extend
-- to other lines, you must treat the next lines
-- as individual comments with leading -- symbols.
At first it seems like this is a step backwards. Really, though, it is the
result of experience with a programming language feature that seemed fine but has proven
in practice to be more trouble than it is worth. For example, consider the following
function in which Pascal-like comments are used.
Procedure Shutdown(Temperature : in Integer) is
{This program checks to see whether the parameter Temperature
is out of range. If it is, procedure Shutdown_Reactor is
called to initiate shutting down the reactor before the
reactor core melts}
begin
{Make the check here
if Temperature > Maximum_Allowed_Value then
Shutdown_Reactor;
{If Shutdown_Reactor is called, program control will not
return to this point.}
end Max;
Notice that the programmer forgot to terminate the first comment after begin
with a }. Thus
the entire if statement is treated as a comment, up until the next } is found. This
procedure would compile and execute without error. However, if the
Temperature were ever out of range, procedure Shutdown_Reactor would not be
called!
It is often difficult for a
programmer to locate such errors, because he/she focuses on the if statement, which looks
fine, not noticing that it is part of a comment. Such errors can be
fatal. Now consider the same program with single line comments:
Procedure Shutdown(Temperature : in Integer) is
-- This program checks to see whether the parameter Temperature
-- is out of range. If it is, procedure Shutdown_Reactor is
-- called to initiate shutting down the reactor before the
-- reactor core melts
begin
-- Make the check here
if Temperature > Maximum_Allowed_Value then
Shutdown_Reactor;
-- If Shutdown_Reactor is called, program control will not
-- return to this point.
end Max;
The problems described above go away. The language is safer. And really,
the comments are actually more readable, because it is clear on each line whether that
line is part of the comment or not. Thus, a little more work for the programmer is
beneficial in the big picture.
Points to consider for scanning Pascal comments:
- If the end of file is reached before a closing brace is found, a scanner error
should be noted as usual
- If a second opening brace is found before the awaited closing brace is encountered,
a warning should be issued to alert the programmer to a possible runaway comment.
The opening brace does not mean that an error has occurred for sure, because the
opening brace could be part of the comment.
Scanning String Literals
Another situation that requires potentially long scans involves string literals.
For example, the statement
Write('Hello World');
requires scanning from the first apostrophe to the second, maintaining the string
found in the lexeme and returning something like string_literal as the token. Two
points to note are:
- in most programming languages (particularly
imperative programming languages) string literals are not allowed to continue past end
of line, so the scanner can signal an error if an end of line character is found before
the closing apostrophe is encountered. If an end of line character is
supposed to be part of a string, it must be preceded by a special escape
character, such as \, which can be recognized by the scanner.
- in most programming languages, if one wants to insert an apostrophe in a string, two
apostrophes in a row must be used for this purpose, as in
Write('I Don''t know how to scan this');
This means that a single apostrophe is represented in a string literal as '''' (four
apostrophes in a row), whereas the null string is just '' (two apostrophes in a row
without a following third apostrophe).
Compiler Directives in the Source Code
Many compilers allow the programmer to insert special instructions to the compiler,
such as instructions to insert a breakpoint, or to include a procedure or function in
line, or to turn debugging on or off (note that if debugging is to be done the compiler
must generate a lot more target machine code to allow this to happen). Sometimes
these are inserted as special comments. For example, a Pascal comment starting with
{? might denote that a compiler directive has been encountered.
Other languages have special instructions for compiler directives. In Ada, the
pragma is used for this purpose.
In either case, the scanner cannot simply ignore the compiler directives. It
must scan the following directive and pass it on to the parser so that proper action can
be taken.
Creating a Source Listing File
The scanner may also be called upon to generate a source code listing, which is a copy
of the input file with perhaps some other information made available, such as where errors
were encountered. More elaborate listings that show procedure nesting and so forth
require advanced techniques that we usually leave for the parser.
Creating a "Pretty Print" File
Sometimes a program development environment also provides a way to reformat the source
code into a standard form with standard indentation, standard case for identifiers, and
even color coded phrases. This is usually a separate program that simply converts
the original source program into the formatted form and is not involved with scanning at
all.