January 31

Friday, January 30

More Odds and Ends of Scanning

In this lecture we cover most of the remaining aspects of scanning that we have not yet encountered.

Commenting your Compiler

Our requirements for commenting a program are simple. As far as possible, the program you write should be self documenting. This just simply means that you should choose very clear identifier names, whether for variables, functions, procedures, packages, classes, or anything else that is named, and that you should adhere to a very clear structured style of programming that is easy to follow and understand. Tricky programming is only justifiable if it truly leads to a substantial efficiency (such as replacing an O(n^²) algorithm with an O(n log_₂ n) algorithm). If the program is self documenting, then each module (class, method, procedure, function, package, etc.) needs only to have a brief comment at the beginning describing what that module does. We require these comments to be in the form of pre- and post-conditions. The only other comments needed are those necessary in the body of the module to describe obscure or "tricky" code (again, such programming should be avoided wherever possible). Other comments may be required for archival reasons, such as the name(s) of the author(s) of the module, the date completed, and so forth. These always go at the very start of the module and are policy dependent.

Incremental Implementation via Stubs

The use of stubs to implement a large program in stages is a common and useful practice. Recall that the word "stub" implies something that is short and only partially complete. So, for example, if a module is to be written that is to provide a number of operations to other parts of the program, the operations (e.g., procedures, functions, methods, etc.) can be stubbed so that they can be called and so that fake values are returned where necessary. That way, other members of the team, or even an individual programmer, can work on the parts of the program that call these operations. They can compile and run the program to be sure that their components do indeed call the operations properly (i.e., they can ensure that the interface is correct), and thereby debug and complete those aspects of the program. Later, as time permits, each stubbed operation can be completed without requiring any changes to the calling components.

Upper-Lower Case Letter Considerations

In most programming languages no distinction is made between upper and lower case letters in program identifiers and reserved words. There is a good reason for this. It would be very confusing to have two different variable names that differ only in their case, such as fred and Fred. Java is a case in point (Java syntax is described as "case sensitive"). While Java has some conventions that programmers are encouraged to follow, such as beginning the names of attributes and methods with lower case letters and using uppercase letters to begin class names, these rules are not enforced. Programs can be difficult to read and maintain when case distinctions are made, and programmers are discouraged (but not prohibited) from using the same names that differ only in their use of capital letters to stand for different things in the same program.

If a programming language is not case sensitive there are some issues to deal with. When the scanner scans an identifier, it must (in our approach) give the scanned lexeme to the ADT for resolving reserved words, which must in turn calculate the proper token -- either the token identifier or the proper reserved word token, such as mp_begin. As an example, notice that the reserved word "begin" could be written as BEGIN, begin, Begin, or even BeGin (no, we aren't advocating actually doing this in a program, but pointing out that it could be done and must be compiled properly in any case). The way to handle this is to keep the reserved word abstract data type list in one case, say lower case, and then to convert each lexeme to be compared to that case before the comparison is made. There are two things to note:

The original lexeme must be left in its original form in case a source listing is needed and for the parser, which is going to be responsible for some symbol table actions (to be discussed later in the course).
The programming language you are using to write your compiler may well have a "built in" routine for case conversion. This is a commonly required operation and is therefore often available in a programming language.

Scanning Comments

What should be done with comments encountered during a scan? Comments are for the program reader, not for the program. Therefore, comments can be skipped, being treated much like white space. Here are some points to notice:

Be sure to update line and column numbers while scanning comments
Handle runaway comments as a scanner error if the end of file is encountered before the end of a comment is reached.

This last point raises an interesting problem about comments. In early programming languages, such as Fortran, comments ended at the end of a line (if column 1 of a line had a C in it, the rest of the line was considered to be a comment). Then, since it was clear that since multi-line comments were often needed, new languages incorporated opening and closing comment markers. For example

     (* this is a Pascal
        comment *)

     {So is this.  In the early days of card punch machines
      curly braces were not available, so comments were given
      in Pascal as in the previous example.  Later, Pascal
      was extended to allow the curly braces to delimit
      comments}

      A := {you can also embed comments like this} 5;

      /* similarly, this is a comment in C
      */

Interestingly, some modern languages, such as Ada, have reverted to comments that have beginning markers and are automatically terminated at the end of the line, as in

      -- This is an Ada comment.  If you want it to extend
      -- to other lines, you must treat the next lines
      -- as individual comments with leading -- symbols.

At first it seems like this is a step backwards. Really, though, it is the result of experience with a programming language feature that seemed fine but has proven in practice to be more trouble than it is worth. For example, consider the following function in which Pascal-like comments are used.

     Procedure Shutdown(Temperature : in Integer) is

       {This program checks to see whether the parameter Temperature
        is out of range.  If it is, procedure Shutdown_Reactor is
        called to initiate shutting down the reactor before the 
        reactor core melts}

       begin
         {Make the check here
         if Temperature > Maximum_Allowed_Value then
           Shutdown_Reactor;
         {If Shutdown_Reactor is called, program control will not
          return to this point.}
       end Max;

Notice that the programmer forgot to terminate the first comment after begin with a }. Thus the entire if statement is treated as a comment, up until the next } is found. This procedure would compile and execute without error. However, if the Temperature were ever out of range, procedure Shutdown_Reactor would not be called!

It is often difficult for a programmer to locate such errors, because he/she focuses on the if statement, which looks fine, not noticing that it is part of a comment. Such errors can be fatal. Now consider the same program with single line comments:

     Procedure Shutdown(Temperature : in Integer) is

       -- This program checks to see whether the parameter Temperature
       -- is out of range.  If it is, procedure Shutdown_Reactor is
       -- called to initiate shutting down the reactor before the 
       -- reactor core melts

       begin
         -- Make the check here
         if Temperature > Maximum_Allowed_Value then
           Shutdown_Reactor;
         -- If Shutdown_Reactor is called, program control will not
         -- return to this point.
       end Max;

The problems described above go away. The language is safer. And really, the comments are actually more readable, because it is clear on each line whether that line is part of the comment or not. Thus, a little more work for the programmer is beneficial in the big picture. Points to consider for scanning Pascal comments:

If the end of file is reached before a closing brace is found, a scanner error should be noted as usual
If a second opening brace is found before the awaited closing brace is encountered, a warning should be issued to alert the programmer to a possible runaway comment. The opening brace does not mean that an error has occurred for sure, because the opening brace could be part of the comment.

Scanning String Literals

Another situation that requires potentially long scans involves string literals. For example, the statement

    Write('Hello World');

requires scanning from the first apostrophe to the second, maintaining the string found in the lexeme and returning something like string_literal as the token. Two points to note are:

in most programming languages (particularly imperative programming languages) string literals are not allowed to continue past end of line, so the scanner can signal an error if an end of line character is found before the closing apostrophe is encountered. If an end of line character is supposed to be part of a string, it must be preceded by a special escape character, such as \, which can be recognized by the scanner.
in most programming languages, if one wants to insert an apostrophe in a string, two apostrophes in a row must be used for this purpose, as in

     Write('I Don''t know how to scan this');

This means that a single apostrophe is represented in a string literal as '''' (four apostrophes in a row), whereas the null string is just '' (two apostrophes in a row without a following third apostrophe).

Compiler Directives in the Source Code

Many compilers allow the programmer to insert special instructions to the compiler, such as instructions to insert a breakpoint, or to include a procedure or function in line, or to turn debugging on or off (note that if debugging is to be done the compiler must generate a lot more target machine code to allow this to happen). Sometimes these are inserted as special comments. For example, a Pascal comment starting with {? might denote that a compiler directive has been encountered. Other languages have special instructions for compiler directives. In Ada, the pragma is used for this purpose. In either case, the scanner cannot simply ignore the compiler directives. It must scan the following directive and pass it on to the parser so that proper action can be taken.

Creating a Source Listing File

The scanner may also be called upon to generate a source code listing, which is a copy of the input file with perhaps some other information made available, such as where errors were encountered. More elaborate listings that show procedure nesting and so forth require advanced techniques that we usually leave for the parser.

Creating a "Pretty Print" File

Sometimes a program development environment also provides a way to reformat the source code into a standard form with standard indentation, standard case for identifiers, and even color coded phrases. This is usually a separate program that simply converts the original source program into the formatted form and is not involved with scanning at all.