January 22

Wednesday, January 21

Organizing a Compiler: Two Approaches

One way to organize a compiler project is seen in the diagram from the last lecture. Here just the front end is shown.

Source                Token             Parse                       Intermediate
----------> Scanner ----------> Parser -------> Semantic Analyzer --------------->
Program               File              Tree                        Code (IR)                                                                                                                                   |

In this diagram, it appears that the each phase of the compiler generates a new output file from its input file and then is finished. This approach would work fine, and is used.

A second way is to have the various phases work together. In this approach, the parser acts as the driver. It only requests a token from the scanner when one is needed to continue with the parse. Also, when it has enough of a parse completed so that IR can be generated, it calls the semantic analyzer at that point to accomplish this. As an example, suppose compiling were at the point of processing the following line in a program:

     .
     .
     .
X := X + 1;
     .
     .
     .

The compiler really doesn't need to look at the rest of the program to know how to scan, parse, and translate this line into IR. The parser just retrieves each token in this line as needed from the scanner, then asks the semantic analyzer to create the proper IR to represent adding 1 to X and storing the result into X.

In this scheme, the front end of the compiler can be organized as shown below:

Parser
(Driver)
/      \
    get next token /        \partial parse tree 
/         \
Scanner    Semantic
     ^       Analyzer
 |          |
 |          |
 Source       V    
 Code        IR

In this diagram, the parser acts as the driver for the compiler. As it builds the parse tree, it just gets tokens as needed from the scanner. The scanner is responsible for maintaining its position in the input source file and making the next token available through a public variable or method when requested by the parser. As the parser is able to parse a complete piece of the program, such as an arithmetic expression, it passes this information to the semantic analyzer, which can then create the IR for just that piece of the program (e.g., it can create the IR that represents the arithmetic expression just parsed).

Possible Extension

Because of the way the scanner, parser, and semantic analyzer are separated from each other, it would be possible to write each as a separate concurrent process (Ada tasks, Java threads, or C++ pthreads). You will notice that the interface between the parser and the scanner, for example, is a producer/consumer problem. The resulting program would not be any faster (likely a bit slower) on a single processor, but faster on a multiprocessor if the operating system is clever enough to schedule the tasks (threads) on different processors. The problem for the operating system is that it knows about the single process that is the compiler when it is running, but not all OS's can recognize that a single process has, say, three threads running inside it, and hence cannot schedule these threads. I would recommend this approach to anyone who would like to try it, as I think you would learn some valuable lessons from it.

Starting with the Scanner

In real life, it would be possible to break the project into portions so that team members did different components of the compiler, such as the scanner, the parser, and so on. In this class, we will be learning about each of these components in succession, so team members must work on parts of each module as we learn about them.

Where does one start when beginning to write a compiler? We need to get the best definition of the language available. Ideally, this would be an ISO standard definition manual for the language. That is, we:

Obtain the ISO standard (or closest thing to it) for the language. For example, consider the ISO standard for Ada.
Use this standard to identify all of the tokens in the language. Sometimes these tokens are all listed nicely for you in the standard (they should be). However, other times they are not.

In the case that the tokens are not listed for you in the standard, you need to make a list of all tokens and their regular expressions as determined from the language definition standards (be aware that in EBNF notation the curly braces, { and }, mean "0 or more times". Thus, enclosing a regular expression in curly braces is the same as using an * in the notation from the theory of computing. Examples are

       Token                      Regular Expression

       identifier       letter {letter | digit}
       l_paren          (
       r_paren          )
       integer_literal  digit{digit}
       float_literal    integer_literal.integer_literal
      
       etc.

     where letter ::= a | b | c | ... | z | A | B | C  ... | Z
     and   digit  ::= 0 | 1 | 2 | 3 | 4 | 5 |6 | 7 | 8 | 9

One trick you need to be aware of is that some of the regular expressions you need are "auxiliary," or "helper" regular expressions and do not themselves stand for tokens. In the above, letter and digit are examples of auxiliary regular expressions that make it easier to define the tokens.

incorporate matching algorithms for each token in the scanner. This is done by converting the regular expression for each token into a finite state automaton that recognizes those tokens.
Modify each finite state automaton to handle the practical aspects of scanning. For example, when doing fsa's for the theory course, we assumed that the input being checked terminated with the end of the input string. For scanning, we are matching substrings in the long program that comprises the "input string" for our scanner. Thus, we will often read one or two characters farther than we should, and we will need to put those characters back into the input file in preparation for the next call to the scanner.
Implement the scanner by writing the fsa for each token as a separate private method in the scanner.

For an on demand compiler, the scanner would be a module (i.e., object or package) with an interface for the driver (parser) to access. What would be the interface for the scanner (i.e., what public methods, variables, procedures, and/or functions are needed)?

Next token (the name of the token, e.g., id)
Lexeme (the string that matched the token, e.g., "r2_d2")
Line number of start of lexeme (for error reporting purposes)
Column number of start of lexeme (for error reporting purposes)
Error message (as needed)

There would also be a number of internal, or private, methods:

the matching algorithms for each token would be private
source code file operations would be internal to the scanner