Wednesday, January 21


Organizing a Compiler:  Two Approaches

One way to organize a compiler project is seen in the diagram from the last lecture.  Here just the front end is shown.

Source                Token             Parse                       Intermediate
----------> Scanner ----------> Parser -------> Semantic Analyzer --------------->
Program               File              Tree                        Code (IR)                                                                                                                                   |

In this diagram, it appears that the each phase of the compiler generates a new output file from its input file and then is finished.  This approach would work fine, and is used.

A second way is to have the various phases work together.  In this approach, the parser acts as the driver.  It only requests a token from the scanner when one is needed to continue with the parse.  Also, when it has enough of a parse completed so that IR can be generated, it calls the semantic analyzer at that point to accomplish this.  As an example, suppose compiling were at the point of processing the following line in a program:

     .
     .
     .
X := X + 1;
     .
     .
     .

The compiler really doesn't need to look at the rest of the program to know how to scan, parse, and translate this line into IR.  The parser just retrieves each token in this line as needed from the scanner, then asks the semantic analyzer to create the proper IR to represent adding 1 to X and storing the result into X.

In this scheme, the front end of the compiler can be organized as shown below:

 
Parser
(Driver)
/      \
    get next token /        \partial parse tree 
/         \
Scanner    Semantic
     ^       Analyzer
 |          |
 |          |
 Source       V    
 Code        IR  
In this diagram, the parser acts as the driver for the compiler.  As it builds the parse tree, it just gets tokens as needed from the scanner.  The scanner is responsible for maintaining its position in the input source file and making the next token available through a public variable or method when requested by the parser.  As the parser is able to parse a complete piece of the program, such as an arithmetic expression, it passes this information to the semantic analyzer, which can then create the IR for just that piece of the program (e.g., it can create the IR that represents the arithmetic expression just parsed).

Possible Extension

Because of the way the scanner, parser, and semantic analyzer are separated from each other, it would be possible to write each as a separate concurrent process (Ada tasks, Java threads, or C++ pthreads).  You will notice that the interface between the parser and the scanner, for example, is a producer/consumer problem.  The resulting program would not be any faster (likely a bit slower) on a single processor, but faster on a multiprocessor if the operating system is clever enough to schedule the tasks (threads) on different processors.  The problem for the operating system is that it knows about the single process that is the compiler when it is running, but not all OS's can recognize that a single process has, say, three threads running inside it, and hence cannot schedule these threads.  I would recommend this approach to anyone who would like to try it, as I think you would learn some valuable lessons from it.

Starting with the Scanner

In real life, it would be possible to break the project into portions so that team members did different components of the compiler, such as the scanner, the parser, and so on.  In this class, we will be learning about each of these components in succession, so team members must work on parts of each module as we learn about them.  

Where does one start when beginning to write a compiler?  We need to get the best definition of the language available.  Ideally, this would be an ISO standard definition manual for the language.  That is, we:

       Token                      Regular Expression

       identifier       letter {letter | digit}
       l_paren          (
       r_paren          )
       integer_literal  digit{digit}
       float_literal    integer_literal.integer_literal
      
       etc.

     where letter ::= a | b | c | ... | z | A | B | C  ... | Z
     and   digit  ::= 0 | 1 | 2 | 3 | 4 | 5 |6 | 7 | 8 | 9

One trick you need to be aware of is that some of the regular expressions you need are "auxiliary," or "helper" regular expressions and do not themselves stand for tokens.  In the above, letter and digit are examples of auxiliary regular expressions that make it easier to define the tokens.

For an on demand compiler, the scanner would be a module (i.e., object or package) with an interface for the driver (parser) to access.  What would be the interface for the scanner (i.e., what public methods, variables, procedures, and/or functions are needed)?

There would also be a number of internal, or private, methods: