January 18

Friday, January 16

Reasons for compilers and high level languages

Computers today are similar in one respect to all but the very earliest computers. They all have a machine language that is a set of instructions that are expressed in binary (usually written in hexadecimal for brevity). An instruction might look something like

7C0300000010

where 7C is the operation code for the instruction (perhaps meaning LOAD), 03 is the register, and 00000010 is the address. In this case the instruction might mean that the hardware is to load register 3 with the contents of memory location 00000010.

The very earliest computers were essentially "rewired" when they were to execute a different program. Patch panels were used for this purpose. Once the idea of the stored program was formulated (usually attributed to John von Neumann, although it is said that Ada Agusta, Countess of Lovelace, daughter of the English poet Lord Byron, actually proposed the idea to Charles Babbage a century earlier), computers no longer had to be rewired to be reprogrammed. Instructions were stored in memory just like data and were executed by bringing them from memory into the processor one at a time and decoded and executed there.

The problem was writing a program using numeric codes. It was difficult and error prone. It wasn't long before assembly languages were invented as a way to circumvent this problem. An assembly language instruction for the above might be

LOAD R3, X

where X is a symbolic name for the address 00000010. Note that assembly language is

much easier to write
much easier to understand
has a 1-1 correspondence with the numeric machine instructions

In order for assembly language to work, though, a program called an assembler is needed. The assembler translates the mnemonic assembly instructions into the binary number form of the actual hardware instructions.

Still there are serious drawbacks to assembly language.

It is still difficult to write large programs in assembly language
It is still difficult to read assembly language programs
There is no structure to assembly language that enforces good program style
It is platform dependent

High level languages were thus developed. This was quite a leap, as it was widely believed that the translators for high level languages would produce such inefficient code as to be nearly useless. The first high level language was Fortran (for FORmula TRANslator), which proved that the worries about inefficiency were largely unfounded. New high level languages proliferated, and with them was born the study of compilers, needed to effect the translations of the high level languages into the languages of the computers that were targeted. Notice that high level programming languages really are abstract languages. No computer "understands" them. To get a concrete program that a computer can execute, a compiler must be written to make this translation from the abstract to the concrete.

The objective of a compiler

Recall the objective of a compiler. Given a program in a source programming language, translate the program into the machine language of some target computer and operating system.

Source Program Target Machine
---------------------> Compiler ----------------------->
File Code File

There are a number of modules to the compiler.

Source                Token             Parse                       Intermediate
----------> Scanner ----------> Parser -------> Semantic Analyzer --------------->
Program               File              Tree                        Code (IR)     |                                                                                                                                |
 _________________________________________________________________________________v
v
|                        Optimized                     Target Machine
>----------- Optimizer -------------> Code Generator ------------------>
                         IR                            Code

The portions are usually referred to as the front end of the compiler, whereas the green portions are referred to as the back end.

Reasons for a front end and a back end

As discussed in the last lecture, the front end of a compiler serves a number of purposes:

The front end is target platform independent. That is, the process of finding tokens in a generic ascii file (scanning), determining whether the source program token stream represents a syntactically correct program in the source language (parsing), and conversion of the source language program into some intermediate form that is closer to the machine language of most extant computers (semantic analysis) can all be done without concern about which real platform is the eventual target.
The same front end can thus be used for a particular high level programming language regardless of how many real computer platforms might be targeted. This an efficient way to design compilers.
The intermediate language program that is produced by the front end represents a program for a virtual (i.e., nonexistent) computer. Thus, the potential exists for constructing a virtual machine (a program that interprets the intermediate program) that can be compiled and run on many different computers. This is how the Java Virtual Machine (JVM) works. The JVM is a program that must be compiled on each different real computer platform in order to make it possible to run "byte codes" programs those computers.

Similarly, the back end serves a number of purposes.

Since the back end translates an intermediate form program into an executable program on some particular hardware/operating system platform, it needs no knowledge of the original source program. Thus, works to complete the compiler for any front end for any source language.
Having an existing backend for a particular intermediate code makes the development of compilers for this target machine much quicker, as only a new front end for the new language needs to be developed.

An Example of a Complete Translation

An example of a Pascal program that has been translated into the machine code of the DEC Alpha (esus) is given here.

The Pascal source program
The intermediate form output by the semantic analyzer is often in a form that is not easily printable, such as an "abstract syntax tree." For debugging purposes the pc compiler does allow you to see an assembly language form of the program (which is technically not an intermediate form, but is not yet in executable form).
The hexadecimal version of the executable (translated) program can be viewed to see what it looks like. Note that the numbers you see here do not represent ascii characters. This type of file is called a "binary file" (even though all files are binary) because the information is stored as binary numbers, not as binary codes for characters. That is, you really cannot make sense of this file unless you know what the binary numbers are for the various op codes, registers, and so forth are for this machine language.
An ascii character set table for making conversions between ascii characters and hexadecimal codes and vice versa. You can use it to see the contents of the hexadecimal version of the source program.