arrays and run time stack

Arrays and Loop Optimizations

Optimizing Loops

Suppose that in a certain program, called Main, we are compiling (for simplicity's sake we will consider a program that has no functions or procedures) the following variables have been declared:

variables
N, I, J, X : integer;
A : array[-5..100,0..1000] of integer;
Z : float

Then the symbol table for this program would look like

Name Nest Size Link

Main 0 //
lexeme kind type size (bytes) offset
N
var
int
4
4
I
var
int
4
8
J
var
int
4
12
X
var
int
4
16
A
array
int
424424
20  indices --> -5,100 --> 0,1000 -->//
Z
const
real
8
424444

As the translated code for Main begins executing, the run time stack will look like:

space for Z

space for A

space for X

space for J

space for I

space for N

space for old D0

program code

Remember that at this point during run time D0 points to the start of the AR for main (i.e., it points to the location that now holds the old D0) and that SP points to the first free location on the top of the run time stack (just above the space for Z).

Consider the following code inside Main:

Read(X);
Read(N);
for I in 1..N loop
  for J in 1..N loop
    A(I,J) := A(I+3,J) * X / 5 * I;
  end loop;
end loop;

As discussed before, one might be tempted to try to make this program faster by moving invariant calculations outside the loop in which they are invariant. For example, the expression X / 5 above does not change in either loop, so it could be moved to just outside the outer loop:

Read(X);
Read(N);
Temp1 := X / 5;
for I in 1..N loop
  for J in 1..N loop
    A(I,J) := A(I+3,J) * Temp1 * I;
  end loop;
end loop;

We see that there are two invariant expressions with respect to the J loop: I+3 and Temp1*I. We could move these to just outside the J loop.

Read(X);
Read(N);
Temp1 := X / 5;
for I in 1..N loop
  Temp2 := Temp1 * I;
  Temp3 := I + 3;
  for J in 1..N loop
    A(I,J) := A(Temp3,J) * Temp2
  end loop;
end loop;

The problem with this approach is that the code becomes practically unreadable. It is also, therefore, prone to error. It defeats the purpose of high-level language programming to force the programmer to go through such contortions to try to make a program run more efficiently. Actually, a programmer who uses a good compiler with excellent optimization capabilities doesn't need to do this, because:

A good optimizer can locate and move invariant code outside of loops automatically.

What this means is:

A programmer should always program in the clearest manner and leave code improvement up to the optimizer.

Of course, if the programmer can find a new algorithm for accomplishing the same task, an algorithm with a better time complexity, then the programmer should implement the better algorithm.

Good optimizers are even better than one might expect when writing in a high level language. For example, consider the original statement:

A(I,J) := A(I+3,J) * X / 5 * I;

The first section of code that we need to generate is the calculation of the address (location) on the run time stack of A(I+3,J) so that this value can be pushed. Notice that the symbol table places I at 8(D0) and J at 12(D0). Notice also that the lower bound for index I is -5 and the upper bound is 100, and that the lower bound for index J is 0 and the upper bound is 1000. Then we might have code similar to the following to compute this location.

-- Check that the first subscript value is at least as large as its lower bound
Push 8(D0)
Push 3
Adds
Push -5
CompareGE
BranchFalse BoundsError
-- Check that the first subscript is at least as small as its upper bound
Push 8(D0)
Push 3
Adds
Push 100
CompereLE
BranchFalse BoundsError
-- Check that the second subscript is at least as large as its lower bound
Push 12(D0)
Push 0
CompareGE
BranchFalse BoundsError
-- Check that second subscript is at least as small as its upper bound
Push 12(D0)
Push 1000
CompereLE
BranchFalse BoundsError
-- Compute where the I+3rd Row is on the run time stack
Push 8(D0) -- push I
Push 3     -- push 3
Adds       -- I+3 on top of stack
Push -5    -- push lower bound of first index
Subs       -- stack top is now the row of A normalized to 0
Push 1001  -- push the number of elements in each row of A
Muls       -- stack top now contains the current row offset into A
Push 4     -- 4 bytes per integer location of A
Muls       -- compute byte offset of current row of A
-- the stack top now contains the offset in bytes from the start of A
-- to the I+3rd row of A
-- Now compute the offset to the Jth column of A in this row
Push 12(D0)-- push J
Push 0     -- push the lower bound for J
Subs       -- normalize this value to zero
Push 4     -- push number of bytes per element of A
Muls       -- compute the offset in bytes of J in a row
--  The second stack element now contains the byte offset to the I+3rd row
--  normalized to zero, and the stack top contains the byte offset to the
--  Jth element (in any row) normalized to zero.
--  Now add these two offsets to get the offset into A of
--  element A[I+3,J]
Adds
-- At this point, the top of the stack contains the offset on
-- the run time stack to A(I+3,J) from the start of A
-- To get to the actual location of A(I+3,J) in the activation
-- record for this procedure on the run time stack, the starting
-- offset of array A must also be added to this value.
Push 20     -- A starts at offset 20 from D0
Adds        -- stack top contains offset from AR start to A(I+3,J)
Pop T1      -- pop the offset to A(I+3),J into register T1
Push T1(D0) -- push the value at A(I+3,J) onto the stack
-- Whew! At this point the value in A(I+3,J) has finally been pushed
-- onto the stack!
Push 16(D0) -- push X
Muls        -- A(I+3,J)*X on top of stack
Push 5
Divs        -- A(I+3,J)*X/5 on top of stack
Push 8(D0)  
Muls        -- A(I+3,J)*X/5*I on top of stack
-- at this point, we need to generate code to
-- pop the top of the stack (the result of the expression
-- evaluation) into A(I,J).  This in turn will require
-- that we generate code that will check whether I and J
-- are within range and code to calculate the location
-- of A(I,J) on the run time stack.

Notice that all of this code is buried inside the inner nested J loop in the translation. Whew!

A good optimizer will notice that the code for checking that I is within the bounds is invariant in the inner J loop and move it out of the J loop. It will also notice that the calculation of the displacement to the Ith row is invariant in the J loop and move that code out of the J loop. Since there are two times when A is accessed in the loop, there are about 20 lines of code that are moved in this case out of the J loop . If we suppose that the outer loop runs 1000 times and the inner loop 1000 times, this will result in a savings of 19,080,000 instruction executions for just the calculation of the addresses of A(I+3,J) and A(I,J) alone!

The optimizing portion of a compiler will also take care of moving the x/5 out of both loops and moving the * I part out of the J loop. Some substantial savings will occur.

You can see why some compilers don't do range testing on array indices. It's expensive. But it is also unsafe not to do it. So the Ada way of allowing the programmer to turn range checking code generation on or off in the compiler is a nice compromise.

lexeme	kind	type	size (bytes)	offset
N	var	int	4	4
I	var	int	4	8
J	var	int	4	12
X	var	int	4	16
A	array	int	424424	20 indices --> -5,100 --> 0,1000 -->//
Z	const	real	8	424444