Start Lecture #14

We want to record the flow of information from instructions that compute a value to those that use the value. One advantage we will achieve is that if we find a value has no subsequent uses, then it is dead and the register holding that value can be used for another value.

Assume that a quad p assigns a value to x (some would call this
a *def* of x).

**Definition**:
A quad q **uses** the value computed at p (uses
the def) and x is **live** at q
if q has x as an operand and there is a possible execution path from
p to q that does not pass any other def of x.

Since the flow of control is trivial inside a basic block, we are able to compute the live/dead status and next use information at the block leader by a simple backwards scan of the quads (algorithm below).

Note that if x is dead (i.e., defined before used) on entrance to B the register containing x can be reused in B.

Our goal is to determine whether a block uses a value and if so in which statement is it first used. The following algorithm for computing uses is quite simple.

Initialize all variables in B as being live Examine the quads q of the block in reverse order. Assume the quad q computes x and reads y and z Mark x as dead; mark y and z as live and used at q

When the loop finishes those values that are read before being written are marked as live and their first use is noted. The locations x that are written before being read are marked dead meaning that the value of x on entrance is not used.

Those values that are neither read nor written remain simply live.
They are **not** dead since the dynamically next

basic block might use them.

Note that we have determined whether values are live/dead on entrance to the basic block. We would like to know as well if they are live/dead on exit, but that requires global flow analysis, which we do not know.

The nodes of the flow graph are the basic blocks, and there is an edge from P (predecessor) to S (successor) if S might follow P. More formally, such an edge is added if the last statement of P

- is a jump to S (it must be to the leader of S) or
- is
**NOT**an**UNCONDITIONAL**jump and S immediately follows P. (Note that the figure on the right satisfies this condition for every basic block. Do**NOT**assume that is always the case.

Two nodes are added: entry

and exit

.
An edge is added from entry to the first basic block, i.e. the block
that has the first statement of the program as leader.

Edges to the exit are added from any block that could be the last block executed. Specifically, edges are added to exit from

- the last block if it doesn't end in an unconditional jump.
- any block that ends in a jump to outside the program.

The flow graph for our example is shown on the right.

Note that jump targets are no longer quads but blocks. The reason is that various optimizations within blocks will change the instructions and we would have to change the jump to reflect this.

For most programs the bulk of the execution time is within loops so we want to identify these.

- No block in L other than E has a predecessor outside L.
- All blocks in L have a path to E completely inside L.

The flow graph on the right has three loops.

- {B
_{3}}, i.e., B_{3}by itself. - {B
_{6}}. - {B
_{2}, B_{3}, B_{4}}

**Homework**: 1.

Remark: Nothing beyond here will be on the final.

We are not covering global flow analysis; it is a key component of
optimization and would be a natural topic in a follow-on course.
Nonetheless there is something we can say just by examining the flow
graphs we have constructed.
For this discussion I am *ignoring* tricky and important
issues concerning arrays and pointer references (specifically,
disambiguation).
You may wish to assume that the program contains no arrays or
pointers.

We have seen that a simple backwards scan of the statements in a
basic block enables us to determine the variables that are
live-on-entry (and their first use) and those variables that are
dead-on-entry.
Those variables that do not occur in the block are considered live
but with no next use; perhaps it would be better to call
them ignored by the block

.

We shall see below that it would be lovely to know which variables
are live/dead-on-**exit**.
This means which variables hold values at the end of the block that
will / will not be used.
To determine the status of v on exit of a block B, we need to trace
all possible execution paths beginning at the end of B.
If all these paths reach a block where v is dead-on-entry before
they reach a block where v is live-on-entry, then v is dead on exit
for block B.

The goal is to obtain a visual picture of how information flows
through the block.
The leaves will show the values entering the block and as we
proceed *up* the DAG we encounter uses of these values, defs
(and redefs) of values, and uses of the new values.

Formally, this is defined as follows.

- Create a leaf for the initial value of each variable appearing in the block. (We do not know what that the value is, not even if the variable has ever been given a value).
- Create a node N for each statement s in the block.
- Label N with the operator of s. This label is drawn inside the node.
- Attach to N those variables for which N is the last def in the block. These additional labels are drawn along side of N.
- Draw edges from N to each statement that is the last def of an operand used by N.

- Designate as
*output nodes*those N whose values arelive on exit

, an officially-mysterious term meaning values possibly used in another block. (Determining the live on exit values requires global, i.e.,**inter**-block, flow analysis.)

As we shall see in the next few sections various basic-block optimizations are facilitated by using the DAG.

As we create nodes for each statement, proceeding in the static order of the statements, we might notice that a new node is just like one already in the DAG in which case we don't need a new node and can use the old node to compute the new value in addition to the one it already was computing.

Specifically, we do not construct a new node if an existing node has the same children in the same order and is labeled with the same operation.

Consider computing the DAG for the following block of code.

a = b + c c = a + x d = b + c b = a + x

The DAG construction proceeds as follows (the movie on the right accompanies the explanation).

- First we construct leaves with the initial values.
- Next we process a = b + c.
This produces a node labeled + with a attached and having
b
_{0}and c_{0}as children. - Next we process c = a + x.
- Next we process d = b + c. Although we have already computed b + c in the first statement, the c's are not the same, so we produce a new node.
- Then we process b = a + x. Since we have already computed a + x in statement 2, we do not produce a new node, but instead attach b to the old node.
- Finally, we tidy up and erase the unused initial values.

You might think that with only three computation nodes in the DAG,
the block could be reduced to three statements (dropping the
computation of b).
However, this is **wrong**.
Only if b is dead on exit can we omit the computation of b.
We can, however, replace the last statement with the simpler

b = c.

Sometimes a combination of techniques finds improvements that no single technique would find. For example if a-b is computed, then both a and b are incremented by one, and then a-b is computed again, it will not be recognized as a common subexpression even though the value has not changed. However, when combined with various algebraic transformations, the common value can be recognized.

Assume we are told (by global flow analysis) that certain values are dead on exit. We examine each root (node with no ancestor) and delete any for which all attached variables are dead on exit. This process is repeated since new roots may have appeared.

For example, if we are told, for the picture on the right, that c and d are dead on exit, then the root d can be removed since d is dead. Then the rightmost node becomes a root, which also can be removed (since c is dead).

Some of these are quite clear. We can of course replace x+0 or 0+x by simply x. Similar considerations apply to 1*x, x*1, x-0, and x/1.

Another class of simplifications is *strength reduction*,
where we replace one operation by a cheaper one.
A simple example is replacing 2*x by x+x on architectures where
addition is cheaper than multiplication.

A more sophisticated strength reduction is applied by compilers that
recognize induction variables

(loop indices).
Inside a

for i from 1 to N

loop, the expression 4*i can be strength reduced to j=j+4 and 2^i
can be strength reduced to j=2*j (with suitable initializations of j
just before the loop).

Other uses of algebraic identities are possible; many require a
careful reading of the language reference manual to ensure their
legality.
For example, even though it might be advantageous to convert

((a + b) * f(x)) * a

to

((a + b) * a) * f(x)

it is illegal in Fortran since the programmer's use of parentheses
to specify the order of operations can not be violated.

Does

a = b + c x = y + c + b + rcontain a common subexpression of b+c that need be evaluated only once?

The answer depends on whether the language permits the use of the associative and commutative law for addition. (Note that the associative law is invalid for floating point numbers.)

Arrays are tricky. Question: Does

x = a[i] a[j] = y z = a[i]contain a common subexpression of a[i] that need be evaluated only once?

The answer depends on whether i=j. Without some form of disambiguation, we can not be assured that the values of i and j are distinct. Thus we must support the worst case condition that i=j and hence the two evaluations of a[i] must each be performed.

A statement of the form x = a[i] generates a node labeled with the
operator =[] and the variable x, and having children a_{0},
the initial value of a, and the value of i.

A statement of the form a[j] = y generates a node labeled with
operator []= and three children a_{0}. j, and y, but with no
variable as label.
The new feature is that this node kills all existing nodes depending
on a_{0}.
A killed node can not received any future labels so cannot becomew a
common subexpression.

Returning to our example

x = a[i] a[j] = y z = a[i]We obtain the top figure to the right.

Sometimes it is not children but grandchildren (or other descendant) that are arrays. For example we might have

b = a + 8 // b[i] is 8 bytes past a[i] x = b[i] b[j] = yWe are using C-like semantics, where an array references is thought to be a pointer to a[0], the first element of the array. Hence b is a pointer to 8 bytes past the first element of a, which is a[2] for an integer array and a[1] for a real array. Again we need to have the third statement kill the second node even though the actual array (a) is a grandchild. This is shown in the bottom figure.

Pointers are even trickier than arrays.
Together they have spawned a mini-industry in disambiguation

,
i.e., when can we tell whether two array or pointer references refer
to the same or different locations.
A trivial case of disambiguation occurs with.

p = &x *p = yIn this case we know precisely the value of p so the second statement kills only nodes with x attached.

With no disambiguation information, we must assume
that a pointer can refer to *any* location.
Consider

x = *p *q = y

We must treat the first statement as a use of every variable; pictorially the =* operator takes all current nodes with identifiers as arguments. This impacts dead code elimination.

We must treat the second statement as writing every variable. That is all existing nodes are killed, which impacts common subexpression elimination.

In our basic-block level approach, a procedure call has properties similar to a pointer reference: For all x in the scope of P, we must treat a call of P as using all nodes with x attached and also killing those same nodes.

Now that we have improved the DAG for a basic block, we need to regenerate the quads. That is, we need to obtain the sequence of quads corresponding to the new DAG.

We need to construct a quad for every node that has a variable attached. If there are several variables attached we chose a live-on-exit variable, assuming we have done the necessary global flow analysis to determine such variables).

If there are several live-on-exit variables we need to compute one and make a copy so that we have both. An optimization pass may eliminate the copy if it is able to assure that one such variable may be used whenever the other is referenced.

Recall the example from our movie

a = b + c c = a + x d = b + c b = a + x

If b is dead on exit, the first three instructions suffice. If not we produce instead

a = b + c c = a + x d = b + c b = cwhich is still an improvement as the copy instruction is less expensive than the addition on most architectures.

If global analysis shows that, whenever this definition of b is used, c contains the same value, we can eliminate the copy and use c in place of b.

Note that of the following 5 rules, 2 are due to arrays, and 2 due to pointers.

- The DAG order must be respected (defs before uses).
- Assignment to an array must follow all assignments to or uses of the same array that preceded it in the original block (no reordering of array assignments).
- Uses of an array must follow all (preceding according to the original block) assignments to it; so the only transformation possible is reordering uses.
- All variable references must follow all (preceding ...) procedure calls or assignment through a pointer.
- A procedure call or assignment through a pointer must follow all (preceding ...) variable references.

**Homework**: 1, 2,

A big issue is proper use of the registers, which are often in short supply, and which are used/required for several purposes.

- Some operands
*must*be in registers. - Holding temporaries (i.e., values used in only one basic block) thereby avoiding expensive memory ops.
- Holding int
**er**-basic-block values (loop index). - Storage management (e.g., stack pointer).

For this section we assume a RISC architecture. Specifically, we assume only loads and stores touch memory; that is, the instruction set consists of

LD reg, mem ST mem, reg OP reg, reg, regwhere there is one OP for each operation type used in the three address code. We will not consider the use of constants so we need not consider if constants can be used in place of registers.

A major simplification is we assume that, for each three address operation, there is precisely one machine instruction that accomplishes the task. This eliminates the question of instruction selection.

We do, however, consider register usage. Although we have not done global flow analysis (part of optimization), we will point out places where live-on-exit information would help us make better use of the available registers.

Recall that the mem operand in the load LD and store ST instructions can use any of the previously discussed addressing modes.

These are the primary data structures used by the code generator. They keep track of what values are in each register as well as where a given value resides.

- Each register has a
*register descriptor*containing the list of variables currently stored in this register. At the start of the basic block all register descriptors are empty. - Each variable has a
*address descriptor*containing the list of locations where this variable is currently stored. Possibilities are its memory location and one or more registers. The memory location might be in the static area, the stack, or presumably the heap (but not mentioned in the text).

The register descriptors could be omitted since you can compute them from the address descriptors.

There are basically three parts to (this simple algorithm for) code generation.

- Choosing registers
- Generating instructions
- Managing descriptors

We will isolate register allocation in a function getReg(Instruction), which is presented later. First presented is the algorithm to generate instructions. This algorithm uses getReg() and the descriptors. Then we learn how to manage the descriptors and finally we study getReg() itself.

Given a quad OP x, y, z (i.e., x = y OP z), proceed as follows.

- Call getReg(OP x, y, z) to get R
_{x}, R_{y}, and R_{z}, the registers to be used for x, y, and z respectively.

Note that getReg merely selects the registers, it does*not*guarantee that the desired values are present in these registers.

- Check the register descriptor for R
_{y}. If y is not present in R_{y}, check the address descriptor for y and issue

LD R_{y}, y'

y' is*some*location containing y. Perhaps y is in a register other than R_{y}.

- Similar treatment for R
_{z}.

- Generate the instruction

OP R_{x}, R_{y}, R_{z}

Note that x now is*not*in its memory location.

When processing

x = y

steps 1 and 2 are analogous to the above, step 3 is vacuous, and
step 4 is omitted since getReg() will set
R_{x}=R_{y}.

Note that if y was already in a register before the copy
instruction, *no* code is generated at this point
(getReg will choose R_{y} to be a register containing y).
Also note that since the value of x is now **not** in
its memory location, we may need to store this value into x at block
exit.

You may have noticed that we have not yet generated any store instructions. They occur here (and during spill code in getReg()). We need to ensure that all variables needed by (dynamically) subsequent blocks (i.e., those live-on-exit) have their current values in their memory locations.

- Temporaries are never live beyond a basic block so can be ignored.
- Variables dead on exit (thank you global flow analysis for determining such variables) are also ignored.
- All live on exit variables (for all non-temporaries) need
to be in their memory location on exit from the block.
Therefore, for any live on exit variable whose own memory
location is not listed in its address descriptor, generate
`ST x, R`where`R`is a register listed in the address descriptor.

This is fairly clear. We just have to think through what happens when we do a load, a store, an (assembler) OP, or a copy. For R a register, let Desc(R) be its register descriptor. For x a program variable, let Desc(x) be its address descriptor.

- Load: LD R, x
- Desc(R) = x (removing everything else from Desc(R))
- Add R to Desc(x) (leaving alone everything else in Desc(x))
- Remove R from Desc(w) for all w ≠ x
(not in 2e
**please check**)

- Store: ST x, R
- Add the memory location of x to Desc(x)

- Operation: OP R
_{x}, R_{y}, R_{z}implementing the quad OP x, y, z- Desc(R
_{x}) = x - Desc(x) = R
_{x}(Now Desc(x) does*not*contain`x`'s memory location!) - Remove R
_{x}from Desc(w) for all w ≠ x

- Desc(R
- Copy: For x = y after processing the load (if needed)
- Add x to Desc(R
_{y}) (recall that R_{y}=R_{x}). - Desc(x) = R
_{y}.

- Add x to Desc(R

Since we haven't specified getReg() yet, we will assume there are an unlimited number of registers so we do not need to generate any spill code (saving the register's value in memory). One of getReg()'s jobs is to generate spill code when a register needs to be used for another purpose and the current value is not presently in memory.

Despite having ample registers and thus not generating spill code, we will not be wasteful of registers.

- When a register holds a temporary value and there are no subsequent uses of this value, we reuse that register.
- When a register holds the value of a program variable and
there are no subsequent uses of this value, we reuse that
register
**providing**this value is also in the memory location for the variable. - When a register holds the value of a program variable and all subsequent uses of this value are preceded by a redefinition, we could reuse this register. But to know about all subsequent uses may require live/dead-on-exit knowledge.

This example is from the book. I give another example after presenting getReg(), that I believe justifies my claim that the book is missing an action for load instructions, as indicated above.

Assume a, b, c, and d are program variables and t, u, v are compiler generated temporaries (I would call these t$1, t$2, and t$3). The intermediate language program is in the middle with the generated code for each quad shown. To the right is shown the contents of all the descriptors. The code generation is explained on the left.

t = a - b LD R1, a LD R2, b SUB R2, R1, R2 u = a - c LD r3, c SUB R1, R1, R3 v = t + u ADD R3, R2, R1 a = d LD R2, d d = v + u ADD R1, R3, R1 exit ST a, R2 ST d, R1

- For the first quad, we need all three instructions since nothing is register resident on block entry. Since b is not used again, we can reuse its register. (Note that the current value of b is in its memory location.)
- We do not load a again since its value is R1, which we can reuse for u since a is not used below.
- We again reuse a register for the result; this time because c is not used again.
- The copy instruction required a load since d was not in a register. As the descriptor shows, a was assigned to the same register, but no machine instruction was required.
- The last instruction uses values already in registers. We can reuse R1 since u is a temporary.
- At block exit, lacking global flow analysis, we must assume all program variables are live and hence must store back to memory any values located only in registers.

Consider

x = y OP z

Picking registers for y and z are the same; we just do y.
Choosing a register for x is a little different.

A copy instruction

x = y

is easier.

Similar to demand paging, where the goal is to produce an available
frame, our objective here is to produce an available register we can
use for R_{y}.
We apply the following steps in order until one succeeds.
(Step 2 is a special case of step 3.)

- If Desc(y) contains a register, use of these for R
_{y}. - If Desc(R) is empty for some registers, pick one of these.
- Pick a register for which the cleaning procedure
generates a minimal number of store instructions.
To clean an in-use register R do the following for each v in Desc(R).
- If Desc(v) includes something besides R, no store is needed for v.
- If v is x and x is not z, no store is needed since x is being overwritten.
- No store is needed if there is no further use of v prior to a redefinition. This is easy to check for further uses within the block. If v is live on exit (e.g., we have no global flow analysis), we need a redefinition later in this block.
- Otherwise a
*spill*ST v, R is generated.

As stated above choosing R_{z} is the same as choosing
R_{y}.

Choosing R_{x} has the following differences.

- Since R
_{x}will be written it is not enough for Desc(x) to contain a register R as in 1. above; instead, Desc(R) must contain only x. - If there is no further use of y prior to a redefinition (as
described above for v) and if R
_{y}contains only y (or will do so after it is loaded), then R_{y}can be used for R_{x}. Similarly, R_{z}might be usable for R_{x}.

getReg(x=y) chooses R_{y} as above and chooses
R_{x}=R_{y}.

R1 R2 R3 a b c d e a b c d e a = b + c LD R1, b LD R2, c ADD R3, R1, R2 R1 R2 R3 a b c d e b c a R3 b,R1 c,R2 d e d = a + e LD R1, e ADD R2, R3, R1 R1 R2 R3 a b c d e 2e → e d a R3 b,R1 c R2 e,R1 me → e d a R3 b c R2 e,R1

We needed registers for d and e; none were free. getReg() first chose R2 for d since R2's current contents, the value of c, was also located in memory. getReg() then chose R1 for e for the same reason.

Using the 2e algorithm, b *might* appear to be in R1
(depends if you look in the address or register descriptors).

a = e + d ADD R3, R1, R2 Descriptors unchanged e = a + b ADD R1, R3, R1 ← possible wrong answer from 2e R1 R2 R3 a b c d e e d a R3 b,R1 c R2 R1 LD R1, b ADD R1, R3, R1 R1 R2 R3 a b c d e e d a R3 b c R2 R1

The 2e *might* think R1 has b (address descriptor) and also
conclude R1 has only e (register descriptor) so might generate the
erroneous code shown.

Really b is not in a register so must be loaded. R3 has the value of a so was already chosen for a. R2 or R1 could be chosen. If R2 was chosen, we would need to spill d (we must assume live-on-exit, since we have no global flow analysis). We choose R1 since no spill is needed: the value of e (the current occupant of R1) is also in its memory location.

exit ST a, R3 ST d, R2 ST e, R1

What if a given quad needs several OPs and we have choices?

We would like to be able to describe the machine OPs in a way that enables us to find a sequence of OPs (and LDs and STs) to do the job.

The idea is that you express the quad as a tree and express each OP as a (sub-)tree simplification, i.e. the op replaces a subtree by a simpler subtree. In fact the simpler subtree is just a single node.

The diagram on the right represents x[i] = y[a] + 9, where x and y are on the stack and a is in the static area. M's are values in memory; C's are constants; and R's are registers. The weird ind (presumably short for indirect) treats its argument as a memory location.

Compare this to grammars: A production replaces the RHS by the LHS. We consider context free grammars where the LHS is a single nonterminal.

For example, a LD replaces a Memory node with a Register node.

Another example is that
ADD R_{i}, R_{i}, R_{j}
replaces a subtree consisting of a + with both children registers (i
and j) with a Register node (i).

As you do the pattern matching and reductions (apply the productions), you emit the corresponding code (semantic actions). So to support a new processor, you need to supply the tree transformations corresponding to every instruction in the instruction set.

This is quite cute.

We assume all operators are binary and label the instruction tree with something like the height. This gives the minimum number of registers needed so that no spill code is required. A few details follow.

- Draw the
expression tree

, the abstract syntax tree for an expression. - Label the leaves with 1.
- Label interior nodes with L:
- If the children have the same label x, L=x+1. This looks like height.
- If the children have different labels, x and y, L=max(x,y).

- Recursive algorithm starting at the root.
Each node puts its answer in the highest number register it is
assigned.
The idea is that a node uses (mostly) the same registers as its
sibling.
- If the labels on the children are equal to L, the parent's
label is L+1.
- Give one child L regs ending in highest assigned to parent. Note that the lowest reg assigned to the parent is not used by this child. The answer appears in top reg assigned to the child, which is the top reg assigned to the parent.
- Give other child L regs, ending one below the top reg assigned to the parent. This child does use the bottom reg assigned to the parent. The answer appears in top reg assigned to the child, i.e., the penultimate parent reg.
- Parent uses a two address OP to compute its answer in the same reg used by first child, which is the top reg assigned to the parent.

- If the labels on the children are M<L, the parent is
labeled L.
- Give
bigger

child all L parent regs. - Give other child M regs ending one below bigger child.
- Parent uses 2-addr OP computing answer in L

- Give
- If at a leaf (operand), load it into assigned reg.

- If the labels on the children are equal to L, the parent's
label is L+1.

Can see this is optimal (assuming you have enough registers).

- Loads each operand only once.
- Performs each operation only once.
- Does no stores.
- Minimal number of registers having the above three properties.
- Show need L registers to produce a result with label L.
- Must compute one side and not use the register containing its answer before finishing the other side.
- Apply this argument recursively.

Rough idea is to apply the above recursive algorithm, but at each recursive step, if the number of regs is not enough, store the result of the first child computed before starting the second.