# Compilers Lecture 33: Instruction Scheduling (2)

>>So this is the

data dependence graph and last time we computed

for each instruction in the graph what we call

the critical path distance. So the critical path

of this is four. The path for this zero. For this, it’s one,

seven, and this is five. This is four and this is two. Okay? And with the critical path

distance, we computed a schedule that is — So let’s

put the schedule here. In cycle one, we put E

because it has the highest critical path. In cycle two, we put D. And

in cycle three, we put B or F. Is F ready in cycle three?>>No.>>No. F is not ready

in cycle three, so we cannot put

it in cycle three. So in fact we put B. And

in cycle four, we put F. And in cycle five, we

put A. In cycle six, we put C. In cycle seven?>>It was a stall.>>Yeah, we couldn’t

put anything because nothing was ready here. So in cycle seven, the

ready list was empty. And in cycle eight, we put

G. And this is a schedule that uses eight cycles as

opposed to how many cycles for the original program order?>>Fourteen.>>Fourteen, yeah,

compared to 14. So this saves a number

of cycles. And again, this is a simulation that the compiler

does at compiled time. And the accuracy of

this simulation depends on the accuracy of

the machine model to how accurately the compiler

is modeling the processor. This is clearly a

simple machine model, we’re assuming a single issue

machine and we are assuming that we know exactly

what the latencies are but real machines are

going to be multiple issue. You can issue multiple

instructions in each cycle. And the latencies are

not known precisely. Usually there is a, you know,

a range of latencies for some of the more complex

instructions. But given this model,

this is what we have. An important note about this

is that we are assuming here that we are doing instruction

scheduling within a basic block. So instruction scheduling is

done within a basic block. So using the same

terminology that we used for register allocation,

would this be local or global?>>Local.>>Local. So this is local

instruction scheduling. Local within the basic block. Now global instruction

scheduling — In fact we will not cover

global instruction scheduling in this course but we will

just talk about it now for just to point out some

of the complexities in global instruction

scheduling. In global instruction

scheduling, there are complexities. For example, if you have this

basic block and this basic block and here’s an instruction

I or in fact, it will be — So let’s do this like an if. So if I have an instruction

I in this basic block, in order to move

it, so if we were to do global instruction

scheduling, we would like to have the

flexibility of moving I to this basic block,

for example. But this will not always be —

This will not always be legal. Right? Why is that? Moving I here from

this basic block to this basic block will

not always be legal. It will not always

generate correct code.>>Yeah, because you don’t know, because what I might’ve

[inaudible] considering it’s in the one before it splits,

you’d probably have add it to the other one, too.>>Yes and it may be — It

depends — This is a condition. Right? So with this

basic block, this is, you’re checking some

condition, conditional branch.>>Yeah, because

it was before –>>If some condition, then

do this, else do this. So this is the then

and this is the else.>>Right. Right. If it needs to do

something that’s checked for by the condition, you

move it after the condition and the condition is not going

to have the correct check.>>Yeah, exactly. So it may be that, you

know, I, when you do I here, when you move it

here, I will have, assuming that the data

dependencies are honored. So we are assuming that all the

data dependencies are captured by the data dependence graph,

then if we move it here, now this basic block, the

then will be executing the I but the else will not

be executing the I. So here, the I will not

be executed all the time. So with a motion like

this, you may want to copy I in both basic blocks. So this is one complexity. And now if you do the motion

in the other direction, so suppose that I have

an instruction J here, so moving J from

this basic block to this basic block again

will not always be legal because J should be executed

only if the else is true. And if you move it here, you will be executing it anyway

whether the condition is true or false. And this may not be legal

unless the J is going to be harmless for this task. So we’re not here

trying to understand or describe an algorithm for

global instruction scheduling, but I’m just trying to point

out some of the complexities that arise when we do global

instruction scheduling. And that’s why, you know, some production compilers

do not implement global instruction scheduling. They only implement local. And one of them is

the LLVM compiler. The LLVM compiler does

not implement global instruction scheduling. It implements local

instruction scheduling. So — So now we are 100% focused on

local instruction scheduling. So we did this last time and we started discussing

the algorithmic details. So now we understand the concept

behind this critical path list scheduling, critical path list

scheduling, and now we would like to understand the details. And one of the details is the

notion of the dependent count for each instruction or for

each node in this graph. In fact, let’s try

to think about, to think of the data structure that we need for

each instruction. So if we have class instruction,

what do we need to keep in this? So one of the entries — One of

the data fields in this is going to be the depth count. What else do you think I need

to store for each instruction to track for each

instruction in this algorithm?>>The cost for the priority.>>Okay, so the dependent count,

the critical path distance, yes. [ Inaudible Student Comment ] And what else?>>What nodes are

connected to it.>>Well, yeah. So that’s part of the

graph representation. So the graph representation, remember in your data

structures class, graphs, there are multiple ways

of representing graphs. Can you remember one of them? How we represent the graphs,

there are two representations, at least two common

representations of graphs. There are adjacency lists. Do you remember adjacency lists?>>Matrix.>>Yeah, matrix.>>So adjacency list. So this graph will

be represented as an adjacency list. In fact, if we draw the

adjacency list representation of this — The whole point

in today’s lecture is to think algorithmically

about this so we understand the concept

but in today’s lecture, we will trying to understand the

algorithm, the actual algorithm. So the adjacency list is going to have here A and the

adjacency list there will be an array of nodes A, B, C, D, E, F, G. And what will we

have for each node here? We will have the

list of neighbors. So for A, it’s going to

be C. For B, the neighbors or the dependence in

this context are C. For C, what do we have here? G. For D?>>F.>>F. And for E, F.

And for F, we have G and G doesn’t have

an adjacency list. So this is the adjacency list. Of course, this is

a simple example. So the adjacency list of each

node has only one item in it but in general it will

have multiple items. Okay? So now — Okay,

so how does this, the list scheduling

algorithm start? What’s the first step? So it’s based on the

notion of the ready list. So, okay, let’s write

the code here. List scheduling and first

create a ready list R. And what was the other

list that we used in the example last time?>>The active list.>>The active list. Yes. The ready list R and

an active list A. Okay. So initially what does

the active list have and what does the

ready list have?>>The ready list has nodes

that don’t have anything, any other nodes pointing

to them.>>Okay. So remember the

depth count equals zero, depth count equals zero. So the nodes that have

a depth count or zero. Well first of all, we have

to compute the depth counts. So this step, compute

depth count for each node. Now this — We are writing

this as a single line but there is an underlying

algorithm here. How would you compute the depth

count given this adjacency list representation of the graph? So this is an adjacency

list representation, how would you compute

the depth count for each node given this — So

look at this adjacency list. So, you know, the depth count

is — Let’s write it in red. So here it’s zero for this. It’s one for this. It’s — Sorry. For this, it’s zero, zero, zero. For C, the depth count is?>>One.>>No. Two. It’s dependent on

two other nodes.>>Oh, okay. Yeah.>>Okay, so depth

count equals two. Depth count equals two.>>And that one’s two.>>Depth count equals two. Okay. Now how can we

calculate the depth count?>>Count the occurrences

in the list.>>Count what?>>The occurrences

of those in the list, of each node in the list.>>Yeah, exactly. So it’s the number of

occurrences of each node in the adjacency list. So we’ll have to go

through the adjacency list and count how many times

each node appears on the, in one item in one of the lists. So C appears twice. F appears twice. And G appears twice. And other nodes do not appear

at all in the adjacency list. So these are the

adjacency lists. Okay, so there is an

underlying algorithm. In fact, what would be

the running time for this? Compute the dependent

count for each node. [ Inaudible Student Comment ] No.>>Oh, wait. N squared.>>Well, this is N. So assuming that N is the number

of instructions.>>So it would have

to be N squared.>>It’s not that bad. No. So it’s N. What’s

the total number of items in all adjacency lists?>>The number of edges.>>The number of edges. Yeah, exactly. The number of edges. So in fact, the running time

for doing this is going to be — Well, if V is the

number of vertices, the number of vertices,

E is number of edges. So in fact compute the

dependents count for each node, this is going to

be O of V plus E. Okay, now we compute

the depth count. Then what’s the next step? For each node whose

depth count is zero, we add it to the ready list. So for each instruction,

let’s call it op, if op.depth count equals

zero, then R.add op R, add op to the ready list. Okay? Now we have the ready

list, the initial ready list. Then we will have to go through

the ready list to keep looping and we pick one instruction

from the ready list. So we will have a big

while loop, while. What should be the

termination condition here? What should be the

termination condition? So we should keep

looping as long as?>>The ready list

and the active list.>>Yeah, exactly. So in fact, if the

ready list is empty, that’s not a termination

condition because here, we have the ready list empty

but we are not done yet. In fact, you can just count

the number of instructions that you have scheduled. So one way of doing it is

counting how many instructions you have scheduled and

when that number is equal to the total number of

instructions, you are done. Another way of doing

it is looking at the ready list

and the active list. And you terminate

when both are empty. If one of them is not

empty, you keep going. So while R or A is not empty. One of them is not

empty, we loop. Now here I’m using indentation to indicate what’s

inside the loop. So I’m not going to use braces. I will use indentation to

indicate what’s inside the loop, which is typical pseudocode,

pseudocode convention. Okay, now what do we — We

look at the ready list, right? So if R is not empty, we pick an

instruction from the ready list. Now what do you think would

be the natural implementation of a ready list here? What would be the

natural implementation?>>Priority queue.>>Yeah, priority

queue, exactly. So this — You know,

conceptually, this is clearly a priority queue because we have these ready

nodes and each one has a number that indicates their priority. So this is exactly what a

priority queue was designed for. So we’ll assume that R is

then R is a priority queue, so R is a priority queue while

A, we don’t have the notion of, we don’t have the notion of an

order or a priority within A or a ranking within A, so A

can be just a simple list, just a list, unordered list.>>It could also be a

small hash table, too.>>No. Why complicate things? It can be. Well, it can be — You can make

it the most sophisticated data structure ever but

we don’t need it. You know, here what

I’m trying to say is that this is an unordered list. We don’t have the notion of

a priority or an ordering, so we use the simplest

that works. And always when you

design data structures, you use the simplest

data structure that can get the job done. Okay, and if R is not empty, then we extract the

maximum from the ready list. So we do op equal R.extract max. So extract the node with

the maximum priority number. So in this case, the key

for our priority list, for our priority queue is going

to be these numbers in blue. The numbers in blue are

going to be the keys for the priority queue. So if R is not empty,

extract the max. And then what should

we do with it? We extract it, we schedule it. So we schedule it. So we should have here the

cycle for each instruction. So we set the cycle. We say, okay, op.cycle

equals current cycle. Now we discover that we

should track the cycle number. So we will add this — Yeah,

let’s put it in the beginning, so here, or we can put it here. Cycle number equals one. So op.cycle equals cycle number

or let’s call it current cycle. Current cycle would

be a clearer name. Current cycle. So op.cycle equals

current cycle. Okay? And then what

should we do with it? Now we’re scheduling this.>>We want to add it

to the active list.>>Exactly, yeah. Yeah, so A.add. It doesn’t matter where we

add it, A.add of this op. Okay, so now we have

scheduled this. Now in order to make

progress in this algorithm, what do we need to do? [ Inaudible Student Comment ] Okay, so now after we do this,

so here current cycle plus plus. Okay. And after incrementing

the cycle, what do you think we

should be doing next?>>Check if any instructions

are done.>>Yeah, exactly, exactly. So we check if any

instruction is done. And an instruction is

done, so for each — So now to check if any

instruction is done, which instructions

should we be checking?>>The ones in the active list.>>Exactly. The ones in the active list. So for each instruction op

in A, in the active list. So what’s the condition

for done? What’s the condition? Well, how do we know

that it’s done? So for example, E is going

to be scheduled in cycle one. And it will be done in cycle one

plus three which is four, right?>>If the current cycle equals

the cycle done for the –>>Okay, so for the

current cycle, which is — When the current

cycle becomes four. And what is four? Four is equal to the cycle in which we put E,

which is one, plus?>>Latency.>>Latency. So now we realize that we

need to add the latency. Okay, so I will just put the

latency before the cycle, latency and then cycle. Okay, latency. So if op.cycle plus op.latency, the cycle for op plus the

latency for op equals what?>>The current cycle.>>Current cycle, exactly, equals current cycle,

then this is done. So we remove op from —

We can say that A.remove, remove it from the active list. And then what’s the important

thing that we need to do?>>We need to see if

there’s anything else ready.>>Okay, yeah, if it makes

something else ready. So the completion of one instruction may

decrement the depth count of its dependence. So we have to go

through its dependence. So now for each successor

or neighbor — No, let’s call it successor. For each successor,

op successor S of op, for each successor S

of op, we do S. what? Depth count minus minus. Okay, so now when we

get to cycle four, we decrement the depth count of

F by one, which means that one of the instructions on which

F depends has just completed. So the dependency has

gotten reduced by one. So we look at the successors and

S.depth count is decremented. Then?>>Add S to the ready list.>>If it depth count is zero.>>Exactly. Only if the depth count

is zero because here, if the depth count becomes

one, it’s not ready yet. So an instruction may depend

on many other instructions. In general, you know,

this is the S. This is op. This is op and this

is S, the successor. And the depth count

may be a big number. So it could be five and there

are four other dependencies. So this op is going to decrement

it to four, then it will be — When all four are done,

S will become ready. So for each successor S of

op, S.depth count minus minus. If S.depth count equals

zero, then R.add S or op? Should we add S or op?>>S.>>S. Okay? So that should do it. So this is the — This is the algorithm. So this is what — This

will allow us to, you know, keep updating the ready list. Okay, so now let’s try to

analyze this algorithm. Now the running time of

this algorithm, now assuming that the data dependence graph

is already there has been computed — Well in fact,

yeah, we didn’t talk about the critical

path in this case. So in fact, you know, what is

missing here is computing the critical paths for each,

compute dependence count and critical path. And critical path. And in fact, even if we

compute the critical path, computing the critical

path, as we said last time, is going to be a DFS and it

will be another V plus E. So dependent count and critical

path, each one of them is going to be V plus E. So the

total asymptotic complexity for this is V plus E. Okay? So it’s not going to add to the

asymptotic complexity, right, because we’ll multiply

it by a constant. Now for each instruction

op, if — So what will be the asymptotic

complexity for this loop?>>It would be V.>>V. But adding to

the priority list? So this depends on

the implementation of the priority queue. So if we assume that

the priority — What’s a typical implementation

of a priority queue?>>Heap.>>Heap, yeah. Heap is probably the most

common implementation of a priority queue. By the way, sometimes other

implementations may be more efficient. You know, for example,

in Dijkstra’s algorithm, if the graph is dense, an unsorted array would

work better than a heap. But let’s assume that the

priority queue is a heap, is a heap, is implemented as

a heap, for the heap, the add and extract operations

will take — What would be the running

time for add and extract? So in fact, add or

insert and extract max this is the [inaudible],

extract max or dequeue. Let’s, yeah, let’s just

call it extract max. So what’s the running time for

each of these heap operations?>>Log N.>>Yeah, it’s log N. It’s log

N. So each one of them is log N. This means that what

we have here is because this is a heap,

this is of V log V. Okay? Because we do it for

every op, there are V ops and this is log V for

each in the worst case, yeah, in the worst case. But in general, it will

be much less because — Yeah, it will be much less because these root instructions

typically will be a minority, a small subset of

the instructions, but in the worst case, it

could be that all of them are. So this is a worst

case estimate. Now the loop itself, the

number of executions of — The number of times this loop

is executed will be what?>>V. It’s V, right?>>What’s that?>>It’s V, V times.>>It could be more. So here in this example,

it’s more than V.>>Because it stalls.>>Yeah, there are stalls. Yeah, so in fact it’s V — If we’re going to come up with

an upper bound on the number of executions of this loop, how

would you compute an upper bound on the number of

executions of this loop?>>Find the biggest

latency and multiply it.>>Yeah, exactly. The biggest latency multiplied

by the number of operations. But asymptotically,

what will that be? So if the maximum latency — So

if it’s V times maximum latency, what would that be

asymptotically?>>It’s still V.>>It’s still V, why? Because maximum latency

is a constant. And why it is a constant? Let’s make sure that we

understand why it’s a constant. What does it mean for

a certain parameter to be a constant

in an algorithm? The maximum latency is

a constant in the sense that it does not depend

on the size of the input. In other words, it does

not depend on the size of the data dependence graph. The input to this algorithm

is this data dependence graph. And the size of this data

dependence graph depends on the number of nodes

and the number of edges and it has nothing to do

with the maximum latency. The maximum latency

is a property of what?>>The machine model.>>The machine model. Yes, the target machine. So it depends on what

instructions that machine has. What are the instructions and what’s the maximum

latency instruction, you know, for example typically a divide

instruction will have a latency of six or eight or ten. So the maximum — The latency, this maximum latency is not

a function of the input size. It’s a function of the

— It’s a parameter, it’s a target machine

parameter which means that we can think

of it as a constant. No matter how big

the input graph is, the maximum latency is

not going to change. So we can think of

it as a constant, which means that this loop

will get executed O of V times. Okay? Now if — Let’s

look at what’s inside. So here we have — Every time

we look at the ready list, if the ready list is not

empty, we will do this, we will do the extract max. So this extract max will take

log V, right, so this is — This by itself is log V. And

throughout the — So this is — Each one of them, each extract

max is log V but overall, an aggregate throughout the

execution, the whole lifetime of the algorithm is going

to be V log V. Right? So this thing will

get executed V times. Okay, now A.add,

this is constant because the active

list is unordered, so we can insert in any order. It doesn’t matter. Now here when we look

at the active list, how many times will we be doing

this check for each instruction? This checking for each

instruction, if it’s active or not or sorry, if

it’s done or not, how many times will this

checking be done for each?>>V times the size of A.>>V times the size of A.

Well the size of A is dynamic. It keeps changing. Let’s — Okay, the question

is for any given instruction, any given instruction will be in

the active list at some point. How long, for how many cycles

will an instruction stay in the active list?>>For our model, four.>>It depends — Yeah, for the

latency of that instruction. So the time that an

instruction stays in the active list is dependent on the latency of

that instruction. So for an upper bound,

an instruction will stay for a number of cycles

that is less than or equal to the maximum latency. And the maximum latency

is a constant. So this means that this

checking will be performed O of maximum latency times V.>>Which is still V.>>Yeah, exactly,

which is still O of V. So let’s put the aggregate

here then for this, let’s put the aggregate

which is V log V. So we are putting the

aggregate, number of executions. Aggregate means the number of times it’s executed

throughout the entire algorithm. Okay, now for each successor S

of V, now this particular line, for each successor going

through the successors, how many times will

this get executed? What is this dependent on? So throughout the

entire algorithm, how many times will this get

executed, this particular line?>>Every time something gets

removed from the active list.>>Okay, well not –>>Dynamic.>>Well, when we remove

it, when we remove it, we go through all successors. So for E, we go through this. For D — So how many times

will this line get executed?>>However many elements

are in the adjacency list.>>Yeah, exactly, which

is the number of edges. This is the number of

edges in the graph. This is E, the number

of edges in the graph. So this is going to be O of E. Okay? Alright. So this is E. But if

this is depth count equals zero, when depth count becomes

zero, we will add it to the, we will add it to

the ready list. Right? Now how many times

will this get executed? So this is going to

get executed E times. But the addition

to the ready list, how many times will this get

executed for each instruction?>>Just once.>>Once. An instruction

becomes ready once. So this is going to be — What’s the aggregate

running time for this line?>>V log V.>>Yeah, V log V, exactly. So this is going to be V log V. So then what would be

the overall running time? So obviously, we can ignore the

Vs. We have lots of Vs log Vs. But are these Vs log Vs that

we have, are they additive? Should we add them

or multiply them? Yeah, so this V log V — Is

this V log V within this V log V or should we add it to it?>>Add.>>It’s add.>>So in fact the

relation between the Vs log Vs is additive. And when you add a V

log V to a V log V, an asymptotic notation gives

you a V log V. Asymptotically, this is V log V. So the Vs log

Vs will give you one V log V, but you still have an E.

Now is the E greater than or smaller than a V log V? Will the E be greater than

or smaller than V log V? [ Inaudible Student Comment ] Well, in fact it depends. It depends on what?>>How many — The

ratio of edges to nodes.>>Yeah, it depends on

how dense the graph is. It depends on the

density of the graph. So if the graph is dense,

E is going to be greater than V log V. If the graph is —

What’s the other kind of graphs? Sparse. If it’s sparse, then

E may be smaller than V log V. So which one is greater,

E or V log V, depends on whether the

graph is dense or sparse. So that’s why in

general we just add them. So the overall asymptotic

complexity is O of E plus V log V because it’s

not clear which one is greater. Which one is greater depends on whether the graph

is dense or sparse. Okay, so I, you know, I

wanted to go through all of these algorithmic

details because the point that I’m trying to make here

is that a compiler is a program that consists of a

number of algorithms and these algorithms have

to be efficient in order for the compiler to compile

the code efficiently, within reasonable time. And so, algorithms is the

foundation of computer science. So everything, everything

in computer science is built on algorithms and

theory of computation. In fact, you know, the front

end of the compiler is built on theory of computation and

the back end of the compiler, all of these graph algorithms,

they are built on algorithms. So the foundation of computer

science is algorithms, data structures, and

theory of computation and everything else is

built on top of these, you know, including compilers. Okay? Yeah, so any

questions on this analysis? Okay. No questions? Okay, so now all of this, this

minimizes the number of cycles. But what about register

pressure? So do you remember the

notion of register pressure? So is this — This

schedule, is it good or bad from register pressure

point of view?>>Bad.>>Bad.>>Why is it bad?>>Because well one, two, three requires three

registers right away.>>Yeah, exactly. So they require three registers. But we can do two

registers only. But in fact, you

know, in this case, can we do this with

two registers?>>Not without [inaudible].>>Well with two registers

— If we have two registers, can we do this whole

thing with two registers?>>Yeah.>>I don’t think so.>>No. Well, you can use two

— Suppose that we do these, so we need the register for

this and the register for this. Then we need the register for

this and the register for this. But in order to do these two,

we have to store the result of this in a register. So we need at least

three registers. Right? So we can do

this computation. This is one register, one

register, then we have to store the result in some

register before we can do the other branch, you know, these — We need the register for this

again and the register for this and then we need to store the

result of this in a register. So it’s — You cannot do this

in less than three registers. But again, this schedule was not

looking at register pressure. So this — I’m sorry, this

algorithm was only looking at minimizing the

number of cycles. It wasn’t looking at

register pressure. Now if we were to take

register pressure into account, how should we do this and where? What should we modify?>>I would assume it would

be in the while loop.>>Do we just change

the critical path like what we represent

with the critical path?>>Yeah, so we changed

the priority scheme. So our priority scheme —