8
Oct

Compilers Lecture 33: Instruction Scheduling (2)


>>So this is the
data dependence graph and last time we computed
for each instruction in the graph what we call
the critical path distance. So the critical path
of this is four. The path for this zero. For this, it’s one,
seven, and this is five. This is four and this is two. Okay? And with the critical path
distance, we computed a schedule that is — So let’s
put the schedule here. In cycle one, we put E
because it has the highest critical path. In cycle two, we put D. And
in cycle three, we put B or F. Is F ready in cycle three?>>No.>>No. F is not ready
in cycle three, so we cannot put
it in cycle three. So in fact we put B. And
in cycle four, we put F. And in cycle five, we
put A. In cycle six, we put C. In cycle seven?>>It was a stall.>>Yeah, we couldn’t
put anything because nothing was ready here. So in cycle seven, the
ready list was empty. And in cycle eight, we put
G. And this is a schedule that uses eight cycles as
opposed to how many cycles for the original program order?>>Fourteen.>>Fourteen, yeah,
compared to 14. So this saves a number
of cycles. And again, this is a simulation that the compiler
does at compiled time. And the accuracy of
this simulation depends on the accuracy of
the machine model to how accurately the compiler
is modeling the processor. This is clearly a
simple machine model, we’re assuming a single issue
machine and we are assuming that we know exactly
what the latencies are but real machines are
going to be multiple issue. You can issue multiple
instructions in each cycle. And the latencies are
not known precisely. Usually there is a, you know,
a range of latencies for some of the more complex
instructions. But given this model,
this is what we have. An important note about this
is that we are assuming here that we are doing instruction
scheduling within a basic block. So instruction scheduling is
done within a basic block. So using the same
terminology that we used for register allocation,
would this be local or global?>>Local.>>Local. So this is local
instruction scheduling. Local within the basic block. Now global instruction
scheduling — In fact we will not cover
global instruction scheduling in this course but we will
just talk about it now for just to point out some
of the complexities in global instruction
scheduling. In global instruction
scheduling, there are complexities. For example, if you have this
basic block and this basic block and here’s an instruction
I or in fact, it will be — So let’s do this like an if. So if I have an instruction
I in this basic block, in order to move
it, so if we were to do global instruction
scheduling, we would like to have the
flexibility of moving I to this basic block,
for example. But this will not always be —
This will not always be legal. Right? Why is that? Moving I here from
this basic block to this basic block will
not always be legal. It will not always
generate correct code.>>Yeah, because you don’t know, because what I might’ve
[inaudible] considering it’s in the one before it splits,
you’d probably have add it to the other one, too.>>Yes and it may be — It
depends — This is a condition. Right? So with this
basic block, this is, you’re checking some
condition, conditional branch.>>Yeah, because
it was before –>>If some condition, then
do this, else do this. So this is the then
and this is the else.>>Right. Right. If it needs to do
something that’s checked for by the condition, you
move it after the condition and the condition is not going
to have the correct check.>>Yeah, exactly. So it may be that, you
know, I, when you do I here, when you move it
here, I will have, assuming that the data
dependencies are honored. So we are assuming that all the
data dependencies are captured by the data dependence graph,
then if we move it here, now this basic block, the
then will be executing the I but the else will not
be executing the I. So here, the I will not
be executed all the time. So with a motion like
this, you may want to copy I in both basic blocks. So this is one complexity. And now if you do the motion
in the other direction, so suppose that I have
an instruction J here, so moving J from
this basic block to this basic block again
will not always be legal because J should be executed
only if the else is true. And if you move it here, you will be executing it anyway
whether the condition is true or false. And this may not be legal
unless the J is going to be harmless for this task. So we’re not here
trying to understand or describe an algorithm for
global instruction scheduling, but I’m just trying to point
out some of the complexities that arise when we do global
instruction scheduling. And that’s why, you know, some production compilers
do not implement global instruction scheduling. They only implement local. And one of them is
the LLVM compiler. The LLVM compiler does
not implement global instruction scheduling. It implements local
instruction scheduling. So — So now we are 100% focused on
local instruction scheduling. So we did this last time and we started discussing
the algorithmic details. So now we understand the concept
behind this critical path list scheduling, critical path list
scheduling, and now we would like to understand the details. And one of the details is the
notion of the dependent count for each instruction or for
each node in this graph. In fact, let’s try
to think about, to think of the data structure that we need for
each instruction. So if we have class instruction,
what do we need to keep in this? So one of the entries — One of
the data fields in this is going to be the depth count. What else do you think I need
to store for each instruction to track for each
instruction in this algorithm?>>The cost for the priority.>>Okay, so the dependent count,
the critical path distance, yes. [ Inaudible Student Comment ] And what else?>>What nodes are
connected to it.>>Well, yeah. So that’s part of the
graph representation. So the graph representation, remember in your data
structures class, graphs, there are multiple ways
of representing graphs. Can you remember one of them? How we represent the graphs,
there are two representations, at least two common
representations of graphs. There are adjacency lists. Do you remember adjacency lists?>>Matrix.>>Yeah, matrix.>>So adjacency list. So this graph will
be represented as an adjacency list. In fact, if we draw the
adjacency list representation of this — The whole point
in today’s lecture is to think algorithmically
about this so we understand the concept
but in today’s lecture, we will trying to understand the
algorithm, the actual algorithm. So the adjacency list is going to have here A and the
adjacency list there will be an array of nodes A, B, C, D, E, F, G. And what will we
have for each node here? We will have the
list of neighbors. So for A, it’s going to
be C. For B, the neighbors or the dependence in
this context are C. For C, what do we have here? G. For D?>>F.>>F. And for E, F.
And for F, we have G and G doesn’t have
an adjacency list. So this is the adjacency list. Of course, this is
a simple example. So the adjacency list of each
node has only one item in it but in general it will
have multiple items. Okay? So now — Okay,
so how does this, the list scheduling
algorithm start? What’s the first step? So it’s based on the
notion of the ready list. So, okay, let’s write
the code here. List scheduling and first
create a ready list R. And what was the other
list that we used in the example last time?>>The active list.>>The active list. Yes. The ready list R and
an active list A. Okay. So initially what does
the active list have and what does the
ready list have?>>The ready list has nodes
that don’t have anything, any other nodes pointing
to them.>>Okay. So remember the
depth count equals zero, depth count equals zero. So the nodes that have
a depth count or zero. Well first of all, we have
to compute the depth counts. So this step, compute
depth count for each node. Now this — We are writing
this as a single line but there is an underlying
algorithm here. How would you compute the depth
count given this adjacency list representation of the graph? So this is an adjacency
list representation, how would you compute
the depth count for each node given this — So
look at this adjacency list. So, you know, the depth count
is — Let’s write it in red. So here it’s zero for this. It’s one for this. It’s — Sorry. For this, it’s zero, zero, zero. For C, the depth count is?>>One.>>No. Two. It’s dependent on
two other nodes.>>Oh, okay. Yeah.>>Okay, so depth
count equals two. Depth count equals two.>>And that one’s two.>>Depth count equals two. Okay. Now how can we
calculate the depth count?>>Count the occurrences
in the list.>>Count what?>>The occurrences
of those in the list, of each node in the list.>>Yeah, exactly. So it’s the number of
occurrences of each node in the adjacency list. So we’ll have to go
through the adjacency list and count how many times
each node appears on the, in one item in one of the lists. So C appears twice. F appears twice. And G appears twice. And other nodes do not appear
at all in the adjacency list. So these are the
adjacency lists. Okay, so there is an
underlying algorithm. In fact, what would be
the running time for this? Compute the dependent
count for each node. [ Inaudible Student Comment ] No.>>Oh, wait. N squared.>>Well, this is N. So assuming that N is the number
of instructions.>>So it would have
to be N squared.>>It’s not that bad. No. So it’s N. What’s
the total number of items in all adjacency lists?>>The number of edges.>>The number of edges. Yeah, exactly. The number of edges. So in fact, the running time
for doing this is going to be — Well, if V is the
number of vertices, the number of vertices,
E is number of edges. So in fact compute the
dependents count for each node, this is going to
be O of V plus E. Okay, now we compute
the depth count. Then what’s the next step? For each node whose
depth count is zero, we add it to the ready list. So for each instruction,
let’s call it op, if op.depth count equals
zero, then R.add op R, add op to the ready list. Okay? Now we have the ready
list, the initial ready list. Then we will have to go through
the ready list to keep looping and we pick one instruction
from the ready list. So we will have a big
while loop, while. What should be the
termination condition here? What should be the
termination condition? So we should keep
looping as long as?>>The ready list
and the active list.>>Yeah, exactly. So in fact, if the
ready list is empty, that’s not a termination
condition because here, we have the ready list empty
but we are not done yet. In fact, you can just count
the number of instructions that you have scheduled. So one way of doing it is
counting how many instructions you have scheduled and
when that number is equal to the total number of
instructions, you are done. Another way of doing
it is looking at the ready list
and the active list. And you terminate
when both are empty. If one of them is not
empty, you keep going. So while R or A is not empty. One of them is not
empty, we loop. Now here I’m using indentation to indicate what’s
inside the loop. So I’m not going to use braces. I will use indentation to
indicate what’s inside the loop, which is typical pseudocode,
pseudocode convention. Okay, now what do we — We
look at the ready list, right? So if R is not empty, we pick an
instruction from the ready list. Now what do you think would
be the natural implementation of a ready list here? What would be the
natural implementation?>>Priority queue.>>Yeah, priority
queue, exactly. So this — You know,
conceptually, this is clearly a priority queue because we have these ready
nodes and each one has a number that indicates their priority. So this is exactly what a
priority queue was designed for. So we’ll assume that R is
then R is a priority queue, so R is a priority queue while
A, we don’t have the notion of, we don’t have the notion of an
order or a priority within A or a ranking within A, so A
can be just a simple list, just a list, unordered list.>>It could also be a
small hash table, too.>>No. Why complicate things? It can be. Well, it can be — You can make
it the most sophisticated data structure ever but
we don’t need it. You know, here what
I’m trying to say is that this is an unordered list. We don’t have the notion of
a priority or an ordering, so we use the simplest
that works. And always when you
design data structures, you use the simplest
data structure that can get the job done. Okay, and if R is not empty, then we extract the
maximum from the ready list. So we do op equal R.extract max. So extract the node with
the maximum priority number. So in this case, the key
for our priority list, for our priority queue is going
to be these numbers in blue. The numbers in blue are
going to be the keys for the priority queue. So if R is not empty,
extract the max. And then what should
we do with it? We extract it, we schedule it. So we schedule it. So we should have here the
cycle for each instruction. So we set the cycle. We say, okay, op.cycle
equals current cycle. Now we discover that we
should track the cycle number. So we will add this — Yeah,
let’s put it in the beginning, so here, or we can put it here. Cycle number equals one. So op.cycle equals cycle number
or let’s call it current cycle. Current cycle would
be a clearer name. Current cycle. So op.cycle equals
current cycle. Okay? And then what
should we do with it? Now we’re scheduling this.>>We want to add it
to the active list.>>Exactly, yeah. Yeah, so A.add. It doesn’t matter where we
add it, A.add of this op. Okay, so now we have
scheduled this. Now in order to make
progress in this algorithm, what do we need to do? [ Inaudible Student Comment ] Okay, so now after we do this,
so here current cycle plus plus. Okay. And after incrementing
the cycle, what do you think we
should be doing next?>>Check if any instructions
are done.>>Yeah, exactly, exactly. So we check if any
instruction is done. And an instruction is
done, so for each — So now to check if any
instruction is done, which instructions
should we be checking?>>The ones in the active list.>>Exactly. The ones in the active list. So for each instruction op
in A, in the active list. So what’s the condition
for done? What’s the condition? Well, how do we know
that it’s done? So for example, E is going
to be scheduled in cycle one. And it will be done in cycle one
plus three which is four, right?>>If the current cycle equals
the cycle done for the –>>Okay, so for the
current cycle, which is — When the current
cycle becomes four. And what is four? Four is equal to the cycle in which we put E,
which is one, plus?>>Latency.>>Latency. So now we realize that we
need to add the latency. Okay, so I will just put the
latency before the cycle, latency and then cycle. Okay, latency. So if op.cycle plus op.latency, the cycle for op plus the
latency for op equals what?>>The current cycle.>>Current cycle, exactly, equals current cycle,
then this is done. So we remove op from —
We can say that A.remove, remove it from the active list. And then what’s the important
thing that we need to do?>>We need to see if
there’s anything else ready.>>Okay, yeah, if it makes
something else ready. So the completion of one instruction may
decrement the depth count of its dependence. So we have to go
through its dependence. So now for each successor
or neighbor — No, let’s call it successor. For each successor,
op successor S of op, for each successor S
of op, we do S. what? Depth count minus minus. Okay, so now when we
get to cycle four, we decrement the depth count of
F by one, which means that one of the instructions on which
F depends has just completed. So the dependency has
gotten reduced by one. So we look at the successors and
S.depth count is decremented. Then?>>Add S to the ready list.>>If it depth count is zero.>>Exactly. Only if the depth count
is zero because here, if the depth count becomes
one, it’s not ready yet. So an instruction may depend
on many other instructions. In general, you know,
this is the S. This is op. This is op and this
is S, the successor. And the depth count
may be a big number. So it could be five and there
are four other dependencies. So this op is going to decrement
it to four, then it will be — When all four are done,
S will become ready. So for each successor S of
op, S.depth count minus minus. If S.depth count equals
zero, then R.add S or op? Should we add S or op?>>S.>>S. Okay? So that should do it. So this is the — This is the algorithm. So this is what — This
will allow us to, you know, keep updating the ready list. Okay, so now let’s try to
analyze this algorithm. Now the running time of
this algorithm, now assuming that the data dependence graph
is already there has been computed — Well in fact,
yeah, we didn’t talk about the critical
path in this case. So in fact, you know, what is
missing here is computing the critical paths for each,
compute dependence count and critical path. And critical path. And in fact, even if we
compute the critical path, computing the critical
path, as we said last time, is going to be a DFS and it
will be another V plus E. So dependent count and critical
path, each one of them is going to be V plus E. So the
total asymptotic complexity for this is V plus E. Okay? So it’s not going to add to the
asymptotic complexity, right, because we’ll multiply
it by a constant. Now for each instruction
op, if — So what will be the asymptotic
complexity for this loop?>>It would be V.>>V. But adding to
the priority list? So this depends on
the implementation of the priority queue. So if we assume that
the priority — What’s a typical implementation
of a priority queue?>>Heap.>>Heap, yeah. Heap is probably the most
common implementation of a priority queue. By the way, sometimes other
implementations may be more efficient. You know, for example,
in Dijkstra’s algorithm, if the graph is dense, an unsorted array would
work better than a heap. But let’s assume that the
priority queue is a heap, is a heap, is implemented as
a heap, for the heap, the add and extract operations
will take — What would be the running
time for add and extract? So in fact, add or
insert and extract max this is the [inaudible],
extract max or dequeue. Let’s, yeah, let’s just
call it extract max. So what’s the running time for
each of these heap operations?>>Log N.>>Yeah, it’s log N. It’s log
N. So each one of them is log N. This means that what
we have here is because this is a heap,
this is of V log V. Okay? Because we do it for
every op, there are V ops and this is log V for
each in the worst case, yeah, in the worst case. But in general, it will
be much less because — Yeah, it will be much less because these root instructions
typically will be a minority, a small subset of
the instructions, but in the worst case, it
could be that all of them are. So this is a worst
case estimate. Now the loop itself, the
number of executions of — The number of times this loop
is executed will be what?>>V. It’s V, right?>>What’s that?>>It’s V, V times.>>It could be more. So here in this example,
it’s more than V.>>Because it stalls.>>Yeah, there are stalls. Yeah, so in fact it’s V — If we’re going to come up with
an upper bound on the number of executions of this loop, how
would you compute an upper bound on the number of
executions of this loop?>>Find the biggest
latency and multiply it.>>Yeah, exactly. The biggest latency multiplied
by the number of operations. But asymptotically,
what will that be? So if the maximum latency — So
if it’s V times maximum latency, what would that be
asymptotically?>>It’s still V.>>It’s still V, why? Because maximum latency
is a constant. And why it is a constant? Let’s make sure that we
understand why it’s a constant. What does it mean for
a certain parameter to be a constant
in an algorithm? The maximum latency is
a constant in the sense that it does not depend
on the size of the input. In other words, it does
not depend on the size of the data dependence graph. The input to this algorithm
is this data dependence graph. And the size of this data
dependence graph depends on the number of nodes
and the number of edges and it has nothing to do
with the maximum latency. The maximum latency
is a property of what?>>The machine model.>>The machine model. Yes, the target machine. So it depends on what
instructions that machine has. What are the instructions and what’s the maximum
latency instruction, you know, for example typically a divide
instruction will have a latency of six or eight or ten. So the maximum — The latency, this maximum latency is not
a function of the input size. It’s a function of the
— It’s a parameter, it’s a target machine
parameter which means that we can think
of it as a constant. No matter how big
the input graph is, the maximum latency is
not going to change. So we can think of
it as a constant, which means that this loop
will get executed O of V times. Okay? Now if — Let’s
look at what’s inside. So here we have — Every time
we look at the ready list, if the ready list is not
empty, we will do this, we will do the extract max. So this extract max will take
log V, right, so this is — This by itself is log V. And
throughout the — So this is — Each one of them, each extract
max is log V but overall, an aggregate throughout the
execution, the whole lifetime of the algorithm is going
to be V log V. Right? So this thing will
get executed V times. Okay, now A.add,
this is constant because the active
list is unordered, so we can insert in any order. It doesn’t matter. Now here when we look
at the active list, how many times will we be doing
this check for each instruction? This checking for each
instruction, if it’s active or not or sorry, if
it’s done or not, how many times will this
checking be done for each?>>V times the size of A.>>V times the size of A.
Well the size of A is dynamic. It keeps changing. Let’s — Okay, the question
is for any given instruction, any given instruction will be in
the active list at some point. How long, for how many cycles
will an instruction stay in the active list?>>For our model, four.>>It depends — Yeah, for the
latency of that instruction. So the time that an
instruction stays in the active list is dependent on the latency of
that instruction. So for an upper bound,
an instruction will stay for a number of cycles
that is less than or equal to the maximum latency. And the maximum latency
is a constant. So this means that this
checking will be performed O of maximum latency times V.>>Which is still V.>>Yeah, exactly,
which is still O of V. So let’s put the aggregate
here then for this, let’s put the aggregate
which is V log V. So we are putting the
aggregate, number of executions. Aggregate means the number of times it’s executed
throughout the entire algorithm. Okay, now for each successor S
of V, now this particular line, for each successor going
through the successors, how many times will
this get executed? What is this dependent on? So throughout the
entire algorithm, how many times will this get
executed, this particular line?>>Every time something gets
removed from the active list.>>Okay, well not –>>Dynamic.>>Well, when we remove
it, when we remove it, we go through all successors. So for E, we go through this. For D — So how many times
will this line get executed?>>However many elements
are in the adjacency list.>>Yeah, exactly, which
is the number of edges. This is the number of
edges in the graph. This is E, the number
of edges in the graph. So this is going to be O of E. Okay? Alright. So this is E. But if
this is depth count equals zero, when depth count becomes
zero, we will add it to the, we will add it to
the ready list. Right? Now how many times
will this get executed? So this is going to
get executed E times. But the addition
to the ready list, how many times will this get
executed for each instruction?>>Just once.>>Once. An instruction
becomes ready once. So this is going to be — What’s the aggregate
running time for this line?>>V log V.>>Yeah, V log V, exactly. So this is going to be V log V. So then what would be
the overall running time? So obviously, we can ignore the
Vs. We have lots of Vs log Vs. But are these Vs log Vs that
we have, are they additive? Should we add them
or multiply them? Yeah, so this V log V — Is
this V log V within this V log V or should we add it to it?>>Add.>>It’s add.>>So in fact the
relation between the Vs log Vs is additive. And when you add a V
log V to a V log V, an asymptotic notation gives
you a V log V. Asymptotically, this is V log V. So the Vs log
Vs will give you one V log V, but you still have an E.
Now is the E greater than or smaller than a V log V? Will the E be greater than
or smaller than V log V? [ Inaudible Student Comment ] Well, in fact it depends. It depends on what?>>How many — The
ratio of edges to nodes.>>Yeah, it depends on
how dense the graph is. It depends on the
density of the graph. So if the graph is dense,
E is going to be greater than V log V. If the graph is —
What’s the other kind of graphs? Sparse. If it’s sparse, then
E may be smaller than V log V. So which one is greater,
E or V log V, depends on whether the
graph is dense or sparse. So that’s why in
general we just add them. So the overall asymptotic
complexity is O of E plus V log V because it’s
not clear which one is greater. Which one is greater depends on whether the graph
is dense or sparse. Okay, so I, you know, I
wanted to go through all of these algorithmic
details because the point that I’m trying to make here
is that a compiler is a program that consists of a
number of algorithms and these algorithms have
to be efficient in order for the compiler to compile
the code efficiently, within reasonable time. And so, algorithms is the
foundation of computer science. So everything, everything
in computer science is built on algorithms and
theory of computation. In fact, you know, the front
end of the compiler is built on theory of computation and
the back end of the compiler, all of these graph algorithms,
they are built on algorithms. So the foundation of computer
science is algorithms, data structures, and
theory of computation and everything else is
built on top of these, you know, including compilers. Okay? Yeah, so any
questions on this analysis? Okay. No questions? Okay, so now all of this, this
minimizes the number of cycles. But what about register
pressure? So do you remember the
notion of register pressure? So is this — This
schedule, is it good or bad from register pressure
point of view?>>Bad.>>Bad.>>Why is it bad?>>Because well one, two, three requires three
registers right away.>>Yeah, exactly. So they require three registers. But we can do two
registers only. But in fact, you
know, in this case, can we do this with
two registers?>>Not without [inaudible].>>Well with two registers
— If we have two registers, can we do this whole
thing with two registers?>>Yeah.>>I don’t think so.>>No. Well, you can use two
— Suppose that we do these, so we need the register for
this and the register for this. Then we need the register for
this and the register for this. But in order to do these two,
we have to store the result of this in a register. So we need at least
three registers. Right? So we can do
this computation. This is one register, one
register, then we have to store the result in some
register before we can do the other branch, you know, these — We need the register for this
again and the register for this and then we need to store the
result of this in a register. So it’s — You cannot do this
in less than three registers. But again, this schedule was not
looking at register pressure. So this — I’m sorry, this
algorithm was only looking at minimizing the
number of cycles. It wasn’t looking at
register pressure. Now if we were to take
register pressure into account, how should we do this and where? What should we modify?>>I would assume it would
be in the while loop.>>Do we just change
the critical path like what we represent
with the critical path?>>Yeah, so we changed
the priority scheme. So our priority scheme —

Tags: , , , , ,

There are no comments yet

Why not be the first

Leave a Reply

Your email address will not be published. Required fields are marked *