9
Oct

Compilers Lecture 11: Scanning (8): Implementation, Part II

>>So today’s going to be
the last lecture on scanning. So we’ll be talking about
some implementation details about scanning. So now we know the concept. You know, we know the
concept of scanning. We write regular expression,
then we convert them into nondeterministic
finite automata then to deterministic
finite automata. Then, we will need
to generate the code. We’ll need to generate
actual code that, you know, simulates that deterministic
finite automaton or implements it. And that code, you know,
we can generate it. We can write it manually. You can do the whole
scanning process by hand. And some compilers do it by
hand, or in some compilers, it’s done by hand, meaning
it’s manually coded. And in other compilers, a
scanner generator is used. And like we said last time, with
a scanner generator, you know, the scanner generator
like lex or flex, you just feed it the
regular expressions and it gives you
the scanner code. And internally, it converts the
regular expression to an NFA, then it converts into a DFA. Then, it minimizes the DFA,
then it generates the code. And today’s lecture is going to
be about generating the code, you know, this code
that will get generated, you know, given the DFA. We are given a DFA,
a minimized DFA. And we would like to
generate the code. And there are two
schemes for this. We can do it either
using tables. So it can be either table driven
or it can be direct coded. So let’s talk about
the table driven first. So in a table driven, you
will have a DFA like this. A DFA will be a two
dimensional table or a two dimensional array,
which will look like this. You know, these are your states. So this is S0, S1, to SN. And these are your symbols. So your symbols could
be 0, 1, 2, through 9. And then you have the
letters a, b, through z. And then you have all
the different characters like operators, for example,
all the different operators and all the different characters
that you can use in a program. So there will be a
column for each one, and there will be a state. And the DFA is just a
two dimensional table. So for an S0 and we
get a 1, we go to S2, for example, and so forth. Then, the code will be
based on this table. So the code will just say, okay,
it will keep track of the state. And based on the current
state, it will look up based on the current state and
the current character that it’s reading
from the input. So given the current state,
which is initially S0. You know, this is
our start state. And based on the current
character that we are seeing in the input, we look
it up in the table and see what the next state is
and we move to the next state. And in these states, you
know, there are two — You know, let’s —
There will be a state that represents an error, you
know, the SE or the error state. So the error state will
be explicit in a DFA because in a DFA, there
will always be a transition. What an error state in
the DFA may correspond to and then NFA is just
no transition. So, for example, if you
have an NFA with, you know, this goes on A or B.
And in this final state, if you get an A, you cycle. Then, this implicitly
means that if you get a B when you are here, then
you go to some error state. So this is an NFA and in
the corresponding DFA. Then, if you get an A, you loop. And if you get a B, you go
to an error state explicitly. So in the DFA, the
error state is explicit. And we need this because we
need to check if the string that we are scanning,
if we are still matching or we stopped match. So remember the example
from last time, you know, if you have something
like xyz12 equals abc. So when you have
something like this, you know, you keep matching. This matches with an identifier,
matches, matches, matches. You are fine until
you get to this. When you get to this,
you no longer match. So it means that you have a
transition to an error state. So to make this concrete
— So for this, you know, you will have start state. And then for, you know, A
through Z, you go to this state. And if you have A through Z or
0 through 9, you keep cycling. But if you get something
else other, anything else, you go to an error state. So when you are scanning
this, you know, the first x will take you to
this, which is an accept state. Then, the y will
keep you, the z12. You stay in this accept state,
until you see this equal. Then, you go to an error state. And when you get to this equal,
you know that now you have — You know, this is the lexeme
that you have matched. So this lexeme has been
identified as an identifier. So this accept state, you
know, for every accept state, you need to label
that accept state with a corresponding
syntactic category. And the syntactic
category, in this case, is — What’s the syntactic
category for? Identifier, exactly. So this is an accept state. So for each accept state,
there is a syntactic category, which is, in this
case, identifier. So the idea is that we use
this error state to detect that we no longer match. So after you reach this, now
you need to, when you get to the error state,
you need to backtrack to the last accept
state that you visited. So in this case, you
backtrack and you go to this last accept state and you take this as
the longest match. Remember last time, we would go
for the longest match, right. We don’t go for the first match. So in this case, you know,
this matches, this matches, this matches, but the longest
match, the longest lexeme that matches one of the
syntactic categories in the language here is
this whole variable name. Okay. So once you hit the error, you go to the last accept
state that you have found. And then, once you do this
and you recognize the lexeme and the corresponding syntactic
category, you start over. Now, you start over scanning from the equal sign,
and you match it. Then, you start over here, and the longest lexeme
is going to be this. This will match an
identifier and so forth. Okay. So now, the point is that
when you reach this error state, you know that you want to roll
back to the latest accept state. The latest accept state may not
necessarily be the last state. So sometimes you may want to go
a few states back in general. Generally speaking, you may
want to go a few states back. So that’s why, in a
typical implementation, you put the states on a stack. And then in order to backtrack, to go to the latest accept
state, what do you do? Pop, you keep popping until
you get to an accept state. So you keep putting
all of these states until you get to the last. So in this case, you know,
for example, this is S1. This is S2, and this is SE. So on your stack, you will have
S1, then S2, then S2 again, then S2 again for
one, two, three, four. So you have four S2s then an SE. So when you get to the SE, you take the last accept
state, and it’s an S2. And you have to have a mechanism to determine whether a given
state is an accept state or not. So an implementation,
you know, this accept, each state may be
represented by an object. And in this object,
you have a field that indicates whether it’s
an accept state or not. Yeah.>>Wouldn’t there be S2s? So you go to S2. So you initially enter S2
on the x and then go back.>>Okay. Okay.>>That’s five.>>Yes, yeah. Yes, you are right. So there will be five
of them in this case. Yeah. And S1 is not
an accept state. Yeah. Okay. So the idea is that you want
to pop things off the stack and you get to the last
latest accept state. And as we said last time,
there may be multiple matches, which means that
the accept state that you have may correspond to
multiple syntactic categories. And in this case, you need a
way for resolving this conflict. And one simple way of
resolving this conflict is to give priorities to the
different syntactic categories. And a simple way of specifying
the categories is just to go by the order in which
they are listed in the regular expression file,
the file of regular expressions that you are inputting
to the scanner generator or that you are using to
generator the scanner manually. Okay. So now, having said
that, there is one more thing that we need to say about the
table-driven scanner before we write some pseudocode for
a table-driven scanner. Here, there is some
redundancy in this table. So where do you see
the redundancy? So this table, you know, is bigger than what it should
be or what it could be. We can make some compaction. So what do you think
the redundancy will be? So can you detect the redundancy
in this table or something that is taking unnecessarily
more space than it should? Yeah.>>Imagine like in
the all states, then all the numbers would have
the same destination state. So maybe you could –>>Exactly.>>Have just the column
for all of the numbers.>>Yeah. So all the numbers
will be treated equally in most cases. And we will detect this
by looking at this column. So if the states in this column,
you know, if this, for example, you know, S3, S5, S9
and this is S3, S5, S9. So it’s two columns. Sorry. I put this under 9. Sorry. S3, S5, S9. So most likely, you know, all of these characters will
have the same transitions because they are
treated equally. So if you have multiple
columns that lead to the same transitions, then you can categorize the
symbols of the alphabet. So you can categorize all of
these as, you know, the letter, belong to the letter category. And then you have one transition
for the letter category and similarly for the digit. So you can do two tables,
one table that takes a symbol and gives you the
corresponding category. And then your actual
transition table is based on the categories. So it could be something
like this. So the categorization table,
cat for categorization. So you may have something like,
you know, for example, 0, 1, and these are all — Or maybe
let me make it a little bit bigger to write digit. Zero, 1, 2. These are digits. Okay. And, you know, A,
b. So these are letters. And then, your actual transition
table is going to be like this. So this is the transition
table, which corresponds to the transition function. It’s going to be like
this: S0, S1, S2. And then here, you
have digit, letter. And there are things that
you cannot categorize, like this is a category
by its own. So it will have its
own category. So it will appear in
the category table, but its category is like, you
know, semicolon or something. So here, it’s the
semicolon category. And so, there are symbols that
will have their own categories, and there are symbols,
multiple symbols that will get mapped
to the same category. So this will obviously
minimize the number of columns in the table. Yes.>>What happens if we encounter
a situation like way integers where we don’t want
leading zeros and it wants to say zero but not the rest?>>Okay. So if the zero
has a different behavior, then the column for zero
will not match the columns for the other digits. So it will have a
different transition. So in this case, the zero
will have its own category. So, you know, this will be,
you know, the non-zero digits, and then zero will
have its own category. So you can put symbols
in the same category only if they have identical
transitions in all cases, for all states. So the corresponding
columns are identical. Okay. So that’s a good question. Now, so this is about having,
you know, similar columns. We can get around this or we
can avoid having this redundancy in the table. Now what’s the price
that we are paying here?>>Two table lookups.>>Yeah, exactly. Two table lookups and
tables are in memory. So in this case, you are
accessing memory twice. You are doing two table lookups. And this can potentially
be slower. But this is more
space efficient. But sometimes, space
can affect speed. So using a lot of space can slow
down your program sometimes. Why?>>Caches.>>Because of caching,
yeah, exactly. You know, sometimes using a lot
of space means that, you know, your dataset does
not fit in the cache, which means a lot of slow down. So sometimes space can
make things slower. So it depends on, you know,
the size of your cache and whether this is going
to fit in your cache or not, not only this, you know, the
whole dataset, the whole dataset that your program is
accessing at the given point or at the given phase. Okay. So let’s now take this. Now before we erase
this, we just said that we can have
identical columns. Can we have identical
states, identical rows I mean?>>I mean, wouldn’t the DFA
minimization get rid of that?>>Exactly. Yeah, that’s exactly the answer. So if you don’t do
DFA minimization, then you may have
identical rows. But if you do DFA minimization,
DFA minimization is about merging states that
have identical transitions into one state. If there are multiple states
that have identical transitions on all symbols, you merge
them into one state. You identify them as equivalent
states and you merge them. So if you have done
DFA minimization, if this is the minimal DFA, then you will not
have identical rows, but you can have
identical columns. Okay. So let’s now adopt
the two table lookup. So our table-driven
scanner is going to be — So this is the pseudocode. So initially, you know, our
state, so we have a variable that keeps track of the state. Our state is initially S0. So we’ll always call
our start state S0. Then our lexeme is initially
empty or this looks like blank. Let’s make it empty,
or just call it empty. Our lexeme, which is the string
that we are — X123 equals ABC. So in this case, you know,
our lexeme is initially empty. Then, we will add x to the
lexeme, then we will add 1. Our lexeme becomes x1, then
it becomes x12, then x13. And the lexeme that we
will end up matching in this case is going
to be x123. Okay. So now, this is the state. Now, we have to remember
that step. We have to push things
on the stack. So we’ll have a stack, and
we will clear it first. Stack.clear. Then, stack.push. We will push something
on the stack to indicate that this is the start. So in this case, you know, on
that example, before we put S1, S2, S2, we can put
some artificial state that indicates that, you
know, when we get to this, then we have reached nothing. You know, so you can
call it a null state. So we push a null
state on the –>>Why not use S0?>>Okay. Because you
may get to S0 again. So you may go back to
— So this could be S0.>>Oh, okay.>>S1, S2. Then you can go to S0 again. So when you pop and
you get to this S0, this does not necessarily
mean that you are right at the beginning of the lexeme or that this corresponds
to an empty lexeme.>>Right.>>Okay. Then, we will loop. Now our loop will be looping until we get to that
error state. So y state is not equal to SE. So SE is the error state. Okay. We will read
the character. Character equals
get next character. We read the next character. And then we add this
character to the lexeme. So our lexeme equals
lexeme plus character. Okay. Then we haven’t pushed S0. So we push the state,
.push the state. There is something missing here that I will talk
about in a minute. Then, what should we do? So we have the character. What should we do next? So every time, you are checking
a character, every iteration of the loop is checking
a character.>>Before we put
it in the lexeme, shouldn’t we check it first?>>Yeah. We put it, and then if
it turns out that it will lead to an error state,
we will roll back. So yeah, we put it
to the lexeme now. And we will check it. We will check the
categorization table. So our category is going to come from the cat table
of the character. So it will tell us
what the category is. Then, what should we do next?>>Look it up in the transition.>>Look up the category
in the transition table. So our new state is going to be,
you know, the transition table. Okay. So cat, table is
the categorization table. And tx table is the
transition table. So the transition table —
In fact, this is going to be because the cat table
is an array, so we will put it
using this notation. It’s an array. Now the transition table
is a two-dimensional array. The row is? What’s the row and
what’s the column?>>Row is state. Column is the –>>Category. Okay. Now we know the new state. And we keep looping. Okay. Now one thing
that we need to do is — Because when we get out of this
loop, we will have an s error. So we’re going to
pop and drawback to the last accept state. So that’s why here we need to
clear the stack or clear it if we have an accept state. Sorry. Yeah, clear it
if we have an accept. So whenever we have an accept
state, so this is your stack. You know, this is an accept
state, accept, accept. And this is SE and accept. So when you roll back, you
want to roll back to this. So you don’t want any of these. So once you get to an accept
state, you no longer care about the accept
states before it. You always, you know, pop
the states until you get to the last accept state. So you don’t need these. So that’s why you can pop
the stack, clear the stack. So stack — So if state is an
accept state, belongs to the set of accept states,
then stack clear. And SA here is set
of accept states. You can think of this as
checking if it’s in the set of accept states or
checking if, you know, checking a certain flag. Yes.>>Why do you have the
statement there instead of right after reset the state?>>Oh, not here?>>Yeah.>>Because we are checking the
state that is coming from here. So we have never checked S0 yet. So the first iteration of
the loop, you check S0. All right?>>Okay.>>Iteration number
one, you check S0. It hasn’t been checked yet. So you want to do this for S0. Okay. Now after the loop, after the loop, you
need to — Yes.>>If we clear the
stack, won’t we get rid of that start state we
were in that said –>>Yeah. It’s a good question. So we are assuming that, yeah, the clear will not
get rid of that. Yeah. Yeah, you are right. So we are assuming that this
clear is a special kind of clear that will not get —
Yeah, you are right. So yeah. So maybe we
can move this here. And the assumption here is that
this is a very good observation. Yes. So the clear does
not delete this null state that we have added.>>Or you could just push the
[inaudible] back up onto –>>Yes, or you can — Yeah. Yeah, exactly. The clear, you can
add another null. Now after we are done with
this, we have to roll back. So we have to go back. So we have to keep
looping until we get to the latest accept state. So we have to keep
looping until we get to either an accept state or to?>>A null state.>>Null state. Exactly. So y state
does not belong to SA, is not an accept state, and
state is not equal to null. We keep looping until
we get to a state. And in each step, what we do
is state equals stack.pop. And we do truncate the lexeme, which means remove the last
character from the lexeme, remove last character,
remove last character. And roll back, which
means, again, you know, move the pointer in your lexeme
to the previous character. So when your x123 equals
ABC, back to this example, when you get to this,
you need to roll back. So you remove the last
character, which is the equal. And you move the
pointer to this. So roll back, it means
moving your pointer. Okay. So then, eventually,
you will get to a state that is either an accept
state or a null state. And after the loop, if
state belongs to SA, then this is an accept. So we return type of state. So in this case, you know, the type of state
is the corresponding syntactic category. So type is the corresponding
syntactic category. And if we have multiple types,
we have to resolve this. We have to resolve this conflict
based on certain priorities. So, for example, you know, this type could be
identifier or some keyword. And we have to resolve. So what we are implying here is that we are resolving
this conflict if the state has multiple
syntactic categories, else return error that
would couldn’t match. Yeah.>>How do we return the lexeme
so say the parser can tell to see if it makes sense or not?>>Yeah. You know,
this is pseudocode. So in order to truncate
the lexeme, it means that your
lexeme is a string that you are saving
in a certain buffer. And to truncate it,
you just go back to the previous index
in the buffer.>>No. I mean when
we’re doing returning.>>Oh, how do you return it.>>Yes. Only have
one return in a code.>>Okay. So that’s
a good question. So, in fact, you will see this
when you write an actual scanner in the next assignment,
not assignment one, in the next assignment. So it will be saved
in a certain variable. So there will be a certain
variable that saves the lexeme. And you will see this in the
assignment when you use flex. So there is a return value,
which is the syntactic category, and the lexeme itself will be
saved in a certain variable for holding the lexeme. Because you need both, right. You need both. You need to know that
it’s an identifier, but knowing that it’s an
identifier is not enough. You need to know the
actual name as a compiler because the compiler needs to create a variable
with that name. Okay? Yeah.>>Well, in actual code, something I would do
is create a structure, put it in a parameter list
[inaudible] pointer to it and then the return returns zero or one indicating
success or error.>>Yeah. So that depends
on the design of your code and the language
that you are using. Yes, yeah. Okay. But again, you
know, this is pseudocode, so it’s very conceptual. But you will get to write
actual, an actual scanner. All right. So this is, you know,
table driven. A direct coded scanner is a
scanner that doesn’t use tables. A direct coded scanner
is a scanner that does not use
tables, but it — You know, in each
state, direct coded. So this is the — You know,
we will not have a loop. And this is going to be our
start or initialization code. Then, instead of having a
loop, there will be states. You know, this is, for
example, state one. And in state one, you
read the character. You get the lexeme. You push it, but you
don’t have the tables. So instead of doing the tables, you just directly
code the transition by going to the next state. So you say if character
equals A, then go to S2, else if character
equals something else, go to S5 and so on. Then the last else, which
corresponds to a character that you are not
recognizing, go to what? A character for which there
is no transition from S1. Where will you go? You would go to SE, or you
will go to the end, SE, and if you have a character
that you are not recognizing. Or let’s call it go to end. And the end is just this. So the end is just this. So this is the roll back
to the last accept state. And then you just
do the same thing. So the idea here, the difference
between a table driven and a direct coded is that,
here, we don’t have a table. We don’t look up the next
transition in a table. We just directly code it. So we just say if
the state is S1 and the character is
this, go to state S2. If the character is that,
then go to S5 and so forth. Now, you know, advantages
and disadvantages. This will look ugly, right. This looks ugly and
messy, but it –>>A lot of code.>>Yeah. But it’s ugly
and messy if this is code that you would write manually. This is not code that you
would normally write manually. This is code that
will get generated by the scanner generator. So this is code that you
will never have to debug. So you will not need
to debug this code. This is code that will get
automatically generated by the scanner generator. You feed the regular expressions
to the scanner generator, and the scanner generator
will generate this for you, with all of these
ugly go-to statements. Yes.>>What do scanners
generally favor, the table method or this method?>>Yeah. So each one has
advantages and disadvantages. And I’m not sure. You know, for example, I’m not
sure flex, which scheme it uses. It probably uses direct. I’m not sure. But each one has
advantages and disadvantages. And in spite of the
disadvantages of this, the advantage is what?>>No memory access.>>Yeah. No memory access. So you are avoiding the
memory access, you know, especially the double
table lookup. So to avoid memory
accesses, you are doing this. And this will be okay if it’s
get automatically generated, if it’s not code that
you will have to debug.>>One thing I want
to point out, though, is that the direct coded, it’s
only for the one language. But a table driven, just
by changing the tables, you can use the same code
to scan any language.>>Yes. Yeah, that’s true. So you are encoding the
language within the code. But again, it’s automatically
generated.>>Yeah.>>So in practice, you will
change the regular expressions. So that’s all what you change. You change the regular
expressions. You rerun the scanner
generator program, and the scanner generator
program will generate new code for you. So you will not have
to do it by hand. And like I said, from software
engineering point of view, this is terrible if you
were to write this by hand. But it’s automatically
generated.