Compilers Lecture 10: Scanning (7): Implementation, Part I

>>Okay so now we have learned that you know we can
write regular expressions. The regular expressions can
get translated or converted into an NFA, the NFA gets
converted into a DFA. The way we will be doing this
is that we will be using a tool that will take regular
expressions. So it will write a file that
has regular expressions in it and that file will have rules for the different syntactic
categories in the language. So you will have you know RE
integers, RE for identifiers. RE for example flows.>>Expressions.>>No, so expressions are not
— so the REs are for the words of the language, the
syntactic categories. Now we will be using
these words. when we get to parsing, we
will be using these words to build sentences
or statements. Then there will REs for
the different keywords, you know RE you know for example
int which is a triple RE, RE for four, RE for Y.
So for the keywords, the RE for a keyword
is the keyword itself because it’s a language that
has a single string in it which is that keyword. Then the union of all of these
REs is going to be set of REs for that programming language. So if you take the union of these REs then
you get a big RE for. The programming language. I think we are done with this. So when we use a
scanner generator. When we have a scanner generator
that scanner generator takes REs and it gives us C
code for the scanner. So let’s put the scanner code
regardless of the language. Scanner code. It takes the REs, we
want to define the REs, we only [inaudible] a
file that has the REs for the different syntactic
categories in the language and it gives us, it
generates the scanner for us. And internally it does that by
converting this RE into an NFA. So it does [inaudible]
instruction and it converts this into NFAs. Then it does the subset. Instruction and it
converts this into DFA, [inaudible] put DFA
that’s one NFA because it’s the
unit of many DFAs. Then it does some
DFA minimization, in fact this part we
will not be studying it so let me extend this. So scanner code and — so
there are some algorithms for minimizing the DFA, minimizing the number
of states in a DFA. So DFA minimization and that
gives you a minimal DFA. Was DFA minimization
covered in 135?>>No.>>Was it covered?>>I think so.>>Not in great detail.>>Okay but we will
not cover it here because it’s not an
essential feature in theory. In practice it’s important but
in theory it’s not essential. You know what’s essential is
converting the REs to NFAs and then NFAs to DFAs then
this is a minimization. Then just — then you
generate code for the DFA. So this is through generator, remember that this is not the
compiler’s code generator, all of this is only
generating the scanner for us. Okay so this is not the code
generator of the scanner, this is only generating
the scanner code and the scanner code is going to
be one of the many source files that constitute the
compiler, source code. So the scanner is only one
stage in that compiler. So in the assignment, in the first programming
assignment you will be basically writing REs for some
programming language, a basic programing language. And then you will be
using a tool which is flex to you know take your REs
and construct scanner code, you know generate scanner
code for you by doing this. So internally it does this. Now we understand what it does
internally except for this part. In fact this part is the one
that we’ll talk about today, even though it’s
very straightforward. At least we know this and this. Okay. Now so the point is this
subset construction is running while generating code for
one part of the compiler. So we are generating you know
one source file in the compiler, this is what we are doing. So this is not done when we are
compiling applications this is done when you are building
the compiler itself. Okay so now let’s now see you
know the implementation issues that we need to understand when
it comes to building a scanner. So when you build the NFAs that will get built
will look like this. So this is a start state, then
you may have this you know for example for identifiers
you’ll have zero to Z, let’s ignore underscore just. And then you will
have this accept state and this accept state
you will loop, assuming that an identifier
is, you know, consists of — starts with, sorry this is
A to Z, starts with a number and followed by any combination
of numbers and letters. So in this case here we will
have A to Z or zero to nine. Okay so this NFA will have an
accept state for identifiers. So this is the identifier
accept state. And it will have an
accept state for numbers, so for integer it
takes 09 and then. You have to have at
least one and then. Zero nine assuming that we
will, you know, this is integers with leading zeros allowed. So this is an accept
state for integers. Positive integers. And the you will have
states for the keywords, so you will have
for example four. This is for the four keywords. So this is the accept state
for the four keywords. And each keyword can have you
know its own branch of the NFA or you can think of this as
the union of multiple RFAs. And you can have for
example new, a new keyword. Sorry. Okay. And you’ll have a bunch of
them, you’ll have a bunch of REs for the different
syntactic categories. So identifier is a
syntactic category, integer is a syntactic category, and each keyword is a
syntactic category by itself. Now when we recognize, so suppose that this is our
input X123 equals 5, 59. So in this case you
know the scanner, this is an NFA by the way. So if this is an NFA you know
we will explain this assuming an NFA but keeping in mind that it
will get converted into a DFA. The NFA is easier to understand,
it’s probably easier you know for humans because there
are fewer states while the corresponding DFA is
going to have more states. So for us as humans it’s easier
to work with the NFA but we know that we can always
convert it into a DFA. So in this case, the scanner
will you know will try to match this. So here you know X12
so it’s going to go — it’s going to take this
path and it will recognize, in fact X is going to
lead to an accept state. So X itself will lead
to an accept state. So now should the
scanner just take this X as a valid token
or keep scanning?>>Keep scanning.>>Keep scanning.>>Yeah, keep scanning. So in this case it will
keep scanning until when?>>It hits a different token
or a delimiter character.>>Yeah, but yeah, how will you?>>It’s not part of the.>>String.>>Regular expression.>>Yeah, so in fact if you
think about it as, you know, from language recognition
point of view it keeps going until it gets to this. And this now, this lexeme, this doesn’t match
anything in this NFA right. There is nothing of this NFA that matches this
with the equal sign. There is something that
matches this right. So up to this point you
know this matches our identifier right. But when [inaudible] the
next then there will be no [inaudible]. So basically when we get
here with the X123 we are in this accept state
and the next symbol which is the equal there is
no transition for equal here or equal takes you
to an error state. So if you get an equal there is
no transition shown explicitly this means that this is an
error state or trap state. So the scanner will keep
scanning until it hits an error like this and it knows at
that point that it will have to go back, you know roll
back to the last accept. And in this case, you know
the last accept is here, happened here. So it will take X12 as
you know the first lexeme that it could match with
syntactic categories. So it would say okay this
lexeme is an identifier and then we’ll start over. So we’ll just flush this and we’ll start scanning
from this point. And this equal is going to
match some you know NFA here for equal, so there will be
a trivial NFA here for equal and this will identify
it as an assignment. Okay and then you know
assignment, when it matches this with an assignment
there is no transition with a five, with a number. So it will stop there but it will not you know
equal five does not match anything right. Equal five does not match
anything, so when it gets to the five it knows that okay
this is not going to match so I will take the last
match which is the equal. And now I will start over and
I will try to match the five so the five is going
to match here. With the nine we will stay here and if there are other numbers
it will stay in this accept stat until we get to something
which does not match which is the semicolon
here so it will stop. It will take this as —
so this is the assignment, this is an integer and
then it will match this as end of statement okay. So basically in order for this
to work the scanner will have to go for the longest
match not the first match, the longest match. Now this leads to a
question of what will happen if we have multiple matches? So can you think of you
know given this example, can you think of a string
that has multiple matches?>>Yeah two equals
and succession.>>Two equals, equal equal. Well let’s stick to what
we have here so we have.>>Okay.>>You know what we
have in this example.>>Four.>>Yeah, so four. What if you have four? So if you have four then
four will match here and it will match here, so
there will be two accept states for four.>>Oh okay.>>Okay so four again you know
we are looking at this as an NFA but keep in mind that this
will get converted to a DFA. So the corresponding DFA for
this — let me just verify it. So let’s call this N1, N2,
N3 and let’s call this N4. So the corresponding DFA
only for the identifier and the four ignoring everything
else is going to be like this. Will have a zero, then
this will go to a state that corresponds you know it
will be either an N1 or N4, so this is going to
be N1 4 and on an F. So will this be an accept state? Yes, it will be an accept state because four is an accept
state but one is not. So this will be an accept
state and with the O we will go to state N2 4, so this
is 2 and this is 4 and this is also
an accept state. And then with the R,
we’ll go to state N3 4. So N1 4 it has only
one accept state so if we match here this
is only an identifier. But if we get here then
it’s both identifier and four keyword. Okay so now in this case if
we have multiple matches. So in fact it’s up to the
person who is defining this by regular expressions that the
scanner generator allows you to specify priorities,
certain priorities. If you have multiple matches
which ones should be taken. And of course you know the
logical thing here is what?>>Four.>>Is to take the keyword. And the way you specify this when you write the regular
expressions, so in that file with regular expressions so
you write regular expression for the keyword before you
write the regular expression for identifier because the
flex scanner generator takes priorities based on their order
and the regular expression file, the file that has the
regular expressions in it. If you mention keyword
first then you are saying that you know if you have
multiple matches then take the first match or the one that
appears firs tin the list of regular expressions okay. All right but if you have,
you know, if you have a string like forecast, forecast
equals seven. In this case you will
not have multiple matches because you will go
for the longest match. When you go for the longest
match you don’t stop here you keep going until
you get to this, so this is the longest match. So the longest match there
is only you know one matching regular expression and
this will get taken as an identifier,
a variable name. Okay. So now we have been
assuming that in order to represent you know keywords and identifiers you should
have a regular expression for each keyword. So this is one way of doing it
that you know for each keyword, each keyword will have its
own regular expression. And that regular expression
will correspond to a branch on the NFA that will get built. So as we have on the board
you know this a branch that corresponds to
a regular expression, this is another branch,
and so forth. So there is another approach
to recognizing keywords which is instead of having,
instead of implementing them as you know separate
regular expressions, we can recognize
them as identifiers. And then whenever we recognize
an identifier we check if that identifier
is a keyword or not. So the idea here is to get
through all of these states. Get rid of all of these
states and just identify, just recognize the
keywords as identifiers. But for every identifier you
do some kind of checking, you have to check it
in a certain table and that table will
have to have what? Will have to have all
the keywords in it so there will be
a keyword table. Keyword table and that keyword
table will have all the keywords in it, if, else, new,
for, while, etcetera. It has all the keywords in it. And remember you match an
identifier you check it in the keyword table
and see if it’s there. If it’s there it’s a keyword, if it’s not there then
it’s a regular identifier. So now in order for this to be efficient how do you think
we should be implementing this keyword table? Yeah.>>Hash table.>>Hash table yeah exactly. The hash table is just the —
with the perfect data structure for implementing
this keyword table. In fact you can easily
do perfect hashing. This is an example where
you can do perfect hashing. Why can you do perfect
hashing easily here?>>Because you know exactly
what’s going to be in it.>>You know exactly the number
of entries that you will have in this table and there
aren’t very many of them. You know in a typical
language the number of keywords is very limited. And you know exactly
how many keywords. So if the language has 50
keywords then 50 is all what you need to have in this table. So hash table is the best
implementation for this and you can do perfect hashing. And if you can do perfect
hashing because the number of entries is limited this means
that searching for the lexeme that you have matched,
searching for this in the keyword table
is going to take what? What will be the
asymptotic complexity for the search operation
in this case?>>Constant.>>Constant yeah,
it’s going to be O1. So it’s going to be O1
constant time search. But now the question
is, now let’s try to understand the trade-offs
here, the advantages and disadvantages of
implementing keywords as separate REs or separate
branches in the NFA as opposed to implementing keywords
as identifiers and then checking in
the keyword table. So now you know from
space point of view, obviously we have
a disadvantage here when we are implementing these as separate REs then we
will have a lot more states in the NFA and in
the corresponding DFA that this NFA will
get converted to. So the resulting DFA
will be much bigger when we have keywords
represented as REs or NFAs. So there will be
a lot more states, a lot more space used,
that we need to use. Now in terms of speed, so here with the hash table implantation
we will have perform this checking. So we will have for every lexeme that matches an identifier
we will have to do the check in the table. And in most cases will the
result be found or not found?>>Not found.>>Not found because in a
typical program you have more variable names than you
have keywords right. So there will be most lexemes
are going to be variable names or function names so they’re
going to be identifiers, they’re not going
to be keywords. So it’s like you know for every
100 checks that you do only 5 of them are going to hit
in the keyword table. So you will be doing
the checking, that checking will
be an overhead right, even though it’s O1 but it still
takes time, it’s not zero time. So it will take some time. And the question
is, this checking through the keyword table will
this be done at compile time or while building a compiler?>>Compile time.>>So we will have to do it when we’re actually
compiling a program. So this is something that
will affect the compile state because when we are
compiling the program, not when we are building,
not here it’s not going to be done here when we
are building the compiler, it’s going to be done when
we are compiling an actual million-line application. And in this million-line
or millions of lines there will be millions
or hundreds of thousands of variables [inaudible] and
we will have to do this check. And most of them
are not keywords. So there is an overhead here
although it’s not, you know, it’s not to bad because after
all it’s constant time, it’s O1. But there is an overhead only
to save us some, you know, to reduce the number
of states [inaudible]. Any questions on this? Questions, is it clear
what the trade-offs are and what the two
different approaches are? Okay so is it clear to you
now how we [inaudible]? In fact all of this
is done for you and in fact there are
tools that do this. But there are people
who write this by hand. So the scanner is easy enough to
write by hand unlike the parser. You know as we will see
when we get parsing, parsing is much more
complicated than scanning. So even though there are
tools that do this for you but if you have to do it, if
you have to do all of this by hand it’s not that bad. It’s not trivial but
it’s not you know as difficult as building
a parser. And next time you know
we will be again going through more details about how
this code generation is done, how the scanner is
getting generated. Given a DFA how do you
generate code for that DFA? So this is something that we
will be covering next time and there will be
two approaches, a table-driven approach
and a [inaudible] approach. So you can just put
the DFA in a table and the table here means
a two-dimensional array. You always keep in mind that the
DFA is just a two-dimensional array in which the rows are the
states and the columns are what?>>The symbols.>>The symbols of the alphabet,
so A, B, C, D, 2 [inaudible]. So we will talk about this in greater detail
tomorrow, next time I mean. And I think that will be the
last lecture on scanning.

Tags: , , ,

There are no comments yet

Why not be the first

Leave a Reply

Your email address will not be published. Required fields are marked *