8
Oct

Compilers Lecture 6: Scanning (3): Converting a Regular Expression into an NFA

>>So, now the procedure for
converting a regular expression to an NFA is called
Thompson construction. Okay. Thompson construction.So
this is not the physicist Thomson, but there
is a physicist with the last name Thompson. I think without the P
so that’s the physicist, this is the computer scientist. Okay. So, the idea here —
so let’s ask this question. Suppose that we have — that we
have the regular expression how to construct a regular
expression, and NFA for a [inaudible]
regular expression. You know, for example if you
have a regular expression that consists of a single
letter in the alphabet, so suppose our alphabet is A,B,
so this is an NFA, a trivial NFA that will accept A.
So, this is an NFA for the regular expression
A. So, A is a regular expression
it’s — that consists of
a single symbol. Right. In general, what’s
a regular expression? A regular expression consists
of the symbols of the alphabet that related using
three operations. What are the three
allowable operations in a regular expression? [inaudible comment]. Three operations?>>Zero or more. One or more.>>But what are the operations
in the regular expression. What are the operations
that we apply to the–>>Yes>>Union, catenation,
and [inaudible].>>Union, catenation, and
star or the Kleene closure. It’s called Kleene closure but let’s call it star
which is repeating. So these are the
three operations which we call regular
operations, by the way. So, a standard regular
expression can have only one — can have any combination
of these operations but it cannot have
any other operations. Anything else is an
extension of that notation of regular expressions. Now, if we know how to
construct NFAs for, you know, the individual symbols just
like this, it’s trivial. And we know how to construct
given two different NFAs for — two NFAs for two
different languages. We know how to construct an
NFA for the union and how to construct an NFA for
the concatenation and how to construct an NFA for a star. Then we are done. Then we can build an
NFA for any language. So, these are the
building blocks and these are the
operations that we apply to these building blocks
then we can build an NFA for any general regular
expression as we will see. You know, at the end we
will give a complete example and we will — things will not
become 100% clear until we get to the complete example. Okay. So now, suppose
that I have two NFAs. So, this is, you know, NFA
one and this is NFA two. So, I have two NFAs for two
different languages and I would like to construct an
NFA for the union. Now, the question is, can
I assume, without any loss of generality, that an NFA
has a single accept state. Is this a valid assumption or
not, that every NFA consists of or has a single accept state? [Inaudible comment]
Well, it may sound — you know, on the
surface it may look like we are losing generality
by making this assumption but the fact we are not
losing generality, why? I can assume without
any loss of generality that every NFA has a
single accept state. Why? Yes.>>You can just map every accept
state into a new I accept state.>>Exactly. Exactly because if you have an
NFA with 1000 accept state — with many accept state,
you can just do this. You can just have an epsilon
transition to one accept state and then you make
these regular states. So these — change this
to a regular state, change this to a regular state, so it’s an NFA has many accept
states, you can a map this to an NFA or you can
convert it easily to an NFA that as a single accept state by
doing these epsilon transitions from the existing accept states to a new accept state
that you introduced. And then, you change these from
accept states to regular states. Okay. So now, given this, we
can assume without any loss of generality that an NFA
has a single accept state. Now, to build an NFA for the
union, what should we do? The union of these — this
is an NFA for language one, this is an NFA for language two. So, this is for language one. This is for language too. So, we need an NFA
for L one union L two. Yes?>>Use a branch.>>Yes. So we branch. We — so — we have
a start state. Then we branch. Okay. So, we branch. And this is the — you
know, this is NFA one and we have this accept state. And this NFA two with
this accept state and then what should I do? I don’t have accept states now. Well, I can leave it
like this by the way. I can leave it to like this
but to stick to our standard of having one accept state, we can just map these
one accept state. We don’t have to, by the way. So, this is correct. This is correct. We don’t have to map
them to one accept state or introduce a transition
to one accept state. But, let’s stick to our standard
of unifying the accept state. So, this is an NFA
for the union, right? Okay. So, this is for the union. For the concatenation, NFA for the concatenation,
let’s do it here. NFA for the concatenation. So, what should we do? For L one concatenated
with L two? Yes. Exactly. So, put L two after L one. So we just do this. I will do something wrong and
then we figure out what’s wrong. So, NFA one, NFA two. Okay. So, this is an NFA
for the concatenation, L one concatenated with L two. So, what’s wrong with
what I have on the board?>>[Inaudible] NFA one
[inaudible] the accept state>>Exactly. So, this accept state, you know
I just, did this to emphasize that when we do the
concatenation, the accept state of the first NFA will no
longer be an accept state. Just to emphasize this. Okay. We’re doing
the abstract and now but in a minute we will see
a concrete example for this. Okay. So, this is
the concatenation. Now let’s do the star. The star is a little bit
tricky so let’s do, you know, and NFA for L one star. So, an NFA for L one star. Let me do the — make the first
attempt that will be wrong and then we will see why it’s
wrong and how to correct it. So, if I do this,
this is NFA one. So, to construct an
NFA for the star, what would be tempting
would be to do what? To do epsilon from
accept to start, right? We’ll be tempted to do
this but, in general, this will not work
Because, first of all, for a star you have to accept
the empty string all the time. Because star, you can repeat
zero or any number of times and if you repeat zero times,
this means the empty string. So, you have to accept
the empty string. So, this will not necessarily
accept the empty string but if we try to accept the
empty string, by doing this, then this will be a disaster. [Laughter]. Why will this be a disaster? [Inaudible comment]
Oh no, we are assuming that you are not changing
what’s in between. So, NFA one, we are
assuming, that — you know, all the transitions
that are in between. — between the start state
and the accept state. They are the same. So, in this case we can have
a million states between them and they are still the same. So, all the transitions in
the state between the start and the accept are still there. We did not — we
don’t touch them. This is what I mean. So, it’s good that
you’re brought this up because I’m not implying here that my NFA consists
of two states. It consists — this is
the accept — sorry — this is the start and
this is the accept and between them you can have
as many states as you want and as many transitions as you
want but we are only looking at the start and the accept. Okay. So, yes, I am glad
that you brought this up. So, okay. So, now why
is this a disaster? Yes.>>You have in the NFA if
there is a return to start, that would be intended
to be an error. [Inaudible] accept it but
since you just turned your star into an accept state –>>Yes. It will accept
it many things that shouldn’t be accepted in the easiest —
the simplest example. What if the state has
something like this? Zero, one or sigma? Yes, assuming that
sigma equals zero, one. So, if it has this, if
the original NFA has this, then what are we doing here? What’s the language that
this NFA recognizes? What does this NFA
recognize now with these — with everything in red? [Inaudible comment] It’s going to accept anything
which is sigma star. So, it’s going to
accept any string. Any sting will be accepted
because you are already in an accept state and
you have a transition. So, let’s give a
concrete example. Just — let’s give
a concrete example. This is abstract. So, what if we have
this language L — the set of strings W such
that W ends with one. So, ends with one
is going to be — the NFA is going
to be like this. Zero, one, and then you have
a one then you have an accept state, right? So, this is an NFA for this
language W ends with one. Now, if you try to
construct a stage for the star for L star, by doing this — Right? Then, this is no longer
— this is not the star of this, this is an NFA for sigma star. This is an NFA that
can accept any string because this is already
an accept state. Okay. So, in conclusion, I can’t
use the existing start state as an accept state that will
allow me to accept epsilon so I’m going to use — I am
going to introduce a new state, a new start state, that
I will make accept state. So, now this is — So, I will introduce
a new star state. This will become an accept state and I’m doing this
only to accept epsilon. But this will not accept
anything and then the transition from — I have to have
this backward transition from the accept to the start to implement the loop
[inaudible] repeating. And then, just stick
to my standard of having one accept
state, I will just — let me just draw a new one. So, I will just do this. So this is NFA one and
then epsi — sorry. And then, both of
these are going to go to one accept state
and, of course, these are epsilon transitions. Okay. So, this is how I
construct an NFA for the star. I introduce a new state and
then I add an epsilon transition from the accept state
to the start state. And, both of these, the
existing accept state and the new accept state, I
introduce a transition from them to a new accept state. Yes.>>Will we be able to
use the new initial state as the accepting state? I don’t understand why you
created a second new state?>>You can but it is
just to be systematic. You know, systematically
apply the same procedure. Maybe to make it — emphasize that these were the two original
accept states and we met both of them to one accept state. Okay. Yes. You are suggesting using
this as one accept state and then mapping — introducing
an epsilon transition from this to this? You have to think about this. I can’t verify it now. It is — I will have to
think about it for a minute. It may be something wrong. So, I will need a minute or
two to think about to verify that this would work but
I am sure that this works. Okay. So, now this
looks abstract. Now let’s look at
the concrete example and the concrete example will
be — will look much easier. Okay. So, let’s do a concrete
example from the book. A, B union C, star. So, this is a regular
expression and who would like to construct an NFA
for this regular expression. Okay. So, we first construct
a regular expressions for the individual
symbols which is trivial. So this is an, you know, an R — a regular expression for
A. It’s going to be A. And, a regular expression
for B is going to be B. And a regular expression — sorry NFA not a regular
expression. Sorry, I meant an NFA. NFA four C is C. Okay. So this looks trivial so far. Now what should be
the next step? Constructing an NFA for — so if you were evaluating
this expression, what would you evaluate next? [inaudible comment]
Yes, the union. So, the next step is to
construct an NFA for the union. So, an NFA for B union C. So,
using the rule that we just had on the board, constructing
an NFA for the union we
will just branch. All right. So we will do this. Epsilon, epsilon,
and then this is B and then this is C. Then what? [inaudible comment] one
common accept state. And again, you know, having
one common accept state is not necessary but let’s stick to
this convention or standard. It’s not necessary
for correctness, but let’s stick to
this convention. Okay. So, now we have
an NFA for B union C, now we need to [inaudible]
an NFA for what? Did you have a question? So, and NFA for the
star of the union. So we have to star
this whole thing. All right. So this is going to
be the main step here. Starring this whole
thing and starring, we will have an epsilon
transition to the start state and we will introduce
a new state. So let’s draw a new one. So, NFA for B union C, star. So, to star something,
we add a new state. let’s do it in a
different color. And then — We copy this whole thing. We just copy. I am copying right now. Okay. What else do I need to do? What else do I need to do? What’s the most important
step in the star?>>You have to go
back to the beginning.>>Yes. You have to go
back to the beginning.Okay. So, I did this and now my
accept states are this and this. So, I will just add this
or make it look better. Okay. So this is an NFA for
B union C star or the star of B union C. And I
forgot [inaudible]. Okay. Now, what remains is what? What is the last step?>>Adding A?>>So, concatenating
A with this. And concatenating means just
you stick the A before this so I’m not going
to draw this again, I will just put the A
— So, this is the A — And, so the a is
going to get connected to this using a star,
and epsilon. So I connect them with epsilon
transition and the accept state for the A will no longer
be an accept state. And remember the accept state for the first one will
no longer be accept. And now, the obvious
observation here is that this procedure
introduces too many what?>>Epsilons?>>Too many epsilon transitions. So we will do it
systematically so — because this is something
that we — we will program. Program, process,
input systematically. Now, there are ways of
minimizing the number of epsilons but you
have to think of it — of a systematic procedure for minimizing the
number of epsilons. So, it’s not always obvious
where there are certain star — certain epsilon transitions
can be eliminated or not. You know, sometimes —
some of them are redundant and can be eliminated
and some of them can’t. So, it’s not that trivial
task to determine, you know, which of these starts
are redundant and which ones can
be eliminated. So, we will stick to
the systematic procedure that will — that based
on [inaudible] introduced, you know, ridiculously
large number of epsilons. But we will stick to the
systematic procedure. The Thompson procedure. Okay. Questions? Okay. And, in fact, this
is one of the topics that usually students
find very easy. In the exams, you know, in
[inaudible] I will not ask about this in the exam but in [inaudible] I will
always have a question like this and students usually
find it very easy because it’s very systematic. Yes.>>[Inaudible] these
questions are asked expected to have these epsilon
transitions?>>Yes. So just do
it systematically. In fact, you know in
this course we care more about doing things
systematically because we are building a
program and the program it’s — the program may not
be as smart as you are when you look at
this on the board. You know, when you look at the
small example on the board, we can identify redundant or
epsilons or we can, you know, find a way minimizing
the number of states. But to do this in
general, is not trivial. To do it for, you know, imagine
a state with a thousand — an NFA with a thousand states. It will not be easy.