# Compilers Lecture 6: Scanning (3): Converting a Regular Expression into an NFA

>>So, now the procedure for

converting a regular expression to an NFA is called

Thompson construction. Okay. Thompson construction.So

this is not the physicist Thomson, but there

is a physicist with the last name Thompson. I think without the P

so that’s the physicist, this is the computer scientist. Okay. So, the idea here —

so let’s ask this question. Suppose that we have — that we

have the regular expression how to construct a regular

expression, and NFA for a [inaudible]

regular expression. You know, for example if you

have a regular expression that consists of a single

letter in the alphabet, so suppose our alphabet is A,B,

so this is an NFA, a trivial NFA that will accept A.

So, this is an NFA for the regular expression

A. So, A is a regular expression

it’s — that consists of

a single symbol. Right. In general, what’s

a regular expression? A regular expression consists

of the symbols of the alphabet that related using

three operations. What are the three

allowable operations in a regular expression? [inaudible comment]. Three operations?>>Zero or more. One or more.>>But what are the operations

in the regular expression. What are the operations

that we apply to the–>>Yes>>Union, catenation,

and [inaudible].>>Union, catenation, and

star or the Kleene closure. It’s called Kleene closure but let’s call it star

which is repeating. So these are the

three operations which we call regular

operations, by the way. So, a standard regular

expression can have only one — can have any combination

of these operations but it cannot have

any other operations. Anything else is an

extension of that notation of regular expressions. Now, if we know how to

construct NFAs for, you know, the individual symbols just

like this, it’s trivial. And we know how to construct

given two different NFAs for — two NFAs for two

different languages. We know how to construct an

NFA for the union and how to construct an NFA for

the concatenation and how to construct an NFA for a star. Then we are done. Then we can build an

NFA for any language. So, these are the

building blocks and these are the

operations that we apply to these building blocks

then we can build an NFA for any general regular

expression as we will see. You know, at the end we

will give a complete example and we will — things will not

become 100% clear until we get to the complete example. Okay. So now, suppose

that I have two NFAs. So, this is, you know, NFA

one and this is NFA two. So, I have two NFAs for two

different languages and I would like to construct an

NFA for the union. Now, the question is, can

I assume, without any loss of generality, that an NFA

has a single accept state. Is this a valid assumption or

not, that every NFA consists of or has a single accept state? [Inaudible comment]

Well, it may sound — you know, on the

surface it may look like we are losing generality

by making this assumption but the fact we are not

losing generality, why? I can assume without

any loss of generality that every NFA has a

single accept state. Why? Yes.>>You can just map every accept

state into a new I accept state.>>Exactly. Exactly because if you have an

NFA with 1000 accept state — with many accept state,

you can just do this. You can just have an epsilon

transition to one accept state and then you make

these regular states. So these — change this

to a regular state, change this to a regular state, so it’s an NFA has many accept

states, you can a map this to an NFA or you can

convert it easily to an NFA that as a single accept state by

doing these epsilon transitions from the existing accept states to a new accept state

that you introduced. And then, you change these from

accept states to regular states. Okay. So now, given this, we

can assume without any loss of generality that an NFA

has a single accept state. Now, to build an NFA for the

union, what should we do? The union of these — this

is an NFA for language one, this is an NFA for language two. So, this is for language one. This is for language too. So, we need an NFA

for L one union L two. Yes?>>Use a branch.>>Yes. So we branch. We — so — we have

a start state. Then we branch. Okay. So, we branch. And this is the — you

know, this is NFA one and we have this accept state. And this NFA two with

this accept state and then what should I do? I don’t have accept states now. Well, I can leave it

like this by the way. I can leave it to like this

but to stick to our standard of having one accept state, we can just map these

one accept state. We don’t have to, by the way. So, this is correct. This is correct. We don’t have to map

them to one accept state or introduce a transition

to one accept state. But, let’s stick to our standard

of unifying the accept state. So, this is an NFA

for the union, right? Okay. So, this is for the union. For the concatenation, NFA for the concatenation,

let’s do it here. NFA for the concatenation. So, what should we do? For L one concatenated

with L two? Yes. Exactly. So, put L two after L one. So we just do this. I will do something wrong and

then we figure out what’s wrong. So, NFA one, NFA two. Okay. So, this is an NFA

for the concatenation, L one concatenated with L two. So, what’s wrong with

what I have on the board?>>[Inaudible] NFA one

[inaudible] the accept state>>Exactly. So, this accept state, you know

I just, did this to emphasize that when we do the

concatenation, the accept state of the first NFA will no

longer be an accept state. Just to emphasize this. Okay. We’re doing

the abstract and now but in a minute we will see

a concrete example for this. Okay. So, this is

the concatenation. Now let’s do the star. The star is a little bit

tricky so let’s do, you know, and NFA for L one star. So, an NFA for L one star. Let me do the — make the first

attempt that will be wrong and then we will see why it’s

wrong and how to correct it. So, if I do this,

this is NFA one. So, to construct an

NFA for the star, what would be tempting

would be to do what? To do epsilon from

accept to start, right? We’ll be tempted to do

this but, in general, this will not work

Because, first of all, for a star you have to accept

the empty string all the time. Because star, you can repeat

zero or any number of times and if you repeat zero times,

this means the empty string. So, you have to accept

the empty string. So, this will not necessarily

accept the empty string but if we try to accept the

empty string, by doing this, then this will be a disaster. [Laughter]. Why will this be a disaster? [Inaudible comment]

Oh no, we are assuming that you are not changing

what’s in between. So, NFA one, we are

assuming, that — you know, all the transitions

that are in between. — between the start state

and the accept state. They are the same. So, in this case we can have

a million states between them and they are still the same. So, all the transitions in

the state between the start and the accept are still there. We did not — we

don’t touch them. This is what I mean. So, it’s good that

you’re brought this up because I’m not implying here that my NFA consists

of two states. It consists — this is

the accept — sorry — this is the start and

this is the accept and between them you can have

as many states as you want and as many transitions as you

want but we are only looking at the start and the accept. Okay. So, yes, I am glad

that you brought this up. So, okay. So, now why

is this a disaster? Yes.>>You have in the NFA if

there is a return to start, that would be intended

to be an error. [Inaudible] accept it but

since you just turned your star into an accept state –>>Yes. It will accept

it many things that shouldn’t be accepted in the easiest —

the simplest example. What if the state has

something like this? Zero, one or sigma? Yes, assuming that

sigma equals zero, one. So, if it has this, if

the original NFA has this, then what are we doing here? What’s the language that

this NFA recognizes? What does this NFA

recognize now with these — with everything in red? [Inaudible comment] It’s going to accept anything

which is sigma star. So, it’s going to

accept any string. Any sting will be accepted

because you are already in an accept state and

you have a transition. So, let’s give a

concrete example. Just — let’s give

a concrete example. This is abstract. So, what if we have

this language L — the set of strings W such

that W ends with one. So, ends with one

is going to be — the NFA is going

to be like this. Zero, one, and then you have

a one then you have an accept state, right? So, this is an NFA for this

language W ends with one. Now, if you try to

construct a stage for the star for L star, by doing this — Right? Then, this is no longer

— this is not the star of this, this is an NFA for sigma star. This is an NFA that

can accept any string because this is already

an accept state. Okay. So, in conclusion, I can’t

use the existing start state as an accept state that will

allow me to accept epsilon so I’m going to use — I am

going to introduce a new state, a new start state, that

I will make accept state. So, now this is — So, I will introduce

a new star state. This will become an accept state and I’m doing this

only to accept epsilon. But this will not accept

anything and then the transition from — I have to have

this backward transition from the accept to the start to implement the loop

[inaudible] repeating. And then, just stick

to my standard of having one accept

state, I will just — let me just draw a new one. So, I will just do this. So this is NFA one and

then epsi — sorry. And then, both of

these are going to go to one accept state

and, of course, these are epsilon transitions. Okay. So, this is how I

construct an NFA for the star. I introduce a new state and

then I add an epsilon transition from the accept state

to the start state. And, both of these, the

existing accept state and the new accept state, I

introduce a transition from them to a new accept state. Yes.>>Will we be able to

use the new initial state as the accepting state? I don’t understand why you

created a second new state?>>You can but it is

just to be systematic. You know, systematically

apply the same procedure. Maybe to make it — emphasize that these were the two original

accept states and we met both of them to one accept state. Okay. Yes. You are suggesting using

this as one accept state and then mapping — introducing

an epsilon transition from this to this? You have to think about this. I can’t verify it now. It is — I will have to

think about it for a minute. It may be something wrong. So, I will need a minute or

two to think about to verify that this would work but

I am sure that this works. Okay. So, now this

looks abstract. Now let’s look at

the concrete example and the concrete example will

be — will look much easier. Okay. So, let’s do a concrete

example from the book. A, B union C, star. So, this is a regular

expression and who would like to construct an NFA

for this regular expression. Okay. So, we first construct

a regular expressions for the individual

symbols which is trivial. So this is an, you know, an R — a regular expression for

A. It’s going to be A. And, a regular expression

for B is going to be B. And a regular expression — sorry NFA not a regular

expression. Sorry, I meant an NFA. NFA four C is C. Okay. So this looks trivial so far. Now what should be

the next step? Constructing an NFA for — so if you were evaluating

this expression, what would you evaluate next? [inaudible comment]

Yes, the union. So, the next step is to

construct an NFA for the union. So, an NFA for B union C. So,

using the rule that we just had on the board, constructing

an NFA for the union we

will just branch. All right. So we will do this. Epsilon, epsilon,

and then this is B and then this is C. Then what? [inaudible comment] one

common accept state. And again, you know, having

one common accept state is not necessary but let’s stick to

this convention or standard. It’s not necessary

for correctness, but let’s stick to

this convention. Okay. So, now we have

an NFA for B union C, now we need to [inaudible]

an NFA for what? Did you have a question? So, and NFA for the

star of the union. So we have to star

this whole thing. All right. So this is going to

be the main step here. Starring this whole

thing and starring, we will have an epsilon

transition to the start state and we will introduce

a new state. So let’s draw a new one. So, NFA for B union C, star. So, to star something,

we add a new state. let’s do it in a

different color. And then — We copy this whole thing. We just copy. I am copying right now. Okay. What else do I need to do? What else do I need to do? What’s the most important

step in the star?>>You have to go

back to the beginning.>>Yes. You have to go

back to the beginning.Okay. So, I did this and now my

accept states are this and this. So, I will just add this

or make it look better. Okay. So this is an NFA for

B union C star or the star of B union C. And I

forgot [inaudible]. Okay. Now, what remains is what? What is the last step?>>Adding A?>>So, concatenating

A with this. And concatenating means just

you stick the A before this so I’m not going

to draw this again, I will just put the A

— So, this is the A — And, so the a is

going to get connected to this using a star,

and epsilon. So I connect them with epsilon

transition and the accept state for the A will no longer

be an accept state. And remember the accept state for the first one will

no longer be accept. And now, the obvious

observation here is that this procedure

introduces too many what?>>Epsilons?>>Too many epsilon transitions. So we will do it

systematically so — because this is something

that we — we will program. Program, process,

input systematically. Now, there are ways of

minimizing the number of epsilons but you

have to think of it — of a systematic procedure for minimizing the

number of epsilons. So, it’s not always obvious

where there are certain star — certain epsilon transitions

can be eliminated or not. You know, sometimes —

some of them are redundant and can be eliminated

and some of them can’t. So, it’s not that trivial

task to determine, you know, which of these starts

are redundant and which ones can

be eliminated. So, we will stick to

the systematic procedure that will — that based

on [inaudible] introduced, you know, ridiculously

large number of epsilons. But we will stick to the

systematic procedure. The Thompson procedure. Okay. Questions? Okay. And, in fact, this

is one of the topics that usually students

find very easy. In the exams, you know, in

[inaudible] I will not ask about this in the exam but in [inaudible] I will

always have a question like this and students usually

find it very easy because it’s very systematic. Yes.>>[Inaudible] these

questions are asked expected to have these epsilon

transitions?>>Yes. So just do

it systematically. In fact, you know in

this course we care more about doing things

systematically because we are building a

program and the program it’s — the program may not

be as smart as you are when you look at

this on the board. You know, when you look at the

small example on the board, we can identify redundant or

epsilons or we can, you know, find a way minimizing

the number of states. But to do this in

general, is not trivial. To do it for, you know, imagine

a state with a thousand — an NFA with a thousand states. It will not be easy.