22
Oct

# Lec 15 | MIT 6.00SC Introduction to Computer Science and Programming, Spring 2011

The following content is
OpenCourseWare continue to offer high quality educational
additional materials from hundreds of MIT courses, visit
MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: Good morning. This is the second of two
lectures that I’m retaping in the summer because we had
technical difficulties with the lectures that were taped
during the academic term. I feel I need to tell you
this for two reasons. One, as I said before,
the room is empty. And so when I say something
hilarious and there’s no laughter, it’s because
the room is empty. And if I’m not asking the
students questions during the lecture, it’s because there
are no students. The other important thing I want
you to understand is that I do own more than one
shirt and more than one pair of pants. And the reason I’m dressed
exactly the way I was for lecture 13 is I gave lecture
13 five minutes ago, even though this is lecture 15. So again, here I am. And I apologize for the
uniformity in my clothing. OK, on lecture 14, which came
between 13 and 15, at the end of it I was talking about
flipping coins and ended up with the question how can we
know when it’s safe to assume that the average result of some
finite number of flips is representative of what we would
get if we flipped the same coin many more times. In principle, an infinite
number of times. Well, we might flip a coin
twice, get 1 heads and 1 tails and conclude that the true
probability of getting a head is exactly 0.5. Turns out– assume I have a fair coin– this would have been the
right conclusion. Just because we have the right
answer it doesn’t mean our thinking is any good. And in fact, in this case our
reasoning would have been completely faulty because if
I flipped it twice and had gotten 2 heads, you might have
said oh, it’s always heads. But we know that wouldn’t
have been right. So the question I want to pose
at the start of today’s lecture is quite simply how
many samples do we need to believe the answer? So how many samples do we need
to look at before we can have confidence in the result. Fortunately, there’s a very
solid set of mathematics that lets us answer this question
in a good way. At the root of all of it is
the notion of variance. Variance is a measure of how
much spread there is in the possible outcomes. Now in order to talk about
variance, given this definition, we need to have
different outcomes, which is why we always want to run
multiple trials rather than say one trial with many flips. In fact, you may have wondered
why am I not– if I could end up flipping the coin a million
times, why would I do multiple trials adding up to a
million rather than 1 trial of a million? And the reason is by having
multiple trials, each of which give me a different outcome, I
can then look at how different the outcomes of the different
trials are and get a measure of variance. If I do 10 trials and I get the
same answer each time, I can begin to believe that really
is the correct answer. If I do 10 trials and get 10
wildly different answers, then I probably shouldn’t believe any
one of those answers, and I probably shouldn’t even think
I can average those answers and believe the mean is
a real answer because if I run an 11th trial maybe I’ll
get something totally different yet again. We can formalize this notion
of variance in a way that should be familiar
to many of you. And that’s the concept of
a standard deviation. Something I, in fact, already
showed you when we looked at the spread of grades on the
first quiz this semester. Informally, what the standard
deviation is measuring is the fraction of values that
are close to the mean. If many values are close to
the mean, the standard deviation is small. If many values are relatively
far from the mean, the standard deviation is
relatively large. If all values are the same,
then the standard deviation is 0. In the real world that
essentially never happens. We can write a formula
for this. Fortunately, it’s not
all about words. And we can say the standard
deviation of x, where x is a set of trials– sigma is usually used
to talk about that– is equal to the square root of
1 over the absolute value of the length of x. So that’s 1 over the number of
trials times the summation of the value of each trial, little
x and big X, of x minus mu squared, where
mu is the mean. And as I said, that’s the
cardinality of x. Well, so that’s a formula. And those of you are majoring
in math are going to love that. But for those of you who are
more computationally oriented, I recommend you just take
a look at the code. So here’s an implementation
of the standard deviation. So the standard deviation
of x is equal– start by getting the mean of x,
which is by summing x and dividing it by the
length of x. Then I’m just going to sum all
the values in x and do the computation. So that code and that formula
are the same thing. All right, now we know what
standard deviation means. What are we going
to do with it? We’re going to use it to look
at the relationship between the number of samples we’ve
looked at and how much confidence we should
have in the answer. So we’ll do this again looking
at a bunch of code. So I’ve got this function flip
plot, which doesn’t quite fit in the screen, but that’s OK. It’s not very interesting
in the details. What it does is it runs multiple
trials of some number of coin flips and plots a bunch
and tails and also the standard deviation of each. So again nothing very exciting,
in the code I’m just going to keep track for
all these trials. The minimum and the
maximum exponent. I’m using that so I can run
a lot of trials quickly. The mean ratios, the
differences, and the standard deviations for exponent
in range. Minimum exponent to maximum
exponent plus 1. I’m going to build an x-axis. So this is going to be
the number of flips. And then for the number of flips
I’m going to run a bunch of tests and get the ratios
of heads to tails and the absolute difference between
heads and tails. And then, I’m going to do
a bunch of plotting. And again, what I want you to
notice is when I’m doing the plotting, I’m going to
label the axes and put some titles on. And I’m also going to use
semilog because given that I’m looking at different powers, it
would compress everything on the left if I would
just use linear. All right. Let’s run it. Actually, let’s comment out the
code we need to run it. So I’m going to call flip plot
with a minimum exponent of 4, a maximum exponent of 20. That’s pretty high. And I’m going to
run 20 trials. This could take a little while
to run, but not too long. And it’ll give us some pretty
pictures to look at. Give me a chance to have
a drink of water. I know the suspense is killing
you as to what these plots are going to look like. Here they are. All right. So if we look at plot
one, that’s the ratio of heads to tails. And as you can see, it bounces
around in the beginning. When we have a small number of
flips, the ratio moves a bit. But as I get to a lot of flips
out here, 10 to the 5th, 10 to the 6th, what we’re seeing is
it begins to stabilize. We don’t get much difference. Kind of interesting where
it’s stabilizing. Maybe not where we’d
expect it. I would have guessed it would
stabilize a little closer to one than it did as
I got out here. And maybe I have
an unfair coin. That’s the problem with running
these experiments in real time that I can’t
necessarily get the answer I want. But for the most part, actually,
it looks much better on my screen than it does
on your screen. In fact, in my screen,
it looks like it’s very close to 1. I don’t know. I guess there’s some
distortion here. Think 1. And if we look at the standard
deviation of the ratio of heads to tails, what we’re
seeing is that’s also dropping from somewhere up here around
10 to the 0 down to 10 to the minus 3. And it’s dropping pretty
steadily as I increase the number of trials. That’s really what you would
hope to see and expect to see that not the number of trials,
the number of flips. Sorry. As I flip more coins, the
variance between trials should get smaller because in some
sense, randomness is playing a less important role. The more random trials you do,
the more likely you are to get something that’s actually
representative of the truth. And therefore, you would
expect the standard deviation to drop. All right. Now, what we’re saying here is
because the standard deviation is dropping, not only are we
getting something closer to the right answer but perhaps
more importantly, we have better reason to believe we’re
seeing the right answer. That’s very important. That’s where I started
this lecture. It’s not good enough
to get lucky and get the correct answer. You have to have evidence that
can convince somebody that really is the correct answer. And the evidence here is the
small standard deviation. Let’s look at a couple
of the other figures. So here’s Figure (3). This is the mean of the absolute
difference between heads and tails. Not too surprisingly– we saw this in the
last lecture– as I flip more coins, the
mean difference is going to get bigger. That’s right. We expect the ratio to get
smaller, but we expected the mean difference to get bigger. On the other hand, let’s
look at Figure (4). What we’re looking here is
this difference in the standard deviations. And interestingly, what we’re
seeing is the more coins I flip, the bigger the standard
deviation is. Well, this is kind
of interesting. I look at it, and I sort of said
that when the standard deviation is small, we think
that the variance is small. And therefore, the results
are credible. When the standard deviation
is large, we think the variance is large. And therefore, the results
are maybe incredible. Well, I said that a
little bit wrong. I tried to say it right
the first time. What I have to ask is not is the
standard deviation large or small, but is it relatively
large a relatively small? Relative to what? Relative to the mean. If the mean is a million, and
the standard deviation is 20, it’s a relatively small
standard deviation. If the mean is 10, and the
standard deviation is 20, then it’s enormous. So it doesn’t make sense. And we saw this. We looked at quizzes. If the mean score on Quiz 1
were 70 and the standard deviation were 5, we’d
say OK, it’s pretty packed around the mean. If the mean score were 10, which
maybe is closer to the truth, and the standard
deviation were 5, then we’d say it’s not really packed
around the mean. So we have to always look at it
relative or think about it relative to that. Now the good news is we have,
again, a mathematical formula that lets us do that. Get rid of all those figures
for the moment. And that formula is called the
coefficient of variation. For reasons I don’t fully
understand, this is typically not used. People always talk about
the standard deviation. But in many cases, it’s the
coefficient of variation that really is a more
useful measure. And it’s simply the standard
deviation divided by the mean. So that let’s us talk
about the relative variance, if you will. The nice thing about doing
this is it lets us relate different datasets with
different means and think about how much they vary
relative to each other. So if we think about it– usually we argue in that if
it’s less than 1, we think about that as low variance. Now there should be some
warnings that come with the coefficient of variation. And these are some of the
reasons people don’t use it as often because they don’t
want to bother giving the warning labels. If the mean is near 0, small
changes in the mean are going to lead to large changes in the
coefficient of variation. They’re not necessarily
very meaningful. So when the mean is near 0, the
coefficient of variation is something you need
to think about with several grains of salt. Makes sense. You’re dividing by something
near 0, a small change is going to produce
something big. Perhaps more importantly,
or equally importantly– and this is something we’re
going to talk about later– is that unlike the standard
deviation, the coefficient of variation cannot be used to construct confidence intervals. I know we haven’t talked about
confidence intervals yet, but we will shortly. All right. By now, you’ve got to be
tremendously bored with flipping coins. Nevertheless, I’m going to ask
you to look at one more coin flipping simulation. And then, I promise we’ll
change the topic. And this is to show you some
more aspects of the plotting facilities in PyLab. So I’m going to just flip
a bunch of coins, run a simulation. You’ve seen this a
zillion times. And then, we’ll make
some plots. And this is really kind of
the interesting part. What I want you to notice
plotting curves. Here we’re going to
plot a histogram. So I’m going to give a set of
values, a set of y values. In this case the fraction
of heads. And a number of bins in which
to do the histogram. So let’s look a little
example first here independent of this program. Oops. Wrong way. So I’m going to set l, a list,
equals 1, 2, 3, 3, 3, 4. And then, I’m just going to plot
a histogram with 6 bins. And then show it. I’ve done something I’m
not supposed to do. I just know title. There’s no x-label. No y-label. That’s because this is
totally meaningless. I just wanted to show you
how histograms work. And what you’ll see here is that
it’s shown that I’ve got three instances of this
value, of 3, and one of everything else. And it’s just giving me
essentially a bar chart, if you will. Again many, many plotting
capabilities you’ll see on the website. This is just a simple one. One I like to use and use
fairly frequently. Some other things I want to
show you here is I’m using xlim and ylim. So what we could do here is this
is setting the limits of the x and y-axis, rather than
using defaults saying the lowest value should be this
thing, the variable called xmin, which I’ve computed
up here. And the highest ymin. What you’ll see if we
go up a little bit– so I’m getting the fraction of
heads1 and computing the mean 1, and the standard
deviation 1. Then I’m going to plot a
histogram of the way we looked at it. And then what I’m going to do
is say xmin and xmax is pyLab.xlim. If you call xlim with no
arguments, what it will return is the minimum x value and the
minimum y value of the current plot, the current figure. So now I stored the minimum
x values and the maximum x values to the current one. And I did the same
thing for y. And then going to plot
the figure here. Then I’m going to
run it again. I’m going to run another
simulation, getting fracHeads 2, mean 2, standard
deviation 2. Going to plot the histograms. But then I’m going to set, for
the new one, the x limit of this to the previous ones
that I saved from the previous figure. Why am I doing that? Because I want to be able to
compare the two figures. As we’ll see when we have our
lecture on how to lie with data, a great way to fool people
with figures is to subtly change the range
of one of the axes. And then you look at things
and wow, that’s really different or they’re
really the same. When in fact, neither
conclusion is true. It’s just that they’ve been
normalized to either look the same or look different. So it’s kind of cheating. And then we’ll do it. So now let’s run it and
see what we get. I don’t need this little
silly thing first. Let’s see. It’s going to take a
long time, maybe. That’s one way to fill
up a lecture. Just run simulations that
take a long time to run. Much easier to prepare
than actual material. But nevertheless, shouldn’t
take forever. I may have said this before. I have two computers. I have a fast one that sits at
my desk that I use to prepare my lectures and a slower
one that I use to give the lectures. I should probably be testing
all these things out on the slow computer before
making you wait. But really, it’s going to stop. I promise. Ah. All right. So we’ll look at these. So Figure (1) has got 100,000
trials of 10 flips each. And Figure (2), 100,000 trials
of a 1,000 flips each. And let’s look at the two
figures side by side. Make them a little smaller so
we can squeeze them both in. So what have we got here? Notice if we look at these
two plots, the means are about the same. 0.5 and 0.499. Not much difference. The standard deviations
are quite different. And again, you would
expect that. A 100 flips should have a lot
higher standard deviation than a 1,000 flips. And indeed, it certainly does. 0.15 is a lot smaller
than 0.05. So that tells us
something good. It says, as we’ve discussed,
that these results are more credible than these results. Not to say that they’re
more accurate because they’re not really. But they’re more believable. And that’s what’s important. Notice also the spread of
outcomes is much tighter here than it is here. Now, that’s why I played
with xlim. If I used the default values, it
would not have looked much tighter when I put this up on
the screen because it would have said well, we don’t have
any values out here. I don’t need to display
all of this. And it would have then about the
same visual width as this. And therefore, potentially very
deceptive when you just stared at it if you didn’t look
carefully at the units on the x-axis. So what I did is since I knew
I wanted to show you these things side by side and make the
axes run the same length. And therefore, produce
comparable figures. I also, by the way, used xlim
and ylim if you look at the code, which you will have in
your handout, to put this text box in a place where it
would be easy to see. You can also use the fault of
best, which often puts it in the right place. But not always. The distribution of the results
in both cases is close to something called the
normal distribution. And as we talk about things like
standard deviation or a coefficient of variation, we are
talking about not just the average value but the
distribution of values in these trials. The normal distribution, which
is very common, has some interesting properties. It always peaks at the mean and
falls off symmetrically. The shape of the normal
distribution, so I’m told, looks something like this. And there are people who imagine
it looks like a bell. And therefore, the normal
distribution is often also called the bell curve. That’s a terrible picture. I’m going to get rid of it. And indeed, mathematicians
will always call it this. This is often what people
use in the non-technical literature. There was, for example, a very
controversial book called “The Bell Curve,” which I don’t
recommend reading. OK. So this is not a perfect
normal distribution. It’s not really exactly
symmetric. We could zoom in on this one
and see if it’s better. In fact, let me make
that larger. And then we’ll zoom in on it. Now that we’re not comparing the
two, we can just zoom in on the part we care about. And you can see again it’s
not perfectly symmetric. But it’s getting there. And in fact, the trials
are not very big. Only a 1,000 flips. If I did 100,000 trials of
a 100,000 flips each, we wouldn’t finish the lecture. It’d take too long. But we’d get a very pretty
looking curve. And in fact, I have done that
in the quiet of my office. And it works very nicely. And so in fact, we would be
converging here on the normal distribution. Normal distributions are
frequently used in constructing probabilistic
models for two reasons. Reason one, is they have nice
mathematical properties. They’re easy to reason about for
reasons we’ll see shortly. That’s not good enough. The curve where every value
is the same has even nicer mathematical properties
but isn’t very useful. But the nice thing about normal
distributions is — many naturally occurring
instances. So let’s first look at what
makes them nice mathematically and then let’s look at
where they occur. So the nice thing about them
mathematically is they can be completely characterized by two
parameters, the mean and the standard deviation. Knowing these is the equivalent
to knowing the entire distribution. Furthermore, if we have a normal
distribution, the mean and the standard deviation
can be used to compute confidence intervals. So this is a very important
concept. One that you see all the time in
the popular press but maybe don’t know what it actually
means when you see it. So instead of estimating
an unknown parameter– and that’s, of course, all we’ve
been doing with this whole probability business. So you get some unknown
parameter like the probability of getting a head or a tail, and
we’ve been estimating it using the various techniques. And typically, you’ve been
estimating it by a single value, the mean of
a set of trials. A confidence interval instead
allows us to estimate the unknown parameter by providing
a range that is likely to contain the unknown value. And a confidence that
the unknown value lies within that range. It’s called the confidence
level. So for example, when you look at
political polls, you might see something that would say the
candidate is likely to get 52% of the vote plus
or minus 4%. So what does that mean? Well, if somebody doesn’t
specify the confidence level, they usually mean 5%. So what this says is that 95%
percent of the time– 95th confidence interval– if the election were actually
conducted, the candidate would get somewhere between 48%
and 56% of the vote. So 95% percent of the time, 95%
percent of the elections, the candidate would
get between 48% and 56% of the votes. So we have two things, the range
and our confidence that the value will lie within
that range. When they make those
assumptions, when you see something like that in the
press, they are assuming that elections are random trials
that have a normal distribution. That’s an implicit assumption
in the calculation that tells us this. The nice thing here is that
there is something called the empirical rule, which is used
for normal distributions. They give us a handy way to
estimate confidence intervals given the mean and the
standard deviation. If we have a true normal
distribution, then roughly speaking, 68% of the data are
within the one standard deviation above the mean. And 95% percent within two
standard deviations. And almost all, 99.7%, will
fall within three. These values are
approximations. They’re not exactly right. It’s not exactly 68 and 95. But they’re good enough
for government work. So we can see this here. And this is what people
use when they think about these things. Now this may raise
an interesting question in your mind. How do the pollsters
go about finding the standard deviation? Do they go out and conduct
a 100 separate polls and then do some math? Sort of what we’ve been doing. You might hope so, but that’s
not what they do because it’s expensive. And nobody wants to do that. So they use another
trick to estimate the standard deviation. Now, you’re beginning to
understand why these polls aren’t always right. And the trick they use for that
is something called the standard error, which
is an estimate of the standard deviation. And you can only do this under
the assumption that the errors are normally distributed and
also that the sample population is small. And I mean small, not large. It’s small relative to the
actual population. So this gets us to one of the
things we like about the normal distribution that in
fact, it’s often an accurate model of reality. And when people have done polls
over and over again, they do discover that, indeed,
the results are typically normally distributed. So this is not a
bad assumption. Actually, it’s a pretty
good assumption. So if we have p, which
is equal to the percentage sample. And we have n, which is equal to
the sample size, we can say that the standard error, which
I’ll write SE, is equal to the formula p times 100– because we’re dealing
with percentages– minus p divided by n to
the 1/2, the square root of all of this. So if for example, a pollster
were to sample 1,000 voters, and 46% of them said
that they’ll vote for Abraham Lincoln– we should be so lucky that
Abraham Lincoln were running for office today– the standard error would
be roughly 1.58%. We would interpret this to mean
that in 95% percent of the time, the true percentage
of votes that Lincoln would get is within two standard
errors of 46%. I know that’s a lot to
swallow quickly. So as always, we’ll try and
make sense of it by looking at some code. By now, you’ve probably all
figured out that I’m much more comfortable with code than
I am with formulas. So we’re going to conduct
a poll here. Not really, we’re going
to pretend we’re conducting a poll. n and p. We’ll start with no votes. And for i in range n, if
random.random is less than p over 100, the number of votes
will be increased by 1. Otherwise, it will stay where
it was and will return the number of votes. Nothing very dramatic. And then, we’ll test
the error here. So n equals 1,000, p equals 46,
the percentage of votes that we think Abraham Lincoln
is going to get. We’ll run 1,000 trials. Results equal that. For t in range number of trials
results.append, I’ll run the poll. And we’ll look at the standard
deviation, and we’ll look at the results. And we’ll print the fraction
of votes and the number of polls. All right, let’s see what
we get when we do this. Well, pretty darn close to
a normal distribution. Kind of what we’d expect. The fraction of votes
peaks at 46%. Again what we’d expect. But every once in while, it gets
all the way out here to 50 and looks like Abe might
actually win an election. Highly unlikely in our
modern society. And over here, he would
lose a lot of them. If we look here, we’ll
see that the standard deviation is 1.6. So it turns out that the
standard error, which you’ll remember we computed using
that formula to be 1.58– you may not remember it because
I said it and didn’t write it down– is pretty darn close to 1.6. So remember the standard error
is an attempt to just use a formula to estimate
what the standard deviation is going to be. And in fact, we use this
formula, very simple formula, to guess what it would be. We then ran a simulation and
actually measured the standard deviation, no longer a guess. And it came out to be 1.6. And I hope that most of you
would agree that that was a pretty good guess. And so therefore because, if you
will, the differences are normally distributed, the
distribution is normal. It turns out the standard
error is a very good approximation to the actual
standard deviation. And that’s what pollsters rely
on and why polls are actually pretty good. So now the next time you read a
poll, you’ll understand the math behind it. In a subsequent lecture, we’ll
talk about some ways they go wrong that have nothing to do
with getting the math wrong. Now, of course, finding a nice
tractable mathematical model, the normal distribution, is of
no use if it provides an inaccurate model of the data
that you care about. Fortunately, many random
variables have an approximately normal
distribution. So if for example, I were doing
a real lecture and I had 100 students in this room, and
I were to look at the heights of the students, we would find
that they are roughly normally distributed. Any time you take a population
of people and you look at it, it’s quite striking that you
do end up getting a normal distribution of the heights. You get a normal distribution
of the weights. Same thing will be true if you
look at plants, all sorts of things like that. I don’t know why this is true. It just is. What I do know is that– and probably this is
more important– many experimental setups, and
this is what we’re going to be talking about going forward,
have normally distributed measurement errors. This assumption was used first
in the early 1800s by the German mathematician and
physicist Carl Gauss. You’ve probably heard of Gauss,
who assumed a normal distribution of measurement
errors in his analysis of astronomical data. So he was measuring various
things in the heavens. He knew his measurements of
where something was were not 100% accurate. And he said, well, I’ll bet it’s
equally likely it’s to the left of where I think it
is or the right as where I think it is. And I’ll bet the further I get
from its true value, the less likely I am to guess
that’s where it is. And he invented at that time
what we now call the normal distribution. Physicists insist today still
on calling it a Gaussian distribution. And it turned out to be a very
accurate model of the measurement errors
he would make. If you guys are in a chemistry
lab, or a physics lab, or a bio lab, mechanical engineering
lab, any lab where you’re measuring things, it’s
pretty likely that the mistakes you will make will
be normally distributed. And it’s not just because you
were sloppy in the lab. Actually, if you were sloppy
in the lab, they may not be normally distributed. If you’re not sloppy in the
lab, they’ll be normally distributed. It’s true of almost
all measurements. And in fact, most of science
assumes normal distributions of measurement errors in
reaching conclusions about the validity of their data. And we’ll see some examples
of that as we go forward. Thanks a lot. See you next time.

• Tawinan Cheiwchanchamnangij says:

Thank you again for lecturing in the empty room. Very great effort!

• Tamás Sarkadi says:

😀 Very entertaining lecture! Guttag is just fun to watch and easy to understand. I hope he and MIT is aware of how much we non-MITers appreciate these free lectures. Great job!

• Luthress says:

I don't mind waiting for the simulations to run, and I like your drawings (i.e. of the bell curve). Thanks for all the great OCW!

• Pan Fayang says:

T.T thank you MIT!

• Orbital808 says:

Thank you MIT!

• Kata Lune says:

thank you for your lectures =)

• Wen Jie Lee says:

Thank you MIT for putting in the effort to re-record your videos. Very much appreciated.

• Michael Boratko says:

Impressed and appreciative of the time taken by all involved to re-record the lectures which had technical difficulties! Thank you!

• kheffah says:

I love MIT OCW. They made e first love science in 2008 (before joining medical school) and they are helping me get accustomed to courses that are disparate from my major now that I'm joining an inter-disciplinary PhD program in biomedical informatics. Thank you!!! Dr Guttag is awesome!

• S. Sawhney says:

Thanks Prof Guttag. Great lectures.

• Somerandomdude4.2526 says:

can someone explain why standard error only works for a small sample size

• eathenbad says:

is it just me… or did he actually say 'it's a lot swallow… quickly…'. yes he did… @44:00

• Maeda Toshiie says:

For finding the sample deviation, shouldn't tot be divided by (len(x)-1) or N-1 instead of N which is in the code presented at 08:35, to get an unbaised estimate?

• Alex says:

Where did that formula for the standard error, 43:20, come from? I can't find it elsewhere

• Yuri Aps says:

Thank Prof. Guttag !!!! You are THE MAN !!!!! Thankz a lot !!!! Here from Brazil