14
Nov

Spreadsheets – Coding – 2.1


As we begin our discussion of Coding and Data
Science, I actually want to begin with something that’s not coding. I want to talk about applications or programs
that are already created that allow you to manipulate data. And we are going to begin with the most basic
of these, spreadsheets. We’re going to do the rows and columns and
cells of Excel. And the reason for this is you need spreadsheets. Now, you may be saying to yourself, “no
no no not me, because you know what I’m fancy, I’m working in my big set of servers, I’ve
got fancy things going on.” But, you know what, you too fancy people,
you need spreadsheets as well. There’s a few reasons for this. Most importantly, spreadsheets can be the
right tool for data science in a lot of circumstances; there are a few reasons for that. Number one, spreadsheets, they’re everywhere,
they’re ubiquitous, they’re installed on a billion machines around the world and everybody
uses them. They probably have more data sets in spreadsheets
than anything else, and so it’s a very common format. Importantly, it’s probably your client’s format;
a lot of your clients are going to be using spreadsheets for their own data. I’ve worked with billion dollar companies
that keep all of their data in spreadsheets. So, when you’re working with them, you need
to know how to manipulate that and how to work with it. Also, regardless of what you’re doing, spreadsheets
are specifically csv – comma separated value files – are sort of the linguafranc or the
universal interchange format for data transfer, to allow you to take it from one program to
another. And then, truthfully, in a lot of situations
they’re really easy to use. And if you want a second opinion on this,
let’s take a look at this ranking. There’s a survey of data mining experts, it’s
the kdnuggets data mining poll, and these are the tools they most use in their own work. And look at this: lowly Excel is fifth on
the list, and in fact, what’s interesting about it is it’s above Hadoop and Spark, two
of the major big data fancy tools. And so, Excel really does have place of pride
in a toolkit for data analyst. Now, since we’re going to sort of the low
tech end of things, let’s talk about some of the things you can do with a spreadsheet. Number one, they are really good for data
browsing. You really get to see all of the data in front
of you, which isn’t true if you are doing something like R or Python. They’re really good for sorting data, sort
by this column then this column then this column. They’re really good for rearranging columns
and cells and moving things around. They’re good for finding and replacing and
seeing what happens so you know that it worked right. Some more uses they’re really good for formatting,
especially conditional formatting. They’re good for transposing data, switching
the rows and the columns, they make that really easy. They’re good for tracking changes. Now it’s true if you’re a big fancy data scientist
you’re probably using GitHub, but for everybody else in the world spreadsheets and the tracking
changes is a wonderful way to do it. You can make pivot tables, that allows you
to explore the data in a very hands-on way, in a very intuitive way. And they’re also really good for arranging
the output for consumption. Now, when you’re working with spreadsheets,
however, there’s one thing you need to be aware of: they are really flexible, but that
flexibility can be a problem in that when you are working in data science, you specifically
want to be concerned about something called Tidy Data. That’s a term I borrowed from Hadley Wickham,
a very well-known developer in the R world. Tidy Data is for transferring data and making
it work well. There’s a few rules here that undo some
of the flexibility inherent in spreadsheets. Number one, what you want to do is have a
column be equivalent to the same thing as a variable; columns, variables, they are the
same thing. And then, rows are equal – exactly the same
thing as cases. That you have one sheet per file, and that
you have one level of measurement, say, individual, then organization, then state per file. Again, this is undoing some of the flexibility
that’s inherent in spreadsheets, but it makes it really easy to move the data from one program
to another. Let me show you how all this works. You can try this in Excel. If you have downloaded the files for this
course, we simply want to open up this spreadsheet. Let me go to Excel and show you how it works. So, when you open up this spreadsheet, what
you get is totally fictional data here that I made up, but it is showing sales over time
of several products at two locations, like if you’re selling stuff at a baseball field. And this is the way spreadsheets often appear;
we’ve got blank rows and columns, we’ve got stuff arranged in a way that makes it
easy for the person to process it. And we have got totals here, with formulas
putting them all together. And that’s fine, that works well for the person
who made it. And then, that’s for one month and then we
have another month right here and then we have another month right here and then we
combine them all for first quarter of 2014. We have got some headers here, we’ve got
some conditional formatting and changes and if we come to the bottom, we have got a very
busy line graphic that eventually loads; it’s not a good graphic, by the way. But, similar to what you will often find. So, this is the stuff that, while it may be
useful for the client’s own personal use, you can’t feed this into R or Python, it
will just choke and it won’t know what to do with it. And so, you need to go through a process of
tidying up the data. And what this involves is undoing some of
the stuff. So, for instance, here’s data that is almost
tidy. Here we have a single column for date, a single
column for the day, a column for the site, so we have two locations A and B, and then
we have six columns for the six different things that are sold and how many were sold
on each day. Now, in certain situations, you would want
the data laid out exactly like this if you are doing, for instance, a time series, you
will do something vaguely similar to this. But, for true tidy stuff, we are going to
collapse it even further. Let me come here to the tidy data. And now what I have done is, I have created
a new column that says what is the item being sold. And so, by the way, what this means is that
we have got a really long data set now, it has got over a thousand rows. Come back up to the top here. But, what that shows you is that now it’s
in a format that’s really easy to import from one program to another, that makes it tidy
and you can re-manipulate it however you want once you get to each of those. So, let’s sum up our little presentation here,
in a few lines. Number one, no matter who you are, no matter
what you are doing in data science you need spreadsheets. And the reason for that is that spreadsheets
are often the right tool for data science. Keep one thing in mind though, that is as
you are moving back and forth from one language to another, tidy data or well-formatted data
is going to be important for exporting data into your analytical programmer language of
choice.

Tags: , , , , , , , ,

There are no comments yet

Why not be the first

Leave a Reply

Your email address will not be published. Required fields are marked *