Statistical Aspects of Data Mining (Stats 202) Day 1


Uploaded by GoogleTechTalks on 08.10.2007

Transcript:
>> MEASE: So, I'll try and give you guys some background about what's going on here if you
havenít figure it out from the e-mail. So, this is a class called, "Statistical Aspects
of Data Mining." I'm teaching it at Stanford this summer. Today was the first day at Stanford
just like today is the first day here. Basically, I just had the idea that I'm teaching the
course at Stanford anyway, it seems like something that might be of interest to people here at
Google. So, I work here at Google and I just teach the class at Stanford. So, I thought,
well, since I'm going to come to work everyday after I teach the class, why don't I just
go ahead and teach it here? And so people that are interested can sort of sit here at
Google and take the class here. Because of that, sort of the slides and everything you
see is basically the exact same as the slides that I'm presenting when I'm at Stanford.
So, some of the things like, for example, number four up here, the fourth thing we did
at Stanford today was I took pictures of the students so I could have pictures of them.
Obviously, I'm not going to take pictures of you because your pictures are on MoMA,
et cetera. So there's going to be certain things that obviously donít apply to you
and you just sort of have to be a little bit patient with that. But for the most part,
I'm going to follow exactly the script from what I do at Stanford. And to that end, the
outline for today is basically to go over the information on the course webpage. Run
through chapter one on the textbook--I'll talk about the textbook--and then talk about
the software that we're going to be using for this class and how you can go about getting
it. So, if there--unless there's any sort of pressing questions before we begin, I'm
going to start going through these slides. Question in the back?
>> [INDISTINCT] hear you back here. >> MEASE: Okay. So, is that better?
>> Yes. >> MEASE: Okay. I'll try and stay close to
the microphone and I'll just try and speak loudly when I walk over to the board. If you
can't hear me, let me know. >> You have a lapel mic. Maybe...
>> MEASE: Yes, but they said this is just for recording, for the video conferencing.
So maybe we can figure out how to get that to work over the speaker. Yes?
>> Textbook? >> MEASE: Textbook, yes, we'll talk about
that right off the bat. Okay. So, textbook, okay. This is available on Amazon for about
80 bucks. It's called--well, I'll show you on the webpage. It's called "Introduction
to Data Mining." Authors are Tan, Steinbach, Kumar. We're trying to order some copies for
you guys but there's no way we're going to have enough. So, if you come next time, I
may have some to give you via some lottery system. But you can go ahead and order it
online if you haven't already. It's a good book to get if you want to just sort of, you
know, hope your officemate will order one and, you know, you guys can share. And a good--it's
a good book to get and we are going to be following it quite closely and I'll talk more
about it in a second. >> Is it first edition or second edition?
>> MEASE: There's only one edition and I've been told there may be a paperback available
if you're trying to save money. But there's only one edition as far as I know.
>> What are the author's names? >> MEASE: Pardon me?
>> Can you repeat the authors' names? >> MEASE: Yes, the authors are Tan, Steinbach,
Kumar. And I'll show you where that's up on the webpage. Okay, so the first thing to talk
about is the webpage. It's www.stats202.com. If you ever forget this, if you remember my last
name and look up my last name on Google, you'll find my homepage and then there's a link from
my homepage. But basically, stats202.com will have all the information. Now, again, this
is the information for the students at Stanford but it will be relevant to you. And so, if
you just go see www.stats202.com and so, this is what the webpage looks like. Now, some
things are not relevant to you. You donít care about the current grades, right? Those
are the grades for students at Stanford. Homework and exam solutions you might care about, right?
Because I'm going to give homework assignments, if you want to play along and do them, you
know, the solutions then will be up there. Obviously, we're not going to grade them but
there might be, you know, something extra you can do. And then the homework assignments,
like I said, they'll be linked from here, if you want to do them, it's up to you and
I will be posting solutions for the students at Stanford. Lecture 1 is linked here; these
are just the PowerPoint slides I have, so those will be up there. And then probably
the most important thing on the stats202.com webpage is the course information. If you
click there, you go to this pink page and it has--so, you know, donít use this e-mail,
use my Google email, of course. Also, you know the e-mail. Let me--in case, any of you
donít know it, let me write it down. This is datamining, one word, no underscore, 07@google.com.
That's the e-mail I set up. You should have got an e-mail to that already if you signed--I
signed up basically everyone on the Trix spreadsheet, so you should have got an e-mail to that already.
If you're not, it's public; you can go to mailman and add yourself to that. Phone, thatís
my cell phone, it's in [INDISTINCT] office hours donít pertain to you. TA, donít bother
the TA, webpage, stats202.com, okay. Okay. Yeah, he would get really confused. Okay.
So, the textbook, we were talking about there are the authors' names right there, Tan, Steinbach,
Kumar, "Introduction to Data Mining." Like I said, I think it retails somewhere between
$60 and $80. So, go ahead and get yourself a copy of that or find someone else who's
going to have a copy of it and agree to share it. So we are going to try and get some. But
it will have to be a lottery because there's no way we have enough for everyone that's
going to be in the room. Course description. Okay. So this is the Stanford generic catalog
description, so "Data mining is used to discover patterns and relationships in data. Emphasis
is on large, complex data sets such as those in very large databases or through web mining."
Topics are going to be decision trees. We will talk about neural networks. We'll talk
about association rules which if you're coming from a stats background like I am, that's
something new. We will talk about it. Clustering, you've seen before, no doubt. Case-based methods
and data visualization. And then we're going to basically follow the textbook pretty closely.
So, first chapter is introduction, just sort of a soft introduction, I'm going to go over
that today. Second chapter is on data, basically types of data, importing data, caveats about
data. We'll talk about that for about two lectures. Chapter three is exploring data
and for those of you who know me, I love to make plots of data and so I think that's very
important even though a lot of people think it's trivial. So, we'll spend, I think, at
least three lectures on chapter three talking about different ways of summarizing data through
graphs and tables and chapter 6, then association analysis, basic concepts and algorithms. I
have a break right there because that's when the students at Stanford are going to be taking
a midterm. If you want I can, you know, e-mail you guys the midterm if you want, you know,
to sort of quiz yourself. What it might mean practically for us is before chapter--between
chapter six and four, we might have a day where we donít--where we donít meet or we
might use it as a catch up if, for some reason, we donít get through everything because we
are only meeting for an hour whereas, at Stanford, they're meeting for an hour and 15 minutes.
Chapters four and five are both on classification. That's sort of one of my favorite areas so
we're going to spend, you know, good amount of time on chapters four and five. And then
finally we'll finish with chapter eight which is the cluster analysis. Evaluation, you donít
care about either. The late assignments, you don't care about; technology, you do care
about. So, basically we're going to be using R and Excel. Okay? So, if you have a PC that
sort of makes your life easy because Excel is probably installed on your PC and R is,
of course, a free download that available for PC, it's available for Mac. And, you know,
there is an R user's e-mail list; maybe I'll send that around to you and with a link for
how to uninstall R depending on what your Linux platform is. I donít really keep up
with it because I tend to use R more on my Windows machine. But I know that we have installations
for Linux and I just--I havenít really kept up with it. So, maybe I'll try and send around
a pointer to you guys for that. I'll run through today briefly how to install R on Windows,
and then maybe from there, you can sort of extrapolate and figure out how to install
it on Linux. But mainly, we're going to be using R with a little bit of Excel, which
Excel, for those of you who arenít familiar, is just a real simple spreadsheet application
that works for all the small data. Academic honesty, you donít care about. So that's
all the--that's all the information on the webpage. So, go ahead and use, you know, stats202.com
as your reference for things in this class. Just remember that the webpage is designed
for the students of Stanford so the obvious things, you know, donít pertain to you. And,
you know, for example, right, donít e-mail me at stanford.U, e-mail me at @google.com.
I think that's all I wanted to say about the webpage. Are there any questions about anything
I said so far about the webpage? Yes. >> This is an undergraduate class?
>> MEASE: It's a master's level class but it's an intro class and there's upper--there's
a higher level class. There's a 300 class for those of you who are familiar with the
Stanford curriculum. So, it sort of necessarily keeps this at an intro level, which is--which
I think is good for us because a lot of us are sort of, you know, this is our first time
seeing some of this stuff. If this is isnít your first time seeing some of this stuff,
that might, you know, you might think, "Well, this might be too basic for me." So sort of
pick and choose when you come or what lectures you watch. The lectures are being videotaped;
they are available on Fish. So, those of you who donít want to sit here, would rather
just sit on their PC and watch it there, sit on your machine and watch it there, then you
know they are going to be up on Fish. Any other questions about anything I said so far?
Okay. So moving on, so the textbook, again, we'll start with chapter one, it's just a
real soft introduction to what were going to be doing in this class. Well, this is sort
of interesting. I--when I said I was going to teach this class on data mining, the first
thing my officemate asked me, you know, he said, "Well, what is data mining?" I said,
"Well, I'll be able to tell you that by the time I'm done teaching the class." Well, hopefully,
you know, by the end of today, we'll be able to say something intelligent about what is
data mining. So, the definition in your textbook, it says, "The process of automatically discovering
useful information in large data repositories," and there's many other definitions. So, let
me just sort of dissect this a little bit and sort of, you know--I come from stats.
The question is how is data mining different from statistics? Well, I think the easiest
one, right, is the notion of a large, right? The fact that the data set is large. So, one
way you could define data mining is, well, it's statistics with large data sets. Okay.
But there's more than that, right? There's this idea that I'm automatically discovering
useful information. You know, again, what does automatically mean? Well, you know, you're
not going to write a script thatís going to do all the analysis for you and tell you,
"Hey, I looked at your data, and you know, you should be aware that, you know, there
is problem with this variable," or "There's something strange going on here," right? It's
not going to be completely automatic, but it's sort of more automatic than statistics,
right? So in statistics, you might say, you know what? I really want to analyze these
two variables and see, you know, what the correlation is between them, blah, blah, blah.
In data mining you might say, look, I have a thousand predictors in this data set; I
want to look at all parallelized correlations and I want to get an automatic e-mail every
time two of those correlations goes above a certain value, for example. So, on some
level, it's more automated and--than stats but, of course, itís not like a magic thing
that does all the work for you. And then the final aspect I was just going to mention is
discovering useful information, right? I mean, obviously, there's a lot of data out there
and the goal of data mining is to see if there's anything useful there. And actually, one last
aspect to this definition I want to mention is the last part where it says, "The data
is in large data repositories," right? So, it doesn't just say large data sets, it says
large data repositories. So, you think of a repository as some place where data just
accumulates, right? You didnít necessarily collect it. It's just there, right? So, web
logs are an example, right? I mean, the data is just there; whether or not you're going
to get any use out of it, it's up to you. Like, a whole bunch of other examples will
have credit card transactions, supermarket data. The data isnít really being collected
for any specific reason, but it's sort of hard to not collect it. The data just sort
of accumulates naturally. So, the question is, given that all this data is there, can
we find any useful information in it? And that's quite different from statistics where,
in statistics, you often say, you know what? I'm going to go out and collect data specifically
to answer a specific question. Whereas in data mining, you're accumulating the data
and the question is, "Can I find anything useful in this--in this data?" Okay. So then
I say there are many other definitions. And on the next slide I say, so find a different
definition and see how it compares to the previous slide. So, this is sort of a fun
exercise to sort of look through and see what other people say is data mining. And the first
thing that you'll notice is that, you know, the authority, right, Wikipedia, it doesnít
give one definition; it gives two, which already suggests to you that there's some, you know,
non-uniform standard for what is data mining. So, the first definition is "Nontrivial extraction
of implicit, previously unknown, and potentially useful information from data," and the other
is "The science of extracting useful information from large data sets or databases." So, that
second definition which--I think thatís a stat reference, right? Yeah, David Hand. So,
thatís very similar to what we had. The idea is that you're looking to see if there's anything
useful. The data set is large and, you know, basically, it's the art or the science of
extracting that. The first definition isn't too different. Just as potentially useful
information from the data, a little bit of an omission that sometimes we're not going
to find anything useful. It does say here, the first one, nontrivial, and I'll talk about
that in a second. There's a lot of tasks that you could say, look, I'm extracting useful
information from data in an automated way, but it's sort of trivial, right? So data mining
deals with what we'll call nontrivial. And I'll give you some examples, in a second,
of what I would consider trivial and nontrivial, and your textbook talks about those. Other
definitions, so you can sort of see--I think there's a few I clicked on earlier. What is
data mining? It says here, "Generally, data mining, sometimes called data or knowledge
discovery, is the process of analyzing data from different perspectives and summarizing
it into useful information," sort of not that good of a definition. I think this one was
pretty close to what we were talking about. Let's see. There's a "What is data mining?"
somewhere down here. Yeah, "Data mining or knowledge discovery is the computer-assisted
process of digging through and analyzing enormous sets of data and then extracting the meaning
of the data." So you see, the digging through is sort of carrying on the mining analogy.
There are a couple more. Maybe I'll show you one more of these that I thought was pretty
good. What does this one say? Data mining is what? "Analytic process designed to explore
data, usually large amounts of data, typically business or market related, in search of consistent
patterns and/or system relationships between variables." I think this little parenthetic
statement, typically business or market-related is telling. I mean, we're looking at it from
a point of view of, you know, we're--most of us are computer scientists and so, we're
looking at it from more of a science point of view. But it's really things in industry
and the market that has driven data mining. That's really where the phrase comes from
and it's her--one of these, you know, catchy, trendy words that like, "Oh, that company,
you know, my competitor is doing data mining and I'm not, so they're going to beat me."
So thatís, you know, that's where a lot of it comes from it. You know, if you're cynical,
it really is just, "Well, it's statistical techniques or it's machine-learning techniques
or it's, you know, these techniques with a new word put on them." But, you know, it's
sort of--that's how things in business and market get popularity; someone attaches a
word to them. And so this is basically the word that's been attached to--again, as your
textbook says--process of automatically discovering useful information in large data repositories.
Now, I'll say this on the side. So, you know, like I said, I come from statistics and my
officemate has the same background as me, and he said, "You know, all you really have
here is two ingredients, you know, to make a disaster, right? You have a lot of data
and you don't know what you're looking for." He said, "You're only--you're only going to
get yourself in trouble." Well, you know, you can and you cannot, so we'll talk about
some caveats that you have to be careful about. But generally, this is the feel of it, you
have a large data set and you're just looking to see if you can find some useful information
there. And what he warns about getting yourself in trouble is you need to make sure it really
is useful and you're not just telling yourself some story that's completely artificial. So
I mentioned to you some data mining tasks aren't really data mining tasks, right? Sometimes
you think you're extracting useful information from a large data repository but it's not
really considered data mining, and that's because it's too trivial, right? So here on
the left side of the screen, you have some data mining tasks. On the right, you have
some non-examples, right? So, for example, looking up a phone number in a phone directory;
well, that's extracting useful information. If you want the phone number, it's useful
to you. The phone directory is a large data set, so you're extracting useful information
from a large data set, but it's not considered data mining; it's too trivial, right? An example
of something that would be data mining would be suppose you have the phonebook and you
start to look for relationships that you previously didn't know. So, for example, it says here
you see names like O'Brien, O'Rurke, O'Reily occurring more in the Boston area. And you
say, "Oh, I didn't really, you know, know that but it makes sense to me now because
I sort of know, you know, how the different, you know, people settled in the United States
and I know there's a lot of people, you know, of certain descent in this area so it makes
sense to me." And if you say, "Well, that's--you know, I knew that already," imagine, you know,
giving it the phone book from India or from, you know, Brazil or a country that you're
not familiar with and you donít know any of these names and you start to see how they
cluster in the different regions, you know, youíre learning something about the data
without really knowing what youíre looking before going in; you start to see this grouping.
So, that would be an example of data mining. On the right here, the second one, Query,
a web search engine for information about Amazon. Okay. I'm getting useful information
from a large data set, right? The web is large data set. But again, that's not data mining;
something that would be data mining here on the left, grouped together similar documents
returned a search engine according to their context. For example, if you thought about,
you know, drawing a picture of all the web pages that come back for query Amazon, and--sorry,
if you can't hear me--you start to see two groups, right? You start to see--here's a
group over here, and here's a group over here. Okay? And you say, you know, how are these
groups--how are these groups, right? Well, maybe, you know, users that query Amazon go
to these pages or they go to these pages, but there's very few users who query Amazon
that go to a page here and a page here, right? So these are connected and these are connected
but they're very split. So what have you learned? Well, you've learned that maybe Amazon has
two different dominant interpretations. So presumably one is the retail site and the
other one is the river. And you say, "Well, I knew that already. Hang on a second. I knew
that already," you know. But imagine doing it in a language that you didn't know already
or imagine having some automated process that would tell you one query has two dominant
interpretations or one and only has one main interpretation. What was your question?
>> Just when you write on the board, if you had a black marker, it would be easier.
>> MEASE: Yes, if someone can toss me one. I don't really have one.
>> There's one right under the podium. >> MEASE: Where? See? I have to search for
it. Okay. Okay. Not that ornery. All right. So imagine those are black. Okay. So, that
is--that would be an example of data mining. And that's actually clustering, and we'll
talk about that specifically as an example clustering. Okay. So why mine data? So there's
the scientific point of view. Now, I'm going to talk about the scientific point view and
the commercial point of view. Both of these basically have this flavor like I'm collecting
the data anyway, so there might be some useful information in it. And from a scientific point
of view, you're collecting lots of data. Examples would be a satellite that has sensors on it,
telescope that look across the sky, micro rays. You know, with gene expression data
was sort of trendy a few years back. Generally, you know, any simulation you do, you can generate
lots and lots of data. I don't need to tell, you know, you guys about collecting lots of
data. Traditional techniques are infusible, and so data mining might be helpful in sort
of classifying and segmenting data or informing hypotheses. So that's sort of the scientific
point of view. The commercial point view, you know, again, the commercial point of view
is really sort of the driving the data mining on some level. Data there is being collected
from web data, from e-commerce, right, any time you use a search engine, I don't have
to tell you, any time you buy something from a site online, any time you go to department
or a grocery store, any time you use your--a bank or a credit card. So the data is just
there. Computers are cheap where you can't say, "Oh, we can't afford to store that much
data." No, storage is cheap. "Oh, we can't afford, you know, to analyze that." No, the,
you know, processing is cheap. And then your competitor is doing it, right? If you donít
want to do data mining, well, you can be certain that your competitor is--and if it's giving
them any edge, well, you know, you're going to get beat out eventually. Now, the one thing
I wanted to talk about on this slide which I always thought was interesting, the grocery
store example is sort of the very--it's like the classic defining example of data mining,
which is that when you go to the grocery store, they have a record of, you know, you bought
eggs and you also bought diapers or you bought milk and you also bought beer, you bought
chips and you also bought salsa. So they have a record of that. Now, the funny thing is--you
should think about is how do they have a record of that? Right? So if I go to the grocery
store today and I bought chips and I pay cash, right? Suppose I pay cash, I donít use my
credit card; I'm trying to sort of be off the grid, right? So I pay cash and I get the
chips. Then tomorrow, I go, "You know, I forgot the salsa," so tomorrow I go and I buy salsa
and I pay cash again. What they don't want to know just that one person bought chips
yesterday and one person bought salsa today. They want to know that it's me. They don't
just want to know what I bought. They want to know who I am. So, the question I'll ask
you is how do they know who I am if I don't pay with my credit card?
>> [INDISTINCT] >> MEASE: Yes. You have the little, you know,
your Safeway, save six cents on gas, right? You have your little Safeway cards. So, you
can think about it, you know, that sure, they don't get the data for free but all they have
to do is give you a little card and let you save three cents on every purchase or whatever
it is, and now they get all the data in the world, right? And, you know, you can opt in
or opt out. You don't have to use the card. If you really want to--you know, don't want
people spying on you, you can just not use the card. But it's not hard for them to get
the data. And once they do something like that, they have the data. So that type of
supermarket data where they know each customer, at least their ID, and what they bought is
sort of one of the classic examples of data mining, data sets, you know, where they use
that data to discover relationships between, you know, people who buy this product usually
buy this product. Now, what does that mean for the grocery store? Well, you know, use
your imagination. If they often buy these two products at the same time, maybe they
should put them in the same aisle. Better yet, maybe they can say, "Look, let's close
down the whole supermarket and just sell these two products because we know, you know, if
we just stock those, we can make this much money, things like that." So, anyway, they
have--they have that data because they give you a little discount card. And they give
you discount card for other reasons too. So, this is sort of a fun exercise. I knew I was
going to give this one today so I started thinking about it as soon as I woke up. So
I'll give you four examples. It says here, give an example of something you did yesterday
or today which resulted in data which could potentially be mined to discover useful information.
Okay, so in black, I will write here the four things that I thought of and see if I can
get you guys to give me some others that I haven't thought of. So, I just literally went
from the time I woke up--actually, I went from the time I woke up to time that I started
teaching and came up with some examples. So, the first thing when I wake up, the door on
my apartment doesn't have a key lock. It has this card, right? Little light card and it
goes beep when you--when you open it. So you think, "Well, they're not going to keep that
data, right?" What do they want--why would they want to know that data? Why would they
want to spy on you that much? Well, actually when I moved in, they told me. They said,
"Don't use your card to try and open someone else's apartment door because we'll let--we'll
come after you, you know. We'll get mad at you." And I thought right away, I thought,
"That's kind of weird, right? I mean, what if I just take the elevator to the wrong floor
and, you know, I'm half awake, right?" But, you know, presumably, they're keeping that
data around or at least they have some sort of alert system. So, you wouldn't think it,
but--you know, I'll call that my apartment door. You know, presumably, that data is sitting
around somewhere where they know what tenants tried to open what doors at what time. And
if there's any useful information there, you know, they can use it. What would you use
that information for? I don't know. Maybe they want to hire a security guard and they
want to know what sort of traffic, whether people are coming and going. Maybe they do
really want to spy on you. I mean, they can use the data for whatever they want, right?
And you're consenting to it because you're the one using the card to open and shut your
door. You're the one living in their apartment. Okay. So then after I open the door, what
do I do? Well, I go and I, you know, I hit the elevator button. Now, that one, I'm not
really sure if they're keeping that data around. But I kind of--you know, I wish they would
because maybe if they had some intelligent system, I wouldnít have to wait so long for
the elevator because, you know, what's it doing down there on the basement when everyone's
sleeping? They know that, you know, it should be setting up at the top. Okay, after I go
in the elevator, then I--the parking garage has another thing but that's the same as the
apartment door. As soon as I get one Guadalupe Expressway, there's metering lights. And I
wish that they would use the traffic sensors, you know, to do something better about the
traffic, right? So presumably, they could know that they donít need to turn the metering
lights on 87 when 101 is moving so quickly. So, you know, they could mine that data too.
They know who's driving on the highway--well, they donít know who's driving. They know
how many cars are driving on the highway at what time. They donít really know who you
are, although if you had, like, the FastTrack going over the bridge, they would know who
you are, right, because it's your fast track. And then finally, when I get to Stanford,
they donít give me, like, a nice parking pass, so I have to put money in the--in the
pay parking machine. And how do they know who I am? I use my credit card and I use the
same credit card every time. So, these are--you know, none of these are related to Internet
applications. I'm trying to, you know, be a little bit creative. All cases where I'm
producing data that someone could be using to do data mining, you know, and they're not
trying to spy on me; it's just I'm giving them the data. It's freely available for them
to use for whatever purpose they want. So, this is what? In-class exercise number two,
I call it. So I gave you four. How about you guys give me four? Yeah, Charles?
>> [INDISTINCT] stuff down Micro Kitchen. >> MEASE: The Micro Kitchen. They run out
of data, right? So they have to restock the Micro Kitchen. They run out, so they know
what we're eating and what building we live in, right? Okay. So, thatís--yeah. So presumably,
someone is looking at this data in the Micro Kitchen. Okay. One more. In green. Sorry.
>> Yeah, the government's tracking wherever you go through your cell phone.
>> MEASE: Cell phone, right? Not only--not only--right? Yes. So you can turn that off,
right? But not only do they know who you are, who you called, what time you called, they
also now know where you are because they have that little, you know, GPS location sensor
in there. And, you know, I donít know. If you're a paranoid person, this isnít a good
exercise for you. But, you know, the data is there. You know, they could ignore it if
they want, but it's there and they might find information in it. Okay. Another one in the
front. >> Google badge [INDISTINCT]
>> MEASE: My badge, right? So someone asked me this one time. So let me--let me just say
this is my badge, right, which is--oops, B-A-D-G-E, which is similar to--similar to my apartment
door but this is, you know, an employer, right, who might have a little more interest in who
I am and where I'm going at what time. And I have--when people ask me--I donít know
if they ask you this, when you tell them you work at Google, they say, "Well, what time
do you start work?" And you say, "Well, it depends what time I wake up," and things like
that. And they say, "Well, what time does your boss tell you have to be there?" Well,
you know, whatever. And then they say, "But certainly, you know, they know when you scan
your badge in and they keep track of that," and I'll go, "I guess they could," right?
But, you know, knowing Google, they're likely to use that data but not, you know, to spy
on us; more so to sort of just keep statistics and just know when they should stock the Micro
Kitchens, right, or know when they should serve breakfast. Okay. So let's get one more,
one more. Yeah. >> That's [INDISTINCT] probably know, we know...
>> MEASE: Yes. >> ...where you are and all [INDISTINCT]
>> MEASE: Yes. >> [INDISTINCT]
>> MEASE: Yes. So... >> What data you are transferring.
>> MEASE: Laptop. Yes, one time--well, that doesnít say P. Laptop. One time I was using
a computer somewhere in an office at the university, and the guy called me and he said, "Why are
you VPN? Why are you using a VPN connection?" And I thought, "Who are you to ask to me why
am I..." but he was the administrator of the network, right? So, yes, any time you use
a computer, people are getting lots of information about you. And you know we have web logs or
Google--Stats202.com, we have web logs from that. So one of the things we're going to
be doing is playing with the logs for that and it'll be cute because we can see certain
spikes when certain events happen and I can see what webpage you go to. I donít know
who you are but I know your IP address. So anyway, you know, you can think of loads of
examples here of different cases where you're producing data that, if someone wants to,
they can mine, they can use it to get information about and help them to make different decisions.
Okay. So where does data mining come from? So, you know, this--you can tell this book
is sort of a statistical book because you see statistics and they you see everything
else. What's everything else? Well, you have artificial intelligence, you have machine
learning, you have pattern recognition, and some people sort of put data bases and things
like that and information retrieval there too. But, you know, it is sort of like we're
teaching--or I'm teaching this course from a statistics point of view, but it's not just
statistics, of course. It's borrowing ideas from artificial intelligence, machine learning,
pattern recognition and all those. Traditional techniques, when we say they have traditional
techniques in the second bullet, it's traditional statistical techniques would be unsuitable.
Why? The data is large, not just large like a lot of observations, but large, it's high
dimensional and heterogeneous and distributed, right? So, there are sort of new challenges
for statistics. We coined this phrase "data mining" but we're borrowing information from
or we're borrowing ideas from all these other areas too. Okay. So, the book breaks down
into two types of data mining tasks, and this dichotomy is a little bit forced in some cases
but I'll walk through it. So, they differentiate between predictive methods and descriptive
methods. And let me sort of write the shorthand version of these. The one thing to remember
is I guess descriptive methods donít really have one right answer. You sort of know if
you found something useful because it's useful but you really never know exactly what you're
going after whereas predictive methods, you're going to look at your classification accuracy
or your precision and your recall, so those are straightforward. So predictive methods,
this is predictive. What do we have here? It says "Some variables to predict unknown
or future values of other variables," right? So basically we're trying tom predict future
in some sense, right? We're trying to use some inputs to predict the future of classification
of some output. Whereas, descriptive methods, descriptive, for that, we're just basically
trying to find patterns in the data. Okay. Find patterns. Okay. So, you know, the way
to remember this right is sort of this is the supervised learning and the unsupervised
learning, if you will, right? So the example I would--I would give you if you think about
the Amazon, right, with Amazon I found a pattern, right? There were two distinct types of pages
about Amazon. There is like the commercial Amazon and there was the river Amazon, so
I found the pattern. Okay. Suppose, alternatively, that I already--so that would be descriptive.
I described the pattern, I described that there were two groups of the Amazon webpages.
Predictive would be more like, I know that there's two types of Amazon webpages and I
know there's like the--one's about commercial site and I know there's one about the river.
Okay. I know that there's two groups, but can I predict given a new one, given a new
webpage, right, can I have a computer algorithm that will predict which one of these two classes
it falls in? And the way I am going to measure success there is how accurate am I going to
be able to predict this. Right? What is my misclassification rate? Am I going to get
90% of them correct, 95% of them correct, et cetera? Now, it's really easy for a human
to read the webpage and say, "Oh, this is about, you know, Amazon the retailer or this
is about Amazon the rainforest." But, you know, can a computer use the human labeled
observations to get a pretty accurate rule? Thatís predictive data mining, whereas, again,
I told you descriptive data mining was just finding the fact that there's two groups in
the first place. So the topics that we're going to cover fall into these two categories
as follows. So, the book talks about classification and regression as both being predictive. So
let me make this--so we'll put here classification and regression. Both of these as being predictive.
Now, classification, we're going to cover in chapters four and five. Regression, we're
not going to cover in this course. However, if you take a regress--oh, sorry, if you take
a stats course, they're going to cover regression. Really, the main difference between these
is classification, you're trying to say, you know, I said I'm 90% accurate, right? I have
two classes; it's either Amazon the rainforest or Amazon the river right. Or you could have
three classes or four classes or any number of classes and you're trying to see how accurate
you are. Regression is analogous to that but instead of trying to predict the class, you're
generally trying to predict a continuous attribute, right? So let me give you an example, right?
So it says to change from a web application. This is a book, this is an eraser. You can
these apart, right? Suppose you send like sonar signals, right, and bounced the sonar
signals off the book and the sonar signals off the eraser. Well, they're going to look
different, right? And so classification would be to use those sonar signals and some labeled
instances--some labeled instances of the book, some labeled instances of the eraser, and
predict for new cases whether it's a book or an eraser, right? Thatís classification.
I'm trying to predict is it the book class or is it the eraser class, right? Just like,
is it the Amazon rainforest class or the Amazon web--commercial class? Regression would be
more like, can I use the sonar to predict the size of the book? Right? So you donít
just either get it right or wrong. If you say the book is 11 inches tall and it's really
10.5 inches tall, you're off by exactly .5. So, in classification, you're basically trying
to predict what class it is, whereas regression, you're trying to predict some continuous attribute.
And so you would measure your performance a little bit differently. Classification,
you might use recall, precision or misclassification rate. Regression, you might use like squared
area of loss, L1 loss, some sort of lost function like that that measures how close you are
to the target. Where--again, we're not going to cover regression. Classification is more
common than data mining, but regression gets a lot of attention in classical stats courses
and also a lot of the classification techniques you can extend to regression; we're just not
going to get in to them. Okay. Then for descriptive, the visualization we're going to cover in
chapter three, association analysis in six, clustering in eight and anomaly detection
is in chapter ten, although we're not going to get to it. So, let me just say a few words
about these. So, visualization is in chapter three. Visualization, as I said before, it's
one of the most important things you're going to do. If you think about writing a report
or doing some study, people are going to remember the picture, right? If you can't tell it with
one simple picture, you probably havenít really said anything interesting. And there's
sort of an art in making good pictures and making pictures that can see clearly. And
so, we're going to spend a fair amount of time in chapter three just talking about differently
ways to visualize data. And visualization can be two purposes, right? One is to present
to someone. Okay. You know what you want to say and this is just a good way to present
it and the other is to learn something yourself. You donít know what you're looking for. If
you're just going to look at a bunch of pictures or if you only look at one picture that's
going to tell you what's going on so you can discover. Both of those are visualization
tasks. They're both descriptive, and we'll talk about those in chapter three. Association
analysis. Association analysis, this is something that doesnít really make it into mainstream
stats too much. We're going to talk about that in chapter six. This one is the classic
supermarket one, right? The people that bought--the people that bought diapers often bought beer,
right? The people that bought chips often bought salsa. This is a type of association
analysis and we're going to talk about that in chapter six, in particular, that mark--basket
example I just talked about. Clustering, chapter eight. Clustering. Okay. The clustering example--the
canonical example there would be like the Amazon search engine versus the Amazon Rainforest,
right? You see two distinct groups emerge, you know something is going on and, of course,
that example is trivial, but suppose I give you a query in a language you donít know.
Can you tell me, you know, what sort of pattern there are in those web pages? Are there two
main interpretations? Is there one dominant interpretation and one slightly less common
interpretation? So you're just sort of looking for patterns in the data, and grouping is
one pattern, and that type of grouping is called clustering. News stories, right? Can
you--can you group together different news stories? These are about sports, these are
about politics. Can you see different groups emerging in the data even without having labels
on them? So, it's unsupervised. It's clustering. And then finally, anomaly detection, we're
not going to have time to get to but you might want to read about it in--only one L, right?
You might want to read about it in chapter 10. When we do chapter three, we'll do some
of it because when we make pictures of data, sometimes thatís exactly what we're looking
for, things that are strange. Anomaly detection is probably, you know, as I say, association
analysis is the one of the classic examples of data mining. Anomaly detection is one of
the ones that gets all the press because--shoot, I had a news story, I donít know if I can
find it. That, you know, you always see data mining in the news because they're using it
for purposes of, you know, credit card fraud detection and they're using it to find terrorists,
right? And both of these things are anomalies, right? So, what is credit card fraud, right?
Someone--you know, you have your credit card and all of a sudden, you spent a whole bunch
of money in a place that you've never been before. Thatís an anomaly. Their credit card
flags it. The sooner they can flag it, the more money they can save. So, thatís anomaly
detection through credit cards. With respect to terrorists, what are they looking for?
Strange behavior, right? Something thatís indicative of a terrorist. Now, you could
argue, well, in that case, maybe this should be up here because you're trying to see how
accurately you can, you know, spot the terrorist. But, you know, thatís why I said this line
is a little bit blurry. But your textbook tends to classify anomaly detection as descriptive
because you donít really know exactly what you're looking for. Okay. Thatís all chapter
one notes I want to talk about. Now I'm going to talk a little bit about the software. But
let me stop and see if there's questions. Question?
>> So, we end up giving yet more data to the data mining mill because everytime we're planning
a trip, we have to inform every credit card company that we will be spending stuff abroad.
>> MEASE: Right. Yeah. So the question is, if you donít want your credit card to sort
of, you know, call you and cancel your card because they see something weird, some people
will call the credit company ahead of time and tell them, "Look, I'm going to be traveling
overseas," but then, the point is that they can also use that data to, you know, feed
into the data mining framework. Yeah, some people will do that. Some people, every time
they're going to travel, they'll let their credit card company know ahead of time because
they donít want any problems. Actually, I have a friend who is a pilot who uses cash
only which is surprising in this day and age. But for that very reason, he doesnít want
to call the credit card company every time he goes somewhere. And, you know, he does
fights that look as though they're anomalies, right? So, anyway, are there any other questions
on anything I said so far before I talk about software? Question?
>> So what is the difference with clustering, classification [INDISTINCT]
>> MEASE: Okay. So what--your question is what's the difference between clustering and
classification? Okay. So, they're very similar. Clustering is unsupervised. Classification
is supervised. So the thing is--let's see. Let's go with the web page example, right,
with the Amazon, Amazon, all right. In one case, I have all the label--all the two instances
labeled. This one is about the rainforest. This one is about the e-commerce company.
And I'm trying to predict for a new observation which one it is and I'm going to measure how
accurately I'm doing. That's classification, thatís predictive. I want to see how well
I can predict a new observation into these two classes. Okay. Clustering, on the other
hand, is the act of actually discovering that there are two classes, because I didnít know
that ahead of time. I was just looking at a bunch of different queries, looking at how
things grouped together, and I saw two distinct groups emerge for Amazon. Now, the clustering
is you donít really know you're right. You know, are there really two groups. Well, in
this case, you do. But you're just sort of trying to discover a relationship. So, does--is
that good? Is there anyone else that can give a better definition than I just gave? Because
a lot of people in this room that are experts on this machine learning and they can tell
you supervised learning, unsupervised learning and--but thatís sort of my take on it. It's
a little--it can be a little blurry especially after you do the clustering if you say, "Oh,
I really did learn something that was right." That, you know, it tends to have a little
bit of a classification feel, but that should help. Okay. Other questions? Yes.
>> This anomaly, itís actually the same as the [INDISTINCT]
>> MEASE: Yes, to a large degree. To a large degree. And there are some subtle differences
and you can--you can read about that. But, yeah, generally an anomaly is an outlier and
vice versa, generally speaking. Yeah. An outlier, you can sort of see the word outlier, something
that lies out a rest from--away from the rest of the data, an outlier. So in some space,
an anomaly is an outlier, but key might be to find that space. Other questions about
things that I have said so far? >> Is an anomaly like unsupervised learning
an outlier? An outlier pertaining any actual [INDISTINCT] cluster.
>> MEASE: Right. Right. So Charles was making a point about the relationship between outliers
and anomalies. I donít want to get too much into that distinction, but yeah, there's a
relationship and some subtle difference. Other questions on what I've said, anything I've
said so far? Okay. So, let me see. How are we doing here? So, okay, we're doing good
on time. So, like I said, Stanford, this is an hour and 15 minutes, but here, we're trying
to stay under an hour, obviously, so I go a little bit faster and skip a few things
that donít matter for you guys. Okay. So, what are we using in here? We're using Excel
and we're using R. Now Excel, if you have a PC, you're in a good shape. If you donít--you
know, I donít know. Trix probably won't give you everything you need. You know, no offense,
you know, but it's just not the same product. Open Office might give everything you need
but if you have a PC, you're in good shape. If you have Mac, I'm sure Excel is installed
on there. If you donít have either of these, if you're sort of a--just a strict Linux user,
I donít know. I could--I can't really speak to Open Office but it might--it might get
you through most of it, but we are going to using Excel not primarily, right. Excel isn't
very powerful, it doesnít handle large data, it's very slow, it's--you know. But for some
purposes, it's good. And it's--it is good to sort of have a spreadsheet application
sort of that you're comfortable with because sometimes you can do things very quickly that,
you know, you donít really want to take the time to strip up. So we're doing some things
in Excel, and so you should have access to that. But then primarily, we're going to be
using R, which is free. If you have Windows, I'll go through how to install it on a Windows
machine right now. The same installation instructions will generally hold for Mac. And for Linux,
like I said, I'm going to try and send you guys a link to something on the--something
I can get from the R users which talks about installation. But different people do different
installations depending on what they're doing and so, I havenít really kept up with it.
But let me take you through R and give you a little preview of that and show you how
to get it installed on your Windows machine. I have a Windows machine right here, obviously.
So, I'm going to--as I go through examples, I'll be doing it on the Windows machine. So,
you know, you might, if you have a Windows laptop, just install R on that and use that
for going through examples. Also, it's easy, you can bring it with you and you can sort
of play along as you're sitting here. Okay. So, how do I download R? So you go to this
web page, right? It's sort of a little bit tricky. For a while on Google if you just,
you know, queried R, you wouldn't get it. I think you can now, but let's just go to
CRAN. Let's see. Here we go. Okay. So, we're going to be good with--you know, I have a
Windows machine, so I just hit Windows 95 or later and then, Base. So, it's open source
and different people contribute different packages. If we need anything special, I'll
let you know. But for now, Base is going to do the trick. And then if I go down to this
one here, this is a self-extracting EXE file. It says I have to get it from a [INDISTINCT].
But it turns out if I click on this right now, the behavior is itís just going to give
me one and I can just save it. Then you double click, go through all the defaults that everything
as it is is going to be fine and it will get you R. And once you do all that, you can see
what it looks like. Here. Let me--let me just run through those screenshots again. So these
are up on the PowerPoint slides if youíre--you sort of forget what I said. So, go to cran.r-project.org.
And for me, I would click on Windows 95 and later. Just click here on Base, and then 2.5.0.
I think I have a 2.4 version. The version shouldnít matter too much if it's, you know,
within the last year. You just save this to your machine and then it will just install
itself and all the defaults are pretty good. Once you do that, okay, you get something
that looks like this. So here I have actually 2.4.1 on my machine. And it's sort of command
line, right? Itís not a spreadsheet app; it's sort of command line. Let me see if I
can make this a little bit bigger for you so you can see. Let's change the font size
from 10 to--let's try 20. That's too big probably, right? It looks like a cartoon. Okay. So,
you know, it's sort of my online, right? I can do 10 + 1 and figure out that thatís
11. Okay. You have functions in here, right? So, let's see. Let's think about an easy function.
EXP is exponential E to the zero is one. Okay. So, your functions. You can look for help
on the functions. So, I want to like help on the exp function. If I type question mark
in an e function, it brings up a window and tells me, you know, that this log computes
natural algorithms, log 10 computes--okay. And it gives you some examples. So, the help
is pretty good. And you can look things up online because there's a lot of documentation.
So, it sort of has command line. You can write your own functions, you can, you know, sort
of use it as a little bit of a scripting language. But also, it's really good for plotting, right?
So, if you type like, okay, well, seq(1:10) is the integers 1 through 10. So, if I made
a plot of--well, here. Here, here. I'll show you. So, let's x--let x--seq(1:10). And if
I plot, suppose, like, x, let's say, x+10, then I get a plot, right? And you get to change
sort of almost everything you can change on this plot. You can change the plotting symbol,
you can change the font. You can change the color. You can make it quick able, you can
label things. So, the plotting in R is really good and it gives you a lot--a nice tool for
making plots very quickly. And we'll go through a lot of that when we get through chapter
three. But I think in the meantime, go ahead and make sure that you have a machine where
you can--you can use R and get it working. And next time, when I get into chapter two,
I'm going to go through some example datasets and we'll talk about, you know, manipulating
them in R and Excel a little bit. But let me just take, you know, the last minute here
and see if there's sort of any question. So, I'll just run through in case you missed at
the beginning. The whole point here is that this class, you know, I'm teaching it at Stanford,
so it's not too much extra effort for me to come here and teach it here. It is being videotaped,
so you can watch all these, you know, on your--on your machine at your desk. It's an hour--even
though a Stanford class is an hour, 15 minutes. And, you know, you can sign up on Mailman;
you're welcome to come to any lectures you want. Everyone is invited. This is the textbook.
We'll try and, you know, distribute some of these next time, but there won't be enough.
We'll have to do it by lottery. So, go ahead and buy one. Go to www.stats202.com for all
the information about the course and make sure you're subscribed to the datamining07@google.com.
And so, I think, thatís all the organizational information. We're going to meet Tuesdays
and Fridays from 1:00 to 2:00. I can't think of anything else that I might have said. Let
me just stop and ask if there's any organization questions or anything. Yes, question.
>> Can you get a larger room? >> MEASE: No, actually. I mean, I asked about
this and they said like, "Well, you can, but you have to go through another process." So
this is the biggest the Google EDU folks had and so...
>> The machine learning EDU talks had a lot of attrition like for the first lectures [INDISTINCT]
>> MEASE: Yes. So, we hope that a lot of you--we hope that...
>> Why are you looking at me when you say that?
>> MEASE: Because you're--because you're sitting on the floor.
>> [INDISTINCT] to that second group? >> MEASE: That's--I can talk to the Google
EDU folks about that. I mean, theyíve been extremely helpful. It was sort of a challenge
to estimate the attendance and we knew the Trix was an overestimate and didnít know
with the video conferencing how many people would actually want to come. And I think I
still donít know how many people actually want to come. I think we'll know more on Friday.
Think if there's still overflowing on Friday, then we can--we can really try and see if
we can do something better than having people sit on the floor. Other organizational questions?
No other organizational questions? Okay. There's free lunch in the cafeteria.