Google Python Class Day 1 Part 3

Uploaded by GoogleDevelopers on 01.03.2010

>> PARLANTE: All right, hey, everybody. All right [INDISTINCT] a lot of people working
away on this one. It's our requirement that you finish like all the problems on this one,
I'm just hoping that if you could get like, some of them. What I'd like to do, is I want
to do another quick lecture section then I want to break our time for, you know, the
nice long exercise at the end. And you're welcome to keep working on the list problems
if you want, but this last exercise, once we have file reading, then could begin to
look like a real program. And it's going to involve--it's going to sum up all the material
we've talked about, lists and strings also, put it all together and that'll be the last
thing for today. So, the last day I was sure to show you is the HashTable or also called
the dictionary. It's a very useful data structure, it's built-in to Python; it's pretty easy
to use. So, the delimiter character for a dictionary is the curly brace. So here, I'll
create one. I'll say D is equal to left curly brace, right curly brace. I'll just create
an empty one. And the way the dictionary works--and I've got a little piece of art there from
the hand out at the top--is you can think of it as a--it's said to contain key value
bindings. For each key, it can look up that value very quickly. So, the way that works
in Python is I'll say, under the key A, I would like to store the string alpha, and
under the string O--I'm just going to make the same example I have in that little picture
there--I'll store omega, and under the key G, I'll store gamma, put it there on my [INDISTINCT].
So, what HashTable does is you always talk to it in terms of keys. You say, "Dictionary,
please store this key," and it will store whatever. And then later on, when you try
and retrieve by key--this is the one thing that HashTables are fast at--it can retrieve
via key in constant time. It's just as fast as you can imagine, it's very quick at key
retrieval. So, for example, if I say, okay dictionary--and I--in Python, the way it works
is we just use the square brackets. When I use the square bracket on the left, I was
storing a value in. So, that series of three assignments, that builds exactly this picture
of the dictionary. So, you can think of the dictionary as what has a set of keys and each--for
each key it points to some value. If I want to look--do a look up, I would say, well instead
of--just don't put it on the left hand side of an equal. So, if I just say D of A, it
just returns the value, oh alpha. Now, not to belay with the point too much, that's the
one thing that dictionaries are fast at. You could put 10 million keys in this dictionary,
and yet for a particular one, you could call it up and it would call it up in just a few
cycles on the machine. It's very, very fast at key look up, that's the one thing that
it does. So, let me show you, you know, a few more features you could do with dictionaries.
So, obviously, you can store stuff in, you could use the square right to get stuff out.
If I refer to a key that is not in there, I say, "All right dictionary, what do you
have for key value X?" What I get is actually an error, a key error so that's a--it's a
long puzzle. I think Python is trying to be sort of consistent with the list here, that
if I said list, square bracket like a thousand and there is no thousand--you know, it would've
given an error. So, I think there's sort of an attempt to be kind of consistent with what
square bracket means. And square bracket means, yeah, it should've been there and if you refer
to something that's not there, it's an error. Fortunately, there's an alternative if you
use--there's a dot get method on the dictionary. And so, if you do dot get, in the simplest
form, what it does is it returns the value none, if it's not in there and otherwise returns
the value. So, if I say dot get of A, then I get alpha, but of X then I just get back
nothing. It happens in the interpreter, none is sort of a special case, it just prints
nothing; it just comes back. So, if you want to distinguish, you know if you want--if you
care where the nothing is in there, then you could use get and you're more protected. How
do you suppose you test if a value is in a dictionary, someone say? Something I wrote
I think over there, yeah, all right. While waiting for his in, right? Dictionary is just
another composite thing just like a list. So, I can just say A in the dictionary it's
like yup, it is. Like, all right, I could say, was x in there? No, it's not. So, can
you kind of live, live a little higher on the food chain, use those built-ins. So, I'm
going to--well here, I'll show you some more features first. So, the dictionary as itself
is--it's just good at this sort of key value storage, that's all it does. One thing that's
kind of common is you open a file and you read a bunch of data; you store it all on
a dictionary to kind of organize it. And then, once you're done with it, you want to loop
over the dictionary and kind of look at all the contents. So, the simplest way to do that
is the dictionary has a method called dot keys and what does is it returns a list of
just the keys. There's also a list of--oh, and notice the keys are in kind of random
order. Now, in this case I guess they appear to be in alphabetical order, but as the dictionary
gets bigger and bigger, these will be in kind of a random order. This is the hashing strategy
that the dictionary uses causes the keys to be in a scrambled order. I'm sure I'm probably
switching between the word dictionary and the word HashTable, so better yet those are
for the same thing. And hashing is this general strategy that's very powerful, you'll see
it just crop up in a lot of languages. So, the--likewise, there is a dot values, oops
what's the problem there? I spelled it wrong. So, this just pulls out the linear list of
values. So, it's actually, it's in a kind of a random order, but it's in the same random
order that the dot keys was. So, I'm going to show you what I would regard as kind of
the most common case of looping over a dictionary. You've built the dictionary, it's got all
this, you know, social security numbers or URLs or whatever in it and now you want to
loop over and look at them. I would write it this way, I'd say, "Well, for K in dictionary
dot keys." All right, so, what that means is go to the dictionary, pull out the linear
list of keys and now, I want to loop over that. But, it's just sort of as a nicety;
the keys are going to come out in kind of random order. So, let's loop over the sorted
of that. Pull out the keys, sort them, now loop over them and write this all on my mind,
I'm going to say like, well what I could print is--I mean, let's say I want to print a little
thing with like an arrow, like this key goes to this value. I could say print. First of
all, the dictionary, square bracket K so just--that's the--I'm sorry, don't do it (ph). I'll do
it, I'll say key colon, comma K, and I'll say comma, and I'll draw like a little arrow
and I'll say dictionary square bracket K. So, if I run that without so many errors,
so that's going to pull out the keys, it's going to loop through all the keys. And then,
for each one, it's doing this little line up like, here's what the key is and here is
what--here is the value that is stored for that key. So, that is the most--I think dot
keys and dot values are the most common case. There is this one other case which is called
dot items. And what dot items does isn't--rather than just pulling out the key or just pulling
out the value, it pulls them both out simultaneously and it puts each one in a tuple-sized two.
So, what you're seeing here is a list of tuples. Each tuple is length two and each tuple represents
one--what I would call a binding from the dictionary. And so, what you could do is you
could--if you really want to just get the key value all in one step, then you could
use dot items. So, for example, I could say for tuple in d dot items colon, you know,
print the tuple or whatever. All right, check it out, I left off the prints. So, that's
another way of kind of dealing with all of the data of the dictionary, you know, once
you've built it up. You know, okay, I think that's--I'm not going to show you the other
thing, it seems messy enough. So, now I'm going to sort of step outside for saying--I
mean, what is the dictionary good for? Suppose you've got an Apache log file of like, 20
million get requests and in a sense, the Apache log file there, it's just random right? This
IP address requests something, and then this one and that one, whatever--it's all just
kind of jumbled together. A very common form of the use of the dictionary in solving your
problem is that it takes incoherent data and it makes it coherent, it organizes it. So,
for example, suppose I wanted to count how many times each IP address had been hitting
my server by looking at this log file. What I could do is I'd say, okay, I'm going to
rip through the entire file--which I'll show you in a second--and for each IP address,
I'm going to use the IP address as, let's say, the key in the dictionary. And as the
value, I'm just going to count how many times I've seen that key. So, the first time it
shows up, I'm going to put a one in there. Maybe I could use in to kind of check if it's
there or not, to kind of boot strap this like, the first time I put it in. And then the next
time I see that IP address, I could go to the dictionary, use square bracket, I'd say,
"Oh, is this in there?" If it is, I'll do, you know, plus equals one or something to
kind of like pump the value up, and so by the end, I've gone through this unordered
data and I've used the dictionary to kind of associate by IP address. So, anything where
you've got this random piece of data in one place and you want to associate it with other
occurrences of that random piece of data, the dictionary is like, hands down the good
way to do that. And it relies on this fact that it's very fast at this one operation,
which is, given any random key, it can look up that value very, very quickly. Now, stepping
even farther outside of myself, Google really runs as a big hash table and that's basically
what it is, right? It's a big hash table by word, right? And each word, kind of, points
to a list of URLs or scores, or something. So as a user, when you type in a word--you
know, whatever that, you know, Britney Spears, that's two words, I guess. You know, really
it's going to this gigantic hash table that is Google, it's looking up Britney and like
there's this big list of URLs. And it's looking up Spears, and it does officially and then,
you know, you just kind of intersect those lists. I mean, that's Google in that show
(ph). To show that's very clarifying, all right.
>> Where is Yahoo now? >> PARLANTE: Yeah, no, Yahoo it's a linear--anyway,
yes. I'm--they all have--they also have hash table technology so that's not, that's not
the problem. All right. So, predictably the big exercise for today which I'll demo in
a second. Yes, I mean, why don't you use the dictionary that's totally appropriate thing.
And in fact, dictionaries are--I mean, you can sort of see--the subtext I was saying
you--telling you is, dictionaries are an extremely useful, and powerful construct. And so, I'm
going to want you push to use them like it's a good thing to get practiced with. All right.
So let me show you the last thing which is files. And I think--I'm going to go back to
editing for this. I'm going to show you some files stuff here. All right, so here's,
hello, all right, so there's just, you know--I'm going to--there's--In UNIX, there's this utility
"cat" and what it does is it--just, you know, prints out the concepts of a file. So, I'm
just going to go ahead and kind of recreate cat and Python just to kind of show you how
to how to open files. So, let's say this thing takes a filename--I'm going to write a function
called "cat" and it takes a filename argument and what I want it to do is just print the
content to that file. So, the way to do this is there's a built-in called "open," and it's
first argument is filename and the second argument to this is kind of resembles other
languages. I can specify an "r" here to mean, I want to open this for reading or I could
specify "w" to mean, I want to open it for writing, whatever. This case will just--we'll
open this for reading. And there's actually a variant, if you put a "ru" there, it opens
it in a way where it's going to--it'll ignore DOS line endings versus UNIX line endings
or whatever. I will kind of fix that under the hood, so that's nice. So, probably the
simplest way to deal with--let's say there's a text file--to deal with online and the files,
you can actually write a for-loop over a file; sort of like it was a list. So I can say "for
line in f." So, I loop over "f" and why is it great, what it does is it actually reads
the file in line by line and just processes one line at a time. So, if I were to say "print
line"--I mean this is why--I want to just echo the file. So, I'll just print each line,
and then, here I'll just--I'll fix-up my "main" to call this thing. So, "main" just protects
whatever command line argument I have here and it's just going to call "cat" and just
say, you print this out. So, I'll--I'm here, right? Okay, so let's see. In day one--so,
I've got this file, a small.txt, I'm going to take that up here. Okay, so if we look
at what's in small.txt, this is a beautiful poem I wrote for the class. So now if I'd
say, """ of small.txt--well, it's to me it said, so, I open that file, it's
reading it, and it's printing it but I'm getting this kind of line doubling; I'm not improving
my poem. So, the problem is this line string, it includes the new line at the end and then,
when I print it, I'm kind of putting another new line there. So to fix this--Python obscures
some text. If you put--if there's a trailing comma at the end that inhibits that last new
line. So, I'll save that. I'll just run it again. Ah, excellent. Okay. All right, so
that is one way to read a file. You open it into this thing "f" which is kind of a file
handle kind of thing, and then you just loop over it. This technique, where you loop over
line by line, it has this one virtue--sort of going back to my apache. I could have an
apache log that's like 47 gigabytes of just texts, and this loop hole has run through
it but it will not require 47 gigabytes of RAM. This uses a tiny bit of RAM because it
deals with each line independently, that's an efficient case. So, if you can go line
by line, this is an efficient way to do it. Also, just for completeness, at the end there's
a dot close--I should not, you know, close the file, you know--just to be plain (ph).
If you omit it, it closes it on when the process exits but, you know, probably more correct
it and close it. All right, so that's one way of doing it. I'm going to make aside here,
just a stylistic aside that you're going to need badly in about 20 minutes which is--if
we were in Java or C++, we would know what variables are strings and what variables are
lists because there's all these declarative syntax telling you; what's a string, what's
a list, or whatever. I want to put out--in Python there'd be all these cues. Sort of,
when you look at your own code there'd be all these cues about what's going on. In Python,
there are no cues. You just have to keep in your head, what's a filename, what's a file,
what's a string, what's a list? Now, there's something you can do about this. The one thing
you can do, the one thing you get in Python is your variable names. And you knew, you
were taught or whatever, well, you knew, oh, yeah, having variable names is kind of a good
idea. But in Python, it's 10 times more important. So, I want to point out in my little garbagey
(ph) example here. And now, what I'm typing the stuff in the interpreter. A lot of times
I use one variable names but I'm a little more disappointed here in the editor. This
thing I spelled out--yes, that's a filename not a file, not a list string, it's a filename.
It's very useful in Python that you name the variables to kind of keep track of what is
that. And very often when I'm walking around standing behind people, many of the bugs I
see is where you kinds off by one. You sort of lost the scent. You're like, "Oh, this--you
know that's the filename." "Oh, no, it turns out that's a line from the file or it's list
of all the file in there." You kind of get this off by one. Of like, "Wait what does
that variable contain?" because you've got this natural progression of filename, to file,
to line of text, to word, or whatever. It's very easy to get off by one of where you are
in that progression. And Python, the only thing you have is the variable names. So,
I've worked with a lot people, you know, debugging, basic Python programs and I'll tell you, it
is--it pays off very well to be conscious of giving your variables good names of like
what is that? Is that a string? Is that a list? Is it a dictionary; whatever. And certainly
in my solutions, whatever, you know, I fall that, I mean, that's just good style anyway,
but in Python it happens to pay off especially well because you've got so what else is going
on. All right, so on our bigger exercise today this will, this will certainly come up. The
other way that this comes up is whenever I have a list of things. I'll try and give it
a variable name ending in s words, or lines, or something like that to sort of like cue
me like, "All right, that's the lists of things." Okay, end of editorial; it's actually good
advice. So, my second favorite Python file primitive is there's this function I think--the
capitalization I remember, I think it's "readLines" like that. I guess we just try it. And what
that does, is it reads the entire file into memory as a Python list of lines where each
element of the list is one line from the file. So, then here--so for cat, I could just say
"print lines" I guess and it just be done. All right, I'll try this to just check if
I have the name right there. Has no attribute "readLines" okay, apparently not. Maybe, it's
lowercase L, it must be. Oh, excellent. All right, so now you can see just--so now it's
a Python list. So, now you could loop over that list, do something or whatever. So, if
you had a file with like, you know, 10,000 or something rather is that you want you process
as a list, you know, this is a--one easy way, you just pull it and ran in one step. Notice,
if the thing is 28 gigs of data, now this does at least 28 gigs of RAM to store like
I really am just drawing the whole thing around. But, you know, I get, I get very--you know
give you an access to it. My absolute favorite--I know you've been very curious. What is my
favorite Python file primitive? Oops, and it's just--there's this one just called "read"
and what is does, is it reads the entire file into one string. And I know that's it, but
for some problems that is just so convenient. There's no notion of like looping this line
versus that line I just, like suppose you wanted to do big, some big suture place operation
on a file or say some regular expression processing. Yeah, just having the entire thing is like
one big string is phenomenally convenient. So, that is, I think that'll just work. Yes,
check it out. So, we'll build on that one a little bit more later on. I think for today
just the very--the first one I showed you. Just for line in "f." Now, that's probably
the most common one. Also going back to my ranting editorial. By variable, noticed? When
I--yes, read into variable called text like I'm trying to remind myself like, oh, right,
you know, that's the whole text, whatever. All right. So we've got dictionaries. We've
got files; these are the last two elements that you needed to know right--there's kind
of an inflection point, where like with all the stuff I've showed you now, you can write
just like real Python programs. And so that, that's why I like, last exercise is going
to look at--look like. I already gave you the advice about variable names. One other
piece of advice I should give you. I'm about to demo some larger programs for you that
I want you to implement and it is the--sometimes--or at Stanford, right? If I'm talking some 17
year old--what they will do is they'll type in the entire program, like, well, here's
my vision of how the entire program is going to work, and then once they type in the entire
thing in, then they try running it. And then they like, you know, of course, it crashes
on my client too early. That is not the way to do it, especially not in Python. Python
is so good and you make a little change and then you can run, and you can see immediately
kind of what's going on. So you want to sort to keep that focus right at the kind of growing
edge of your code. And there's two things that are going to help you out here; let's
say--so for your next problem, you're gonna do these things that involve like reading
a file, then processing lines and for each line you're going to want to kind of do something.
What you should do is to move very incrementally, like re-read in the lines and well the first
thing I just want to print each line. So you just print each line and then after you print
it and, you know, when you're done, you could call like sys.exit or something, just like,
okay we'll just bail out. And just get that working through like, no now, it's pretty
[INDISTINCT]. So it's okay, well, what's my next goal? My next goal is, you know, I want
to find all the words, so I'll just print each word or whatever. And so just have the
program print what your kind of next goal is and keep iterating until it's printing
the right thing. And Python--because Python is very easy to edit and it's built-in printing
facility is actually quiet good, right? You can print a dictionary, you can print a list,
all that stuff; there's no code in your carbon card. It just prints, it just--it'll adopt
something kind of reasonable to the console. And so you can use that to kind of iterate
very tightly and certainly as I walk around, ask and answer questions, the same thing I'll
do. So, that's why--have a goal, print what you have, [INDISTINCT], some of it--it's kind
of nice if you print what your data structure has kind of a step in of solving this thing,
seeing the pixels of like, "Oh, I have a list of strings or whatever," can help your imagination
think of like, "Okay, well, what's my next step? Oh, now I need to make a list of, you
know, a dictionary or something." So it's so nice to be all just to see your data structure
right there. And I guess more broadly, what—you know, why is it that Python program was given
this zone of like, "Wow, they're very productive, they can just get stuff quickly." It's partially
because not just that they know the Python language but that their program in a style
that fits the things Python is good at, so this is the right stuff for today. Okay, enough
kind of heavy device, let's talk about actual work. Okay, so I'm going back down to the
day 1 directory, and there are—I'm gonna show you two programs, all right? As before,
this program--so the first one I will show you is a word count. It's the first exercise
for today, this is what I want you to do today, and I'm going to demonstrate very quickly
a second exercise which you could do if you just don't like, you wanted more practice
or whatever. But word count's the one I wanted to work on. So I'm going down to my solution
directory, so I can actually demo working. So what word count does, is if you--oops,
ignore that hey there thing, that's left over for something else--so what it does is you're
given a file and what I want you to--I'll just run it here, and I'll run it on that
small file--oops, I'm sorry--it takes a command line argument, in this case a dash, dash count,
and what this does is I want you to go through the file, I want you to find all the words,
and I want you to just count and I want you to convert them with all to lower case, so
we won't count upper case, smaller case different. I want you to just count how many times each
one appears. Now to find the words, there's this Python built-in--which I showed you before,
I'll show you again--so I have like aaa bbb ccc, there's dot split and I'll show you this
dot split before I would like a corn or something. It turns out if you call dot split with no
arguments then it splits on white space, and that's going to turn out to be a pretty reasonable
behavior for this program. So for word count, I'm just looking for split with no arguments
so you just get the word and it's just automatic--it knows what it's have is, what a new line is,
it'll do something kind of original. Tomorrow with regular expressions, we'll do a better
job but for today this is [INDISTINCT]. All right, so what I want you to do is process
the whole file and like a fair with the words are, and then you can debug on small.text,
and that's sort of more fun. Then we also just have the entire text of "Alice in Wonderland"
and so you could just process that as well. Something a little dopey is happening here
well, like the word while appears 20 times, but the word while followed by a comma appears
four times. This is just because of the split strategies just on white space and it's not
smart enough to trim off anything else. Okay, I apologize this just, this just to get us
through today. Tomorrow, I will show you how to deal with the punctuation as well, but
for today, I think just deal--we've got enough problems just with the dictionary. Well, I
want you to learn the dictionary in the file. So one thing's that's need to out this is,
Python is not known for being especially fast, it's known for being quick with your time
as a programmer, it's you know, it's a great instrument and you can get stuff done very
quickly. But you know memory in anywhere else is not that quick. That's just you know, you
should know, that's one of the things you know the truth. And yet, I just process the
entire "Alice in Wonderland" and like, whatever, if your time that says like it's about 28th
of a second and this code is just written in the most obvious convenient way--the laziest
way I can think of, which is of course the way [INDISTINCT]. So on the one hand, you
know, Python is not the quickest language; on the other hand, most things are just kind
of I/O bound like probably saving that last inter second CPU time, maybe not the best
usage of your time or you just want stuff to work so that will sort of show up here.
All, right, so that is part A, counting the words. Then—and I should mention the parsing
of the command line arguments and the other kind of boilerplate, I provide for you. So
when you look in that word count file, you'll see there's a boilerplate that's done for
you and then you'd figure out what just the part of you need to add. So then I want to
have top count--oops, what's two top count on "Alice in Wonderland?" Do you guys have
a 28th of a second to spare? Sure, all right. All right, so what does that do? Yes, it's
sort of--it's by the frequency. Yes, so--what appear most often. Now, just in terms of writing
code, well, this both use a dictionary which has the word as the key and the count as the
value. I don't want to have one copy of the code that builds that dictionary for dash,
dash count and another copy of the code that builds it for dash, dash top count. Really,
you should have one copy of the code that builds the dictionary, either by putting into
a separate function or, you know, or you could [INDISTINCT] but I want one copy of that code
that builds the dictionary and then we have these two uses of the dictionary, either you
could loop to the dictionary and you just print all the words in alphabetical order
or you could go to the dictionary and so the trick here is, it's like, "Oh, I don't want
all the words, I want to pick out the words that have the big counts and I want to print
those and in sort of order by count." So for that problem, you kind of want to use the
dot items on the dictionary, where you can pull out all those two poles and then made
this some custom sorting, right where you want to sort the two poles now at the big
second number, anyway, that's part B. So the first problem is part A. All right, I have
one more thing to show you, this is whole other program, the mimic program. And I'll
just run this on "Alice in Wonderland." Let's see what does this say? "Alice's shoulder,
and birds with some minutes, the only grinned a stop to death quote. You might do something
splashing about at the Caterpillar." What is this? Yes, you could--well, so actually
you can run this--so, all right. So I'll tell you what this program does. It's a fun program,
it uses actually in terms of coding technique, it's very similar to word count. Same level
of obstruction, it just does something different. What this program does and also it shows how
Python is actually quite fast sometimes. It looks at every word that appears on the text
and for every word it just create keeps a Python list of all the words that ever appear
after it in the entire text. You can sort of imagine building a Python dictionary that's
just sort of attracts this. So for the word sister, I have a list of all seven words that
ever appeared after the word sister in the text. So that once you have that data structure
you could just pick a random words, so I'm just going to start with Alice's, and I look
at the list of the words that ever appeared after that and I just pick one at random.
So apparently in this case I got shoulder comma and then for shoulder comma, I look
at the words that ever appeared after that and I just--it's just totally random. And
yet, you can--it's sort of weird, you can I mean it's not completely believable but
it's sort of weird for one unbelievably primitive strategy this is, I mean, you can end up mimicking
some text. So you can have fun if you run it on Google design docs. The other thing
that you can do of this--only a computer sciences can prove it is you can write it on Python
text, and it sort of like it's some random sense like my variables and comments and stuff.
I mean the line breaks are completely messed up. So, that is--so if you have time or extra
energy to kill, you could take a look at the mimic program but really what I want you to
work on is the word count program. And that is the lesson for today. So, please go ahead.
Get start on that, I mean, if you want to keep working on the list once earlier or once--I'm
happy to answer questions about that. But this is a nice kind of a threshold like, oh
now, this is really all the pieces of a real Python program you put together, so it's a
nice exercise to work on.