Google Python Class Day 2 Part 3


Uploaded by GoogleDevelopers on 01.03.2010

Transcript:
>> PARLANTE: All right, so I'll draw you back. So, now, you may want to keep working on that
a little bit. But I just want to--going to cut you short. I just want to show a few last
things and then open up more time to [INDISTINCT] where you can keep working on this one, you
can work in that corner or come what everyone. So--yeah, question.
>> I have a general question. >> PARLANTE: Okay, question.
>> How do you find things? And here's what I mean, all right. I didn't realize we've
had in the handout additional information about this topic. So now I'm trying to find
information about the OS file. >> PARLANTE: Yeah. So the question, I'm going
to say, very much it was like, well, how do you find definition about Python modules?
So, probably, I think with Google Search, it's like-- for I'm about to show you the
urllib module. And I suspect, if you do a Google Search for like Python URL Library
or Python URL Module, the--you might say the hits are pretty good. So that's the free-form
answer. The other answer is that python.org is the sort of official maintain sort of [INDISTINCT]
and there's, you know, it's very browsable, there's a lot of organization modules there.
So I think those are two pretty good answer. >> But, here is the problem. All right, just
search under OS and they don't show you that there is os.path in there that you should
actually be looking at. You get like... >> PARLANTE: Yes. So the problem is if you
just do DIR on OS, it's hard to know that you need to look in os.path especially because
[INDISTINCT]. So I think DIR, it's kind of a nice quick-and-dirty solution, but maybe
doing a real Google Search and really looking at python.org. If you really want to browse
is the more rich experience because DIR is like pretty [INDISTINCT], but it's attractive
for like quick-and-dirty. So, I don't have--there's no--yeah, I don't have [INDISTINCT]. But--I
mean why--this is progress in software, right, that it used to be in C. I mean you just had
to write everything yourself. And now, with like Python and Perl and Java, whatever like,
in a sense, you do have these libraries, I mean, it is a tremendous improvement. And,
now--but there's this cost to like, "Well, we have to kind of do around and find them."
So, oh well, progress. Okay. So what I want to show here is, you know, a couple--just
mattering of things and then one more module and then get to this last exercise. So first
of, I want to show you--this is just totally optional. Okay. You do not need to know this,
but I just feel like it's something that's useful to know. I'm going to go back to my--there's
"hello.py"--okay, yes. Here, you see the beautiful blue screen showing nothing. So earlier, I
had written "hello.py" to be kind of a Cat utility. And so I've recreated that. And I
want to just talk about this a little bit in particular to talk about exceptions. The
need--you do not need to know exceptions through the exercise today, but I just think it's
something that's useful to kind of have seen and so I want to spend a few minutes just
mentioning how these works. So what this code does is--here's this function called Cat and
it takes the filename argument. And all it does is it just, you know, it tries to read
it and then it prints out--here, I had a print like "---" and the filename and then it prints
the text from the file. And then down here at the main, well, it's very much like the
main, you guys have--it just takes any number of command line arguments and it just tries
the Cat also. So I guess I'll go with my two. I'll leave that text up there. I go to the
other one. There's the other one. It just goes down here. All right, so if I go up here,
so I could run hello and I could say, "Oh, please, Cat"--if I say "small.text", it prints
that one. But I could list multiple files. So I'll have it print itself. So I'd list
multiple files and then there's the--so if I just scroll up, it printed--here, it printed
"small.text", and then here, it printed "hello.py." So any number I give, it's just going to--it's
going to give. So here's the problem I want to deal with, which is errors at runtime.
What if I give this thing a file first in line here of like, you know, "nosuch.text".
So now, what's going to happen is, it's going to run the loop. And it's going to try and
Cat that first one. And what's going to happen is--and of course, yeah, you may have seen
one or two of these over the last couple of days. Right, this is a stack trace exception.
And what I said yesterday is like, well, you want to kind of care about the stuff at the
bottom. It's sort of the most interesting things. So what it says is there was an "IOError,
there's no such file or directory, nosuch"--you know. Hey, that's almost English. I think
that's pretty helpful. And then here, it's giving us, "Well here's the line where that
happened. So you can see it was my open call on that file name, it failed. And then up
above, these are the lines that--this is the call sequence. So if I read this from top
to bottom, well, it started at main line 24 and that called Cat on line 20 and then that
went down to line 10. So it's kind of going from the past up to the present and then the
failure. Now, the mechanism--and I'm going to give you just a little bit of detail here.
What's going on here is called an exception. And what's happened is that this line, "f=open(filename)"
is said to have thrown an exception object representing a runtime failure like, "Okay,
this is not going to work." And the default behavior of an exception is that it goes all
the way out to main and exits the program. It just unwinds the whole thing, and like,
"Okay, we're done." And you know what? That's actually kind of a fine default, right? You
might use list wrong or make all sorts of errors and, you know, kind of what you want
to do is exit the program, print some kind of error message. So like as at default, that's
fine. But what I want to show here is how might you intercept the error, print a message,
and keep going. In particular, I think it would be really nice if that first one being
wrong didn't crash the whole thing. It would be nice that it would print a little error
message but actually continue to print small notes. So if I could just hold your questions,
all right, let me fix this. Now, exceptions are kind of, you know, a deep topic, and so
I'm--by no means am I telling you everything, there's [INDISTINCT]. I'm just showing you
like the most simple thing possible. All right, so the way to do this is I'm going to say
try. And so that's a little--it's called the--a try block, and I'll say except, and then I'm
going to say "IOError." So if I go back over here, the exception that was thrown to here
is an "IOError", so I'm going to catch that "IOError". And I'll say--I print, you know,
something--how about IOError, but at least, I'll mention the file. So save that, comma--what
was the filename here? It's filename. All right, so here's what the--here's the dynamic
here. What this try-except does is very dynamic. It's going to--in the try, it's going to try
and do each one of these lines. So it tries to do that one, tries to do that one, tries
to do that one. And if anyone of them, at runtime, could throw an error and that's going
to interrupt the usual series of execution, and it's going to jump from there and it's
going to pop to down here. So it's kind of a--normally, code is very linear, right? It
just goes top to bottom. This interrupts--it interrupts it and jumps down here to the--if
the exception matches, it jumps to here. And so what's going to happen is either this will
read--well, okay, let's see if I did it right. My intention here is that if this executes
normally, we'll see the regular output. But if it's bad, it'll just print this line. But
in either case, the function will then just exit normally and we're back at main and the
loop could just continue. So rather than the default kind of kill the whole program behavior,
I'm at least scratching the surface of exceptions to kind of control it, print something and
keep going. All right, so now let's see--you know, other languages have a--have the same
feature, it's called exceptions in other languages just Python just has it's own syntax for it.
So if I do this one, all right, that is actually good. So there's the first one failing, says
IOError and then the loop continued and did the other two. So there's a question over
here. >> Does Python run through the entire program
before outputting anything? >> PARLANTE: So the question is, does Python
run through the entire program before outputting anything? Python does when you--that's [INDISTINCT],
I'm about to do a thing which a little bit shows this, but when you first load the module,
Python does a linear pass to kind of tokenize and read in the code, but it does almost no
analysis. It's really superficial. So it is then, when the execution actually runs over
the lines that more semantic stuff actually happens.
>> [INDISTINCT] a non-variable error like without even--I mean you just have like print
hello and then print like blah, blah, and you didn't define blah, blah, it wouldn't
print hello. >> PARLANTE: That's right. That's right. And
so to sort of summarize it, for example, if I had--oh, this is maybe an excellent example.
So let's say here, I called some full function like handle error or whatever, would--and
I forgot to define it, like, there is no such function, this code will compile and run fine
so long as the exception doesn't happen. I mean this kind of what you're saying. So [INDISTINCT]
still this, when you load a Python file, pretty much nothing is checked, errors, you know,
typos, some variable names, functions wrong, just most totally errors that have not checked.
It is only when the code actually executes over those lines; that's when it's checked.
And that, you know, I'm sure you've noticed that for the last couple of days. All right,
so let me get back the main theme here. All right, so that's just--exceptions is a deep
topic. I just felt like I should show you just, you know, six minutes for just a little
how they work. Yes. >> I think what he's asking is reverse from
the concept because in this case, I think it's saying that he's actually still on error
without [INDISTINCT]. >> PARLANTE: Well, let me--I'm not really
following, but let me just [INDISTINCT] your head. And there's more stuff I want to show
you and we can talk after class. All right, so let me--all right, so that was the basic
exception stuff. So I want to show you this other thing. I want to talk about modularity
a little bit. So we go down to the baby name structure here. So modularity refers to this
idea that you have code, you know, any projects starts off. At first, it's just some engineers,
some person working on something for their personal-like headache, right? They're sort
of solving it. And then pretty soon, they're teammates' here about, "Oh, like, hey, can
I use that?" And, you know, then, pretty soon, it's like, you know, [INDISTINCT]. So just
for good design using Python, you want to think a little bit about modularity. And by
that, I mean, like, well, something was built to solve one problem, can it grow or be reuse
by other people over time so like--so they can use it as well. So just code reuse just
within a project. And we'll talk about this at two levels. The simplest level is just
having a program take command line flags. This is just totally primitive technique and
yet it's actually very effective, and it's used within Google a ton. So you'll notice
in our programs we write today, like, I didn't assume, "Well, it's going to write this directory.
It's always going to read that file," or whatever. Instead, I was conscious of always trying
to feed in what you read and what you write into the program through command line flags.
And so, you know, clearly Python supports that. We've been doing sort of primitive command
line parsing. You know, I sort of do it by hand, that's fine for simple things. There
are also our command line flag parsing libraries, you know. You can go find the module for it
if you wanted like more rich flags for it. So that's a very basic thing to get right,
but certainly, you know, step one for modularity, you would want to get that right. The next
layer up would be, if you want to reuse someone else's module and instead of calling their
program as like a program and passing a flag, you would want to call their Python functions.
So someone, you know, not some officemate of yours wrote a function, and you want to
call it. I'm going to just going to scratch the surface a little bit of how that would
work because Python has pretty good support for this case. So what I want to do is I'm
going to look at my baby names program here and, you know, let's get this back on screen.
All right, I'm sorry. Is this the solution? No, I'm going to go and hit my solution for
baby names. It's a little more interesting. And I'm going to do something that you should
never do, which is, I'm going to put a print statement. I was going to say, "Hi, there,"
right at the outer level. Now, I haven't talk about this a lot before, but when you run--when
you load a Python module, really what it does--what Python does is it executes it from top to
bottom and this prints data. What that's going to do is it's going to print whatever. And
in executing this def, what that does is it sort of looks its code and then, you know,
it binds it to the symbol extract names. And then it just [INDISTINCT]. So--and it's only
when we get down to the bottom--oops, get down, oops, get down, uh-oh, what did I just
do? Here we go. All right, let me undo out of this. All right, only when I get down here
to the bottom, then, when it gets to this "if" statement, then it actually calls main,
and that [INDISTINCT] thing off. All right, so I'm just going to save. Uh-oh, did I undo
my--my print notes' still there. Okay. So I'm just going to CTRL+Z out of there. So
that's--I edit it, and I'm going to fire up the Python Interpreter. Now, we've done import
re and import an OS, write all this idea of modules you've seen a lot. Now Python, the
word module is really [INDISTINCT] in Python with just a ".py" file, right. And a '.py"
file, you could think all name, space, and just have some stuff defining it. So it's
kind of crazier. I want to show you is I can say import baby names. And what that did is
that loaded the file, took all of its depths. And then that sort of print that I put in
that you should never do, that--it ran it. So, when you load a Python file, it sort of
executes it. And then because I put that print in there, it kind of shows that that's what
was going on. So, now this will kind of--now, I'm going to kind of connect code you've written
with like the way we've used the OS module, so its data is like, I can do a "dir(babynames)".
And what I'm going to see is there's this "__" things, those are kind of like internals
that you probably don't want to mess with too much. But then like there's "main", there's
my main function, and check it out, "ExtractNames." There's like that function I wrote. And in
fact, if I say "help(babynames.ExtractNames)", it's like I get this little main page. And
I'm going to quit right here. I'm going to CTRL+Z, so I'm going to get a little fancy.
I'm going to go back to the other real quick--oops! Oops! Oops! Wrong editor. All right. Okay,
let's just look at baby names. So here, I've been talking about the slot for your solutions,
but here's my depth of baby names, and then there's this big string, all right? So with
the triple quoted string, so it--just starts with three quotes that it's just a string
constant but it's just a lot of the spam lines, all right. So it's just a way of having a
big string constant. And then this is a little bit like Javadoc, if you've seen that before.
So it's--what's happening is it understands. Well, this big string, that's the first line,
that's probably the documentation for this function. So when you call help that's how
help works, all right. So, I mean, it's really trivial system. But that's how--it's, you
know, pulling up little bits of docs of like what this function does. All right. So, now,
where's my Python interpreter, this one, all right. All right, so here I am in Python and
I've imported baby names. So, now, what I want to show you is I can call--what is it
called to extract names? I can call "ExtracNames" here from the Python interpreter. So, I could
say "babynames.ExtractNames", and what does it take? A name of baby file, it's going to
be like "../baby1998.html", of course--I think that's right, so let's just try it. So, here,
what I--this returns a list, right? Yeah, there it is. All right. So, this--I mean,
the stuff I've kind of said, but like--so, for example, if extract names, like printed
to stand it out directly, I couldn't reuse it here, right? But because, I mean, the way
the functions work, it took its arguments in as inputs, it computes whatever it returns,
and it returns to the caller, you know, whatever [INDISTINCT] it means that now I can just
reuse it. I can sort of pluck it out of this program and reuse it for like who knows what?
Some other purpose. So, obviously, this is a deep topic, but I'm kind of putting you
in the direction a little bit of like what it would mean to have kind of a well-designed
program, well-designed function and how modules could share code to build sort of bigger systems.
And that's, you know. Yeah, that's what big software systems look like. All right. So
let me show you--so that's--that's our--that's the style discussion for this section. So
the other thing I want to show you--all right, because I need the Python interpreter back,
because I want to show you another module. All right, I mean, in every lecture section
today, I just keep showing you more modules of building code that you might want to work.
The one I want to talk about today on this section is called a "urllib" and this one
has nice support for messing with URL, you know, you'll never guess what it does. All
right, so the--what--first thing I'm going to do here is there's a "urllib.urlopen",
and I'll just give a URL like, say, "http://google.com". And what "urllib" does, let's see, it takes
a URL and it tries to make it look like a file. And so when I say "urllib.urlopen",
it's trying to kind of look like that open command that we've used to open files and
it returns--I've named my verbal UF there, it returns those thing that's--it's kind of
like a file object like the "F", but really it's pointing over the network to this thing.
So, in particular, I can say "uf.read", and what that's going to do is it does the networking
and, like, gets all the data and, like, here it is. And so, I could, you know, say, for
example, I don't know, regular expressions or something like you could do on this. So,
"urllib", it has like a lot of features. You can set cookies and all sort of that kind
of stuff. And I'm just doing like the most simple sort of URL retrieval here. So the
other thing I want to show you here, man, look at all this. Does anyone ever look at
the source code of our homepage. All right, so I'm going to look for ".gif" in here. There
you go, okay. So there, "/intl/m" blah, blah, okay, so that--I'm going to copy that. That's
the URL of the GIF that's on today's homepage currently. So I'm going to show you another
"urllib" function which is "urlretrieve". What that takes is a URL, so I'm going to
say "google.com" and I'm going to paste in that. So, I believe that is the full URL path
of today's GIF. I'm going to say "," and what "urlretrieve" does is it does a download.
So there's a GIF and I'm going to say let's call this like "blah.gif". And if I run that,
okay. So, now, if I CTRL+Z out of here, I do an "ls". Check it out, "blah.gif". This
may prove to be handy in our next exercise. So, I've just showed you two functions there,
yes, I mean, in fact, there's a huge amount of stuff there for parsing URLs, I mean, you
can meet all sorts of stuff with URLs and HTTP, whatever. Yeah, of course, there's like
tons of built in behaviors that I'm just--I've just showed you like my favorites. But if
you want to do some URL thing, you certainly want to look at that. All right, so let me
go up and show you our next--our last exercise, the best exercise ever. All right, so this
exercise is actually in the form of a puzzle. When you solve this coding exercise, you will
know the name of the puzzle or know the name of the puzzle but I'm just not going to show
you the answer, like, you just have to figure it out. There's two parts, there's Part A
and Part B. If you just solve Part A, you don't--that's kind of good enough, you do
Part B if you wanted. That's okay. Okay, so here's the idea, I had to think of some highly
motivated and incredibly realistic puzzle situation. So what I've done, I'm going to
look in this file "happy_www.corp.google.com". This is an Apache log file. And I'm going
to look inside of there. I'm sorry. I have to tell you the back story. The back story
is that there's this image. The solution to this puzzle is an image, and it's an image
for the first part of something or someone that is very happy. And in order to solve
the puzzle, I want you to tell me what--what is happy? What is that an image of? Now, what's
happened is some evil person has taken that image and they've shattered it like pinstripes
into a bunch of little vertical stripes of images. And so if you just have one stripe,
you can't really tell what it is. You got to really put it all together. And they've
taken these stripes and they have scattered them over the internet, somewhere. And each
stripe has a URL that points to it. And what's happened is if we look in this "happy_..."
blah, blah, Apache log file, we started--there's just all of this junk in here. But--and I
should mention you for scary reasons, this is not a real Apache log file. I took some
Apache log files and I anonymized them and I wrote a program that kind of put the pieces
together, kind of like some Frankenstein thing. So, if you look right--it doesn't really make
a ton of sense, but syntactically it's accurate, it looks like one, all right. So, if we look
inside here, some--most of these URLs, so you don't have to know anything about Apache
log files, but this is--that's the client URL, this is when it happened, this is the
GET request that the client sent. So there is a client asking for just "/" which is like
a very common thing to want. And I'm going to search for the word puzzle here. Here we
go. Some of the GET requests have the word puzzle in them. See where--that's "~nparlante/puzzle
[INDISTINCT] whatever. So, the ones with "/puzzle" in them, those are the ones that point to
the image slices. And so your first job is process the entire log file, find all the
puzzle URLs; and there's duplicates, so you've got to eliminate duplicates and then sort
them alphabetically. And I want to just see the output nice and nice. Now, just for kind
of regular expression purposes, you know, without getting a lot of detail about HTTP,
the word "GET" is going to be here and then there's going to be a space, and then there's
going to be a bunch of character of whatever it is that they've requested, and then there's
going to be a space. And then the characters that go inside of the URL is like kind of
a mess. There's the" ~", "%" or whatever. So, when you're writing the regular expression
for this, I'd say look--look for the two spaces, like the two spaces are for sure, and then
just try and collect, [INDISTINCT] all the garbage that's in between them. Okay, so let
me, you know, for example, back--I'm just going "\S", looms is like the nice way to
grab that. Okay, so let me try running those things, so I'll go down to my search directory.
So, I'm going to say log puzzle and I'm going to give it that happy file, and so on the
simplest case what it's going to do is it pulled out all the puzzle URLs, it eliminated
duplicates, and it just prints them one per line in alphabetical order.
>> But you appear to have stolen all of them. >> PARLANTE: Excuse me?
>> You personally appear to have stolen all of them.
>> PARLANTE: Stolen? No, no, no, no, I create it. Anyway, all right, so, oh, yes, yes, what--who
is the evil genius behind this--who's on this exercise? All right, no one knows. All right.
Well, also, yeah, so that's the first part, you got them. Now, what I want you to do is
[INDISTINCT] I'm going to say "--todir", I'm going to say "output". So what I wanted to
do is if the todir option is specified, I want you to find all the URLs and I want you
to download them all. I want you to retrieve all the little slices and I want you to write
them into this output directory, right? Each URL has a slice, I showed you "url.urlretrieve"
so just grab them, just pull them. So I'm going to go look into the output directory
here and I want you to just give them names, like discard their original names and just
give them the name: image 0, image 1, image 2, image 3 and so on, all right? Now, we've
got those problem, each one of those, if I show it to you, is like--it's a vertical slice.
If you just look at it in isolation, you can't tell what it is. And so what we need to do
is put the slices together, right? So, like reform the original image, so then you can
look at it, you can say, "Ah, I see what that is. I've solved the puzzle." Now, the hard
way to do this, the way I first thought about it is like, "Oh, I guess I could get a Python
imaging library," which, of course, there is a bunch. And in the Python imaging library
I can kind of composite these things together, oh, in alphabetical order, that's correct.
So, alphabetical order will do the slices left to right the way you want for Part A.
So that would be--but then I thought of an easier way, the totally easy way to do this
is to just have Firefox put the images together for you. And the way you're going to do that
is let me just show you the contents of this "index.html", let's look inside of there,
and what I've done, what you're going to do is just put a bunch of image text together,
just--and just with no space between whatever, just lay it together, and then Firefox will
just put those thing together, you're all set. So that will actually solve the thing.
So that is part--I guess, that's Parts A and B, that's just solving the first puzzle. And
if you can get through that, that's great. There is a later part, that's a little more
complicated where you can't just sort images alphabetically. There's a slightly more complicated
scheme how to descramble the left to right order and so that's, you know, for slight
additional difficulty that's the inaccessible account. So what I'd like you to--guys to
do is work on this for like a nice long time and then some time, I mean around 3:30 or
so I'll interrupt you one last time for a few closing remarks and show you like a little
bit of, you know, slightly advanced optional things you might want think about, tech stuff.
And then you're welcome to stay and work on any of today's exercise. Kind of as long as
you want and I--you know, some time after four you can kind of wander off and do your
regular job. Okay, so off to the code.