Google Python Class Day 2 Part 1

Uploaded by GoogleDevelopers on 01.03.2010

>> PARLANTE: Back to pie quick Python class. So, the -- so, this morning I'm going to talk
about regular expression a little bit, in particular, regular expressions in Python.
You may or may not know regular expressions before hand. That's okay. And I'm not going
to show you all of regular expressions. I will show you, like, just enough for us to
get some useful stuff on. But regular expressions are a very powerful combination with Python.
There's a nice integration and so on. I will show you that. Also, the exercises later today
will, of course, you know, have little elements which are solved nicely with Python regular
expressions. Just as a--regular expressions are sort of a good news, bad news situation.
They--regular expressions are very--I mean, I could use the word powerful but also, like,
I could use the word very dense. Like, if you sort of measured the amount of thought
and cups of coffee that get poured into, like, per character, like the regular expressions
are, like, the most dense language possibly. You could possible for hours over, like, one
line of text trying to get, like, all the pats and backticks, and whatever crack. We're
not going to get into that scary [INDISTINCT] but they are--we're going to sort of touch
into a little bit of that part. So, one word of warning, when messing with regular expressions,
it's the--I tend--I try to move a little slowly, like, they are very powerful, they are a little
tricky, so I'm going to try to be careful. And for, you know, today's discussion, like,
yeah, I'm going to show you just sort of basic stuff. And if you're extremely familiar with
regular expressions, well, you know, just bear with me for a little bit. We're not going
to do this for too long. And, obviously, really I'm going to emphasize on Python. That's the
bad news. The good news is also on all the exercises we're going to do later today, in
case I forget to mentions, if there is a regular expression component at the very end printed
in little tiny print, I put the what the regular expression solution is. So, it's kind of like
you can sort of flip to the back and get the answer if you get--if you're struggling with
that part of it, because really, you know, it is a Python class. So, I don't want you
to block on regular expressions too much. All right, so with that introduction, let
me--I'm going to start talking about how these things work. But first, I have to tell you
a joke which [INDISTINCT]. What do you call a pig with three eyes? Piiig. All right, now,
that will be covered. The necessity, that will become clear a little bit. So let me
fire up the interpreter here. So, regular expressions in Python are supported by a module
called "re." So, I'm just going to import that. I'm going to do a lot of stuff here
in the interpreter. I'm going to sort of build this up. So, the basic idea with regular expressions
is their way of searching for a pattern inside of a larger text. So very much like, you know,
search in Microsoft Word where you have the little pattern you're looking for, and it's
going to look over this huge text and find the first instance of that pattern. But it's
this whole language where the patterns can be very popular. So the way this works in
Python the simplest way is there is a function inside of re called "search." And I'll sort
of spec this out. It's going to work basically this way where the first argument to search
is the pattern, which I'm going to about a lot, the second argument is just kind of whatever
the text I want to search. And what it returns is actually not a Boolean, not text, but a
match object. So, here, I'll write this as "match." And then the match object will indicate--it
will show us a bunch of things about the found text. So, let me do an example. So, for the
text, I'll use our punch line. We'll say, you know, it's called piiig. All right, and
let's say for the pattern, we're looking for--and we'll just search--start talking about patterns
here. Maybe I'll just look for "iiig." I'll just put the simplest possible case. So, I
run it. Now, it returns this match object. So, for this type, match--I mean, it's not
really going to print, but it will say, well, that's, you know, that's some kind of Python
object. So it turns out for the first 20 minutes this morning, the only thing you need to know
about match is that it has this--it responds to this method called "group." If called group
on it, it says, it shows you here is what the matching text was. So, this is our first
example of the regular expression. And the simplest case in a regular expression is,
like iig here, is that a character like I or G or something like that matches itself.
So the lower case I matches the lower case I. Now, I'm going to build up the vocabulary
to have a lot more complicated matches. But that's just characters matching themselves
is the simplest case. All right, another thing to point out here. So, this match was successful.
Now, I'm going to do one that's not successful. So, like what, say, we're looking for the
pattern "igs" and that pattern just doesn't appear in there. So, if I run that and I look
at the match object, it's none. If the interpreter is none, it just prints us nothing. But--it's
just not there. So, if I were to try and say, oh, "," it's a very common error,
it's not going to work. Like, because match doesn't point to an object that has a group
behavior. It just match the points to nothing. So, the absolutely standard way to use
is--I'll sort of do an interpreter--first you do and then the next line is
something like "if match:." Like, if the match is there, then we found it. We can look at
the group. Otherwise, it's not there. Now, what I'm going to do--I'm actually going to
write--just--I'm going to def a little find function just here in the interpreter. I'm
going to do so many regular expression searches today. I just want to encapsulate that behavior.
So that what I'm going to show you here are some of the prototypical use of
and then I'll just use it for half an hour. So I'm going to say--I'm just going to call
this thing a "find" and it will take a pattern and some text. This is a little weird. I'm
doing this in the interpreter, but this works. I'm going to type a colon and I hit return,
and now the interpreter is saying, "Okay, what's the next line?" And so, I'll say space,
space and I'll say if match. Now, I'm relying on the fact that--I talked a little about
this yesterday, the rules for true and false. Now, there's a bunch of things that kind of
count as false: zero counts as false, the empty string counts as false. It happens that
the value none also counts as false. So, what this if statement is sort of saying is, like,
yeah, if that match is not none essentially. If it's there, it searches, too. So, if match
is there, I'm going to say "print" Okay, return again, two spaces, I'll say otherwise--I
want to say what happens. So, essentially, yeah, not found. All right, those are the
two cases. The question is you always need the .group. I'm going to always use it today.
In reality, the match object has--you could read the dots what--it has, you know, in what
character position did it start, where is that, and all sorts kind of other stuff about
the match you might want to know and sort of composite in there. All right, I'm going
to--here, return. All right, so now I have to find my find function, so now I can sort
of use this for--well, I'll just--I'll use it for my earlier example. So, now, if I say
find on that, I get "not found." And if I say ig--why didn't that work? I did what?
Oh, I didn't do the match function. Yeah, you're right. All right. All right, here,
I'll just do it really quickly. So, there's def and then I'll say "match ="
Oh, sure. Now, you guys tell me. Sure, where were you five minutes ago? Okay. So, there
is the match and I say "if match:," well, okay, everyone is going to know this code
by heart. If match print "", else: print not found. Okay. So, now, what will
we say, this time for sure--excellent. Okay. So, now, I'm just going to do--you know, I'm
going to build the vocabulary of regular expressions. So, the simplest rule, rule number one is
that simple characters just match themselves. Rule number two is that--and I'm actually
going to--a little hi-tech here. I'm going to make a little space. Special characters,
I made a little table up here. The dot, they are special, dot matches any character, it
means anything except it does not match anyone. So, I could have said, "Well, I'm looking
for, let's say, any three characters and then a G." That's the pattern I'm looking for.
And so, in this case, that's going to find pig. So can be, you know, a little bit of
a sense of, like, how this is going to be, you know, more powerful than just regular
Microsoft Word search. So, the--another route is going on here is--so, for example, if I
were to say "...G," you know--or I don't know, X. There. It's not going to find that. Well,
maybe I'll do it the way I did it before to make it better, put the S. So, if I say ","
that's not found. There's an all the way symmetry here where in order to succeed, all of the
pattern must match. So, in this case, I've got four characters or whatever all--you know,
I can't say, like, "Oh, well, three out of four," no. A hundred percent of the pattern
has to be kind of consumed and matched, but that's not true about the text, right? In
all of my examples, I'm not matching the whole text. Whatever, I can use a little bit of
this at the end of this word or whatever, so as to say, a fundamental asymmetry. The
other thing that's going to happen here is that the search is going to go to left to
right, and it's going to--it's satisfied as soon as it finds a solution. So, we could
make a case whether it's maybe--say for example I'm looking for "...g" and then I here like--and
then I'll make this like, "Oh-oh, there's a much better solution." You know, this one
is eg. What's going to happen is it's not finding that second one, right? It's not just
getting into it. But, how can I make it find that one? What if I said, "Well, really what
I'm looking for is an X and then two characters I don't know about, and then the G. So then,
it's like passes over the first one and finds the second one. Now, the regular expression
engine, I'm not getting too much detail, is--you know, I mean, it finds all the things it's
supposed to and it's smart. It understands. It does--it will backtrack. So, for example--this
is--what if I said, well, I'm looking for "...g" and then I insist that there's an S.
And here, I'll go fix this one. It's like kind of an S here. So you can imagine--so
that exceeds. So you can imagine it like it may be sort of tries to make this one work
and it doesn't work, and so it said, "Well, it didn't work." It will keep going. So, the
other thing is it is going--it is left to right. So, it finds the first one. Okay, yes.
So the question is why didn't I meet the first one. The trick here is I added an S to the
end of the pattern. If you read the pattern, it says--that says, any character, any character
gs. And the problem is it could--this one? One more, this one. That will succeed it.
Oh, why didn't I find that? Okay, yeah. So what it does is it--sorry, yeah--it goes left
to right. And once it finds a solution, it's like, "Okay, I'm done. Don't try it anymore."
Yeah, question. >> For the three partners.
>> PARLANTE: Oh, yes. Okay. Thanks for the question.
>> Statement with a... >> PARLANTE: Yeah, so if you are actually
looking for the period characters, so like what, say, you know, I don't know. There's
a dot there. But you were right, it is C. And you can always put a backslash in and
that inhibits the specialness of a character. So, I could look for c.g. You know, I can
look for c.--"\.l" there. Now, I'm going to introduce a slight extra for the syntax here
which Python has, which is--where it's a little troubling like the backslash, it could be
interpreted at different levels like maybe Python or like a Java--or--it might get taken
out by the language. So, talking too much on this, I'm just going to say--Python has
an option called a raw strength where you put a raw strength where you put a lower case
R to the left of the leading quote. And what the lower case army, it says, "Do not do any
special processing with backslashes. Whatever I type, just send it through absolutely raw
and un-interpreted." This feature--I mean, it's a little bit obscure, but it happens
to be very useful providing regular expressions because it frees us from having to worry about
layers of backslash as possible. So, in fact--even though I've done my examples so far about
the R, I'm just going to use the lower case R for all of my examples from here and out.
So, I just don't have to think about it. So, in this case, let's just try it. Yes, so then
it's able to find this--you know, so it's matching now. So, that's an L with the dot.
So that, you know, that will enable to talk about a dot disposable. All right, so the--where
we got so far, so it goes left to right, dot matches any character, and all of the pattern
has to be matched but the text, you know, we don't care. You don't have to get all the
text. So, let me show you some slightly more scary example. So let's say in my text here
I've got--you know, there are some text that I don't care about and I'm going to say, you
know, ":cat" and then there's more texts. So let's say I want to pick that part out.
So, the next sort of regular expression code I'm going to talk about is "\w." So "\w,"
which is actually what I have up here, "\w" matches what you called a word character.
So that means a letter or a digit and I think it also includes underbar. So in this case,
I'm going to say, well, let's say I'm looking for a colon followed by three-word characters.
So that's going to work okay. So, now, I could say--like sort to the "\w," there's "\d" matches
a digit. So, for example, if I say cat and I'll say like ":123."
>> Between the "\w\w\w" and "...." >> PARLANTE: So, "..."-- it's an excellent
question. The question is what's there ensuing "..." and, you know, three "\w?" The "..."
is just any character, like, it could be a space, a colon, and anything. The "\w" matches
a character so long is it looks like a word character, like, A through Z, zero through
nine, underbar. No, so if this was a Unicode string, the word, it's smarter about being--there
is a basic notion of an alphabetic character versus, essentially, punctuation. Oh, yeah,
question. >> Zero through nine, is that true, the word
character? >> PARLANTE: Yeah, a word character includes
digits zero through nine. So, I mean, you know what it's a little bit like is usernames,
right? You know, blah-vadi-blah 123, you know, the username. That will show up in the later
example. Okay. So, digit is little bit similar. Where is my cat example going? Oh, well, suppose
we'll do it again. So, if I were looking for, you know, have blah, you know, ":123xxx,"
let's see. So I could look for--actually, here, I'll just look for--I'm just looking
for three digits in a row. I can write that as--I'm going to put the r also here. So,
that pulls out the 123. Now, it happens there's--so you can see there are sort of these different
regular expression codes: "\w\w." You know, they kind of represent sort of common cases.
I'm not going to show you all of them, but I'm sure you know the ones that I think are
the most useful and, you know, ones that kind of show up in the product are going to be
later. All right, let's say--last one I'm going to show here is whitespace. So, supposed
I'm looking for like this pattern. It's like, well, I want some digits and they're separated
by spaces. So the simplest way you do that is a "\s" represents a whitespace character.
And the "\s" is smart that it knows about space, tab, new line, those all count as a
whitespace character. It knows about the whole sort of space, the whitespace characters.
So, hopefully, that will work so that finds it.
>> One, two... >> PARLANTE: Yeah, so look--so the question
is if you had two spaces. Just hold that thought for a second because I'm about to--we're about
to get there. So, so far, I haven't done any repetition. I've just to have like, you know,
fix numbers and things. So the--probably, the most powerful part of regular expressions
is that there is these modifiers, plus and star. So plus to the right of something means
one or more of that, and star means zero or more. So, we'll have just going to do that
with my digit example. So the question is, like, what if I got these three digits--whatever.
There's just some amount of space in between them. So the way you would say that in regular
expression is, like, put a plus to the right of that "\s." And that means--yeah, one or
more--that element repeats. There's just one or more of those. And I'll do it with this
one as well. So, here, turn there. So now that matches. So, adding the plus and the
star, and I'll do a bunch of examples with these. You know, exactly what we want to start
matching more complicated patterns. Also, remember how I was saying how per character
regular expressions, I think it's like pretty the densest language that any normal person
would use. And, like, look at that little--okay, look at that little bit of code, right? That
really means something, right? Every character in the order--I mean, it's all really pretty
significant. So, it's getting--I was about to say it's going to get worse. But what I
meant to say is it's about to get even more interesting. Whatever a professor uses the
word interesting, you always know you're in big trouble. All right, so let me do--I want
to use those puzzling a little bit. So, let's say I'm looking for--I've got this random
text and I'm looking for a colon and then, you know, let's say, a word character. You
know, I'll use kitten instead and just more texts here. Now earlier, I've said, oh, you
know, I kind of knew how long the word was, but that was kind of a ridiculous assumption.
The more typical way to do is would be, let's say, well, there's a colon and I'll say, "And
then, there's just some number of word characters." So, I would write that as "\w+." That's the
much more typical way, right? Like, you have the sum, you got quarter, something that sort
of starts, and then you're like, "Yeah, whatever," then just take all the word characters from
there. So, if I write it that way, then it will like--it just picks up the kitten part.
So, that is a--beginning to look a little more the way these things actually work.
>> A word character... >> PARLANTE: Yeah, so the space is not a word
character. That's what's making a stop there. So, all right--and actually there a question
before, like, does it include digit and so forth? What if there was like "kitten123?"
That still works. But if it's like "kitten123" and I--except I have to add a character, like,
let's say ampersand, then it stops at the ampersand. So as the thing, so what the plus
does is it--the plus is greedy. It goes as far as it can and then it stops. So, they're
just kind of the pneumonic for regular expression is it finds the leftmost solution, the first
one, and the largest solution. So, the plot until they accepted there's a plus and a star,
it will just go as far as it can. Yeah, question. >> So period-plus will take you all the way
to the end of the line? >> PARLANTE: Oh, yeah, so the question is
period-plus would take you--well, it's not a question. It's actually a suggestion. So,
I'm going to refer it as question. What if I said period-plus? And the answer is I've
said. Like, yeah, it just goes all the way at the end, right, so period matches dot,
ampersand, everything, okay, except for [INDISTINCT]. All ready, yeah.
>> When you say largest, do you mean that if you say kitten123, 123, you will find a
whole?] >> PARLANTE: So, if I say--you mean here if
I say kitten123, 123? And I'll go back. So, this is "\w+." Okay. What do you think it's
going to do? All right, yeah. So it will go through both 123s, and then it will stop because
the space is not a word character but digits zero through nine are word characters. I mean,
it's made a bit about this of the word "word," right? I mean, to a normal person, it just
like a word character. But compared to, like, ampersand, it's a word character. All right,
so one more code I'm going to show, which I'll just type in here, is backslash uppercase
S is a non-whitespace character. It's kind of like the opposite. And I--I'm a little
saddened who ever designed regular expressions chose to have uppercase and lowercase mean
something different because it just makes a little bit confusing. But backslash upper
case S is really pretty darned handy. So, let's say, for example, I knew that it was
"kitten123&a=123&," you know, yatta, whatever. It's all this junk and then there's a space.
And I want to write a regular expression that picks up all of that and the ease--but I don't
know. It's not just word characters, you know what I mean. It includes all sorts of stuff
but at least there's not spaces in it. And a common pattern is if I write that as there's
a colon and then there's just some series of non-whitespace characters, and then they'll
just sort of terminate with the first whitespace character. So that'll--that just catches the
whole thing even though, like, the Lord knows what sort of characters in there. So, just
as a practical manner, that backslash upper case S is potentially a handy way to catch
that stuff. Already, so let me show you--so those are all the--so the so the plus, the
star, and those backslash codes, those are the ones I want to build on. Now, let me--so
far my examples have been like, you know, a little bit limited. I want to--now, I'm
going to do an example with e-mails, I mean, how to build it up and hopefully I'll show
you pretty practical patterns you can use. All right, so I'm going to make up some texts
here. I'll keep the "blah." So, let's say we are looking for, you know,
and then there's more junky text and let me just add an @ sign just by itself. I'll just
leave it that for now. All right, so the problem I want to solve is pulling e-mail. I want
to imagine. I've got this big body of text and I want to pull e-mail addresses out of
it using regular expressions. So, the--I'm going to try--first I'm going to try to write
this as "\w+" and then there's an "@" sign and then there's "\w+." It's kind of, you
know, plausible (ph) first job (ph) this. So, if we run that, so what's happened here?
So what's happened is, well, you know, it finds--there's the "@" sign. It's the "p,"
but it can't go further left in that because the dot does not count as a "\w" character.
And then likewise, it gets Gmail over here but then it's compounded by dot. So what I
want to say with the apparel (ph) is I want to sort of expand--it's not just word characters,
really it's word characters plus some other stuff. So, regular expressions, there's this
very old syntax for indicating a set of characters and it's going to use the square brackets.
So, inside of those square brackets, I can put "Well, here's the set of characters that
I'm going to allow here." And actually the "\w" works inside of the square brackets because
it's just a common case. So, what I want to say here is, well, "/w" or let's say dot,
and then let's say--well, just leave it that. So, the question is--yes, that's a very natural
question. That dot--it happens, you don't have to backslash that one, that it understands
that the dot inside of the square bracket is just the dot.
>> "\w" is included in that? >> PARLANTE: No, so "\w" word character is
just A through Z, zero through nine. It just not--oh I'm sorry, I'm sorry. That, no, that
dot there means literally a dot. It doesn't match any character.
>> Because it's in the square bracket. >> PARLANTE: Because it's in the square bracket,
it means literally a dot. I mean, it's--I mean it's sort of what's going on here. I
mean, you could do work on your Ph.D. and the text or something that it's kind of levels
of quoting. It's kind of what's going on here. And it's--it is a necessary complexity to
talk about this kind of thing. Let's all just see what dot does. All right, so you can see,
you know, you can sort of see that--so that picks up its word character and then it stops
at the space essentially, right? So, it is--it's not really a dot. I'm sorry. It's not a regular
expression dot. It's just a regular dot. All right, so I'm going to fix the other side
as well. So, I'll put a square bracket over here and I'll put the dot in there, oops,
and the plus goes outside, right, I'm saying that whole set repeats. All right, so that's
kind of fixed it. So, the square bracket is probably the most convenient way if there
are some set of characters that you're looking for, you know, you kind of build, let's say,
well, yeah, here's I'm looking for. What if I had a dot for Nick? I'm sorry, dot, you
mean, like a he... >> Text.
>> PARLANTE: Oh, it would just--it would pick it up. I mean, we've said, you know, we've
said to the left of the @ sign just, you know, as many of these as possible so it'll--now,
suppose we wanted to say that the first character can't be a dot, it must be a word character,
can you think of way in the pattern we could say that? Yeah, I could have a single backslash.
I would say there must be a word character and then it's followed by one or more of the
thing that includes a dot. That'll be it. Although, then to be super--I think, really
hurt (ph) then we should change that one to a star, right? Then there must be one or more
character in the zero or more of these patterns. >> [INDISTINCT] mostly inside the brackets
the order doesn't matter in the... >> PARLANTE: Yes, once it--exactly, yes. So
once it's inside the bracket, the order doesn't matter or, I mean, try to use the word set,
it's a set of characters. All right, so let's try that. Yeah, so then that refines them.
All right, so anyway, I mean--yes, it is a sort of a bottomless topic. I mean, you know,
there's--but, I mean, hopefully, I'll show you stuff that are useful. All right, so I've
got my e-mails example, so that's the first thing. So that's just using group, right?
All I've been doing there is just using group. So now, what I'd like to show you is I'm going
to stop using my find function. I'm going to start doing this raw here. And what I'd
like to do is I want to imagine that I want to pick out the username and the hostname
separately. I want to sort of pick those out. And so, just go back here. I'll just change
this to "" so I'm just doing it manually again. And you can do this--I'm going
to change this back to just the regular way here. By putting parenthesis in the regular
expression around the parts that you care about. Now, the way I'm doing it here, the
parentheses are not changing what it's going to match. I'm just kind of putting in those
mark up of saying, "Well, these are the two parts that I care about." And here I'll get
rid of this dot. So, if I do that, right, so I put parenthesis around the part that
matches the username and parenthesis around the part that matches the host and the @
sign I've just--I've done, I don't care about. So, now, I've done this. If we look at "m,"
it's a match object. If I say, "," it's the whole thing, like, just like we've
always been doing. But there's also a form of the group where you passed at a number.
So, if I say, "," that's now just the username part. And the "1" refers to the
first set of parentheses. So if you count the parentheses--and it goes by the left parenthesis
because you could actually nest them. So the "group(1)" refers to the leftmost parenthesis,
which if you look up here, there's, like, yeah, that's that guy. And then, ""--oops,
you can just guess, "group(2)," that's now the hostname. So, a more--the way this is
going to work for, you know, some problems or regular expressions is all the times you'll
write a regular expression for the thing you're kind of looking for, right? Well, I'm looking
for a URL, or an I.P. address, or something, and then you'll maybe put parentheses and
then say, "And here's the part that I want to extract." Then you'll call ","
you'll get this match object and then you'll use "group(1)" and "group(2)," whatever, to
just kind of--it's already, you know, parsed it for you. You'll just pull out the parts
that you want as text and then process from there. Yeah, question.
>> If I put a plus or a star after one of these [INDISTINCT] the first parenthesis and
it matches that twice, is that "group(1)" and then "group(2)" or is it still a group...?
>> PARLANTE: Yeah, so the question is if there's a plus after the parenthesis, you know, does
that change how the group number works? The answer is no. The group numbering is based
on just statically looking at the pattern as an unchanging thing and just counting the
left parenthesis from--going from left to right. So that is the shortest answer I can
give there. Already, so I've got--I want to--so "" is my second favorite Python regular
expression, regular function. My absolute favorite one--actually, let me make my data
a little more complicated here. I guess I'll also add a "foo@bar." My absolute favorite
regular expression function is called "findall." So, I'll just say "findall" here. And what
"findall" is going to do is I've just still got a pattern, and now I've just changed my
text to--you know, I put a second e-mail address in there. What "findall" does is it just takes
the pattern and rather than just stopping at the first match, it just continues and
it just finds all of the matches and it returns them to you--it returns to you essentially
the ".group," right, the whole text just in the Python list of other strings. So, for
example, we talked about for a file how you could just say "" to get the entire
text is one string. So a pattern--I was in joy because it just saves me so much work
is I just call "re."--I just call "" and I pass that in as the second argument
to a "findall." I just feed the entire file into an "re.findall." I have a pattern, I
just let it ripped through the entire text, skip new lines, whatever, all that stuff it
just handles, and it just pulls out the things I want and just returns them to me as a Python
list. And then, I can--you can write a for loop, you know, stuff we're doing yesterday.
Now, you can just process this list. So that is--that's really how you use this stuff.
So, in this case--so notice I took the parentheses out. So I just left this as a simple pattern.
So I just got, you know, I just got the system matches (ph). There's this one other variation
I can do here. What do you supposed is going to happen if I put the parentheses back in?
I'm, like, well, this is, you know, it's not really--it's a pattern but it has this grouping
in it. And, yeah, we have tuples. What it's going to do is if there are parentheses in
there, instead of just returning the whole match, it says, "Oh, well, there's two plans.
I'm going to return tuples length, too." So each tuple represents a single match and then
the tuple just has the groups in there. So, that--yeah, you can see where this can be
pretty handy. If you got some big file, you just want to kind of--there are some part
about you care about, you just want to rip it out as lazily as possible, so "re.findall."
>> You'll lose the format. >> PARLANTE: Excuse me?
>> You'll lose the format. >> PARLANTE: Yeah, I mean, I would say--so
let's say you lose the format. Let's say, well, the regular expression is narrowing.
You get to say what you want to keep. And so, if you want to keep more, you know, write
the regular expression bigger, you know, to keep one. All right, so that's, you know,
not hard to imagine how we're--it's going to be easy for you to work that in to--doing
stuff later on. So, I'll just mention, there are some optional arguments that you can add
sort of hear as a third option to the regular expression. And what I'm going to actually
do, I'm going to do a DIR on re. So that's the re module. I can say, "Oh, yeah, hey,
what are these symbols in here?" So these are some constants. So, if you add the constant,
"IGNORECASE," to you--this works on ".search" or ".findall." That means that it'll consider
upper and lowercase the same. So a lowercase isle matches an uppercase isle and vice-versa.
You can do the ".all." I had said that the dot matches any character except for new line
and that's kind of historical thing because the processing tended to go line by line.
If you add the "DOTALL" flag, then the dot will match new line as well. And so, you could--because
right now, if you use dot, you're pattern can't expand more than one line. Although,
if you use "\s" where you think there's a new line, that'll expand the line but the
dot will not go over one. So, if you mean--if you add "DOTALL," you can turn off that behavior
and that will just truly match anything. So, if you were to say "DOTSTAR" with nothing
else, it would just go to the end of the file. So there's more--I think the most common ones
to use there. >> [INDISTINCT] at the end of, like, [INDISTINCT].
>> PARLANTE: Yeah, so let me--I'll give you an example. So the way you would use those
constants is it says a third argument. So you would say, "re.IGNORECASE" or whatever.
That's the last argument there. All right, and so the, you know, a couple--the handouts
for today, the first one--if you didn't get one, you can get one in a second. The, you
know, there's a nice, you know, an explanation of regular expressions and it shows the syntax
and, you know, a lot of kind of stuff there or whatever we're doing here. All right, so
I think we're ready for an exercise. So the exercises today are going to be a little bigger.
There's three of them I'd like to do today and, you know, kind of incorporate all those
sort of stuff. So let me demo this first one. So, this fist one is going to involve a brief
4A (ph) into a little understood part of the government called the Social Security Administration.
Now, the Social Security Administration in my life experience is in-charge of putting
certain fields on everyone's paycheck that no normal person understands and it just caused
you [INDISTINCT] just not that was going on, but they do this other thing. If you do a
Google search for Social Security Administration baby names, they do this thing where they
keep track of what the popular baby names are for babies born in that year. And they'd
been doing that actually for a hundred years. So you could look at 1900, 1950, whatever,
you can just see--it turns out for baby names, there's sort of a--there's a popularity of
it. There are sort of names kind of I have in the flow (ph). So, I look at this and I
see assignment idea. So, let's go to--oh, I don't know, 1980. I'm not even going to
try and think about when you guys were all born. Do you know what I'm saying? And let's
just go with, like, I don't know the top thousand. So we get those, like--you'll look at this
and you're just thinking like, "tr," "td," that kind of thing, you know, good thinking.
All right, so, here's just for 1980. The list of baby names and what does this is saying
is that for boy children, Michael was the most popular name. And then, next most popular
was Christopher, Jason, David, and so on. And over here, we have Jennifer, number one;
Amanda, number two; and it just goes down to, like, you know, here we have, you know,
Bobbie, Emil, Jermain, Kraig with the K, down to the, you know, the less popular names.
All right, so what I would like to do--going back to Python here, let's see what I have.
Okay. So I'm going to go into "day2" here, and there's a directory, baby names. So, if
I look inside here, let me look at "baby1990," I've pulled this sort of--I just sort of copied
and cleaned up just a tiny bit the text from Social Security Administration site. I put
this very [INDISTINCT]. Okay, well, there's some partly written CSS and whatever. And
then eventually, there's--here's the "h1" and then here's a table blah, blah, blah,
and at some point it's going to say--all right, here we go. So here's the "h3." This is popularity
in 1990, or as I like to think of it is popularity in "\d\d\d." And then, there's some whatever
junk you want to skip and then here's a "tr." And then here, there's the "tr." And if you
don't know HTML but whatever, that's the HTML for that first row. So it says, "tr," "td,"
and then there's the number one. And then there's more "td" stuff and there's Michael,
and there's Jessica, and then here's row two, and so on. And it just goes on, like, there's
all the data. It's beginning to look like an actual problem. All right, so, the first
thing what I want your baby names program to do is given a file, like, "baby1990.html"
and I'm going to pipe this into more. What I want you to do is I want you to rip through
that entire file. I want you to figure out what year it represents. I want you to pull
out all the names and all the ranks. I want you to organize it so that you can--then produce
a printout that's just in alphabetical ordered by name. So just I've shown here. So you say--so
the first--first you print the year and then I want to see Aaron 34, Abbey 42, and so on.
So you're just show an alphabetical list, here's all the names are. So that will get
us through the... >> Combine male and female?
>> PARLANTE: Yeah, so what's going to happen is there's a strange case but sometimes a
name will appear as both male and female. And I'm making any distinction male from female.
So in that case, I want you to give it the more popular, essentially the smaller number,
whatever the smaller number is, all right? So, the--oh, let me talk--I'm going to go
back to Python for just a second. There's something I mentioned I think maybe very briefly
yesterday but it's about to come up, which is we did file opening, right? So, "f=open(filename,"
so I'll remind you, if you want to write a file, if you want to do it for writing, then
for that second argument you pass a "w." So, yesterday we just did "r." We just read the
file, so that's fine. So, you put a "w" there for writing. And in that case, then probably
the simplest way to write to the file is then it has a ".write" and then you could just
have, you know, whatever kind of text you want in there. And so, you've got to be careful,
you can zero out of file here but it'll write that text. And so, this is [INDISTINCT] as
well. That's about to come up. All right, so that is--so part A is I want you to just
pull out all the names, you know, use a regular expression, findall, maybe a dictionary, I
mean, just total regular work. So for part B, what I'm going to do is there's an option
called "--summaryfile." And I'm going to run this on as star or I'm going to say "baby*.html."
In that case, I want you to produce no output. What did that just do? When the "--summaryfile"
option is given, what I want you to do--oh, notice in this case, I ran it on "baby*."
So I ran it on all of the baby files, right? So the shell just expands out, so in that
case "rv" is going to be all of them. If the "--summaryfile" like I've just given, what
I want you to do is for each file I want you to read it and I want you to create a new
file with the same filename but ending in ".summary," and then I want you to take that
output that we were printing to the screen earlier and I want you to write it to that
file. Now, there's a little bit of a trick here which is I've shown. So, for example,
when you have a low level function, you're not necessarily want it print it to stand
it out directly. You want to have a function, like, given the file, returns to you to say
a Python list or a dictionary or something. And then the code that got that data structure
can choose what to do with it. It can either print it to stand it out, or it could print
your file. So you've got that [INDISTINCT]. I'm sorry. Certainly that technique will come
up tying to solve this. All right, so once you've got that, then you can do something
kind of neat. So, when I got these ".summaryfiles" and what's happened is because I've done them
over a decade, you can sort of see patterns, right? So this is going an increasing order
by year. So my name--the "Nick" format list was, like, not looking so good and it's getting
worse. Now, well, there's a lot of interesting data in here. So you can, you know, interesting
data makes for fun as what I'm trying to think. Here's probably the funniest part of this
thing. There's a "Trinity" and the question is, "In what year did the movie the Matrix
come out?" And, yeah, there's another few things that you could do here. So, like, well,
maybe the Matrix was reacting to a social phenomenon, or it's the other way around.
It's all very complicated. >> PARLANTE: Yeah, yeah, [INDISTINCT] there's
a New York Times Magazine [INDISTINCT]. Anyway, it just turns out this entire topic of baby
name popularity is sort of very interesting and at least now you're--and you get to do
the, like, the needy goody work of actually pointing out that data. All right, so here's
what I'd like you to do, work on this and then--that will get us to--and then have lunch
and I'd like you back here at, let's say, 12:45, all right? So a little bit of coding,
a little bit of lunch, and then back here, all right, go.