Search 101


Uploaded by Google on 25.10.2007

Transcript:

MALE SPEAKER: Good afternoon, my dear
colleagues, dear friends.
It's my real privilege and pleasure to welcome here in
the Czech Technical University Mr. Douglas Merrill, who is
currently a vice president of Google.
You may see that Google is just rolling over the Czech
Technical University, because we have got an excellent
opportunity in April when Vinton Cerf was here and he
was speaking about the internet on Mars, while
Douglas is going to contribute on the
search side of the Universe.
So it means his lecture is about the search
possibilities, and I think he is going to be quite excellent
in this field, because he's not far away from your age who
are sitting over here, and he's of your mentality.
And I think the main point is, please, this lecture will be
around 45 minutes.
Then, of course, it is expected at least hundreds of
questions rising.
Please write them down on the paper to smooth the process,
and send them to the girls who will be going down and up.
So please do it in this way.
And as well, if there is some presence paper, please sign
it, because it is fine to know who is really interested in
such a field.
And in fact, I'm not really to be upon your time, Douglas.
It's your floor, and possibly even your microphone.
You have got it.
So the floor is yours.
DOUGLAS MERRILL: Thank you very much.

Hi, thanks for coming.
It's a great honor to get to come to talk to a university
that's 300 years old about a little tiny company that was
founded eight years ago, nine years ago by two crazy
graduate students.
And Stanford University, where Larry and Sergey were
students, has a bunch of classrooms that
look just like this.
And so I guess my deepest hope is that the next Larry and
Sergey are sitting in the audience right now, and will
be inspired by something stupid that I say during this
lecture to go out and prove me wrong.
So that's my challenge to all of you.
Find what I say that's wrong and fix it.
Thank you so much.
My name is Douglas Merrill.
I'm a Vice President of Engineering at Google.
So just for those of you who are in the front, I recommend
that you loosen up your neck a little bit, just relax.
I pace a lot.
And so down here, you guys, you're going to
get a little seasick.
It's OK.
If you feel seasick, close your eyes and
breathe, it's OK.
Up there, you guys are going to forget who I look like.
It's all fine.
You cannot see me up there anyway, so it's irrelevant.
This is Alex.
Alex is right now having a nightmare that he has a test.
Do you guys all have that nightmare where you're in the
front of class, and you have a test,
and you haven't prepared?
Alex is unprepared.
Next slide.

Well done, sir.
So Larry and Sergey met at Stanford in the computer
science school.
They were both students in an information theory class in
about 1998.
And they didn't like each other.
Larry thought that Sergey was argumentative, and Sergey
thought that Larry was arrogant.
They were probably both right.
However in their class project, they came up with the
idea to try and apply some basic principles of
information theory to unstructured web search.
Now, it's 1998.
Keep in mind at the time, web search is a solved problem.
Everybody knows how to do search.
There's no questions left to be worked on.
So these guys said, oh, but wait, there is.
And they're really kind of interesting questions.
And they set themselves a goal to organize all the world's
information and make it universally
accessible and useful.
That's a kind of a small goal.
These guys didn't shoot high.
All the world's information, universally
accessible and useful.
What I want to talk of it today is a little bit of what
the web looked like in 1998, before most of you were born,
and what it looks like today, and what we think it's going
to look like in the next 10 years.
Next slide.

In 1998, there were a couple of dominant search engines.
Neither one of them exists anymore, don't worry about
their names.
And they knew how to do search.
Here's what they did.
They had these people who sat in these big rooms, kind of
like this, with computers in front of them, kind of like
all of you.
And they were surfing the web-- kind of like all of you.

And what they would do is they would find a page and they
would read the page.
And they would say, oh, you know what?
This particular page is about soccer.
I'm American, I know you guys call the game something else.
Sorry.
And they would have this little toolbar that they would
pull down and they label "soccer." And then that page
would be indexed.
And they knew that this was going to work.
They were wrong.
They were wrong because the world changes too fast. On
average, 10% of the web changes every month.
Here's the interactive portion, boys and girls.
If 10% changes every month, 10 times 12,
carry the one, right.
Likely everything changes every year, which means that
these poor horrible people with this awful job of surfing
the web and making down each page have to look at every
page every year.
Additionally, the web is doubling at this point in
history about every four or five months.
So twice a year or so you've doubled the size, everything
that's already existed has changed at least once, and
keep in mind that it turns out that the web's
not entirely in English.
Who knew?

So now you have to have rooms full of people who speak all
these languages.
Not a scalable model.
Next slide, please.
Really not scalable today.

So this slide is more shocking to Americans, because it turns
out that Americans think that no one else
in the world exists.
You guys have all heard the joke, right?
If you see three languages, you're trilingual, if you
speak two, you're bilingual, if you
speak one, you're American.
[LAUGHTER]
It turns out most of us aren't American.
So the approach used by the search engines in 1998 would
not have gotten us today.
They would not have gotten us here.
What got us here was an insight that Larry--
mostly Larry had, but Larry and Sergey had together--
called Page Rank.
So what does Page Rank do?
Page Rank allows you to figure out whether a particular web
page is interesting or not.
That makes sense.
Is this particular page useful?
So it's called Page Rank.
Obviously it's named because you're ranking web pages.
No, Larry named it after himself.
Larry's last name is Page.
What is today's lesson, boys and girls?
Computer scientists are not funny.
Next slide, please.

In a second, I'm going to talk about how the
stuff actually works.
I'm hoping that's more interesting to you.
But first, I want to pull back a little bit, and I want to
talk about the context.
So I mentioned that web search is about more than web pages
in English.
It matters a lot to actually understand the context within
which you are working.
So for example, if you do a search for BMW on google.cz,
you ought to get different results than if you do a
search for BMW on google.com.
And indeed you will.
We'll recognize that probably you want to go to
the .cz site instead.
Part of our ranking signals are more
than just page ranked.
It also is about the context from which you come.
Next slide, please.

Google publishes--
[LAUGHTER]
I'll give you a second to enjoy the list. The previous
slide was called "Being Local Matters." It turns out it only
matters in certain regards.
So we publish a list called the Zeitgeist. The Zeitgeist
captures the most actively growing and most popular
queries, and we do it by country and by language and a
bunch of things.
And there's a couple of truths.
Apparently, they are universal.
The most popular search in every country
is a beautiful woman.

And apparently game shows and television are also pretty
popular to every one.
Prison Break, for those of you who don't know, is a really
bad American television show.
So fundamentally, it matters, if you're going to do search
right, you need to understand that the web is growing too
fast, it's changing too fast, and it's not all in English.
So the lesson that I want to talk about is,
how do we do that?
And I'm hoping, again, to reiterate that you guys-- one
of you, or two of you, or 10 of you-- are going to hear
something that I get wrong.
And you're going to say, hey, I have a better
idea and go try it.
OK, how does it work?
All right, let's build a search engine.
This is the first of the interactive portions of
today's talk, boys and girls.
How many of you have had to build a search engine in a
computer science class?

How many of them were any good?
Oh, good.
OK, so back in the day, when the web was first created, the
terms were all coined by Tim Berners-Lee.
And he talked about the fact that these pages were all
inter-linked like a world wide web.
What goes on webs?
Spiders.
What do spiders do?
They crawl.
Hence the term of art for finding information in web
search is called crawling.
This would be my second instantiation of how computer
scientists are not funny.
That is supposed to be a joke.
So we're going to start out by crawling information.
How does a crawler work?
Simple kinds of crawlers start from a known web page like
aol.com or pick your favorite portal.
And they go through each link, and they essentially click on
each link, and that expands to more web pages.
Each of those pages has links.
You click on each link from there, and you keep doing
depth-first work recursion until you run out of time,
space, or the Universe ends.
Crawling sounds easy, right?
That's probably, what, 10 lines of Python.
What's hard?
Remember, everything changes.
So you've got to recrawl a lot.
How often do you have to recrawl?
If 10% of it changes every month, you have to recrawl the
natural log of 10 times the number of months since the
last time you completed a crawl.
That's a very big number.
You've got to crawl a lot, is the answer.
Second thing that's hard.
How do you know if you've already seen a page?
Oh, that's easy, right?
Take a hash of the URL.
That would work, wouldn't it?
What happens if they change the title of the page?

What happens if it's of a copy of the page?
Oh right, the hash of the URL won't work.
OK, still no problem.
I'll take a hash of all the content of the page.
That will work, right?
Won't it work?
What happens if they've got a space?
What happens if they misspelled a
word in their copy?
What happens if they inserted a picture in different spots?
Naive crawlers get roughly 25 to 40% percent of their
content is content they have already seen before.
Which means that on average, you're
wasting one byte in four.
If you're crawling to the end of the world, you want those
bytes back.
Crawls are hard.
Additionally, crawls are hard because how do you store the
data once you've got it?

How many you have had a database class?
Come on, guys, I know you're out there.
I hear you breathing.
Come on!

How would you store a page in a database?
It's hard work.
Databases aren't optimized for this.
And you're not going to need to do joins.
There's no concept of query structures here.
So we ended up having to build a file system called the
Google File System.
And it's that technology called Big Table that I'll
talk about in a second.
If you're interested, all the papers are hung off of
google.com, and they're publicly available.
Precisely to let us grab a piece of information, take
what's called a hashmap of it-- which is a hash that has
an error code in it, so that if you add spaces or move
words around, I notice it--
and then store them in a way which is redundant.
Because then you always have the operational side as well.
What happens if you lose a machine?
Crawling seems easy.
It's hard.
And it's the easiest thing on this slide.
After I crawl, remember we've got the crawl that's running
until the end of time, until you run out of space--
I can't remember the joke I made before, but rewind a
little bit.
For those of you who are surfing the web, just go find
a crawl paper.
Then you have to index everything you just crawled so
you can find it later.
What's the right index structure?
Come on, this is easy.
Come on.
It's not easy?
What's the right index structure?
You could index every single word on the page.

Easier, you could index every character on the page.
Pop quiz--
what's the most common character
in the English language?
Space.

What's the second most common character
in the English language?

You're going to end up with a lot of index entries for
space, aren't you?

OK, so you can index every character.
It's not very useful.
Why is it not very useful?
Because every single time you get a query, you're going to
have to go through and reassemble all those
characters into words and then map against all the documents.
Probably the wrong index
structure, but pretty flexible.

You could index trigrams, index three words at a time.
Douglas C. Merrill, you can index that, right?
Would that be better or worse?
Well, different.
What happens if you do a search for Douglas Merrill?
Or worse, you shorten my name, which annoys the hell out of
me, and do a search for Doug Merrill.
A trigram index is going to break because you're not going
to have that entry.
If you look at all the search engines in the world today,
they all have one or more of these index structures.
No, I'm not going to tell you what ours is.
But it's in the space of somewhere between characters
and trigrams. And the index structure is going to have
huge implications on the stuff I'm going to
talk about in a second.
And so far, we're still in the easy stuff.
Then you get a query.
So you go to google.cz, you key some words into the box,
you hit enter, you get a bunch of results back.
Simple, right?
No problem.
On average, we return 10 results in 400 milliseconds,
half a second.
That's not too bad.
What's the speed of light, latency, from a query served
here, if it's served from, say, Northern California?

About 2/3 of that time.
So clearly we can't serve everything
from one data center.
Leave aside the storage and power, et cetera,
et cetera, et cetera.
And then there's all the fun of actually doing the ranking
and picking out which result goes to the top, et cetera.
It turns out it's harder to build a search engine than it
seems. Next slide, please.

We want to give you the right answer at the top every time.
So there's a lip right here.
I'm taking bets about the odds that I go head over heels over
the lip at some point during this talk.
I'm currently giving 5:1 that I end up on my face, just FYI.
Any takers?
OK, we want to give the right answer every time at the top.
This is the art of ranking.
How do you know what the right result is?
Larry and Sergey came up with the concept of Page Rank.
So have any of you read the Page Rank paper?

Wow.
What classes have assign it, or are you guys just
over-achievers?

There's a lot of you.
That's creepy.
OK, usually like one person raises their hand, and it's
the person you don't like.
There are, like, 30 of you.
Wow.
This is cool.
Core concept of Page Rank.
How many of you have met me?

Come on, you guys have met me?
Stephanie--
the Google people should raise their hands, geez!
And so none of you have any idea who I am.
Why are you here?

Oh, right.
You're here because somebody you trust--
or, well--
[LAUGHTER]
OK, let's just pretend.
You're here because somebody you trust said that I was
worth listening to.
You're here to listen to me-- and I make it--
you're here to listen to me because somebody else
suggested that I would have content--
how are you--
twice, I made it-- that I would have content worth
listening to.
Fundamentally, you're trusting that I have useful content
because someone you trust said so.
Page Rank is the same idea.
Some arbitrary page on the web is most likely garbage.

However, if someone you like links to that page, basically
saying this page isn't garbage, it's more likely that
page is useful.
Page Rank is simply a sum of the vertices
of a directed graph.
Start from a top page, make a graph downward of links.
Edges are links, nodes are pages.
Take a sum of the weights across those links,
you get Page Rank.
Thus, the more linked something is, the
higher its Page Rank.
Thus, the more a page is connected across the web, the
more likely that page is good.
What's wrong with this algorithm?

What if the links are garbage?
So say for example, you have a blog, and your
blog has open comments.
And I write a bot that goes and finds your blog with all
of its open comments and inserts a comment which is a
link back to this page.
Page Rank will see that as a link, and thus will think, oh,
this page is better.
Do you think that link is a useful signal?
Probably not.
So Page Rank was our first ranking algorithm designed to
get the right results at the top every time.
We now use something more than 200.
Spam is an arms race.
Every day, we have hundreds of engineers that work on trying
to figure out what the person who's trying to gain the
system is going to do next.
Now there's a fun job.
Every day, you get to go to battle with the bad guys.
Next slide, please.

And then you start thinking about, in addition to crawling
the web, indexing the web, ranking the pages, maybe you
ought to be nice to your users.
Those pesky users.
Some languages, like English, are relatively easy to enter
search terms on.
English doesn't have accents, I don't think.
Do we have any?
I don't think so.
English doesn't have diacriticals.

So my English keyboard has one mode.
Full stop.
Not true for you guys.
But as search engine get better and better coverage,
they can get smarter and smarter, and they can start
noticing things.
For example, we can notice errors in user entries,
specifically like you dropped the diacriticals, and we know
it, so we can just add them back for you.
How do we do that?

Come on, somebody guess.
There's an obvious guess.
Come on.

OK, I'll come out here and I'll guess.
OK.
OK, I'm going to come sit right next to you, and I'm
going to guess.
OK.
I think you do it by having a bunch of
people who speak Czech.

Four times, I made it without falling.
AUDIENCE: [INAUDIBLE]

DOUGLAS MERRILL: That's a great guess, much
better than my guess.
Not right, but much better.
Much better.
So my guess is dumb.
Why is my guess dumb?
Because it doesn't scale.
Your guess makes a lot of sense.
Except it means that I have to teach the crawler and the
indexer what is a diacritical.
AUDIENCE: Is that hard?
DOUGLAS MERRILL: Not as hard as doing it by hand.
But you know what's easier still?
What's easier still is watching your users.
You take anonymized search traffic, and I can see people
who start with that entry up top, and then go, ugh, and
retype the entry below.
And I can do statistical machine learning that says, oh
right, these two are probably actually the same word.
And then I don't have to teach it about diacriticals, I don't
have to teach it about language, I just have to watch
anonymized user traffic.
AUDIENCE: Are there any users that [INAUDIBLE]
DOUGLAS MERRILL: Say again?
AUDIENCE: Are there any users that use diacriticals
[UNINTELLIGIBLE] when searching?
Because I never do.
I always type it out, whatever it is.
DOUGLAS MERRILL: Thank you for helping to
improve our search quality.
The answer oddly enough is yes, but fewer and fewer
because we did the right thing.
Next slide, please.
But the next slide's the same--
this is even better.
This is the same problem only done from the other side.

We can do the same thing I just talked about, about
diacritics and provide spell checking.
How do we do it?
The same way I just talked about.
You see people starting at the top, which is the word for
gym, right?
For gymnasium.
Apparently they're tired because they skipped a letter.
So there's some sort of weird--
But we can notice that you typed that word in, you
probably will get a few results.
In general, the other grand truth of the internet-- so
grand truth number one was that the top-rated search is
always about some woman.
Grand truth two is no matter how badly you misspell a word,
somebody's got a page that spelled it that way.

Anyway, it's never the right page.
And so we always find that a couple minutes later, or a
couple seconds later, more often, you redo the search.
And so by doing statistical machine learning, I can learn
how to spell in almost every language on the planet without
having any notion of morphology, without having any
generative grammar, without having any of the stuff that
Steven Pinker talks about.
All I've got is spell correction,
which is pretty useful.
In fact, it's so useful in English that I use it to
actually spell check my words, because there are all these
words I can't figure out how to spell, so it will teach me.
And all done simply with statistical machine learning.
So how many of you have had a statistical machine learning
class, or has [UNINTELLIGIBLE] a topic in a class?
Pay attention next time.
It's important.
Next slide.

OK, however to your question, it was in there someplace, I
lost where.
I apologize.
Who actually does searches with diacriticals?
Good point.
We do, however, have more sources of
data than just search.
And those sources are the local products we've released
in the market.
The more content that gets created, the better off the
internet is.
But what's the interesting story of the internet?
It's not actually Google or Seznam or Yahoo.
That's not the interesting part.
The interesting part of the story is the democratization
of information creation.
History has always been written by the winners.
400 years ago, about 2% of the people could read or write.

And apparently all of them went to this university.

Now 200 years ago, between 10 20% of the people in the world
could read or write, depending on your perspective.
Nowadays, it's more than that.
I hope a lot more, but I don't actually--
have you ever read an American newspaper?
It might surprise you.
Anyway, leaving that aside, what the internet and tools
like that have let us do is they have let everyone tell
their story.
So instead of history being written only by the winners,
it's written by everyone.
Everyone gets to tell their story, which is cool.
Pop quiz, what's the difference between a
revolution and a civil war?
Who won.
Because if the reigning government won,
it's a civil war.
If the reigning government lost, it's a revolution.

We built a bunch of tools to help people tell their story.
We built a bunch of tools that help people tell their story
in Czech, which allows me to improve my search quality even
if, in fact, no one searches with diacriticals, because I'm
getting content created that I can index.
Next slide, please.

I don't really have anything to say on this slide, but it's
a pretty picture.

So pretty?
Yes?
Anyone have any comments on this slide?
Me neither.
Next slide, please.

So the next time you have a class assignment to build a
search engine, you know what you have to figure out.
You have to figure out how to do a crawl and recognize that
you've seen a page before and find an efficient way to store
the page, find an efficient way to figure out if you've
seen it before.
And then you have to decide on an indexing scheme.
You have to index characters, or maybe words, or maybe
bigrams.
You have to figure out a ranking system.
Maybe you'll use Page Rank.
Or maybe you'll be like us and you'll do hundreds of
different things, some of which are fascinating computer
science, and some of which are funny little hats.
But all of the things will then ultimately result in a
search which works well in one context.
Here's the place where I hope all of you are
actually paying attention.
So everyone who's asleep, please wake up.
The last 10 years have been fascinating.
We've done such great things worldwide in search.
Seznam's done great things.
We've done interesting stuff.
There have been great companies doing
great work for 10 years.
The future's much harder, and much more interesting.
Next slide, please.

So our mission was all the world's information
universally accessible and useful.
All the world's information universally
accessible and useful.
There are at least four huge computer science problems to
solve in that context.
For those of you who are interested in winning Turing
Awards, pay attention.
There's at least 30 of them on the next couple of slides.
Next slide.
Audience participation part number whatever--
three, four, five, whatever number I'm on.
What is this?
AUDIENCE: The world.
DOUGLAS MERRILL: OK.

OK, fair point.
Yes, it's the world.
I did actually give this talk once and I showed this slide,
and someone said it's a photograph of the Earth.

And I was sort of intrigued by this.
So how do you take a picture of the Earth and
have it all be dark?
But let's ignore that for now.
Fair enough.
It's not a photograph of the Earth, but it is
a map of the world.
What are the spots on it?
What's changing?
AUDIENCE: The number of searches conducted?
DOUGLAS MERRILL: How did you know that?
Nobody gets that right.
Hey, you get out of here.

Well done.
So pretend he's not here.
Everybody says, hey look, it's city lights at night.
It's not.
OK, what we did is we took our query traffic for a day, and
we put a little white dot every place that
a query came from.
So we geo-located the source of a query and we plotted it
on the map over time.
And you see some things, like you can see the United States
pretty clearly.
You can see Western Europe pretty clearly.
You can see Tokyo over there, it's [UNINTELLIGIBLE], a
little bit of China.
And you can see it's clearly temporal, because remember,
time is flowing in this diagram.
And although I've taken the scale off it, it turns out the
people seem to search a lot in the morning and the night,
which makes sense because we all work for a living, except
all of you.
But anyway what else is interesting about this slide?
Where is Africa?

I flew over it a couple of days ago.
It was there.
Really.
So what's going on?
What's going on is it turns out that the continent of
Africa is served by basically two very large
wired internet cables.
Two.
One runs down the east coast, one runs down the west coast.
Remarkable how that works.
Each of those internet cables is connected to the ground by
things called points of presence.
Those points of presence, there are about 10 of them,
land in governmentally controlled centers.
What is true about the internet
everywhere in the world?
One, it destabilizes authoritarian governments, and
two, it's a great source of tax revenue.
So what does that suggest is going to be the case for the
wired internet in Africa?

AUDIENCE: [INAUDIBLE]
controlled by government.
DOUGLAS MERRILL: Oh, well done, sir.
It's going to be controlled by the government.
It's going to be really, really darn spendy.
In fact, in some parts of sub-Saharan Africa, the cost
of an hour's internet time in an internet cafe is about the
same as one month's total salary on average.
That suggests there ain't going to be a whole lot of
wired internet use, right?
So there are about 100,000 plus/minus wired internet
connections in Africa.
But you know what else there are?
10 million internet-enabled mobile phones.
Let's say your mission is all the world's information
universally accessible and useful.
What would you be working on?
Search on mobile devices.
Next slide, please.

So how many of you are carrying a laptop?
It should be almost all of you, right?
OK, how many of you are carrying a phone?

Even in a classroom, there are probably 50%
more phones than laptops.
Imagine what it's like in places that aren't schools.
How should search work on a phone?
What should you do?
Should it be exactly the same experience as on your Mac?
Probably not, right?
Because the keyboard's kind of stinky and tiny.
know
I don't know.
Maybe you should get to actually talk into the phone
and have an answer.
That would be useful, wouldn't it?
Or maybe it should text you your search results.
Factoid.
How many of you have used Google Maps?
The rest of you, what are you doing?

OK, please use it.
Thank you.
For those of you who have used it, how many of you have
printed driving directions?
Really?
Nobody?
Wow.
That's all I do.
So it turns out that about 50% of all driving directions ever
printed out from Google maps are left
sitting on the printer.
That makes them not very helpful.
Wouldn't it make more sense, for example, for Google Maps
to automatically text your phone with the driving
directions?
That would make more sense.
The future of search on a mobile phone is not
replicating your desktop or your laptop.
It's doing something unique to that channel.
Your phone knows who you are, it knows where you are, it
knows what you're doing, it knows what you like.
Your phone knows you probably better than your parents do.

God, is that creepy.
Search on a mobile phone is the next big hurdle that
nobody's gotten right yet.
Next slide, please.

Here's Turing Award number two.
I said all the world's information, which is to say
more than just web pages in English.

However, translation between languages is a very
challenging problem.
It's not scalable to have humans translate everything.
See earlier story about why it's not scalable to have
humans index everything.
And yet think how important it is for people in the world to
get to hear stories told by folks who aren't native
speakers of their language.
Think what great perspective it would be.
You could actually understand someone who's very
different from you.
What a great social goal.
So Google is doing a bunch of research into what's called
Automatic Machine Translation.
Machine translation is one of the classic tests of AI.
Build a machine that can understand a doctorate.
My doctorate is in AI.
None of us have yet figured out how to build a machine
that understands a document.
However, recently, we won-- we, Google--
we won an award for Automatic Machine Translation from
English to Chinese and back, and English
to Arabic and back.
How did we do it?

Statistical machine learning.
People who build the same page in multiple languages,
followed by some sort of dirty tricks around recognizing
synonyms from queries.
We are working on automated machine translation from all
of the hundred languages that we speak.
It's not going very fast. We want it to go faster.
It's not clear this is the right approach, although, as I
said, we're winning a lot of awards for it.
There is a Turing Award on this page, folks.
One of you guys, please go to work on it.
Next slide, please.

Wow.

OK, couldn't we have chosen the other query?

All the world's information universally
accessible and useful.
One of the things we've announced in English is a
thing we call Universal Search.
Universal Search means you can do searches for Darth Vader.
OK, this is going to be funny.
How many of you have heard of the movie Star Wars, the
original one, by the way.
Yeah, that's what I thought.
Anyway it's this movie.
It was released in 1977.
It had a big tall guy, black cloak.
Forget it.
You do a search on google.com for Darth Vader, you will get
a bunch of web pages, including a Wikipedia article
talking about it.
You'll get a site called imdb.com, which is the
Internet Movie Database.
You'll get a bunch of YouTube videos.
You'll get a bunch of pictures.
You'll get a bunch of stuff.
You'll get some news.
Kind of cool.
Because Google doesn't know what kind of information you
want, what's the cool technical problem?
What order do you put them in?
How do you rank a web page versus a news
result versus a video?
I don't know.
It's kind of a cool problem, though.
How do you know what the user wanted?
I haven't a clue.
One of you guys, figure it out.
Next slide.
OK, next slide.

So we'll pretend the blank slide didn't exist. Pay no
attention to the man behind the curtain.
OK, so why, you may ask, am I here at a university, other
than to put you all to sleep?
Well, I'm here, first of all, because I have a bet as to
whether or not I can cross this threshold a couple more
times without falling.
So far, for those of you who took my bet, I'm at four.
One more and I win.

So I've spent the last four years of my life at Google.
It is a place where great people get to solve great
problems. Why don't you guys all come join us?
We have engineering centers all over the world, and I'm
very excited to make sure that all of you--
there are a lot of women in the audience, that's a great
thing to see.
Computer science is radically under-represented by women.
So all of you who are in the audience who are female,
particularly pay attention now.
The Anita Borg fellowship.
The nice woman in the back of the room is going to raise her
hands and wave. Thank you.
Everyone who's female-- actually, everyone--
look at my friend.
The Anita Borg fellowship is a scholarship administered by
the Association for Computing Machines, the leading computer
science organization in the world, funded by Google, to
fund women's studies in computer science.
Please keep studying.
We need more computer scientists.
We are hiring.
Next slide, please.

Not to reiterate the point-- we are hiring.
Here's a URL.
Feel free to go to that URL.
All of you guys who are surfing the web, pay attention
long enough, surf to that link, please.
Next slide, please.
But the most important thing to pay attention to, this
crass recruiting pitch notwithstanding is we have
internships.
The internships are available at Google in lots of different
offices, where you get to come in, spend a summer working on
some ridiculously hard problem, do something really
cool for the world, and have some fun.
And with that, I believe, I am done.
Thank you so much for listening to me.
I've really appreciated your participation.
[APPLAUSE]
42 minutes.
Nicely done.
So you have a whole stack full of questions.
Wow, OK.
How do we do this?
Wow, there's more questions coming in all the time.
Yes, my shirt really is purple.

So are you going to read these questions?
What am I going to do here?
Holy-- there are a lot of them.
FEMALE SPEAKER: Or you can read them.
DOUGLAS MERRILL: I'll read them.

OK.

All right, nice.
OK, what about penalties and banning pages?
Are there permanent bans?
If a webmaster caused a penalty for a web, how could
he fix the problem?
OK, so context.
So the question is about web pages getting banned from
Google, et cetera.
Just sort of stepping back for context, if you go to
google.cz and do a search, on the left-hand side, you'll see
a bunch of links.
On the right-hand side, you'll see a bunch of blue stuff.
The blue stuff are advertisements.
Feel free to click on them.
The left-hand side are search results.
Search results are developed in an automated
and objective fashion.
Humans do not touch them.
It does not matter whether you advertise with Google or not.
Those are just what the world is.
On the right-hand side are the advertisements.
That's a business transaction.
They're complicated.
Sometimes people who are advertisers end up breaking
one of the rules that we have for our advertising program.
And so their ad stops serving.
And that means those people, their ads
aren't around anymore.
And sometimes that gets reported in a confusing way as
Google kicked them out of the index.
It's not true.
That's the advertising side of things.
It's very different.
On the search results side, the search results are
automated and objective.
They're not touched by human hands.
There are a set of guidelines which you can find off of the
website under webmaster tools for making it more likely that
your site will be indexed correctly, and there are a set
of tips, including something called site map-- which I
recommend for the [UNINTELLIGIBLE]
webmasters look at--
which is a way of trying to make sure that we notice and
index your content more quickly.
But obviously we have to follow local laws.
And given that we serve basically every country on
Earth, we face a patchwork of local laws.
Some things have to be removed from the index in one country,
but not another, et cetera.
We do that.
But beyond that, most of the coverage that you may have
read about things getting banned from our index is
mostly the advertising side.

How do you want to hunt new developers
in the Czech Republic?
See previous slide.
Please come work for us.
We're fun.

Do you use some ontology languages or semantic web
technologies?
Good question.
There's been a lot of discussion over the years
about what's the right way to index the web.
And lately Tim Berners-Lee and others are talking about the
semantic web.
The idea that the machine should understand content.
And in fact, right before this, we were having a really
long, great conversation about whether you need meaning,
whether you need ontology to do indexing.
This is a great question.
It is an unsolved problem as to whether anyone needs
meaning or not.
Our techniques are not based on meaning.
They are based on syntactic rules, declination rules, and
user traffic.
My machines don't understand anything.
However, there are some competitors of ours that have
asserted that you need understanding, you need
ontology to do search well.
They might be right.
If you believe they're right, build a search
engine and let's see.
Let's find out.
Try it.
Come work for us and try it.
What do I think about Seznam, the most popular website in
the Czech Republic?
Why do you think it's more popular than Google?
Because it's great.
It's a really terrific site.
They've done a really great job providing
search and other services.
I think they're awesome.
What else should I think about them?
Competition is good.
All kidding aside.
It is always better to have competition.
We're better in markets when there is local competition.
We're losing here.
We're not in first place.
That's so motivating.

So go, work hard.

Is Yahoo better at measuring performance quality results in
some things than Google?

I'm sure.
We track our search quality pretty carefully across all
the languages that Google speaks.

In general, our quality is the highest. Our speed of
performance is the best.
But Yahoo is a great company.
They've done a tremendously good job of indexing multiple
kinds of content.
And they've done a really great set of
deals with media companies.
I think they're a terrific company.
And I think we're better, but they're good competition.
So I'm sure they're better at something.
Just nothing comes to mind.
Is it planned to use computational linguistics in
search despite what is spoken about query corrections, or is
it used in some fields already?
Computational linguistics is a fascinating field.
In fact, Steven Pinker was at Google last week.
I think we just posted his video on YouTube.
So if you're interested in the psychology side of
computational linguistics, I highly recommend you take a
peek at his talk.
He's a psycholinguist as opposed to a computational
linguist, but the concepts are the same.
So we are not at present using computational linguistics
anywhere that I can think of.

Although that probably just means that I've forgotten.
Most of our core techniques are statistical machinery.

What was my previous place of work before Google?
I was an academic when I started my career.
I taught university for several years.
I am a lot older than I look.
Sorry, I'm a lot older than all of you.
I was an academic for several years, then I worked at a
place called the Rand Corporation, which is a think
tank, and then I worked at a consultancy, and then
I worked at a bank.
Can you imagine me at a bank?
Yeah, me neither.
And then I worked at Google.
Are there are other questions?
Oh, there are more questions.
Vanna White, ladies and gentlemen.

Wow.
OK, this is like seven questions.
And they're conjoined by negatives, so I'm not sure
which one to answer.
I'll answer the first one.
Google bought a lot of start-ups recently.
Can you integrate all of them with the Google services?

Sure.
We hope so.
That's why we bought them.
We buy companies for two reasons.
Reason number one is they have a product which has
interesting user traffic and has interesting
sorts of user behaviors.
Reason number two is they have great engineers.
And so in general, every acquisition we've done has
either been because we're really interested in their
product and their users, or their engineers, or both.
And so our goal, depending on which one of those things it
is, our goal might be to release their product as a
Google product.
And for example, Google Earth was an acquisition, and Google
Analytics was based on an acquisition.
Or it might be the case that we're buying them for their
engineers and some thoughts they have, and there are a
variety of those.
So it's hard for me to say in the abstract whether we
integrate them all, because in some cases, we're just looking
for engineers.
Did I mention we're hiring?

That's Vanna being a pain in the neck.
Is there a real question there?
No, OK.

Can you tell us about indexing images?
I'd be interested if Google plans
non-text-based photo indexing.
That's really hard.
Indexing a photo is tricky.
What's the right thing to index?
So there's of cheap tricks.
You can index the text around the picture, you can index the
name, you can index metadata if there is any.
But it's pretty interesting to figure out how
to index the photo.
So we're investigating some really cool stuff, like image
mapping and image correlations, and trying to
index images that are similar.
That's a really cool problem space.
Figuring out how to index an image is a
really fascinating problem.
It's a fascinating conceptual problem, leaving aside just
the computational elements.
Because the human body has a lot of computational power
devoted to being able to recognize images, and being
able to recognize that two images are kind of similar--
a lot of computational power, lots and lots of neurons.
So what's the right way to do that in computer science?
I don't know.
We're trying a lot of things, and I'm not sure we're trying
the right stuff.
Did I mentioned we're hiring?

How many computers does Google have as we speak?
Four.
One of them is here.
You may notice the website's slow.
Are there more questions coming down?
Wow, they're piling on.

I'm getting fairly frightened, actually.

Question number one, Google hacking, do I think it's a
problem or not?
So since we started this game a long time ago, there have
been people who have tried to manipulate the system.
They have tried to get their results at the top for a lot
of reasons.
In many cases, these are just spam pages that are trying to
get people to click on them and they want to download
malware or something else.
And we've attacked that problem in a variety of ways,
including the Google toolbar which tries to tell you if
something is a bad site, and things like
stopbadware.org, et cetera.
If you're building anything which involves users
traversing the web, I want to make sure you've paid
attention--
we have a malware API, which you can programmatically call
us and ask us if a particular site is likely to be hosting
malware or phishing, [UNINTELLIGIBLE]
I encourage you to use that API.

Spam, in all of its forms, is an arms race, as I said.
Sometimes you end up with funny spam, like for a while
there was a time if you did a search for complete failure,
you got the website for George Bush, our current president.
No, I wasn't involved.

Sometimes they're funny, sometimes it's spam.
It's clearly an issue.
We want the right result at the top every time.
And the right result is the one the users really want as
opposed to the one which is just funny.

How do you get the list of banned URLs in China, and are
they paired to key words?
So as many of you may know, the Chinese government has a
set of search terms that you cannot search for.
And all the search engines had to decide what to do.
Basically the choices were do nothing and be blocked and
provide no information, or omit the results, including
those key words--
basically, the banned URLs--
from your results and get to serve some traffic.
We chose to serve 99% of the queries, and give the Chinese
people access to 99% of the world's information.
And then on those times when a search would have returned a
banned URL to flag at the bottom of the page, some
results have been removed by government policy.
We thought it was better to engage and provide 99% of the
world's information and tell users the we did something,
than not engage.
We do not publish the list of URLs or key words, but
although we struggled a lot with the final choice, we
think it was the right choice for the users in China.
It looks like Google employees suffer from recruiting an army
of smart enough people.
How do I feel about it?

How's recruiting going, you think?
We're not suffering from recruiting an army of smart
enough people.
We're hiring another army of smart enough people.
We have two armies.
We have the left army and the right army.

Recruiting is an important part of what we do.
Google is a talent play.
At the end of the day, we will continue to be a great company
as long as we hire brilliant people and give
them room to innovate.
We've built a culture and a set of processes to help
people innovate.
It's a great place to be an engineer.

We have lots of hard problems to solve.
Eric Schmidt, our CEO, likes to say, we have a 300-year
mission in front of us.
And for a 300-year mission, you want as many smart
engineers as you can get.
Did I mention that we're hiring?

What's the possibility to get a Google internship abroad for
an average Czech computer science student?
Note, the average Czech computer science student
browses the internet during the lecture.
Well, possibilities are good.
Submit your application.
Remember to mention something that I said, that if you were
actually listening at all during the lecture.
Yes, we have lots of internship opportunities.
And we think that's a great way for students to get some
practical work experience, and to meet us, and for us to
engage with the universities.
We do lots and lots of work engaging with the universities
on the topics.

Am I currently using contexts of my Gmail account to provide
me with more personalized search results?
That's a great question.
No, we are not.

To use Gmail, obviously, you have to login.
You have to tell me who you are.
That makes sense, right?
Otherwise, I'm going to serve your random email.
But when you're logged in, Google offers the possibility
to do things like give you more
personalized search results.
We may give you slightly different results based on the
history of searches you've done and things you've clicked
on, et cetera.
But we do not use the content of Gmail to
change your search results.
We do, of course, use the content of Gmail to serve ads.
They're marked in blue on the right-hand side.
But those two data streams are not mixed.
How many developers work for Google?
Not enough.
Did I mention we're hiring?

Why does Google cooperate with Mozilla Firefox?
We believe that competition is good.
Competition is good in the search engine, ie, there
should be multiple search engines.
Competition is good in the operating system layer, ie,
there should be multiple operating systems, and there
should be multiple browsers.
IE is a perfectly good browser in many ways.
Firefox is a good browser in many ways.
And in general, having the two together competing with each
other has led both to improve.
Competition is good, and there's nothing on this slide
anyway, so we'll just ignore the dark screen and the man
behind the curtain.

I have heard that in South Korea,
Google is not very popular.
The reason is South Koreans use domestic search engines
based on a completely different
approach than Google.
Do you know any details about this?
Thank you.
In Korea, there is a very great local competitor, much
like in the Czech Republic there's a great local
competitor.
We continue to work on the same kinds of things
that I've said here.
If I were giving this lecture in Korea, I would be talking
about a thing called IME which makes it easier to enter
Korean characters.
I would be talking about transliteration, which is much
like the diacritical point I made earlier.
I'd be talking about exactly the same kinds of things I'm
talking about here, and I would be saying there is a
local competitor which is currently beating us, and how
motivating is that?
Just like I said here.

Cisco Microsystems has an education system for students
in schools.
Has Google got a similar system for students?
We do a lot of work with universities.
We have a university outreach program.
And in fact, if you're interested in understanding
what we do with universities--
I guess, Lucas, is that the right person?
Lucas?
MALE SPEAKER: [INAUDIBLE]
DOUGLAS MERRILL: OK.
This guy right here.
So we believe that computer science and mathematics
education are one of the most important things that we are
doing in the world today.
And we are very, very, very motivated to help computer
science and mathematics departments in a lot of
different ways.
So please, yes.
The answer to your question is yes.
And I think I might actually be out of questions.
Oh, there's one more.
It's not very anonymous.
Maybe you should pass it around.

How many queries does Google process per minute, per hour?
And when is traffic maximal?
We don't release traffic numbers.
The answer is a lot.

And the traffic is maximal right now because you guys are
all doing searches trying to figure out
what I'm talking about.
When does Google release Google Desktop for Linux?
I'm already running the search.
That's good, thank you.

We've released the search version.
So the Google Desktop for Windows has widgets in it.
I'm not sure if the desktop version does.
I don't know the answer to that question.
I will find out.
Give me your email address.
This is you, right?
Give me your email address.
I'll find out.
Thank you guys so much.
It's been a great pleasure talking to you.
Have a wonderful day.