Uploaded by
Google on 05.12.2007
I'm up in the Kirkland office today, up here for an off
side, a little bit of planning, and they
said, you know what?
Why don't we see if we can throw together a video in 10
minutes or less.
So we said, all right, let's give it a shot.
So we're thinking about some of the common things that you
do with the webmaster consul, or some topics that webmasters
want to hear about.
People like to go to the webmaster consul and check
their back links.
They also like to find out if they have any penalties.
There's a lot of really good stats in the webmaster consul,
but one thing that I've been hearing some questions about
is how do I remove my URLs from Google?
So why would you want to do this?
Well, suppose you're a school and you accidentally left up
social security numbers of all your students up on the web.
Or you're a store and you left up credit card numbers.
Or maybe you run a forum and suddenly you've gotten spammed
by Ukrainian porn spammers, which happened to a friend of
mine recently.
So whatever the reason, you want to get some URLs out of
Google instead of getting URLs into Google.
Well, let's look at some of the possible approaches, some
of the different ways you can do it.
And what I'll do is all talk through each one of these and
draw a happy face by one or two that I think are
especially good as far as getting the content out of
Google, or preventing it from getting into Google in the
first place.
So the first thing that a lot of people say is, OK, I just
won't link to a page.
It'll be my secret server page, Google will never find
it, that way I don't have to worry about it showing up in
the search engine.
This is not a great approach, and I'll give you a very
simple reason why.
We actually see so many people surf to a page and then surf
to another web server, and that causes your browser to
have a refer in the HTTP browser codes, the header
status, so that the page you were at before shows up to
that other web server.
Now if that other web server says, oh, these are the top
referrers to my page, and maybe that's a clickable
hyperlink, then Google can crawl that other web server
and find a link to your so called secret page.
So it's very weak to try to say, oh, you know what?
I'm just not going to link to it, I'll keep it a secret,
nobody will ever find out about it.
Because for whatever reason somebody will surf from that
page and have the referrer set, or somebody will
accidentally link to that page, and if there's a link on
web to that page there's a reasonable chance
we might find it.
So I don't recommend using that, it's a
relatively weak way.
Now another way you can do is something called HT access.
So that sounds scary, but let me just tell you very simply
this is a very simple file that lets you do things like
redirect from one URL to another URL.
The thing that I'm specifically talking about is
you can password protect a sub directory, or
even your entire site.
Now, I don't think we provide an HT access tool on the web
master consul, but that's OK.
There's a lot of them out on the web, and if you just
search for something like HT tool or wizard or something
like that, you'll find ones where you can say, I'd like to
password protect a directory, and you can even tell it the
directory and it will generate one for you.
Then you can just copy and paste that into your web site.
So this is very good.
Why is this strong?
Why am I going to draw a happy face here?
Well, if you have a password on that directory Google bot's
not going to guess that password, and so we're not
going to be able to crawl that directory at all.
And if we can't get to it it will never
show up in our index.
This is very strong, it's very robust, it will work for every
search engine because someone has to know the password to
get into that directory.
So this is one of the two ways that I
really, really recommend.
It's a preventative measure, so it's not good if it's
already had Google get into a particular area of your site,
but if you're planning ahead and you know what the
sensitive areas are going to be, just slap a password on
there and it will work really well.
Alright, so here's another way, one that a lot of people
know about.
It's called robots.txt.
The standard has been around for over a decade, since at
least 2006, and essentially it's like an electronic no
trespassing sign.
It says, here are areas of your site that Google, or
other search engines, are not allowed to crawl.
We do provide a robots.txt tool on the web master consul,
so given a web site you can test out URLs and see whether
Google bots allowed to get to them.
You can test out whether different variants of Google
bot, like the image Google bot, is allowed to get to it,
and you can take out new robots.txt
files for a test drive.
So you say, OK, suppose I try this as my robots.txt, Could
you crawl this URL?
Could you crawl this URL?
And you can just try it out and make sure
that it works OK.
That's nice, because otherwise you might shoot
yourself in the foot.
Suppose you just make this robots.txt live and it had a
syntax error that would let everybody in or
keep everybody out?
Well, that could cause problems. So I recommend you
take that tool out for a test drive, get one that you like,
and then you can then put it live.
Now robots.txt is interesting.
Different search engines can have slightly different
policies about uncrawled URLs.
I'll give you a very simple example.
Way back in the day ebay.com, newyorktimes.com, didn't allow
anyone to crawl their site.
And so they had a robots.txt that said user agent starred,
disallow everybody.
Nobody is allowed to crawl the site if you're a well behaved
search engine.
So that's problematic, because if you are a search engine and
somebody types in eBay and you don't return
ebay.com you look dumb.
And so what we decided, and what our policy still is, is
that while we won't crawl this page because of robots.txt, we
can show an uncrawled reference.
And sometimes we can be pretty good about it.
For example, if there's an entry for nytimes.com in the
open directory project then we can use that snippet from the
ODP and we'll show it with nytimes.com being an uncrawled
reference, and it can almost look as good as if we crawled
it, even though we weren't allowed to crawl it and
didn't crawl it.
So use robots.txt to prevent crawling, but it won't
completely prevent that reference from
showing up in Google.
So there are other ways to do that.
Let's move on to the no index meta tag.
What that essentially says, for Google at least, is don't
show my page at all in the search engines.
So if we see no index we will completely drop it from Google
search results.
We'll still crawl it, but we won't actually show it in our
list of search results if somebody does a
query for that page.
So that's pretty powerful, it works very well.
It's very simple to understand.
There are a couple complicating factors.
Yahoo and Microsoft, even if you use the no index meta tag,
can still return the reference to that page.
They won't return the snippet and stuff like that, but you
might see the link.
We do see some people having problems with that.
For example, if you're a web master and you're rolling out
a new site, you might put up that no index meta tag and be
shifting around and developing the site, and then you might
forget and you might not take that no index meta tag down.
So a very simple example, the Hungarian, I think, HR.
No?
One of the versions of BMW has done this.
Ben Harper, who's a musician you've probably heard about,
for a long time, and maybe I think still does on
benharper.net, have a no index meta tag.
So if you're the webmaster for that site we'd love if you
would take that down.
So there are various people within Google that have said,
well maybe we should go to this policy where we won't
show the full snippet, but maybe we'll just show a
reference to that URL.
There's one other interesting corner case on no index, which
is we can only abide by that no index meta tag if we've
actually crawled that page.
If we haven't crawled that page we haven't seen the meta
tag, so we don't even know it's there.
So in theory it's possible that you link to a page and we
don't get a chance to crawl that page, and so we don't see
that there's a no index and we don't drop it out completely.
So there are a couple cases in which you can at least see the
reference show up within Google, and Yahoo and
Microsoft, I believe, will pretty much always be willing
to return the reference, even if you use the
no index meta tag.
So here is another approach that you can use.
You can use no follow on individual links.
This is another kind of weak approach, because inevitably
you say, OK, there's 20 links to this page, I'm going to put
a no follow on all of them.
Maybe it's a sign in page.
If you're expedia.com they have a no follow on my
itineraries, which makes perfect sense.
Why would you want Google bot to crawl into itineraries,
because that's a personalized kind of thing.
But inevitably somebody links to that page, or you forget
and you don't have every single link with a no follow.
And so it's very common that, I'll draw
a very simple example.
So suppose we have a page A and we have a no followed link
to page B. Well, we won't follow that link.
We drop it out of our link graph, we just drop it
completely.
So we won't discover page B because of this link.
But now, suppose there's some other guy on page C and he
does link to page B. Well, we might follow that link and as
a result end up indexing B.
So you can try to make sure that every link to a page is
no followed, but it's sometimes hard to make sure
that every single one is correctly handled.
And so this, like no index, does have these weird corner
cases where you could very easily see a page get crawled
just because not every single link was no followed.
In the no index case it could happen because we hadn't
actually gotten around to crawling that page, and so we
didn't see the no index meta tag.
So let's move on to another really powerful way.
I helped a friend use this whenever her forum got spammed
by porn spammers recently, and that's the URL removal tool.
So HT access is great as a preventative measure.
You've put a password on it, nobody can guess what it is,
no search engine's going to get in there,
it won't get indexed.
The other thing you can use, if you do let the search
engines in and then you want to take it out later, is our
URL removal tool.
We've offered the URL removal tool for at least five years,
probably more.
And for a long time it sat on services.google.com and it was
completely self service.
It would run 24/7.
But just recently the webmaster consul team has
integrated the URL removal tool into
the webmaster consul.
And so it's much, much simpler to use, the UI is much better.
The way that it used to work is it would remove the URL for
six months.
And if that was a mistake, suppose you removed your
entire domain and you didn't really mean to, then you'd
have to email Google's user support and say, I'm sorry, I
didn't mean to obliterate my entire site.
Can you revoke that?
And someone at Google would have to do that.
Now you can do that yourself.
And so it's very powerful, but it also gives you a safety
net, because at any time you can go in and you can say, oh,
I didn't mean to remove my entire site.
Revoke that request, and that gets revoked very quickly.
So to use the Google webmaster consul it's not that hard to
prove that you own a site.
You can either make a little page on the root of the
directory, the root of your domain, to say, yes, this is
me, or you can even update a simple meta tag and say,
here's a little signature to prove that it's my site.
Once you've proven this is my domain then we give you a lot
more stats and this wonderful little URL removal tool.
It can remove at a very nice level of granularity.
You can remove the whole domain, you can remove just a
sub directory, I think you can even remove
just individual URLs.
And we show you the status the whole time.
We'll show you all the pending requests that you've done, and
so it's pending until it's gone live, and then once it's
live that'll turn into a revoke.
And you can say, you know what?
I've gotten all the social security numbers or credit
card numbers or whatever down, so revoke that.
And now it's safe to start crawling again.
So, of the ways to remove or prevent your URLs from showing
up in Google, there's a lot of different options.
Some of them are very strong, robots.txt, the no index meta
tag, but they do have these weird corner cases where we
might show the reference to the URL in various situations.
So the ones that I definitely recommend are HT access, that
will prevent people from getting in in the first place,
and at least for Google, the URL removal tool.
So if you have URLs crawled that you don't want crawled
you can still get them out and get them
out relatively quickly.
Thanks very much, and I hope that was helpful.