Monday June 16th, 2008 | News, SEM, SEO

Google Improves on Robots Exclusion Protocol

You may remember when Google took a step towards transparency in early 2007 when they started laying out tips on the ‘how’ & ‘why’ of using a robots.txt file in their January post “Controlling how search engines access and index your website“, followed by “The Robots Exclusion Protocol” in February, and another mid-year follow-up that included some improvements in “Robots Exclusion Protocol: now with even more flexibility“.

Along that same time line we saw collaborative improvements to protocols for sitemaps including the unprecedented coordination of the that was a culmination of a Google experiment started back in 2005. The establishment of a single industry adopted sitemaps protocol changed the way we looked at things. up to that point it was the ‘wild wild west’ with every engine operating absolutely independent of any protocols, or set parameters. While the move was a bit shocking, is was very much welcome in the industry.

So a few weeks ago, when Google opened the door and talked about how they are Improving on Robots Exclusion Protocol by working toward some central, industry adoptable protocols the SEOer channels started to buzz again. Google’s post is pretty transparent and sheds light on the way they treat certain robots.txt declarations as well as introduces some new syntax.

A standardized robots.txt protocol on the horizon?

The Robots Exclusion Protocol [more] (or REP for short) has been around since the 90′s, and all of the engines recognize the robots.txt file and follow the basic REP to some degree, but there are still some dissimilarities in the way that each engine reads & uses the content of the robots.txt file. So in an effort to get the ball rolling on a cross-engine standard REP protocol Google has laid their cards on the table and made the following statement.

Over the last couple of years, we have worked with Microsoft and Yahoo! to bring forward standards such as Sitemaps and offer additional tools for webmasters…in that same spirit of making the lives of webmasters simpler, we’re releasing detailed documentation about how we implement REP. This will provide a common implementation for webmasters and make it easier for any publisher to know how their REP directives will be handled by three major search providers — making REP more intuitive and friendly to even more publishers on the web. – Google

Google’s Robots.txt Directives

Tells a crawler not to index your site — your site’s robots.txt file still needs to be crawled to find this directive, however disallowed pages will not be crawled
‘No Crawl’ page from a site. This directive in the default syntax prevents specific path(s) of a site from being crawled.

Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow
This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed except for a small section within it

$ Wildcard Support
Tells a crawler to match everything from the end of a URL — large number of directories without specifying specific pages
‘No Crawl’ files with specific patterns, for example, files with certain filetypes that always have a certain extension, say pdf

* Wildcard Support
Tells a crawler to match a sequence of characters
‘No Crawl’ URLs with certain patterns, for example, disallow URLs with session ids or other extraneous parameters

Sitemaps Location
Tells a crawler where it can find your Sitemaps
Point to other locations where feeds exist to help crawlers find URLs on a site

Google’s HTML META Directives

Tells a crawler not to index a given page Don’t index the page.
This allows pages that are crawled to be kept out of the index.

Tells a crawler not to follow a link to other content on a given page
Prevent publicly writeable areas to be abused by spammers looking for link credit. By using NOFOLLOW you let the robot know that you are discounting all outgoing links from this page.

Tells a crawler not to display snippets in the search results for a given page
Present no snippet for the page on Search Results

Tells a search engine not to show a “cached” link for a given page
Do not make available to users a copy of the page from the Search Engine cache

Tells a crawler not to use a title and snippet from the Open Directory Project for a given page
Do not use the ODP (Open Directory Project) title and snippet for this page

Regarding their recognized HTML meta directives Google includes this note…

These directives are applicable for all forms of content. They can be placed in either the HTML of a page or in the HTTP header for non-HTML content, e.g., PDF, video, etc. using an X-Robots-Tag. You can read more about it here:X-Robots-Tag Post or in our series of posts about using robots and Meta Tags.

Google’s Other REP Directives

The directives listed above are used by Microsoft, Google and Yahoo!, but may not be implemented by all other search engines. In addition, the following directives are supported by Google, but are not supported by all three as are those above:

UNAVAILABLE_AFTER Meta Tag – Tells a crawler when a page should “expire”, i.e., after which date it should not show up in search results.

NOIMAGEINDEX Meta Tag – Tells a crawler not to index images for a given page in search results.

NOTRANSLATE Meta Tag – Tells a crawler not to translate the content on a page into different languages for search results.

Hopefully we’ll start to see some collaberation from the other engines pretty soon and eventually another standardized protocol!

One great comment already | rss

Thanks for reading.

Quick, add your comment!
Trackbacks are enabled.