Removing duplicate search engine content using robots.txt

This content is 18 years old. I don't routinely update old blog posts as they are only intended to represent a view at a particular point in time. Please be warned that the information here may be out of date.

Here’s something that no webmaster wants to see:

Screenshot showing that Google cannot access the homepage due to a robots.txt restriction

It’s part of a screenshot from the Google Webmaster Tools that says “[Google] can’t current access your home page because of a robots.txt restriction”. Arghh!

This came about because, a couple of nights back, I made some changes to the website in order to remove the duplicate content in Google. Google (and other search engines) don’t like duplicate content, so by removing the archive pages, categories, feeds, etc. from their indexes, I ought to be able to reduce the overall number of pages from this site that are listed and at the same time increase the quality of the results (and hopefully my position in the index). Ideally, I can direct the major search engines to only index the home page and individual item pages.

I based my changes on some information on the web that caused me a few issues – so this is what I did and by following these notes, hopefully others won’t repeat my mistakes; however, there is a caveat – use this advice with care – I’m not responsible for other people’s sites dropping out of the Google index (or other such catastrophes).

Firstly, I made some changes to the section in my WordPress template:







Because WordPress content is generated dynamically, this tells the search engines which pages should be in, and which should be out, based on the type of page. So, basically, if this is an post page, another single page, or the home page then go for it; otherwise follow the appropriate rule for Google, MSN or other spiders (Yahoo! and Ask will follow the standard robots directive) telling them not to index or archive the page but to follow any links and additionally, for Google not to include any open directory information. This was based on advice from askapache.com but amended because the default indexing behaviour for spiders is to index, follow or all so I didn’t need to specify specific rules for Google and MSN as in the original example (but did need something there otherwise the logic reads “if condition is met donothing else dosomething” and the donothing could be problematic) .

Next, following fiLi’s advice for using robots.txt to avoid content duplication, I started to edit my robots.txt file. I won’t list the file contents here – suffice to say that the final result is visible on my web server and for those who think that publishing the location of robots.txt is a bad idea (because the contents are effectively a list of places that I don’t want people to go to), then think of it this way: robots.txt is a standard file on many web servers, which by necessity needs to be readable and therefore should not be used for security purposes – that’s what file permissions are for (one useful analogy refers to robots.txt as a “no entry” sign – not a locked door)!

The main changes that I made were to block certain folders:

Disallow: /blog/page
Disallow: /blog/tags
Disallow: /blog/wp-admin
Disallow: /blog/wp-content
Disallow: /blog/wp-includes
Disallow: /*/feed
Disallow: /*/trackback

(the trailing slash is significant – if it is missing then the directory itself is blocked, but if it is present then only the files within the directory are affected, including subdirectories).

I also blocked certain file extensions:

Disallow: /*.css$
Disallow: /*.html$
Disallow: /*.js$
Disallow: /*.ico$
Disallow: /*.opml$
Disallow: /*.php$
Disallow: /*.shtml$
Disallow: /*.xml$

Then, I blocked URLs that include ? except those that end with ?:

Allow: /*?$
Disallow: /*?

The problem at the head of this post came about because I blocked all .php files using

Disallow: /*.php$

As http://www.markwilson.co.uk/blog/ is equivalent to http://www.markwilson.co.uk/blog/index.php then I was effectively stopping spiders from accessing the home page. I’m not sure how to get around that as both URLs are serving the same content, but in a site of about 1500 URLs at the time of writing, I’m not particularly worried about a single duplicate instance (although I would like to know how to work around the issue). I resolved this by explicitly allowing access to index.php (and another important file – sitemaps.xml) using:

Allow: /blog/index.php
Allow: /sitemap.xml

It’s also worth noting that neither wildcards (*, ?) nor allow are valid robots.txt directives and so the file will fail validation. After a bit of research I found that the major search engines have each added support for their own enhancements to the robots.txt specification:

  • Google (Googlebot), Yahoo! (Slurp) and Ask (Teoma) support allow directives.
  • Googlebot, MSNbot and Slurp support wildcards.
  • Teoma, MSNbot and Slurp support crawl delays.

For that reason, I created multiple code blocks – one for each of the major search engines and a catch-all for other spiders, so the basic structure is:

# Google
User-agent: Googlebot
# Add directives below here

# MSN
User-agent: msnbot
# Add directives below here

# Yahoo!
User-agent: Slurp
# Add directives below here

# Ask
User-agent: Teoma
# Add directives below here

# Catch-all for other agents
User-agent: *
# Add directives below here

Just for good measure, I added a couple more directives for the Alexa archiver (do not archive the site) and Google AdSense (read everything to determine what my site is about and work out which ads to serve).

# Alexa archiver
User-agent: ia_archiver
Disallow: /

# Google AdSense
User-agent: Mediapartners-Google*
Disallow:
Allow: /*

Finally, I discovered that Google, Yahoo!, Ask and Microsoft now all support sitemap autodiscovery via robots.txt:

Sitemap: http://www.markwilson.co.uk/sitemap.xml

This can be placed anywhere in the file, although Microsoft don’t actually do anything with it yet!

Having learned from my initial experiences of locking Googlebot out of the site, I checked the file using the Google robots.txt analysis tool and found that Googlebot was ignoring the directives under User-agent: * (no matter whether that section was first or last in the file). Thankfully, posts to the help groups for crawling, indexing and ranking and Google webmaster tools indicated that Googlebot will ignore generic settings if there is a specific section for User-agent: Googlebot. The workaround is to include all of the generic exclusions in each of the agent-specific sections – not exactly elegant but workable.

I have to wait now for Google to re-read my robots.txt file, after which it will be able to access the updated sitemap.xml file which reflects the exclusions. Shortly afterwards, I should start to see the relevance of the site:www.markwilson.co.uk results improve and hopefully soon after that my PageRank will reach the elusive 6.

Links

Google webmaster help center.
Yahoo! search resources for webmasters (Yahoo! Slurp).
About Ask.com: Webmasters.
Windows Live Search site owner help: guidelines for succesful indexing and controlling which pages are indexed.
The web robots pages.

28 thoughts on “Removing duplicate search engine content using robots.txt

  1. Mark

    quite helpful… trying to figure all this out myself…

    but isn’t this backwards:

    # Block URLs that include ? except those that end with ?
    Allow: /*?$
    Disallow: /*?

  2. Anthony, my wording may be a little confusing (a weird double-negative thing around allow, block and except – combined with me not being too hot on regular expressions) but if you click through the link I provided, you should see that:

    The Disallow:/ *? line will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).

    The Allow: /*?$ line will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).

    Google Webmaster Help Center

    Mark

  3. This post truly is a showcase of your unmatched skills. I am glad to know about you and this blog. Hope I’ll be having good time reading this blog.

    Can you please let me know where do we have to add that code to … which code? The one you wrote for adding to the section, where exactly to add it?

    Hope to see your reply soon.

  4. Hi Adnan,
    Thank you for your kind comments.

    The meta tags to control search engine behaviour go into the <head> section of my website template (my site runs on PHP, yours may use a different language).

    All of the other code that I mentioned was inserted into a text file called robots.txt file that is placed in the web site root (so, in my case, http://www.markwilson.co.uk/robots.txt).

    Mark

  5. Well, you deserve the appreciation.

    I am going to use wordpress 2.5, so how to add it in wordpress?

    Adnan

  6. I use an older version of WordPress but 2.5 should be similar – just add the first block of code (the PHP bit) to the <head> section of your WordPress template and put the other bits into a robots.txt file as mentioned previously – robots.txt is a standard file regardless of the technology being used to create the website.

  7. Hi Mark,

    Great post. I was wondering if I am doing right in adding

    Disallow: /?s=

    Basically, it seems that Google has indexed tons of search pages, which I do not want due to duplicate content. I want to block them out of Google so I have entered the above in the robots.txt file.

    Is this correct?

    Also, I want to remove these pages from the results, can I do this using the removal tool in webmaster console?

  8. it willl be so kind of you if you help me remove my duplicate content from google. my site is located at the address provided. please check my site index by site search i’ve got 75 posts and the results shown are 150 something. so kindly help. i’ve done some geeking but im just a baby in this. so please help and tell me whats wrong. its a request. ive spoiled a lots of time in it already….

  9. Wow Mark! I just stumbled upon your post looking for a way to “explicitly allow Technorati to access my blog” since they say my robots.txt file is “explicitly blocking” them, and although I don’t see the answer to my question (your post came up tops for the search), you have given me some new homework to do to ensure that I further comply with the wishes of the search gods. Thanks for a very insightful post. The things we have to do to appease Google… I tell ya;-)

    Have a great day.
    John

  10. The programming aspect in the above is a little beyond me, but I felt myself coming back to the same question – why do we have to try and blind Google from seeing duplicate content? The answer is simple, don’t write duplicate content in the first place – or am I missing something here?.

  11. @JayJay – the reason there is duplicate content is simple… many sites (including this one) are accessible via a number of URLs but we’d like to focus the Google Juice in one direction. In addition, the data (and hence valid page links) can be sliced many ways – by category, tag, date, etc. I only want each page to appear using the URL that takes me to that single blog post.

    Then again, I suspect you already knew that and are just looking for a trackback for “strategic niche marketing” purposes!

  12. Yo mark!!!

    very well written information. we can really neglect duplicate content using robot.txt files on our server. I have tried your method on one of my website. And got vehry good output.

    thanks a lot for quality post..

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.