This content is 18 years old. I don't routinely update old blog posts as they are only intended to represent a view at a particular point in time. Please be warned that the information here may be out of date.
Here’s something that no webmaster wants to see:
It’s part of a screenshot from the Google Webmaster Tools that says “[Google] can’t current access your home page because of a robots.txt restriction”. Arghh!
This came about because, a couple of nights back, I made some changes to the website in order to remove the duplicate content in Google. Google (and other search engines) don’t like duplicate content, so by removing the archive pages, categories, feeds, etc. from their indexes, I ought to be able to reduce the overall number of pages from this site that are listed and at the same time increase the quality of the results (and hopefully my position in the index). Ideally, I can direct the major search engines to only index the home page and individual item pages.
I based my changes on some information on the web that caused me a few issues – so this is what I did and by following these notes, hopefully others won’t repeat my mistakes; however, there is a caveat – use this advice with care – I’m not responsible for other people’s sites dropping out of the Google index (or other such catastrophes).
Firstly, I made some changes to the
section in my WordPress template:
Because WordPress content is generated dynamically, this tells the search engines which pages should be in, and which should be out, based on the type of page. So, basically, if this is an post page, another single page, or the home page then go for it; otherwise follow the appropriate rule for Google, MSN or other spiders (Yahoo! and Ask will follow the standard robots directive) telling them not to index or archive the page but to follow any links and additionally, for Google not to include any open directory information. This was based on advice from askapache.com but amended because the default indexing behaviour for spiders is to index
, follow
or all
so I didn’t need to specify specific rules for Google and MSN as in the original example (but did need something there otherwise the logic reads “if condition is met donothing else dosomething” and the donothing could be problematic) .
Next, following fiLi’s advice for using robots.txt to avoid content duplication, I started to edit my robots.txt file. I won’t list the file contents here – suffice to say that the final result is visible on my web server and for those who think that publishing the location of robots.txt is a bad idea (because the contents are effectively a list of places that I don’t want people to go to), then think of it this way: robots.txt is a standard file on many web servers, which by necessity needs to be readable and therefore should not be used for security purposes – that’s what file permissions are for (one useful analogy refers to robots.txt as a “no entry” sign – not a locked door)!
The main changes that I made were to block certain folders:
Disallow: /blog/page
Disallow: /blog/tags
Disallow: /blog/wp-admin
Disallow: /blog/wp-content
Disallow: /blog/wp-includes
Disallow: /*/feed
Disallow: /*/trackback
(the trailing slash is significant – if it is missing then the directory itself is blocked, but if it is present then only the files within the directory are affected, including subdirectories).
I also blocked certain file extensions:
Disallow: /*.css$
Disallow: /*.html$
Disallow: /*.js$
Disallow: /*.ico$
Disallow: /*.opml$
Disallow: /*.php$
Disallow: /*.shtml$
Disallow: /*.xml$
Then, I blocked URLs that include ? except those that end with ?:
Allow: /*?$
Disallow: /*?
The problem at the head of this post came about because I blocked all .php files using
Disallow: /*.php$
As http://www.markwilson.co.uk/blog/ is equivalent to http://www.markwilson.co.uk/blog/index.php then I was effectively stopping spiders from accessing the home page. I’m not sure how to get around that as both URLs are serving the same content, but in a site of about 1500 URLs at the time of writing, I’m not particularly worried about a single duplicate instance (although I would like to know how to work around the issue). I resolved this by explicitly allowing access to index.php (and another important file – sitemaps.xml) using:
Allow: /blog/index.php
Allow: /sitemap.xml
It’s also worth noting that neither wildcards (*
, ?
) nor allow
are valid robots.txt directives and so the file will fail validation. After a bit of research I found that the major search engines have each added support for their own enhancements to the robots.txt specification:
- Google (Googlebot), Yahoo! (Slurp) and Ask (Teoma) support
allow
directives.
- Googlebot, MSNbot and Slurp support wildcards.
- Teoma, MSNbot and Slurp support crawl delays.
For that reason, I created multiple code blocks – one for each of the major search engines and a catch-all for other spiders, so the basic structure is:
# Google
User-agent: Googlebot
# Add directives below here
# MSN
User-agent: msnbot
# Add directives below here
# Yahoo!
User-agent: Slurp
# Add directives below here
# Ask
User-agent: Teoma
# Add directives below here
# Catch-all for other agents
User-agent: *
# Add directives below here
Just for good measure, I added a couple more directives for the Alexa archiver (do not archive the site) and Google AdSense (read everything to determine what my site is about and work out which ads to serve).
# Alexa archiver
User-agent: ia_archiver
Disallow: /
# Google AdSense
User-agent: Mediapartners-Google*
Disallow:
Allow: /*
Finally, I discovered that Google, Yahoo!, Ask and Microsoft now all support sitemap autodiscovery via robots.txt:
Sitemap: http://www.markwilson.co.uk/sitemap.xml
This can be placed anywhere in the file, although Microsoft don’t actually do anything with it yet!
Having learned from my initial experiences of locking Googlebot out of the site, I checked the file using the Google robots.txt analysis tool and found that Googlebot was ignoring the directives under User-agent: *
(no matter whether that section was first or last in the file). Thankfully, posts to the help groups for crawling, indexing and ranking and Google webmaster tools indicated that Googlebot will ignore generic settings if there is a specific section for User-agent: Googlebot
. The workaround is to include all of the generic exclusions in each of the agent-specific sections – not exactly elegant but workable.
I have to wait now for Google to re-read my robots.txt file, after which it will be able to access the updated sitemap.xml file which reflects the exclusions. Shortly afterwards, I should start to see the relevance of the site:www.markwilson.co.uk
results improve and hopefully soon after that my PageRank will reach the elusive 6.
Links
Google webmaster help center.
Yahoo! search resources for webmasters (Yahoo! Slurp).
About Ask.com: Webmasters.
Windows Live Search site owner help: guidelines for succesful indexing and controlling which pages are indexed.
The web robots pages.