A couple of weeks ago, I wrote about a WordPress plugin called Redirection. I mentioned that I’ve been using this to highlight HTTP 404 errors on my site but I’ve also been using the crawl errors logged by Google’s Webmaster Tools to track down a number of issues resulting from the various changes that have been made to the site over the years, then creating HTTP 301 redirects to patch them.
Redirections as a result of other people’s mistakes
One thing that struck me was how other people’s content can affect my site – for example, many forums seem to abbreviate long URLs with … in the middle. That’s fine until the HTML anchor gets lost (e.g. in a cut/paste operation) and so I was seeing 404 errors from incomplete URLs like http://www.markwilson.co.uk/blog/2008/12/netboo…-file-systems.htm. These were relatively easy for me to track down and create a redirect to the correct target.
Unfortunately, there is still one inbound link that includes an errant apostrophe that I’ve not been able to trap – even using %27 in the redirect rule seems to fail. I guess that one will just have to remain.
Locating Post IDs
Some 404s needed a little more detective work – for example http://www.markwilson.co.uk/blog/2012/05/3899.htm is a post where I forgot to add a title before publishing and, even though I updated the WordPress slug afterwards, someone is linking to the old URL. I used PHPMyAdmin to search for post ID 3899 in the wp_content table of the database, from which I could identify the post and create a redirect.
Pattern matching with regular expressions
Many of the 404s were being generated based on old URL structures from either the Blogger version of this site (which I left behind several years ago) or changes in the WordPress configuration (mostly after last year’s website crash). For these I needed to do some pattern matching, which meant an encounter with regular expressions, which I find immensely powerful, fascinating and intimidating all at once.
Many of my tags were invalid as, at some point I obviously changed the tags from /blog/tags/tagname to /blog/tag/tagname but I also had a hierarchy of tags in the past (possibly when I was still mis-using categories) which was creating some invalid URLs (like http://www.markwilson.co.uk/blog/tag/apple/ipad). The hierachy had to be dealt with on a case by case basis, but the RegEx for dealing with the change in URL for the tags was fairly simple:
- Source RegEx:
(\/tags\/)
- Target RegEx:
(\/tag\/)
Using the Rubular Ruby RegEx Editor (thanks to Kristian Brimble for the suggestion – there were other tools suggested but this was one I could actually understand), I was able to test the RegEx on an example URL and, once I was happy with it, that was another redirection created. Similarly, I redirected (\/category\/)
to (\/topic\/)
.
I also created a redirection for legacy .html extensions, rewriting them to .htm:
- Source RegEx:
(.*).html
- Target RegEx:
$1.htm
Unfortunately, my use of a “greedy” wildcard meant this also sustituted html in the middle of a URL (e.g. http://www.markwilson.co.uk/blog/2008/09/creating-html-signatures-in-apple-mail.htm became http://www.markwilson.co.uk/blog/2008/09/creating-.htm-signatures-in-apple-mail.htm) , so I edited the source RegEx to (.*).html$
.
More complex regular expressions
The trickiest pattern I needed to match was for archive pages using the old Blogger structure. For this, I needed some help, so I reached out to Twitter:
and was very grateful to receive some responses, including one from Dan Delaney that let me to create this rule:
Source RegEx: /blog\/([a-zA-Z\/]+)([\d]+)(\D)(\d+)(\w.+)
Target RegEx: /blog/$2/$4/
Dan’s example helped me to understand a bit more about how match groups are used, taking the second and fourth matches here to use in the target, but I later found a tutorial that might help (most RegEx tuturials are quite difficult to follow but this one is very well illustrated).
A never-ending task
It’s an ongoing task – the presensce of failing inbound links due to incorrect URLs means that I’ll have to keep an eye on Google’s crawl errors but, over time, I should see the number of 404s drop on my site. That in itself won’t improve my search placement but it will help to signpost users who would otherwise have been turned away – and every little bit of traffic helps.
Hey Mark as a blogger i know how important is to remove such 301/404 errors. So here i want to mention that I am facing such problem ‘In webmaster it is showing that Google bot can’t access this URL due to robots.txt and when i checked same URL in robots.txt checker then tit is not showing any error’ so i want to know how can i solve it?