What happens when search engine spiders get lost? They cannot figure out what way to go on your site.
They sometimes get confused or other sites are sending them to another way to get to the same page on your site.
Some servers use mod_dir which causes additional issues by redirecting the domain without a training slash to the domain with a trailing slash so domain.com redirects to domain.com/.
It is very rare that this ever causes an issue. But, it is a reason when link building that you should always use the trailing slash in a link you add. It is the proper way to link to a site. Ever notice how the Open Directory and many other directories require their editors to add the trailing slash?
Canonical means the “Authoritative Path”.
This is how you tell the Search Engines that these are the pages of your site. Since you are essentially talking to robots, you need to take extra precautions because robots “do not think”. If a robot is caught in a loop or sees pages that are actually the same but have 3 to 6 different paths to get to, it will consider these additional pages.
So if the spider gets confused it can cause it to duplicate your pages and place importance on the unintended form of the page you wanted. Hence, it may put priority on index.html instead of the domain itself. Which is why a proper home page link is either “http://www.domain.com/” or “/” but NEVER /index.xxx.
Canonical issues can take many forms and problems with them are becoming rare thanks to sitemap programs and the increasing awareness of the factors. Yes, they do still exist and can be caused by an webmaster that has no knowledge of the SEO factors involved in developing proper website architecture.
Duplicate Content Issues
Duplicate pages are also caused by using the same contact form with different dynamic variables. So a form may be contact.asp?id=california and the same form may also be contact.asp?id=new york. This means that Google sees the exact page with different ways to get to it and treats it as spam.
The simple fix for this is a rel=”nofollow” tag or banning contact.php wildcards in the robots.txt file. This is becoming a common task on many dynamic sites, I have added this here because we can consider this a potential canonical trigger as the path becomes duplicated.
Adding a SSL (secure) certificate to a site and making a page with the exact same navigation as the rest of the site is a mistake, this means you kept the relative URL links on the https page -oooops. This could be done by every designer but not every SEO as the SEO understands the canonical factors. In creating this with relative links, you have now given the Search Engine spiders access to the entire site under a new domain.
The same way Google treats the non-www and the www. As two sites, you just added 3 sites.
Oh Boy, now you’ve caused a potential trigger that can remove a website from Googles good graces by valuing the wrong version. Search Engines are more likely to consider this as spam or a decision that its automation robots must now make — Is the site more important in the non-www for the www Form or the https form or even the https://www. Form? Its almost a potential nightmare waiting to happen if the decision is wrong.
Relax! There are many simple fixes:
- Always program the site to be friendly by using Absolute Links when developing navigation and adding links to internal pages of your site. Absolute links can also help in preventing automated content stealing, which sites try to own your content by and ranking with it, and “theoretically” it is extremely possible that a third party site can take your content and rank for it while you get hit as a duplicate page and no longer rank for it.
- Use the rel=”nofollow” in the href tags of pages that go to a secure server, and or pages that go to dynamic forms. This tells the spiders right off not to count the pages as a link, in effect helping them understand the priority of the page from the href relevancy command. This can help increase internal page quality as well by removing the potential trigger for “Mad Lib” spam.
- Use a Canonical URL redirect fix. I have listed many here in my 301 and Canonical Redirect Tutorial. I am still looking for the Mac WebSTAR Canonical version which would be appreciated.
- Robots.txt out files and wildcards. Not all search engines use the wildcards. Yahoo does, MSN does, Google does as well these may be best used bu identifying the robot and the path.
In trying to keep this short, I may have overlooked other methods of fixing canonical issues. Please feel free to add them in the comments below.
It’s always great to learn all avenues that can help secure the proper architecture of websites.
Alan Rabinowitz is the CEO of SEO Image, a New York based SEO and Internet Marketing company which focuses on corporate branding and positioning in search engines.
Subscribe to SEJ
Get our daily newsletter from SEJ's Founder Loren Baker about the latest news in the industry!