I was pleased to be invited to write a guest post on Search Engine Journal, but it took me a little while to settle on what exactly I was going to talk about. Today, finally, I made my decision – tips to help diagnose problems with your website architecture. I’ve seen lots of posts covering different aspects of problems associated with website architecture for SEO, but I can’t think of any I’ve seen for a while that try to bring all of those ideas together.
So what ideas should you be looking at when you’re trying to figure out if there’s a problem with your website architecture?
How flat is your site architecture?
As Rand points out, flatter site architecture isn’t just for SEO, it’s for the benefit of your users, too. How many clicks are there between your homepage and a product pages? Have you made it easy for users (and search engines) to get all the way to the bottom of your site map? Think of this from the point of view of the number of levels between your (often most authoritative) homepage and the very lowest content type on the site (e.g. job pages if you’re in recruitment and product pages if you’re in retail). The fewer of these levels there are, the more likely you’ll be in distributing page rank nicely throughout your site architecture.
One of the issues that can come up when you’re categorising your pages is the maximum number of links you can have on a page. Take a look at this post from Matt Cutts on some good background on how far you can go. According to data collected from Linkscape, the average number of links found on a page is 75! That means there are some pages out there with a lot of links!
When you’re designing your site architecture, try to keep as flat as possible, following the homepage>category>brand>product principle, a little like this diagram (click to enlarge):
You can add cross links between categories (or products) that are related too. A great tip is to identify your most authoritative pages using the “Top Pages On Domain” tool at SEOmoz and use those pages to add some authority to others by linking out to them.
Broken internal and external links
Identifying broken internal links is pretty easy – download Xenu’s Link Sleuth and leave it to crawl your site. Get the broken links report together and you’ll be able to quickly identify whether your site has any internal broken links leading to 404 error pages. If you’ve got a really big site to crawl, Xenu can get a little overwhelmed. I really like using Web Link Validator for sites with more than 100,000 pages – the software stays together and can easily export its results without any problems.
For checking inbound links that may be leading to a broken URL or a 404 error page, get Google Webmaster Tools and check out the pages with external links report. The report will give you a list of pages producing 404′s. Work through the list and get free links.
Server header responses
A “server header response”, like a 200 or a 404 is a status code defined by the HTTP/1.1 protocol. Giving the right messages when your page loads fully, is broken or if there’s a server problem is a vital part of the way your website communicates with search engine crawlers. Check every single response code you get by installing live http headers for Firefox and testing a few scenarios such as a broken URL and an ordinary page load. If anything looks wrong, talk to your developers!
Do you have a duplicate content problem?
Diagnosing for duplicate content and fixing any problems you find is, in my mind, a best practice element of SEO. Even though the search engines gave us the canonical tag, you should still be making sure there aren’t too many problems. Use your common sense by taking a look at your total indexed pages doing a simple “site:yourdomain.com” query in Google and by using Yahoo’s Site Explorer. Does it seem like there are too many pages in the index? 75,000 pages when you know there’s more like 20,000? When you’re investigating your site in a search engine index, you’re looking out for malformed urls, query strings (like ?=sessionid or ?first_page etc) or many repeated results with the same title / description. Don’t forget that Google will only display the first 1000 pages urls with the site: operator, so you need to get creative when investigating your site.
Orphaned pages on a big site can be a problem, particularly if you’ve migrated to a new platform or fundamentally impacted your site design and URL structure recently. Not linking internally to a page on your site can often be a death move for the ranking position of the URL in question, particularly if there are no / few external links to the page. Checking for orphaned pages on 100,000+ page sites is enormously difficult too – especially with dynamic sites. Here are a few things you can look out for to make sure you’ve not orphaned any pages:
- Significant change in total pages in Google and Yahoo site index
- Changes in the numbers for pages with internal links from the Google WMT report
- On recrawling your site with Xenu or Web Link Validator, did you notice a change in the number of URLs the software was able to crawl? Save your first crawl, export the data and compare it to a recrawl. Where are the gaps?
Check to see if your site is indexed at http://yourdomain.com and http://www.yourdomain.com and set up a canonical 301 redirect to sort the problem out. Don’t forget to add a trailing slash at the end or the URLs (or remove it, depending on which you prefer). There are plenty of guides to setting these redirects up in IIS or Apache – here’s a nice guide from SEObook.com
Is your development server indexed?
One of the most suicidal moves to make in SEO is to allow your development site to get indexed. There are lots of ways to remove a website from a search engine index and usually we try to avoid that happening. In the case of an indexed development server, you need to get rid as soon as you can. Make sure you use a robots.txt file at the root of your development server and if at all possible, restrict traffic to the development site from outside of your network, perhaps by specifying IP ranges that are allowed or user agents that are not allowed in. It doesn’t take long to find people who have forgotten this very basic rule – take a look at this Google query.
Keeping a close eye on your site on a regular basis by watching your site index, crawling and performing regular checkups can really pay dividends and give you a fantastic early warning system against future issues with your site architecture. Do you have any more tips you’d like to share? Tell us about your experiences below…