SEO

Huge Site SEO: Optimizing for the Long, Long Tail

The biggest obstacle in my search for SEO nirvana has been the lack of an industry blog focused on huge site SEO. When I hear people talk about dealing with 1,000 pages as if it is a lot, I cry a little on the inside. They don’t know how good they have it.

What is huge site SEO? In my eyes, you need at least 100,000 pages to be considered a huge site. My site, Movoto.com, comes up with more than 15 million indexed pages.

site movoto search result 637x127 Huge Site SEO: Optimizing for the Long, Long Tail

For the most part, small and big ass sites overlap in their SEO strategies. Title tags, links, social shares, and xml sitemaps play a role; huge sites just do it all on steroids. But there are certain issues that only big sites face.

Here are some of the solutions I have seen to the specific challenges that a huge site encounters. The focus becomes long, long tail rankings.

Optimize for the Long, Long Tail

On a site with more than 15 million pages, head terms can naturally start at five words long (think “Los Angeles Homes for Sale”). The long tail is where you live. The long, long tail is what you optimize for.

One shining example is keyword repetition. I know, you’re saying this isn’t 2005, or whatever you old timers say was a long time ago. But when you have 15 million pages, page number 13,200,010 has negative pagerank (Well, not really, but close). How do you make it rank? You nail the point where keyword repetition maximizes the ranking potential.

My favorite example is Trulia, one of the largest real estate sites on the Web and a shining beacon of SEO strategy. Trulia deliberately repeats their the property’s address (their main keyword) eight times throughout the H1 and H2s.

repetition of keywords in h2 637x673 Huge Site SEO: Optimizing for the Long, Long Tail

Another tactic that has seen its credibility hit hard has been computer generated content. But who said computer generated content doesn’t have value? If you can add it to a page and improve the user experience in the process, then it is just fine. Trulia uses it when there is no agent description. It has enough value that Google instant finds value in it.

trulia example of computer generated content 637x283 Huge Site SEO: Optimizing for the Long, Long Tail

Control the Link Juice Tsunami

There are two ways to maintain control over link juice flow: Link to everything or maintain a deliberate link strategy. While domain authority plays important role in which strategy to utilize, in my opinion, every site should maintain a deliberate link strategy.

Sites like Trulia and Zappos have so much domain authority they can’t possibly spread it thin enough. Have you seen Trulia’s footer? Try telling them that limiting a page to 100 links is the proper way to do SEO. Or take a look at the Zappos ocean (50+ links on every page):

zappos ocean of links in footer1 637x388 Huge Site SEO: Optimizing for the Long, Long Tail

We mere mortals need to proactively, and thoughtfully, determine where to aim our limited link juice. Since we don’t have the domain authority to link to every page from every page, we have to place priority on particular categories with the aim of optimizing revenue. The tactics include, but (probably) aren’t limited to:

  • Deep Homepage links. You control what products or categories you want spiders, and users, to value the most
  • Deliberate non-linking. A good example of this practice is how we only link to off-market homes category page from one spot in each city. We are telling spiders to place significantly less value on off-market homes relative to active homes with our linking practices.
  • Pick “favorite” nearby cities, or similar products, where “favorite” really means revenue generating.
  • Know your links. Don’t have any extraneous internal links.
  • Remove sitewide external links (Facebook and Twitter, I’m looking at you).
  • Breadcrumbs
  • Footer links

Being smart about directing the flow of link juice significantly affects traffic when you are dealing with a large number of pages.

Beware the Dangers of Duplicate and Thin Content

Fifteen plus million pages makes dealing with WordPress duplicate content issues seem like chump change. It’s hard enough to get a long, long tail page indexed and ranking, you don’t want to make your life harder by splitting the juice between two of the identical pages.

There have been a ton of awesome tutorials on removing duplicate content, so I won’t bore you with the technicalities of how to remove it, but it is extremely important to deal with it preemptively. The “Rel Canonical” meta directive is your best friend in this regard. Adding rel canonical prevents any would be crazy links from creating duplicates at mass scale.

Here is a personal example where rel canonical prevents a ton of duplicate content on our site. Patrick.net, a real estate forum adds a “?source=Patrick.net” to some outbound links. That has “double trouble” written all over it as each of those links would create a duplicate of the target page.

Thin Content

Dealing with thin content is a moral dilemma for me: It can still be useful, even with limited information. In real estate, when a property goes off-market we are severely restricted in the amount of information we are allowed to show without a registration. (From what I’ve read, ecommerce sites face a similar issue with discontinued products.) We go from a page with the richest amount of content we offer, to one with essentially no content. People still search for these properties, so we don’t want to remove them.

We deal with it thusly:

  1. Certain properties get the 404 as soon as they go off market
  2. We only maintain a set amount of off-market properties in the html sitemap. This places higher importance on more recently off-market properties.
  3. After a set period of time, we 404 all properties.
  4. From a user experience perspective, get people who land on off-market properties to active properties.

Related nerd site note: We 301ed properties to the homepage until recently. It seems after a certain number of 301s Google will give the page that is being directed the same content as the target URL. Who knew?

Optimize the Crawl Quota

In order for Google to rank your long tail content, Google needs to be able to find it and index it. When you create 10,000+ new pages per day, there are some things you want to optimize/maximize:

  1. Load and response times
  2. Find-ability
  3. Domain authority

Load and Response Times

The more quickly Google can crawl your pages, the more pages Google crawls. This isn’t rocket science here, but the results speak for themselves. Jonathan Colman’s graph says it all.

load times and page crawl 637x475 Huge Site SEO: Optimizing for the Long, Long Tail

Find-ability

Find-ability, a concept I just made up, relates directly to the idea of flowing link juice to the most important pages with a flat site architecture. When Google crawls 500,000 pages on an off day, you want to be sure that spider is sucking in the appropriate pages. Getting down to category and product pages quickly is paramount because, again, we are optimizing for the long, long tail.

There are other ways to help lead spiders in the right direction. If you link to the category page on every relevant product page, you increase the chance Google finds the product page. It then becomes very important that Google finds the newest, or most important, products on that category page.

Live example: Zillow does something incredibly interesting in this regard. They order homes by featured listings if you search on their site for Los Angeles. BUT, start a private session and search on Google for “Los Angeles Zillow” and Zillow orders homes on the page by newest on market.

Based on the above logic that more links to a page make them more findable, you can make long tail product pages even more findable by interlinking them. Trulia’s a prime example again where they give preference to newest and most important related products:

preference to new pages 637x245 Huge Site SEO: Optimizing for the Long, Long Tail

Domain authority

The last, and probably best, way to make sure your long, long tail pages rank is to increase your domain authority. “Increase your domain authority” is just a fancy way of saying gets links, a lot of links. Trulia obtained over 2K ULDs in January alone according to ahref.com. Yeah … 2K ULDs … in a month…

The only way you can compete on this front is to scale link building. That’s what Trulia did better than anyone for years with their set of widgets. You need to leverage any internal or external networks you have and create some kind of non-black-hat incentive for people to link to you.

Or you need to approach a boring subject from a new angle and think outside the box.

Welcome to My World

I hope this gave you some perspective on what keeps me up at night. You can tell there is some pretty major overlap between small and huge sites, but huge sites have a whole other set of issues and tasks to master.

 Huge Site SEO: Optimizing for the Long, Long Tail
Chris Kolmar is the Director of Marketing at Movoto. He leads the blog team in creating fun and engaging content for the real-estate-o-sphere.
 Huge Site SEO: Optimizing for the Long, Long Tail
 Huge Site SEO: Optimizing for the Long, Long Tail

Latest posts by Chris Kolmar (see all)

Comments are closed.

12 thoughts on “Huge Site SEO: Optimizing for the Long, Long Tail

  1. This is awesome Chris, well done. I have a couple experiments that are 1,000,000+ pages and have worked with a few others. I have to say that my experience mirrors on basically all counts; especially where demanding creativity through machine generation of useful substance. Not easy to do, but when you pull it off, it’s amazing.

  2. Hi Chris

    Interesting article, i have worked with a few e-commerce client where they have issue related to big sites :L specially around duplicate content around products and the lot.

    I was hoping you can clear this up for me . When you say ” We 301ed properties to the homepage until recently. It seems after a certain number of 301s Google will give the page that is being directed the same content as the target URL. Who knew? ” can you plz explain what exactly you mean by that ?

    1. i’ll second that, didn’t quite understand what you were trying to say, would appreciate any clarification because it sounds like you were trying to say something interesting. I thought a 301 redirect simply notified search engines that content had permanently moved to a new url, and would display the page content of that new url. I wasn’t aware of any cap on redirects, or that you could have too many. Especially to the point where Google did anything other than simply process the request and return an ‘ok’ once the new url was reached.

  3. I’ve found on super huge sites it’s also beneficial to control GoogleBot with Robots.txt. That way, as you are working on thin content, or branching out on a new section, GoogleBot doesn’t necessarily have to see it until you are ready.

  4. Great points. I have been struggling for the last three months with a community college site with 75,000+ pages. You had mentioned removing sitewide external links like Facebook. However, our problem is that we have 80 different training programs that insist on having their own individual Facebook pages or twitter accounts that they want linked from their degree/program page. If I am understanding you correctly, it looks like I might have a good argument for getting them to remove those links from their program page?

    1. I would think about the user first, or at least your specific situation first. If visitors are using those links and it benefits them, and subsequently have 80 active FB communities for each program, then keep them. And those aren’t sitewide links either, just a single link from the main program page to that specific programs FB page is enough.

      1. Looks like you couldnt find it.
        I just mentioned a german tool named strucr (.com) which is perfect to analyse big sites.. if you are not intersted in good design couse it’s just filled with tables of data