The biggest obstacle in my search for SEO nirvana has been the lack of an industry blog focused on huge site SEO. When I hear people talk about dealing with 1,000 pages as if it is a lot, I cry a little on the inside. They don’t know how good they have it.
What is huge site SEO? In my eyes, you need at least 100,000 pages to be considered a huge site. My site, Movoto.com, comes up with more than 15 million indexed pages.
For the most part, small and big ass sites overlap in their SEO strategies. Title tags, links, social shares, and xml sitemaps play a role; huge sites just do it all on steroids. But there are certain issues that only big sites face.
Here are some of the solutions I have seen to the specific challenges that a huge site encounters. The focus becomes long, long tail rankings.
Optimize for the Long, Long Tail
On a site with more than 15 million pages, head terms can naturally start at five words long (think “Los Angeles Homes for Sale”). The long tail is where you live. The long, long tail is what you optimize for.
One shining example is keyword repetition. I know, you’re saying this isn’t 2005, or whatever you old timers say was a long time ago. But when you have 15 million pages, page number 13,200,010 has negative pagerank (Well, not really, but close). How do you make it rank? You nail the point where keyword repetition maximizes the ranking potential.
My favorite example is Trulia, one of the largest real estate sites on the Web and a shining beacon of SEO strategy. Trulia deliberately repeats their the property’s address (their main keyword) eight times throughout the H1 and H2s.
Another tactic that has seen its credibility hit hard has been computer generated content. But who said computer generated content doesn’t have value? If you can add it to a page and improve the user experience in the process, then it is just fine. Trulia uses it when there is no agent description. It has enough value that Google instant finds value in it.
Control the Link Juice Tsunami
There are two ways to maintain control over link juice flow: Link to everything or maintain a deliberate link strategy. While domain authority plays important role in which strategy to utilize, in my opinion, every site should maintain a deliberate link strategy.
Sites like Trulia and Zappos have so much domain authority they can’t possibly spread it thin enough. Have you seen Trulia’s footer? Try telling them that limiting a page to 100 links is the proper way to do SEO. Or take a look at the Zappos ocean (50+ links on every page):
We mere mortals need to proactively, and thoughtfully, determine where to aim our limited link juice. Since we don’t have the domain authority to link to every page from every page, we have to place priority on particular categories with the aim of optimizing revenue. The tactics include, but (probably) aren’t limited to:
- Deep Homepage links. You control what products or categories you want spiders, and users, to value the most
- Deliberate non-linking. A good example of this practice is how we only link to off-market homes category page from one spot in each city. We are telling spiders to place significantly less value on off-market homes relative to active homes with our linking practices.
- Pick “favorite” nearby cities, or similar products, where “favorite” really means revenue generating.
- Know your links. Don’t have any extraneous internal links.
- Remove sitewide external links (Facebook and Twitter, I’m looking at you).
- Footer links
Being smart about directing the flow of link juice significantly affects traffic when you are dealing with a large number of pages.
Beware the Dangers of Duplicate and Thin Content
Fifteen plus million pages makes dealing with WordPress duplicate content issues seem like chump change. It’s hard enough to get a long, long tail page indexed and ranking, you don’t want to make your life harder by splitting the juice between two of the identical pages.
There have been a ton of awesome tutorials on removing duplicate content, so I won’t bore you with the technicalities of how to remove it, but it is extremely important to deal with it preemptively. The “Rel Canonical” meta directive is your best friend in this regard. Adding rel canonical prevents any would be crazy links from creating duplicates at mass scale.
Here is a personal example where rel canonical prevents a ton of duplicate content on our site. Patrick.net, a real estate forum adds a “?source=Patrick.net” to some outbound links. That has “double trouble” written all over it as each of those links would create a duplicate of the target page.
Dealing with thin content is a moral dilemma for me: It can still be useful, even with limited information. In real estate, when a property goes off-market we are severely restricted in the amount of information we are allowed to show without a registration. (From what I’ve read, ecommerce sites face a similar issue with discontinued products.) We go from a page with the richest amount of content we offer, to one with essentially no content. People still search for these properties, so we don’t want to remove them.
We deal with it thusly:
- Certain properties get the 404 as soon as they go off market
- We only maintain a set amount of off-market properties in the html sitemap. This places higher importance on more recently off-market properties.
- After a set period of time, we 404 all properties.
- From a user experience perspective, get people who land on off-market properties to active properties.
Related nerd site note: We 301ed properties to the homepage until recently. It seems after a certain number of 301s Google will give the page that is being directed the same content as the target URL. Who knew?
Optimize the Crawl Quota
In order for Google to rank your long tail content, Google needs to be able to find it and index it. When you create 10,000+ new pages per day, there are some things you want to optimize/maximize:
- Load and response times
- Domain authority
Load and Response Times
The more quickly Google can crawl your pages, the more pages Google crawls. This isn’t rocket science here, but the results speak for themselves. Jonathan Colman’s graph says it all.
Find-ability, a concept I just made up, relates directly to the idea of flowing link juice to the most important pages with a flat site architecture. When Google crawls 500,000 pages on an off day, you want to be sure that spider is sucking in the appropriate pages. Getting down to category and product pages quickly is paramount because, again, we are optimizing for the long, long tail.
There are other ways to help lead spiders in the right direction. If you link to the category page on every relevant product page, you increase the chance Google finds the product page. It then becomes very important that Google finds the newest, or most important, products on that category page.
Live example: Zillow does something incredibly interesting in this regard. They order homes by featured listings if you search on their site for Los Angeles. BUT, start a private session and search on Google for “Los Angeles Zillow” and Zillow orders homes on the page by newest on market.
Based on the above logic that more links to a page make them more findable, you can make long tail product pages even more findable by interlinking them. Trulia’s a prime example again where they give preference to newest and most important related products:
The last, and probably best, way to make sure your long, long tail pages rank is to increase your domain authority. “Increase your domain authority” is just a fancy way of saying gets links, a lot of links. Trulia obtained over 2K ULDs in January alone according to ahref.com. Yeah … 2K ULDs … in a month…
The only way you can compete on this front is to scale link building. That’s what Trulia did better than anyone for years with their set of widgets. You need to leverage any internal or external networks you have and create some kind of non-black-hat incentive for people to link to you.
Or you need to approach a boring subject from a new angle and think outside the box.
Welcome to My World
I hope this gave you some perspective on what keeps me up at night. You can tell there is some pretty major overlap between small and huge sites, but huge sites have a whole other set of issues and tasks to master.
Subscribe to SEJ
Get our daily newsletter from SEJ's Founder Loren Baker about the latest news in the industry!