Duplicate content. We all know about it. Countless posts have been written on why it’s bad, how to avoid it. But maybe you’ve got a duplicate content problem and don’t even know it’s there. Or your duplicate content problem is bigger than you realize. So big, it’s epic.
That’s what I discovered recently when auditing a client site. We’re not talking about content replicated across multiple sites. Not scraper sites, or ripoff sites. One site. The original and only source. And it was by forensic tactics that I uncovered exactly how big the problem was. How epic. Orders of magnitude epic.
In this situation, we’re talking about a real estate site. Covering a wide swath of California – offices spread throughout northern and southern California. Billions of dollars in home sales in 2010.
Site: – A Key Metric
Whenever I perform an SEO audit, I run a site: check on Google as one of my first tasks, and ask the client how many pages they really have. This is just to get a feel for how well the site’s currently indexed. This site showed 86,000 pages indexed on my initial check. Except there’s really only about 15,000 pages. Wow. Really? Oh boy…
Now, it’s not uncommon to run a site: check and get less pages showing than actually exist. The public display of pages found is only an approximation, and subject to how well a site is indexed, Google’s algorithm at any given moment as well as fluctuations in the results due to competitive factors.
But this an opposite indexing problem. More than five times as many pages showing as actually exist. So I went back and began to examine the site, my senses on full alert.
1999 Called & Wants It’s Programming Methods Back
What I found that set off the next bell in my “that’s not right” process was finding that they’ve got over 400 agent pages – no – it’s not odd that a large real estate site has hundreds of agent pages. It’s that when you get to any of those pages, the next time you click on any page in the main navigation, the agent’s ID is stuck on the URL. And the home page link no longer goes to the main site home page, but instead goes back to that agent’s home page.
It’s a common programming method – passing identifiers along in the URL string. Except I know right away to then check for canonical URL tags – to see if those are being picked up by Google as authentic “unique” pages, or if the site’s coded to say “don’t index this version”.
No Canonical Tags. Anywhere.
Okay quick math time – 15,000 pages – 400 agents. That’s six million pages that could potentially be indexed. Except I was only seeing just over one percent of that. Still way too many for reality. Yet not the “OMG” disaster it could have been. Or was it?
Forensic SEO Tactics
Here’s where I really got curious – do I really need to go through all of those results to try and figure out what the heck is happening? Nope – not me. No way. No how. Instead, I let my brain chew on the problem.
And thought – let’s search Google first, just to see if any of these agent appended URLs are actually showing up. Sure enough, every one I manually tried was there.
From there, I performed an advanced site: check. In these particular URLs there’s a series of letters used as the variable identifier – so everything after XYZ in the URL string is the agent’s unique ID. So my search then looked like this: Site:www.Domain.com +XYZ
And guess what I found? Not 60,000 pages (the “overage” from the real count to the “pages found” count). What I found was
509,000 pages found
Great. Just great.
So what the heck is going on?
More tests. This time, I ran it with a different chunk of code in those agent URLs. And what did I get?
1.2 million pages found
Wow. This was a complete mess. And my first thought was – how could such completely insane variations exist?
Google – “We Do The Best We Can”
What turned out to be the problem was multi-layered. At any given time, the GoogleBot attempts to crawl the site. At a certain point, it’s just going to get tired of exploring a site, and run away, on to the next shiny object out there. Especially when those agent pages are several layers down in the link chain. Which means all the pages linked from there are also “technically” (but not really) even further down in the link chain.
And then even if some of those pages end up in the index, at some point, Google’s going to see “Hey this content is exactly the same as all this other content.”
And even though claims have been made (Thanks Matt!) that “Google does a pretty good job of figuring things out”, this is a great example of why that’s an imperfect system. Essentially, along the way of processing all this data, the system’s going to choke. And in this particular case, may even barf a little.
But overall, considering the fact that over a million “pages” are actually in their index, they’re able to pare it down by orders of magnitude, down to that 86,000 (still ridiculously over-counted) page range.
Good Enough Isn’t Good Enough
So Google’s system, without further guidance, is only able to pare it down to 86,000 pagers. That still leaves 70,000 of those pages being duplicate. Which means there’s a BIG problem still.
How does Google know which version is the most important? Most of the results in the first dozen pages of results for various searches ARE the primary site version, without the agent appendage. But not all. And for some phrases, it’s all agent pages that show up first.
Which in turn means that the pages that matter the most are NOT being given their full value. On a massive scale.
The Fix Ain’t So Easy
So, you’re saying to yourself – just slap that canonical tag in there. Problem solved.
Well sure, that’s important. Except that’s only good for the future experience. The site’s been like this forever. Would YOU want to be the one who ensures the 301 Redirects are implemented properly for that mess? Well, if you’re a REGEX genius, maybe you would. Me, not so much.
Then there’s the need (yes, it’s a NEED) to get the entire site recoded to STOP USING URL strings. Because I don’t care how much Google says all you need is canonical tags. Because not every search engine or link provider (intentionally or otherwise) is on board with that.
And even to Google, it’s only “an indicator”. It’s not a guarantee.
Which means a coding nightmare for some poor code monkey.
And more QA to ensure it’s all really done properly. Across the ENTIRE site.
Fortunately I’m not the one who has to code it. But I’m the one who’s got to do the QA on it. Yeah. Thanks. I’ll be over here curled up in a fetal ball. Crying. Uncontrollably. At least until I can rant about the process on Twitter.