SEO

You Think YOU Have a Duplicate Content Problem?

Duplicate content. We all know about it.  Countless posts have been written on why it’s bad, how to avoid it.  But maybe you’ve got a duplicate content problem and don’t even know it’s there.  Or your duplicate content problem is bigger than you realize.  So big, it’s epic.

That’s what I discovered recently when auditing a client site.  We’re not talking about content replicated across multiple sites. Not scraper sites, or ripoff sites.  One site.  The original and only source.  And it was by forensic tactics that I uncovered exactly how big the problem was.  How epic. Orders of magnitude epic.

In this situation, we’re talking about a real estate site.  Covering a wide swath of California – offices spread throughout northern and southern California.  Billions of  dollars in home sales in 2010.

Site: – A Key Metric

Whenever I perform an SEO audit, I run a site: check on Google as one of my first tasks, and ask the client how many pages they really have.  This is just to get a feel for how well the site’s currently indexed.  This site showed 86,000 pages indexed on my initial check.  Except there’s really only about 15,000 pages.  Wow. Really?  Oh boy…

Now, it’s not uncommon to run a site: check and get less pages showing than actually exist.  The public display of pages found is only an approximation, and subject to how well a site is indexed, Google’s algorithm at any given moment as well as fluctuations in the results due to competitive factors.

But this an opposite indexing problem.  More than five times as many pages showing as actually exist.  So I went back and began to examine the site, my senses on full alert.

1999 Called & Wants It’s Programming Methods Back

What I found that set off the next bell in my “that’s not right” process was finding that they’ve got over 400 agent pages – no – it’s not odd that a large real estate site has hundreds of agent pages.  It’s that when you get to any of those pages, the next time you click on any page in the main navigation, the agent’s ID is stuck on the URL.  And the home page link no longer goes to the main site home page, but instead goes back to that agent’s home page.

It’s a common programming method – passing identifiers along in the URL string.  Except I know right away to then check for canonical URL tags – to see if those are being picked up by Google as authentic “unique” pages, or if the site’s coded to say “don’t index this version”.

No Canonical Tags.  Anywhere.

Okay quick math time – 15,000 pages – 400 agents.  That’s six million pages that could potentially be indexed.  Except I was only seeing just over one percent of that.  Still way too many for reality.  Yet not the “OMG” disaster it could have been. Or was it?

Forensic SEO Tactics

Here’s where I really got curious – do I really need to go through all of those results to try and figure out what the heck is happening?  Nope – not me.  No way.  No how.  Instead, I let my brain chew on the problem.

And thought – let’s search Google first, just to see if any of these agent appended URLs are actually showing up.  Sure enough, every one I manually tried was there.

From there, I performed an advanced site: check.  In these particular URLs there’s a series of letters used as the variable identifier – so everything after XYZ in the URL string is the agent’s unique ID.  So my search then looked like this: Site:www.Domain.com +XYZ

And guess what I found?  Not 60,000 pages (the “overage” from the real count to the “pages found” count).  What I found was

509,000 pages found

Great.  Just great.

So what the heck is going on?

More tests.  This time, I ran it with a different chunk of code in those agent URLs.  And what did I get?

1.2 million pages found

Wow.  This was a complete mess.  And my first thought was – how could such completely insane variations exist?

Google – “We Do The Best We Can”

What turned out to be the problem was multi-layered.  At any given time, the GoogleBot attempts to crawl the site.  At a certain point, it’s just going to get tired of exploring a site, and run away, on to the next shiny object out there.  Especially when those agent pages are several layers down in the link chain.  Which means all the pages linked from there are also “technically” (but not really) even further down in the link chain.

And then even if some of those pages end up in the index, at some point, Google’s going to see “Hey this content is exactly the same as all this other content.”

And even though claims have been made (Thanks Matt!) that “Google does a pretty good job of figuring things out”, this is a great example of why that’s an imperfect system.  Essentially, along the way of processing all this data, the system’s going to choke.  And in this particular case, may even barf a little.

But overall, considering the fact that over a million “pages” are actually in their index, they’re able to pare it down by orders of magnitude, down to that 86,000 (still ridiculously over-counted) page range.

Good Enough Isn’t Good Enough

So Google’s system, without further guidance, is only able to pare it down to 86,000 pagers.  That still leaves 70,000 of those pages being duplicate.  Which means there’s a BIG problem still.

How does Google know which version is the most important?  Most of the results in the first dozen pages of results for various searches ARE the primary site version, without the agent appendage.  But not all.  And for some phrases, it’s all agent pages that show up first.

Which in turn means that the pages that matter the most are NOT being given their full value.  On a massive scale.

The Fix Ain’t So Easy

So, you’re saying to yourself – just slap that canonical tag in there.  Problem solved.

Well sure, that’s important.  Except that’s only good for the future experience.  The site’s been like this forever.  Would YOU want to be the one who ensures the 301 Redirects are implemented properly for that mess?  Well, if you’re a REGEX genius, maybe you would.  Me, not so much.

Then there’s the need (yes, it’s a NEED) to get the entire site recoded to STOP USING URL strings.  Because I don’t care how much Google says all you need is canonical tags.  Because not every search engine or link provider (intentionally or otherwise) is on board with that.

And even to Google, it’s only “an indicator”.  It’s not a guarantee.

No, the ONLY proper, BEST PRACTICES tasking here is to strip out all those URL parameters.  Just use cookies instead, for cryin out loud.

Which means a coding nightmare for some poor code monkey.

And more QA to ensure it’s all really done properly.  Across the ENTIRE site.

Fortunately I’m not the one who has to code it.  But I’m the one who’s got to do the QA on it.  Yeah. Thanks.  I’ll be over here curled up in a fetal ball. Crying.  Uncontrollably.  At least until I can rant about the process on Twitter.

12bcd73262dd3dcb8597e6d4f9884119 64 You Think YOU Have a Duplicate Content Problem?
Alan Bleiweiss is a Forensic SEO audit consultant with audit client sites consisting of upwards of 50 million pages and tens of millions of visitors a month. A noted industry speaker, author and blogger, his posts are quite often as much controversial as they are thought provoking.
12bcd73262dd3dcb8597e6d4f9884119 64 You Think YOU Have a Duplicate Content Problem?

You Might Also Like

Comments are closed.

17 thoughts on “You Think YOU Have a Duplicate Content Problem?

  1. Interesting read! BuildDirect*com had the exact same issue ages ago and we went (you guessed it) the canonical way. It solved the problem to a good extent but it didn’t uproot it, as even still, there are issues with duplicate titles (9,000+ duplicates), messy urls and I am sure there is juice dilution. It uses Endeca so there is not much can be done about it. A switch to Google Commerce 3.0 would solve it (plus the added bonus of instant search!) but that’s a lot of IT work again.

    1. Syed,

      Thanks for commenting – yes – making entire site structure changes can be very daunting – not only is it a lot of IT work, it’s a Quality Assurance issue, and especially a redirect issue – ensuring all the old content is found. It’s something to never take lightly.

  2. That’s quite a detective story. There’s no substitute for digging your way through a site audit to get to the bottom of what went wrong (though there’s no substitute for doing it right to begin with…).

  3. Hmmmm…. interesting problem. I wonder what I’d be asked to translate if requested to localize the site? 15,000 pages or 86,000? Well, perhaps I could translate 15,000 and charge for 86,000… ;)

  4. Alan,
    Has it ever happened to you when your client looked at your recommendations and said, “It’s too much work. Let’s leave the mess as it is.” If it did happen, what did you do? I know, there is not much you can do, but is there anything?

    1. Nobody’s ever said it quite that way. Typically when they go that route, it’s more like they just drop off the radar screen altogether. Sort of like they read it, feel completely overwhelmed, and proceed to bury their heads in the sand.

      This is the primary reason I now prioritize recommendations, and during the initial review discussion, communicate that work can be done in stages, that it’s progress, not perfection. Takes the edge off of things and helps a lot.

  5. You could try telling Google to ‘ignore’ the offending URL parameters in Webmaster Tools and monitor the effects while you work on a more long-term robust solution.

    1. The only way to 100% guarantee that every duplicate is eliminated is to eliminate the duplication. Robots.txt, Canonical tags, Google Webmaster Tool instructions are all subject to both the imperfection and mixed rules of all the search engines. And all it takes is one single third party to create a manual link to one of those unwanted URLs to override all of those other methods because search engines are annoying that way.

  6. As mentioned above, you could handle this in Google Webmaster Tools and just tell G not to index pages with those parameters. That shouldn’t be temporary…that should be permanent but would take a while to reach the index. You could then update the Robots.txt file to no index / no follow every page with the same URL parameter. Permanent fix. Finally, you could use a rules-based (regular expression based) rewrite / redirect app (like ISAPI_Rewrite) to just rewrite and/or redirect the offending URLs based on the the same URL parameter. None of these would be manual…all site-wide rules based code fixes that are fairly easy to implement…and they would permanently tell G Bot to not index the offending URLs. Doing all 3 might be over kill…but given the magnitude of your problem it sounds like that’s what you need.

    We had a similar problem years ago. Agent pages for the largest real estate site in Texas + logical URLs that had been used for SEO and physical files names for every page that were uploaded to PPC (and subsequently indexed) – two copies of every page!

    1. Nick,

      The multi-pronged approach, which you describe as “overkill” is the ultimate plan. I happen to be dealing with an offshore development company that in the past has required micro-managed hand-holding, repeat requests for even the most simple changes, and constant QA. That’s both why I want to cry AND why the “overkill” approach is vital here.

  7. I enjoyed reading this article. It would have been hilarious if the problem weren’t so serious. OMG, all that coding!