Massive Duplicate Content Problem- More Revealed

SMS Text

So last month, I posted an article here entitled “You think YOU have a duplicate content problem?” where I described a duplicate content nightmare of epic proportions.  Essentially, I found a site with hundreds of replications of every page on the site, and outlined steps I was going to need to task to the developers to fix it.

Well since that time, I did some more digging, in order to provide the full tasking plan for implementation.  And found that the problem was MUCH worse than my original assessment even had.  Stupid worse. And it looks like it’s going to be a major battle to get it all resolved…

Back when I first found the problem, I had estimated there should only be about 15,000 pages, yet discovered that Google was displaying 86,000 pages.  And from there, I found that Google was indexing a million pages, but in their internal automated effort to only show the “right set”, ended up not doing such a great job.

Which meant, at the time, that those 15,000 pages weren’t getting all the value they really deserved.

More Digging

All of that info I’d initially uncovered and figured out was purely based on mid-level audit work.  However when it came time for me to actually write up the findings in a clear, concise and plain English document to be provided the site’s developers, I needed to go further. I needed to really examine, and show examples of links where the problem existed.

And I needed to map out the fix – detailing SEO and Information Architecture best practices.  Because when you’re telling developers that “the hundreds of hours you put into building this site caused massive problems that you’ll now need to fix”, you better be thorough.  And prepared for push-back from someone who emotionally or psychologically may not be willing to admit or acknowledge their roll in the problem.

More Details

If you recall, it’s a real estate site.  With several offices in several counties.  Over 400 agents spread throughout. Well it turns out there are actually over 25,000 homes for sale in their system.  And the site offers drill-down navigation – county to city, to neighborhood, and finally to individual properties.

If you add all those county, city, and neighborhood pages, plus the hundreds of agent pages, it turns out there’s just over 26,000 actual pages.

Oh –  Look!  Several Duplication Problems

Okay so in my last article, I explained how if you go to an agent’s bio page, and then click from there to any other page on the site, all the URLs get appended with that agent’s ID.  That was where the majority of duplication comes from.  Thousands of property pages, each one replicated with every agent’s unique ID appended to the URL.

But Wait! There’s More!

During my final write-up, I scanned the duplicate pages being indexed at Google.  And guess what?  It turns out that every county has two different URLs you can use to get to that county page.  Apparently the code allows for multiple URLs – one was the way the site was originally architected.  Another is how it was modified to work after they went live.

Except nobody realized you have to implement 301 Redirects when you do that.

Think that’s bad?  Well a similar problem exists with every city in every county.  And yes, every neighborhood within every city.

That’s over 130 pages that all have two different URLs you can use (and many of both versions are indexed at Google, thank you very much).

But Wait! There’s More!

And guess what?  When I went into Google Webmaster Tools to see if there were anything that could reveal about this problem?  Yeah, I discovered 100,000 404 Errors listed, all from the month of May!

Now I normally don’t worry too much about 404 errors.  Most big sites are bound to have some.  Sure, it’s best practices to address them all as they crop up.  Yet in most cases, if there’s a handful, it’s a low priority task sometimes.

Except when it’s THIS massive.

Especially when most of them have LINKS POINTING TO THEM.

How To End Up with 100,000 404 Errors.  In Under A Month.

It turns out there were three primary causes for all these 404 errors. First, when this site was first redesigned and rebuilt last year, there was initially URL structure issues.  Recall how I mentioned earlier that this was how the duplicate content problem came about in the county/town/neighborhood system?

Well in the county/town system, those 1st version URLs still work.

They don’t, however, work in other sections of the site.  Those sections got new URL structure in a way that their 1st generation URLs now go to a dead end.  404.  Not found.

Except that’s only applicable for a handful of these.

Then there’s the fact that the old site – the one that existed before this rebuild, had some sort of site-level email linking scheme.  Don’t ask me how, or why.  All I know is there are hundreds of URLs that somehow got into the Google index at some point, where the URLs point to an email folder on that site.  And within that email folder, there were all sorts of links pointing to property pages.  Bizarre. To say the least.

The really massive 404 count however, comes from the old property URL structure on that now defunct site.  For whatever reason, when the new site was built, nobody thought – “Hey – we’re scrapping this old site.  So maybe we should 301 redirect all those property pages”.

And even before this site was rebuilt, nobody ever thought back on the old site “Hey – as properties get sold, maybe we should have an automated 301 set up for every one of those”.

Clean-Up on Aisle 3

So, as you can venture to guess, the problems on this particular site are way beyond more chaotic and entangled and painful than I originally thought when I first wrote this up last month.

Essentially, the entire site’s URL structure needs to be cleaned up.  Which is awesome for me.  Because in my tasking document, I not only communicated all those duplicate town / city / neighborhood pages need to be eliminated / 301’d.  I went further.  And said “throw out BOTH versions”.  And replace them with THIS syntax.

That’s right – I went for it – truly polished, User friendly AND SEO friendly URL structure.

Because I’m a nice guy. πŸ™‚

Not So Fast, Mister!

It turned out that the head of development for this particular site was very cooperative.  Quite willing, without push-back, to revamp the entire county/town/neighborhood system. With my preferred URLs.  That was just awesome to hear.

Until I learned that all was not so joyous.

As it turns out, the really BIG duplicate content problem?  Where they need to strip out the Agent IDs from the URLs, and replace that with a browser cookie system?

Yeah – not so much.  The answer was a resounding, emphatic, “Not Possible.”.

Oh No You Didn’t!

Okay so I’m not a world class web engineer.  I don’t code complex sites in my sleep. I have, however, in the past, coded entire complex shopping cart systems, with multi-layered discounting, five variations of feature options, multiple-shipping method and pricing options, secure membership features, and much more. From scratch.

And so I know a thing or three about cookies.

Except, unfortunately, I didn’t create THIS site.  So I wasn’t aware, until this bombshell discussion, that those Agent URLs get embedded in special email messages that go out to people who sign up for property alerts.

And they get syndicated out to national real estate sites.

Yeah, welcome to my little world.

So for now, all the other tasking I asked for is going to be worked on.  At some point in the next who knows whenever.

But that agentID thing in the URL?  They’re going to have to get back to us on that.  Because I said – think about how you can resolve this.  Because right now, it’s killing the site. And “not possible” is, well, not acceptable.

And just to cover the bases, I’m chewing on how this can be resolved.  In case they come back with a “We really thought about it and we just can’t do it”.I’ve already come up with what I think is a solution.

However I need to chew on it and get together with a developer friend, a guy who happens to be just this side of rocket scientist.

And then, if I ever DO get this all worked out, I’ll write another follow-up article.  Because it’s good to cleanse the soul like this, yet it’s also good karma to share the love in the form of “here’s how we did it – so you don’t have to go through the pain we did…”.

Alan Bleiweiss
Alan Bleiweiss is a Forensic SEO audit consultant with audit client sites consisting of upwards of 50 million pages and tens of millions of visitors... Read Full Bio
Alan Bleiweiss
Subscribe to SEJ!
Get our weekly newsletter from SEJ's Founder Loren Baker about the latest news in the industry!
  • Michael Roberts

    Part One of this story was an interesting read when I first saw it. I genuinely felt bad for you with all the crazy problems of that site. I was not expecting a Part Two but now that there is one I feel as though I have a vested interest in what the outcome of your troubles will be. I can’t wait to see what will hopefully be a story of epic triumph in the face of adversity. πŸ™‚

  • Michael Roberts

    Part One of this story was an interesting read when I first saw it. I genuinely felt bad for you with all the crazy problems of that site. I was not expecting a Part Two but now that there is one I feel as though I have a vested interest in what the outcome of your troubles will be. I can’t wait to see what will hopefully be a story of epic triumph in the face of adversity. πŸ™‚

    • alanbleiweiss

       MIchael, that’s awesome – a vested interest!  Which I suppose means I’ll need to write either another follow-up, or ideally (yet unlikely) a final wrap-up at some point πŸ™‚

  • Anonymous

    Wow, this is a great horror story Alan, hope you are charging this client appropriately. 

    • alanbleiweiss

      Thanks Keepkalm

      Yes, as a matter of fact, I am πŸ™‚

  • Meg

    I’m a glutton for punishment, but I love working on stuff like this.  I even love the arguments with the developers.  Cause in the end, YOU have the essential argument.

    • alanbleiweiss

      It’s a beautiful thing to live through, Meg, isn’t it?  

  • Dawn Wentzell

    Is it weird that I actually like problems like that? Sure, it’s a pain in the ass to (have someone else) fix it, but isn’t it so rewarding once it is fixed?!

    • alanbleiweiss

       Well Dawn, it is.  Except they don’t always get completely fixed.  Client budgets, and all that…  In this situation, I expect it WILL get resolved eventually.  The site owner is pretty significantly vested financially in the existing site’s code, and everyone’s at least willing to “try”.  Just not sure if I’ll still be working in SEO ten years from now when it’s finally working right πŸ™‚

      • Dawn Wentzell

        Wait… “it is” rewarding, or it’s weird that I like it? πŸ˜›

      • alanbleiweiss

        oh – uh both?  LOL nah – not weird.  See Meg’s comment here of a similar nature, and my reply to her.  Not weird.

        Or maybe we’re all weird for liking it.  So at least you’d then not be alone in the weirdness πŸ™‚

      • Meg

         I’m a problem solver.  I like diving into stuff like this and sorting it all out.  I suspect Dawn does too.  And I’m pretty sure (despite your rather tepid disclaimers – ork ork) that you do too.

      • alanbleiweiss

        Actually I don’t like solving problems.  Sure, I enjoy coming up with solutions.  But I much prefer other people do the heavy lifting of actually having to implement those solutions.  So like in this situation, I’ve got the solution, someone else has responsibility on how to get the developers to do the work without causing even more problems πŸ™‚

  • Jill Whalen

    Wouldn’t the canonical link element do the trick?

  • Ewan Kennedy

    Brilliant – and entertaining. I’m dealing with a case of rampant duplicate content but on a much smaller scale (20,000 pages indexed, should be about a tenth of that) and using canonical links, meta noindex, 301’s and whatever else to prune it down. Fortunately, I have very co-operative developers whose initial scepticism abated when rankings for the client’s 6 most competitive terms moved from around position 250 to mid-twenties within 6 weeks after minimal on site SEO. Very gratifying so long as the desired result is within a reasonable timescale. Look forward to hearing how you overcome the agentID problem!

  • online coupons

    Great piece of writing you have just shared. Duplicate content is the most hot topic after changing in Google Panda Algorithm and this article is quite interesting in reading.
    Thanks for sharing it.

  • joe ryan

    This post is awesome..i’ve been reading tons of crap posts from other blogs, but shows you have a more educated reader base.
      Business Loan

  • Andy Piper

    I am glad there are people like you to take care of this stuff so I can get on with selling real estate.  If the current structure is hurting their ranking significantly the value of the improved rankings should convince them to move forward.  The value of top ranking for those neighborhoods and cities is worth a lot of $$.

  • Aluminium Kozijnen

    Very informative post.. Through this post we can  manage our content without duplicate.. thanks for this fantastic work sharing with us..

  • Damien Anderson

    So each URL has the agent identifier in it? Damn! As you say, if only they had been informed to consider SEO at build. So canonical and robots.txt exclusions seem to be a route to sanity? Even then you need to agree on a canonical master page – I guess with agent commissions inherently linked to within the URLs that is also going to be a bung fight. 

  • Judith

    Hey, Alan!  Love your story and thanks for sharing!  While I was reading it I found I had a smirk on my face because, after-all, it is this kind of stuff that is a challenge and when solved exhilarating, right?  Good thing the inside developer was cooperative with your suggestions — that in of itself tells me they are open to you helping them to find their way.  Can’t wait for the next chapter…  πŸ˜‰

  • liz strawford

    Awesome posts, good luck coming up with the solution, getting the devs on board and securing budget from the client! Awaiting the next installment eagerly πŸ™‚ It’s a masterclass on lots of levels

  • remote desktop

    Something similiar happened to my site. 
    I decided to SEO optimize the pages so I renamed the pages.  then i kept getting 404 pages for the original pages. 
    To fix it, i created the old set of pages and redirected them all to the main home page…
    Ugly, I know, but easiest to do FAST.

  • Karim Javed

    alanbleiweiss You are amazing πŸ™‚ after reading this article

  • Durgesh Chaudhary

    True Duplicate content is really a nightmare and it ruins the original’s life.

  • Nancy

    Hi Alan,

    thank you so much for taking even more time to document this for us. Some great solutions are outlined, as are even greater problems. I’m learning so much from your series on this epic and legendary disaster. Please continue to share this with us, because I can certainly apply the hard lessons learned you are living through.

  • Lyena Solomon

    What would be your solution to 100K 404 pages?  301s? Obviously, regex could be used in the code. But what if they are all unique and need to be re-directed to a unique page?  How is it going to affect server load?

  • Afrin

    I enjoy this.continue it.

  • SEOshite

    Author is incorrect. Different URLs to the same page are not a problem since the canonical directive was in place.