So last month, I posted an article here entitled “You think YOU have a duplicate content problem?” where I described a duplicate content nightmare of epic proportions. Essentially, I found a site with hundreds of replications of every page on the site, and outlined steps I was going to need to task to the developers to fix it.
Well since that time, I did some more digging, in order to provide the full tasking plan for implementation. And found that the problem was MUCH worse than my original assessment even had. Stupid worse. And it looks like it’s going to be a major battle to get it all resolved…
Back when I first found the problem, I had estimated there should only be about 15,000 pages, yet discovered that Google was displaying 86,000 pages. And from there, I found that Google was indexing a million pages, but in their internal automated effort to only show the “right set”, ended up not doing such a great job.
Which meant, at the time, that those 15,000 pages weren’t getting all the value they really deserved.
All of that info I’d initially uncovered and figured out was purely based on mid-level audit work. However when it came time for me to actually write up the findings in a clear, concise and plain English document to be provided the site’s developers, I needed to go further. I needed to really examine, and show examples of links where the problem existed.
And I needed to map out the fix – detailing SEO and Information Architecture best practices. Because when you’re telling developers that “the hundreds of hours you put into building this site caused massive problems that you’ll now need to fix”, you better be thorough. And prepared for push-back from someone who emotionally or psychologically may not be willing to admit or acknowledge their roll in the problem.
If you recall, it’s a real estate site. With several offices in several counties. Over 400 agents spread throughout. Well it turns out there are actually over 25,000 homes for sale in their system. And the site offers drill-down navigation – county to city, to neighborhood, and finally to individual properties.
If you add all those county, city, and neighborhood pages, plus the hundreds of agent pages, it turns out there’s just over 26,000 actual pages.
Oh – Look! Several Duplication Problems
Okay so in my last article, I explained how if you go to an agent’s bio page, and then click from there to any other page on the site, all the URLs get appended with that agent’s ID. That was where the majority of duplication comes from. Thousands of property pages, each one replicated with every agent’s unique ID appended to the URL.
But Wait! There’s More!
During my final write-up, I scanned the duplicate pages being indexed at Google. And guess what? It turns out that every county has two different URLs you can use to get to that county page. Apparently the code allows for multiple URLs – one was the way the site was originally architected. Another is how it was modified to work after they went live.
Except nobody realized you have to implement 301 Redirects when you do that.
Think that’s bad? Well a similar problem exists with every city in every county. And yes, every neighborhood within every city.
That’s over 130 pages that all have two different URLs you can use (and many of both versions are indexed at Google, thank you very much).
But Wait! There’s More!
And guess what? When I went into Google Webmaster Tools to see if there were anything that could reveal about this problem? Yeah, I discovered 100,000 404 Errors listed, all from the month of May!
Now I normally don’t worry too much about 404 errors. Most big sites are bound to have some. Sure, it’s best practices to address them all as they crop up. Yet in most cases, if there’s a handful, it’s a low priority task sometimes.
Except when it’s THIS massive.
Especially when most of them have LINKS POINTING TO THEM.
How To End Up with 100,000 404 Errors. In Under A Month.
It turns out there were three primary causes for all these 404 errors. First, when this site was first redesigned and rebuilt last year, there was initially URL structure issues. Recall how I mentioned earlier that this was how the duplicate content problem came about in the county/town/neighborhood system?
Well in the county/town system, those 1st version URLs still work.
They don’t, however, work in other sections of the site. Those sections got new URL structure in a way that their 1st generation URLs now go to a dead end. 404. Not found.
Except that’s only applicable for a handful of these.
Then there’s the fact that the old site – the one that existed before this rebuild, had some sort of site-level email linking scheme. Don’t ask me how, or why. All I know is there are hundreds of URLs that somehow got into the Google index at some point, where the URLs point to an email folder on that site. And within that email folder, there were all sorts of links pointing to property pages. Bizarre. To say the least.
The really massive 404 count however, comes from the old property URL structure on that now defunct site. For whatever reason, when the new site was built, nobody thought – “Hey – we’re scrapping this old site. So maybe we should 301 redirect all those property pages”.
And even before this site was rebuilt, nobody ever thought back on the old site “Hey – as properties get sold, maybe we should have an automated 301 set up for every one of those”.
Clean-Up on Aisle 3
So, as you can venture to guess, the problems on this particular site are way beyond more chaotic and entangled and painful than I originally thought when I first wrote this up last month.
Essentially, the entire site’s URL structure needs to be cleaned up. Which is awesome for me. Because in my tasking document, I not only communicated all those duplicate town / city / neighborhood pages need to be eliminated / 301’d. I went further. And said “throw out BOTH versions”. And replace them with THIS syntax.
That’s right – I went for it – truly polished, User friendly AND SEO friendly URL structure.
Because I’m a nice guy.
Not So Fast, Mister!
It turned out that the head of development for this particular site was very cooperative. Quite willing, without push-back, to revamp the entire county/town/neighborhood system. With my preferred URLs. That was just awesome to hear.
Until I learned that all was not so joyous.
As it turns out, the really BIG duplicate content problem? Where they need to strip out the Agent IDs from the URLs, and replace that with a browser cookie system?
Yeah – not so much. The answer was a resounding, emphatic, “Not Possible.”.
Oh No You Didn’t!
Okay so I’m not a world class web engineer. I don’t code complex sites in my sleep. I have, however, in the past, coded entire complex shopping cart systems, with multi-layered discounting, five variations of feature options, multiple-shipping method and pricing options, secure membership features, and much more. From scratch.
And so I know a thing or three about cookies.
Except, unfortunately, I didn’t create THIS site. So I wasn’t aware, until this bombshell discussion, that those Agent URLs get embedded in special email messages that go out to people who sign up for property alerts.
And they get syndicated out to national real estate sites.
Yeah, welcome to my little world.
So for now, all the other tasking I asked for is going to be worked on. At some point in the next who knows whenever.
But that agentID thing in the URL? They’re going to have to get back to us on that. Because I said – think about how you can resolve this. Because right now, it’s killing the site. And “not possible” is, well, not acceptable.
And just to cover the bases, I’m chewing on how this can be resolved. In case they come back with a “We really thought about it and we just can’t do it”.I’ve already come up with what I think is a solution.
However I need to chew on it and get together with a developer friend, a guy who happens to be just this side of rocket scientist.
And then, if I ever DO get this all worked out, I’ll write another follow-up article. Because it’s good to cleanse the soul like this, yet it’s also good karma to share the love in the form of “here’s how we did it – so you don’t have to go through the pain we did…”.