Duplicate content and Buffy the Vampire Slayer. What do these have in common? They shed some light into the bizarre psyche of Google developers, but were also at the heart of the Duplicate Content session at SMX Advanced.
Duplicate content in 60 seconds:
- Determine whether your site is experiencing intentional or accidental duplicate content or both.
- If intentional, block abusive IPs, detect user agents, block specific crawlers, add copyright information to the content, request the duplicate site remove the content or take legal action.
- If accidental, control URLs through .htaccess, client-side 301 redirects, parameter or variable reduction, 404 pages and consistent linking strategies. Also, don’t duplicate pages in the secure and non-secure areas of your site.
- If you still experience a problem, communicate with the search engines, they are pro-actively working on a solution, but need examples and suggestions to better handle duplicate content.
After the You & A with Matt Cutts, Danny Sullivan moderated the organic session on duplicate content with the major search engines representin’ – the lovely Vanessa Fox (Product Manager from Google), Amit Kumar (Senior Engineering Manager from Yahoo! Search), Peter Linsey (Senior Product Manager for Search at Ask.com) and Eytan Seidman (Lead Program Manager of Live Search from Microsoft).
So, let’s dive in with some of the basics: What is duplicate content?
Intentional duplicate content = Content that is intentionally duplicated on either your or another website.
Accidental duplicate content = Content that is seen by the search engines as duplicate, but happens through passive or accidental methods.
Why is duplicate content an issue?
It fragments rank, anchor text and other information about the page you want to appear. It also impairs the user experience and consumes resources.
How can you combat duplicate content?
It’s difficult for the search engines to decipher the canonical page of your site, so the best way to avoid accidental duplication is by controlling your content! You can do this in a variety of ways including:
- Be consistent with your linking strategy both on-site and off (Jessica Bowman had an excellent article on this, “Should URLs in Links Use Index.html?”)
- Reduce session parameters and variable tracking
- Always deliver unique content even if the location isn’t unique
- Use client-side redirects rather than server-side
- HTTP vs HTTPS – don’t duplicate the HTTP pages in a secure area
As for intentional duplicate content, the options are limited but include:
- Simply asking visitors not to steal content
- Contact those that do steal your hard-earned content and ask that they remove it
- Embed copyright or a creative commons notification in your content
- Verify user-agents
- Block unknown IP addresses from crawling the site
- Block specific crawlers
- If that doesn’t work, get the lawyers involved and go for blood
A final note for both intentional and accidental duplicate content:
If you locate the source of a problem and made all attempts to rectify the situation, but it still is not resolved, contact the search engines. File a reinclusion request with notice of what happened, when, how you tried to fix the problem and where you find yourself today.
And now for some search engine specific advice:
- Consider whether duplicate content is adding value to your site
- If you’re the duplicator, be sure to give attribution
- Consider blocking local copies of pages with robots.txt
- There’s no such thing as a site-wide penalty
- Session parameter analysis occurs at the crawl time
- Duplicates are also filtered when the site is crawled
- Technology exists to find near-duplicates and ignores most mark-up, focusing on just the key concepts
- Duplicate content is not penalized.
- Templates are not considered for duplication, only the indexable content.
- Filter for high confidence, low tolerance on false positives.
- Filters duplicates at crawl-time
- Less likely to extract links from duplicate pages
- Less likely to crawl new documents with duplicate pages
- Index-time filtering
- Less representation of duplicates when choosing crawled pages to put in index
- Legitimate forms of duplication include: newspapers, multiple languages, HTML/Word/PDF documents, partial duplication from boilerplates (navigation and common site elements)
- Not found error pages should return a 404 HTTP status code when crawled (this isn’t abusive, but makes crawling difficult)
Vanessa threw a curve ball and decided not to duplicate presentations! Instead she requested feedback from the audience, but not before alienating anyone over the age of 30 with Buffy the Vampire Slayer metaphors.
And now it’s time for SEO to meet SMM.