Very often we have to deal with site (partial) duplicate content issues created by the site CMS, like:
Blocking the pages via Robots.txt Disallow directive will prevent bots from crawling the page but is it probably worth trying to let them spidering the page without indexing them? This may help a lot for discovering more inner pages.
The possible ways to do that (especially now that we don’t have to worry much about PR leakage):
- Adding “Noindex” robots meta tag to all pages except the first / base one;
- Adding in your robots.txt the Noindex directive (which is unofficially supported by Google)
- Using rel=canonical (which is the mildest of the three: it won’t prevent Google from crawling but at least it will show Google which pages are more powerful).
So have you ever tried preventing Google from indexing pages without blocking it from spidering them? Does it solve the duplicate content issue anyway?