The Google Cache, Caching Google in Protest
In my referals the other day I clicked over to a blog which linked to Search Engine Journal and a story on Google China. The blog entry was about Dorks protesting Google. I’m not sure how dorky covering search engine news is, nor have I really considered myself that much of a dork. However, when it does come to dorky protests, TheGoogleCache takes the cake (tag, now you’re the dork!).
TheGoogleCache is a protest site of the recent court ruling which said that Google’s copyrighted materials is ‘fair use’ :
“After the recent ruling that stated Google’s cache of copyrighted materials was “fair use”, I decided to put this to the test myself. This is The Google Cache. You search Google, your results get cached.”
In essence, TheGoogleCache is caching Google.
“The Google cache is absolutely ridiculous. As an individual who has had quite a bit of experience on both sides of the white hat / black hat search engine industry, the cache is NOT a webmaster’s friend.
1. The cache removes content control away from the author. For example, a site like EzineArticles.com prevents scraping by using an IP blocking method based on the speed at which pages are spidered by that IP. It is absurdly easy to circumvent this by simply spidering the Google cache of that article instead of spidering the site. Google’s IP blocking is far less restrictive, and combined with the powerful search tool, it allows for easy, anonymous contextual scraping of sites whose Terms of Service explicitly refuse it.
2. The cache extends access to removed content, often for months if not years at a time. Google rarely replaces 404 pages (perhaps it is because of their wish to have the largest number of indexed pages). I have clients who have nearly 48,000 non existent pages still cached in google that have not been present in over 14 months. Despite using 404s, 301s, etc. these pages have not yet been removed. Furthermore, Google’s often mishandling of robots.txt, nocache, and nofollow leaves webmasters dependent upon search traffic hesitant to force removal of these pages using the supposedly standardized methods of removal.
3. The cache allows Google to serve site content anonymously. Don’t want the owner of a site to know you are looking at their goods (think of companies grepping for competitor IPs), just watch the cache instead.
The list goes on and on. But I think the point is this…
Why should a web author have to be technologically savvy to keep his or her content from being reproduced by a multi-billion dollar US company? Content control used to be as simple as “you write it, its yours”. It got a little more complicated with time to the point at which it might be useful to use, perhaps, a Terms of Service. Even a novice could write “No duplication allowed without expressed consent”. Now, a web author must know how to manipulate HTML meta tags and/or a robots.txt file.”