Google likes spam. Sounds surprising? Well, actually it does! Of all the spam that is generated, Google seems quite comfortable with 5% of the lot.
Now we all know that spam is bad, right? Wikipedia says, “In the last few years, and due to the advent of the Google AdSense Web advertising program, scraper sites have proliferated at an amazing rate for spamming search engines. Open content sites such as Wikipedia are a common source of material for scraper sites.”
It seems quite clear that scraping content is a type of spamming. Most people believe this, and Wikipedia confirms it. So even while Google wages war against 95% of the spam, why would it like even the remaining 5%?
Reasons Why Google Likes Spam
When you look closely, you’ll find that there are actually two different types of spam:
- The kind of spam that Google is unable to stop
- The kind of spam that Google is able to stop but doesn’t
It’s a well-known fact that Google does not index the entire Web. That leads to some interesting search results. Just take a look at these:
Wait a minute! What do we see here?
These sites are all obviously scraper sites, yet Google has decided to let them be in its search results. Has Google forgotten to block out the spam?
Let’s get in some background information here. Google would like to maintain its position as the No. 1 search engine in the world; so it ruthlessly pursues excellence in search. But Google doesn’t index the entire Web.
One of the reasons is that there are a number of bottom feeder websites that scrape absolute trash, and various free, unmonitored BBS sites that allow garbage on them. All of these are heavily penalized by Google rather than banned outright.
If you take a look at the screenshot above, you’ll notice that the words are all jumbled up—written one after the other without making much sense. Obviously, they have been scraped and placed automatically with Markov scripts or Mad Libs. In spite of that, Google has allowed these sites to come up in its search results.
There’s a very good reason for this :
- Google doesn’t have anything else to show for the particular search term that I have used. Sometimes it’s better to show pages that Google has indexed than nothing at all—a page that might have the answer, but is probably spam.
Had I been a Google engineer, I would have tried to make use of this opportunity. If users searched for any search phrase (or derivative) many times, Googlebots would index the relevant sites even if they were known to be spam sites. For example, the bots would only index the relevant content part of the pages. Why ban a whole site when you can just ban 99% or more of its content, and keep the parts you have no search results for?
We’ve seen that Google allows this bottom feeder spam to exist in its index, and it’s difficult to believe that the idea of partial indexing has not yet occurred to anyone in Google. I’d imagine a lot of webmasters are having an “Ah hah” moment right now, thinking of how to game the system using this method.
[Let me put on my blackhat]
- An automated tool that moderates news and trends, and another tool that measures search volume on the news and trends would be pretty useful in picking topics to build websites on.
It looks as if Google has developed a sort of understanding with the webmasters of such spam sites. Google is essentially saying, “If you can create sites that have content we don’t have anywhere else, we’ll index it and send you traffic.”
The problem for the blackhatters is that good sites may quickly follow suit on the idea, and they will lose the ranking quickly. But an automated black hat site that gets 1,000 visitors in its lifetime is still a success when the blackhatter creates thousands of .info or .cn sites a day, if only a small amount money was invested in it.
So when Google has good results to show for the Web searches, it tries to get rid of the spam. However, it looks the other way when it encounters content for which it has nothing to show for search phrases. Google knows garbage sites have high bounce rates–they can determine footprints, they can see crappy back links. But Google refrains from banning sites completely when they may help to generate results for search terms for which Google has very few results to show.
Google Harnessing Spam Sites to Index Web?
This may be the logic Google is using to partially index certain spam sites and not ban them outright. These sites are essentially performing a filler function–finding something Google didn’t because Google doesn’t have unlimited resources to index the entire Web.
There is plenty of material to support my argument. I’ve checked a lot of blackhat blogs to know that most blackhatters leave footprints all over the place, host up to a thousand sites on one dedicated server (with nothing but scraper sites on it), and do other careless things all because it’s so easy to not bother and still make money. Google rarely bans scraper sites; they just penalize them to an extent where limited amounts of content on limited pages are indexed and get traffic. I believe that in such cases Google indexes only those pages which have comparatively unique terms/phrases, which feature in their database to a limited extent.
There’s one step further to go from here. Google should not penalize sites that it thinks are spam, but it doesn’t know for sure. If a site has low bounce rates and its links improve, it can get out of the sandbox. This ensures that legitimate sites don’t get penalized like the blackhat ones. With more pages getting indexed, and as they gain more weight, the white hat webmasters get more traffic.
As long as Google can stay ahead of the curve and is able to discern the difference between the two, it will continue to win the spam war, which I believe it is winning anyway. There is the threat of the spam Google doesn’t like and wants to get rid of at all costs. For now, Google needs to eliminate as much of the “bad” spam as it can, and allows as much of the “good” spam as I.
More About the Different Types of Spam
As we discussed before, there are two types of Web spam that Google indexes.
- The kind of spam that Google is unable to stop
- The kind of spam that Google is able to stop but doesn’t
The second kind of spam generally comprises long tail keywords—the ones for which Google does not have a sizeable database. As Google is unable to index the entire Web, it must pick and choose what it will index.
Google obtains links from the various websites and then determines whether or not they are important. Since good webmasters often do “bad” things to get links (mass submissions to directories, blog comments, FFAs, guestbook postings), Google can’t completely ignore them. So it has to do a balancing act, because if it ignored them it would not be indexing enough of the Web. On a mass scale, it is very difficult to accurately determine which websites used legitimate methods to build content and links, especially if they are quite new.
The first kind of spam is mostly generated when webmasters create hit and run websites that can be built quickly and easily. These use automated methods to rank for a profitable term, which will either generate a lot of search engine traffic, or some limited traffic but worth a lot of money. Manual or automated tools can be used to determine these niches, and hoards of sites are built to rank for these profitable terms. Most don’t get past Google’s filters; but some do and they are what I call “bad” spam.
Imagine the money a Canadian pharmacy can make in the following scenario:
- It obtains an aged domain with back links
- Wipes the content
- Throws up any garbage page with the relevant keywords
- Obtains a lot of black hat links for it
- Sneaks to the No. 1 spot for just 72 hours
The reason why this is possible is because the Google search engine is an algorithm. It may be getting better every day, but it can still be fooled. Notice how the term “buy Viagra” keeps getting gamed. As long as blackhatters can figure out parts of the Google algorithm, the system will remain susceptible to gaming.
As you can see, the 2 types of spam discussed above are very different.
Think of the spam you see when you search for something that Google doesn’t have a good index for. There may be 7 results, and 4 of them may even appear to have the same content. You can tell that that these are spam sites, just from the title and description in the search result. Google clearly doesn’t have much to show.
In such a case, you may be forced to click on a scraper site, even if you don’t like them. At least you can be content knowing that you won’t help the site owner make an undeserved profit by clicking any ads or following affiliate links on the site. You can simply go in, get what you need, and leave. I’ve often used scraper sites to find what I want. If Google doesn’t have what I’m looking for and a spam site can help me, I’ll click on the link. I’m pretty sure most readers are guilty of this as well.
Why Search Engine Spammers Will Never Go Away
The biggest flaw in Google’s algorithm is the same thing that makes it so great. It relies on a very simple system—if your site doesn’t have any external links directing to it, it’s pretty much worthless. Every link you get helps you, especially from sites Google deems important. You can get an idea about a site’s importance from the Google PageRank.
Since even legitimate webmasters employ questionable means to get more back links, Google’s task becomes more difficult. Blackhat webmasters try to profit from the system by stealing content in order to get good search traffic. So why is Google so unable to detect the fraud and get rid of this spam?
It is because of Google’s reliance on the value of links. As long as blackhat webmasters can exploit this system, both long tail keyword spam (can be useful) and brute force spam (pretty much never useful) are going to stick around. It’s very easy to mass produce sites that target obscure long tail keywords, and these get indexed by Google.
As long as people search for those terms and Google doesn’t have much to show in the search results, the practice will continue. You can run automated linking software on these mass produced sites and get them up in the search engines for days, weeks., months or years. Natural sites often get linked to in a very rapid manner, so blackhatters can get away with this technique until enough webmasters report every instance.
Now that’s next to impossible! So until technology evolves to a point where it can beat the ingenuity of the human mind, we’ll have to co-exist with a limited amount of spam.