Google’s annual Webspam Report covering 2022 highlighted all the ways their SpamBrain anti-spam system became more adept at catching multiple forms of spam. While the report is mainly about reporting how much more spam they caught compared to the year before, the bits about how SpamBrain works seemed just as important.
Google SpamBrain Platform
SpamBrain is the name that Google gave to their machine learning system that Google calls a platform from which to launch algorithms that detect multiple forms of unwanted content.
Machine learning is a form of artificial intelligence that uses data to learn to become increasingly proficient at the task it is designed to complete.
Not much is known about SpamBrain other than it’s a machine learning platform and it’s “central” to Google’s initiatives to keep spam from ranking.
Google’s Webspam report notes this about SpamBrain:
“We also improved SpamBrain as a robust and versatile platform, launching multiple solutions to improve our coverage of different abuse types.”
Improvements to SpamBrain
The Webspam report noted that improvements to the system resulted in catching 500% more spam sites than the year before.
Additional training resulted in a tenfold increase in SpamBrain’s ability to identify hacked websites.
Link Spam Detection
The report noted that special link spam training resulted in catching fifty times more sites creating link spam as compared from the year before, citing SpamBrain’s ability to learn as key to its success.
“Thanks to SpamBrain’s learning capability, we detected 50 times more link spam sites compared to the previous link spam update.”
An interesting fact about SpamBrain is how it identifies spam at the time of crawling.
If a crawled page is detected to be spam it is immediately blocked, preventing it from entering Google’s search index and saving resources from being wasted crawling unwanted webpages.
Blocking spam at crawl time is a capability that was announced in 2021, which noted that indexing is not only blocked when spam is crawled but also when it tries to sneak in through search console and sitemaps.
They wrote in 2021:
“…we have systems that can detect spam when we crawl pages or other content. Crawling is when our automatic systems visit content and consider it for inclusion in the index we use to provide search results. Some content detected as spam isn’t added to the index.
These systems also work for content we discover through sitemaps and Search Console.
For example, Search Console has a Request Indexing feature so creators can let us know about new pages that should be added quickly. We observed spammers hacking into vulnerable sites, pretending to be the owners of these sites, verifying themselves in the Search Console and using the tool to ask Google to crawl and index the many spammy pages they created.
Using AI, we were able to pinpoint suspicious verifications and prevented spam URLs from getting into our index this way.”
So it’s fair to say that one of the many functions of SpamBrain is to act like a gatekeeper, blocking spam before it has a chance to make it into Google’s index.
Scam Protection Is Now Multilingual
Something new for SpamBrain is that the scam identification system is now multilingual, reducing clicks on scam sites by 50% when compared to the year before.
What About Spammy Content?
This year’s report focused on catching link spam, identifying hacked sites and improvements in detecting spam at crawl time.
What it didn’t mention was anything to do with identifying spammy content.
Is this because the content side is handled by the Helpful Content Algorithm and not SpamBrain?
Read Google’s Webspam Report:
How we fought spam on Google Search in 2022
Featured image by Shutterstock/Asier Romero