Machine Learning Spam in Google

January 21, 2020
⋅
4 min read

SEJ STAFF Roger Montti Owner - Martinibuster.com at Martinibuster.com

247

SHARES
3.6K

READS

Spammers are taking advantage of Google Rich Results algorithms in order to place content at the top of Google search. Spammers are taking advantage of machine learning technologies to automatically create video content from web pages and vice versa. The same technology can be used to create text content from podcasts and podcast content from web pages.

This isn’t theory, it’s currently live on Google. Can Google keep up with machine learning spam?

New form of spam. Scrapers use Text to Speech software on your content then copy your featured image for a video, then upload it to YouTube to spam Google Organic with their auto-generated videos that rank for your topic. @JohnMu @dannysullivan

— Roger Montti (@martinibuster) December 27, 2019

Text to Video Spam

I first noticed the text to video spam when I searched on a news headline.

I won’t name the YouTube channel nor the news sites that are having their content “re-purposed” however. The point is to describe a spam technique that currently is live on Google. Who the spammers are is besides the point.

This is how it works:
Google ranks videos that are newsworthy at the top of Google’s search results. When a news topic is trending, Google will promote videos about that topic to the top of Google’s search results.

The spammers are exploiting this loophole in Google’s algorithm because the trending topics algorithm apparently does not not check if the audio content is a duplicate of text content.

How to Create Text to Video Content?

There are many ways to create fake video content. One way can be to download textual news content from an RSS feed then run it through a text to audio converter. Now the spammers have an audio file.

Google Cloud Can Help Spammers Spam Google

Google Cloud has a machine learning product that will transcribe up to four million characters for free.

After that Google charges from $4 to $16 per million characters transcribed from text to voice.

Non-Google services charge from $1/minute to $10/hour for transcribing audio content to text.

The next step is to create an image to display in the video while the audio plays beneath it. What the spammers are doing is downloading the featured image from the news article and using that as the display image in the YouTube videos.

The YouTube video is then altered with an introduction splash screen that says “Presented By” and then the static image shows while an electronic voice speaks the words from the article.

Podcast to Text Spam

Another spam technique is to download audio and run it through an audio to text software. There are many ways to do this, including Google’s own Gboard app for Android, which is free.

Gboard is a free Android keyboard app that features a transcription function. All you have to do is open up a text or note app then click the microphone while the podcast is running. Instant free content! Full instructions on Google’s Gboard Help page here.

Gboard isn’t the only app that can convert audio to text. There are online services that charge from $1/minute to transcribe audio to text to $10/hour. And of course, there are many free apps that use Google’s machine learning technology.

Does Google Know About this Spam?

Yes, Google knows about the text to video spam technique. I tweeted about it on December 27, 2019 (without revealing the spammers identity) and Gary Illyes of Google responded that he had sent the report to Google’s spam team.

Sent it over to the webspam team, thanks

— Gary "鯨理／경리" Illyes (@methode) December 28, 2019

Not only does Google know about it, but variations of these techniques are already being openly discussed in Facebook SEO groups, with specific apps and software to use in order to create spam content. Information about machine learning spam is already out in the wild.

Google knows about it.

Spammers know about it.

You should know about it too.

Publishers need to be aware of the possibility that someone may be using their audio, video or text content without their permission.