1. SEJ
  2.  ⋅ 
  3. SEO

Google Announces A New Era For Voice Search

Google announced a major update to voice search that uses AI to make it faster and more accurate, calling it a new era.

Google Announces A New Era For Voice Search

Google announced an update to its voice search, which changes how voice search queries are processed and then ranked. The new AI model uses speech as input for the search and ranking process, completely bypassing the stage where voice is converted to text.

The old system was called Cascade ASR, where a voice query is converted into text and then put through the normal ranking process. The problem with that method is that it’s prone to mistakes. The audio-to-text conversion process can lose some of the contextual cues, which can then introduce an error.

The new system is called Speech-to-Retrieval (S2R). It’s a neural network-based machine-learning model trained on large datasets of paired audio queries and documents. This training enables it to process spoken search queries (without converting them into text) and match them directly to relevant documents.

Dual-Encoder Model: Two Neural Networks

The system uses two neural networks:

  1. One of the neural networks, called the audio encoder, converts spoken queries into a vector-space representation of their meaning.
  2. The second network, the document encoder, represents written information in the same kind of vector format.

The two encoders learn to map spoken queries and text documents into a shared semantic space so that related audio and text documents end up close together according to their semantic similarity.

Audio Encoder

Speech-to-Retrieval (S2R) takes the audio of someone’s voice query and transforms it into a vector (numbers) that represents the semantic meaning of what the person is asking for.

The announcement uses the example of the famous painting The Scream by Edvard Munch. In this example, the spoken phrase “the scream painting” becomes a point in the vector space near information about Edvard Munch’s The Scream (such as the museum it’s at, etc.).

Document Encoder

The document encoder does a similar thing with text documents like web pages, turning them into their own vectors that represent what those documents are about.

During model training, both encoders learn together so that vectors for matching audio queries and documents end up near each other, while unrelated ones are far apart in the vector space.

Rich Vector Representation

Google’s announcement says that the encoders transform the audio and text into “rich vector representations.” A rich vector representation is an embedding that encodes meaning and context from the audio and the text. It’s called “rich” because it contains the intent and context.

For S2R, this means the system doesn’t rely on keyword matching; it “understands” conceptually what the user is asking for. So even if someone says “show me Munch’s screaming face painting,” the vector representation of that query will still end up near documents about The Scream.

According to Google’s announcement:

“The key to this model is how it is trained. Using a large dataset of paired audio queries and relevant documents, the system learns to adjust the parameters of both encoders simultaneously.

The training objective ensures that the vector for an audio query is geometrically close to the vectors of its corresponding documents in the representation space. This architecture allows the model to learn something closer to the essential intent required for retrieval directly from the audio, bypassing the fragile intermediate step of transcribing every word, which is the principal weakness of the cascade design.”

Ranking Layer

S2R has a ranking process, just like regular text-based search. When someone speaks a query, the audio is first processed by the pre-trained audio encoder, which converts it into a numerical form (vector) that captures what the person means. That vector is then compared to Google’s index to find pages whose meanings are most similar to the spoken request.

For example, if someone says “the scream painting,” the model turns that phrase into a vector that represents its meaning. The system then looks through its document index and finds pages that have vectors with a close match, such as information about Edvard Munch’s The Scream.

Once those likely matches are identified, a separate ranking stage takes over. This part of the system combines the similarity scores from the first stage with hundreds of other ranking signals for relevance and quality in order to decide which pages should be ranked first.

Benchmarking

Google tested the new system against Cascade ASR and against a perfect-scoring version of Cascade ASR called Cascade Groundtruth. S2R beat Cascade ASR and very nearly matched Cascade Groundtruth. Google concluded that the performance is promising but that there is room for additional improvement.

Voice Search Is Live

Although the benchmarking revealed that there is some room for improvement, Google announced that the new system is live and in use in multiple languages, calling it a new era in search. The system is presumably used in English.

Google explains:

“Voice Search is now powered by our new Speech-to-Retrieval engine, which gets answers straight from your spoken query without having to convert it to text first, resulting in a faster, more reliable search for everyone.”

Read more:

​​Speech-to-Retrieval (S2R): A new approach to voice search

Featured Image by Shutterstock/ViDI Studio

Category News SEO
SEJ STAFF Roger Montti Owner - Martinibuster.com at Martinibuster.com

I have 25 years hands-on experience in SEO, evolving along with the search engines by keeping up with the latest ...