Google has published a set of guidelines for its quality raters to follow when evaluating voice search results.
A similar set of guidelines exist for rating the results in Google Search, this marks the first time guidelines have been put in place for rating results returned by Google Assistant.
More specifically, this document deals with results returned by an eyes-free voice assistant such as Google Home. It does not refer to results delivered on a device with a screen, such as the Google Assistant smartphone app.
Therefore, it’s the quality of spoken results that are being reviewed. Results are evaluated with ‘needs met’ and ‘speech quality’ ratings.
Needs Met Rating
Spoken search results are evaluated based on the following ‘needs met’ scale:
- Fully meets
- Highly meets
- Moderately meets
- Slightly meets
- Fails to meet
If a spoken response fully meets a user’s query it will receive a rating of “fully meets.” Ratings go down based on how much additional information would be needed to fully satisfy the query.
For example, if a user asks for the weekend forecast and the device responds with the current temperature, then needs would be moderately to slightly met. The user received partial information, but would have to conduct another search to get all of the information they’re looking for.
Of course, if the query is not answered at all, then it would receive a failing grade.
Speech Quality Rating
In addition to rating the accuracy of the response, answers are also rated based on the following elements of speech quality:
- Length: Was the length of the response appropriate considering its complexity? Should it have been more concise or more detailed?
- Formulation: Was the response grammatically correct? Did it sound like something a native speaking human would say?
- Elocution: Was the pronunciation, intonation, and speed of the spoken response appropriate?
All three of these elements are rated individually for each response, which produces an overall rating for speech quality.
Here is an example of what a quality rater might see when evaluating a spoken result. In this screenshot, the quality rater is evaluating two responses side-by-side.