Google’s Truth Algorithm: 5 Facts You Should Know

SMS Text
Review of Google's Knowledge Based Trust | SEJ

By now you may have read about Knowledge-Based Trust, a Google research paper that describes a method of scoring web documents according to the accuracy of facts. Knowledge-Based Trust has been referred to as the Truth Algorithm, a way to assign a Trust Score to weed out sites that contain wrong information.

According to the title of an article in New Scientist, “Google wants to rank websites based on facts not links.” The idea is to identify key facts in a web page and score them for their accuracy by assigning a trust score.

The algorithm researchers are careful to note in the paper that the algorithm does not penalize sites for lack of facts. The study reveals that it could discover relevant web pages with low PageRank that would otherwise be overlooked by current technology.

In current algorithms, links are a signal of popularity that implies authority in a particular topic. But popularity does not always mean a web page contains accurate information. A good example may be celebrity gossip websites. Getting past simple popularity signals and creating an algorithm that can understand what a website is about is a direction that search technology is moving in today, underpinned by research in artificial intelligence.

Ray Kurzweil, Google’s Director of Engineering, has been tasked with creating an artificial intelligence that can understand content itself without relying on third-party signals like links. Knowledge-Based Trust, a way to determine the accuracy of facts, appears to be a part of this trend of moving away from link signals and towards understanding the content itself.

There’s only one problem: The research paper itself states that there are at least five issues to overcome before Knowledge Based Trust is ready to be applied to billions of web pages.

Is  Knowledge-Based Trust coming soon? Or will we see it integrated into current algorithms?

I asked Dr. Pete Meyers of Moz.com, and his opinion was:

“We tend to see each new ranking factor as replacing the old ones. We jump at everything as if it’s going to uproot links. I think the reality is that more and more factors are corroborating, and the system is becoming more complex.”

Knowledge-Based Trust

Knowledge-Based Trust

I agree with Dr. Meyers. Rather than seeing KBT as a replacement for current algorithms, it may be useful to view it as something that might be implemented as a corroborating factor. An important consideration about KBT is that it demonstrates Google is researching technologies that focus on understanding content, rather than relying on second-hand signals like links. Links measure popularity, but links only indirectly reflect relevance and accuracy, sometimes erroneously.

This research demonstrates that an accuracy score is possible and it proves that this approach can discover useful web pages with low PageRank scores. But the question remains, is Knowledge-Based Trust coming soon? The self-assessment written in the conclusion of the paper notes several achievements, but it also states five issues that need to be overcome. Let’s review these issues and you can make up your own mind.

Issue #1: Irrelevant Noise

The algorithm uses a method of identifying facts that examines three factors in order to determine it. It refers to them as “Knowledge Triples,” consisting of a subject, a predicate, and an object. A subject is a “real-world entity” such as people, places or things.  A predicate describes an attribute of that entity. According to the research paper, an object is “an entity, a string, a numerical value, or a date.”

Those three attributes together form a fact, known in the research paper as Knowledge Triples and often referred to simply as Triples. An example of a triple is: Barack Obama was born in Honolulu. The problem with this method is that extracting triples from websites results in irrelevant triples, triples that diverge from the topic of the web page. The research study concludes:

“To avoid evaluating KBT on topic irrelevant triples, we need to identify the main topics of a website, and filter triples whose entity or predicate is not relevant to these topics.”

The paper does not describe how difficult it would be to weed out irrelevant triples. So, the difficulty and time frame for addressing this issue remains open to speculation.

Issue #2: Trivial Facts

KBT does not adequately filter trivial facts to set them aside and not use them as a scoring signal. The research paper uses the example of a Bollywood site that on nearly every page states that a movie is filmed in the Hindi language. That’s identified as a trivial fact that should not be used for scoring trustworthiness. This lowers the accuracy of the KBT score because a web page can earn an unnaturally high trust score based on trivial facts.

As in the first issue of noise, the researchers describe possible solutions to the problems but are silent as to how difficult those solutions may be to create. The important fact is that this second issue must be solved before KBT can be applied to the Internet, pushing back the date of implementation even further.

Issue #3: Extraction Technology Needs Improvement

KBT is unable to extract data in a meaningful way from websites outside of a controlled environment without being inundated with noise. The technology referred to here is called an Extractor. An extractor is a system that identifies triples within a web page and assigns confidence scores to those triples. This section of the document does not explicitly state what the problem with the extractors is, it only cites “limited extraction capabilities”. In order to apply KBT to the web, extractors need to be able to identify triples with a high certainty of accuracy. This is an important part of the algorithm that will need to be improved if it’s ever going to see the light of day. Here is what the research document says:

“Our extractors (and most state-of-the-art extractors) still have limited extraction capabilities and this limits our ability to estimate KBT for all websites.”

This is a significant hurdle. This is important information. The limitations of current extractor technology adds a third issue that must be solved before Knowledge-Based Trust can be applied to the World Wide Web.

Issue #4: Duplicate Content

The KBT algorithm cannot sort out sites containing facts copied from other sites. If KBT cannot sort out duplicate content then it may be possible that KBT can be spammed by copying facts from “trusted” sources such as Wikipedia, Freebase, and other knowledge sources. Here is what the researchers state:

“Scaling up copy detection techniques… has been attempted…, but more work is required before these methods can be applied to analyzing extracted data from billions of web sources….”

The researchers tried to apply scaled copy detection as part of Knowledge-Based Trust algorithm but it’s simply not ready. This is a fourth issue that will delay the deployment of KBT to Google’s search results pages.

Issue #5: Accuracy

In section 5.4.1 of the document, researchers examined one hundred random sites with low PageRank but high Knowledge-Based Trust scores. The purpose of this examination was to determine how well Knowledge-Based Trust performed in identifying high quality sites over PageRank, particularly low PageRank sites that would have otherwise been overlooked.

Among the one hundred random high trust sites picked for review, 15 of the sites (15%) are errors. Two sites are topically irrelevant, twelve scored high because of trivial triples, and one website had both kinds of errors (topically irrelevant and a high number of trivial triples). This means in a random sample of high trust sites with low PageRank, KBT’s false positive percentage is revealed to be on the order of 15%.

Many research papers whose algorithms eventually make it into an algorithm usually demonstrate a vast improvement over previous efforts. That is not the case with Knowledge-Based Trust. While a Truth Algorithm makes an alarming headline, the truth is there are five important issues that need to be solved before it makes it to an algorithm near you.

What Experts Think About Knowledge-Based Trust

I asked Bill Slawski, of GoFishDigital.com about Knowledge-Based Trust and he said:

“The Knowledge-Based Trust approach is one that seems to focus upon attempting to verify the correctness of content that might be used for direct answers, knowledge panel results and other ‘answers’ to questions by using approaches uncovered during the author’s research while working upon Google’s Knowledge Vault. It doesn’t appear to be attempting to replace either link-based analysis such as PageRank or Information Retrieval scores for pages returned at Google in response to a query.”

Dr. Pete Meyers shares a similar outlook on the future of Knowledge-Based Trust:

“This will be very important for 2nd-generation answer boxes (scraped from the index), because Google has to have some way to grow the Knowledge Graph organically and still keep the data as reliable as possible. I think KBT will be critical to the growth of the Knowledge Graph, and that may start to cross over into organic rankings to some degree. This is going to be a fairly long process, though.”

Is a Truth Algorithm Coming Soon?

Knowledge-Based Trust is an exciting new approach. There are several opinions on where it will be applied, with Dr. Pete observing that it may play a role in growing the Knowledge Graph organically. But on the question of whether Knowledge-Based Trust is coming to a search results page soon, we know there are issues that need resolving. Less clear is how long it will take to resolve those issues.

Now that you have more facts, what is your opinion, is a truth algorithm coming soon?

 

Image credit: Shutterstock.com. Used under license.

Roger Montti
Roger Montti is an independent web publisher of popular websites, a consultant and a Moderator of the Link Building Forum at WebmasterWorld.com since 2004. He... Read Full Bio
Roger Montti
Roger Montti
Get the latest news from Search Engine Journal!
We value your privacy! See our policy here.
  • Great article. I do think it’s coming and may just be incorporated right into the algorithms. But the Knowledge Graph and new Info Box Google is polishing seems to indicate a truth factor will come into maturity at some point.

    • Glad you enjoyed the article. 🙂 It will certainly be interesting to see what gets incorporated and where, into what product.

  • Hi Roger,
    Thanks for your great article of new information here. And I think truth algorithm will coming, and when KBT come, it will helpful to us, because it will help to verify the correctness of content that might be used for direct answers, just like Bill Slawski said.
    Can’t wait to see this KBT, Thanks again, Roger.

    • Thank you. The scientific research paper can be difficult to understand so the challenge was to present it in plain English without losing the important facts. Thanks for your kind feedback!

  • Sometimes I wonder why is Google trying to introduce various algorithms but after seeing what they are meant to do, it feels alright. I hope that the truth algo will help people is getting real results and not just some spam.

    • Exactly! I agree. That’s the goal but whether it works or not is still to be seen. Will be interesting to see if and when and how it is implemented.

  • It is interesting that the use of triples is also a key element of the semantic web. If more sites embraced semantic markup, I have to think that KBT will be better and more quickly enabled. A simple example: “macaroni and cheese” is a meal (the subject) of which “cheddar cheese” (an object) is an ingredient (predicate). I used a recipe as the example because our industry is much further along using semantic markup with that content type because of the recipes rich snippet search results presentation as an incentive to do so. Other content types such as biographies or white papers don’t offer any such incentive – a limitation for faster KBT roll-out. Google could overcome that hurdle by in incentivizing publishers to markup more of their content.

    • That’s an interesting point. Personally, I believe that semantic markup may have been used partially for quality control for their research (to confirm the accuracy of their proposed algorithms), in addition to what we know about its current role in enhancing user experience, communicating what is on the page should they click from the SERP to the page. KBT can’t rely on Schema.org microdata to communicate the various parts of a web page as that depends on the web publishers doing it AND doing it correctly. So I agree with you, publishers must be incentivized or perhaps have it handed to them automatically. Failing that, the search engines will have to improve their extractor technology, which may be the direction they’re heading in.

  • As an Information Architect/UX designer for the past 20 years, I can definitely attest to the value of using a less arbitrary criteria for search results than simply how many links a site has. However like any artificial intelligence-leaning tactic, it is almost impossible to mimic the subtle nuances of the human brain with a computer (just try asking Siri something slightly complex like what time my favorite bakery opens on Sunday).

    Still, an interesting article…

    • You’re right Steve, it’s very challenging. The Panda algorithm is arguably Google’s most ambitious live attempt to mimic the gut instinct of their users, their engineers, and their quality raters, by creating classifiers that rely on many aspects of the information (layout, outlinks, amount of advertisers, code to content ratio, etc.) in order to mimic our own impression of a web page when we look at it and think, “This is authoritative” or “This looks sketchy…”

      Further on a parallel track, there are research papers by Microsoft where they discuss quality control algorithms that function like quality raters, they will actually re-rank sites if certain feedback (like CTR and time on page metrics) indicate the SERP listing is not satisfactory. In fact, these algorithms can question ranking factors and alter them if they are producing unsatisfactory results. Google’s use of CTR for quality control has been documented as far back as 2003, but I’m certain they’ve come a long way with those technologies
      since.

  • Vikram Rathore

    Great Roger ..

    Google always tries for something new to serve better result to their huge audience. The truth algo sounds good but I still want to know more about It. Ranking on just basis of Truth factor it will be a new revolution.

    • Vikram, if you enjoyed reading about KBT, then you may be interested in hearing about some other forms of artificial intelligence type algorithms related to understanding text based on a semantic understanding of words in their specific context. There’s an interesting paper that describes this algorithm and guess what? It relies
      on Wikipedia!

  • Rotimi

    Yes, come to think of it.

    Now I’m not sure if this KBT is indeed that truth algo, but when you really think about it: To remain at the top you need to be a perfectionist. So I doubt that Google would relent before they’ve gotten some kind of super-smart “brain” that will ultimately drive their SERPs towards the perfect truth.

  • This will be great, but difficult to

    1) which is the correct fact as it can be subjective
    2) many times when the fact are the same and you cant write or present in a different way
    3) How is google going to know what is correct as they are not subject expert.

    I think this will be far fetch. I dont think it will come.

    • You may be on to something. An interesting point to note is that in a randomly selected sample of 100 pages containing a high KBT score but low PageRank, 85% proved to be good pages worthy of ranking, pages that would have been ignored under any link-based algorithm. Do you think it would be useful to use this method for finding high quality but low-linked sites?

  • Hi Roger,

    It would be interesting to see how Google manages to deal with copied contents. If Google starts to rank websites based on the fact they present, bloggers would just copy them and re-publish them.

    Google may need another set of signals to actually judge which website presents what kind of facts and also detect whether they have been copied.

    • Hi Brian,
      That’s a good point, you’re spot on about duplicate and plagiarized content. The research papers cites copied content as a factor for additional research.

  • Tom

    In order to determine the origination of the facts, websites will have to cite aka link to the original source. This will result in websites with topical relevance linking to the same sites, which will earn those websites links (basic SEO), and ultimately “trust”. Because if most people are linking to the same source it must be true right? 🙂

    It’s obviously not that easy, but I don’t believe this is something that will be rolled out anytime soon (or ever).

    Heck, I don’t even use Wikipedia as a source when doing my research but that’s just me.

  • Roger, good post. I don’t think we’ll know whether the algorithm is coming until it is truly in play RE: Mobilegeddon. There are still a lot of limitations as you’ve pointed out in this document.

    It might be that the algorithm evolves or iterates forward and utilizes more factors but it is still too early to tell.

  • John Public

    There is a company called FactPipe that has been working on this issue for some time. They appear to be taking a more interactive approach.
    Not sure where they stand since it looks like they are updating their site.

  • Chris McAuley

    I will admit that I am no expert in this field but the thought of Google implementing an algorithm like this, to me, is unjustified. I just wonder how Google will decide what is truth? When a news story breaks, there are many theories, questions. How will Google’s algorithm deal with that? And when is a fact declared as fact? Many of the scientific facts of today will not be facts next year. What about hypothesis? They are not facts. Will the algorithm distinguish between between sites that deal only in facts and sites that deal in hypothesis? I am quite capable of fact checking my own information.

  • Roger knows his stuff. I always pay attention to his articles because they are usually spot on, and this one is no exception.