An Ex-Google software engineer commented in a Hacker News discussion, discussing how Google works. Along the way he mentioned that Google no longer used the original PageRank algorithm.
Google Does Not Use Original PageRank?
The Hacker News discussion forked into a side-discussion about creating a competing search engine and an ex-Googler dropped in to discuss Google’s PageRank.
This is what the ex-Googler said about PageRank no longer in use:
“The comments here that PageRank is Google’s secret sauce also aren’t really true – Google hasn’t used PageRank since 2006. The ones about the search & clickthrough data being important are closer…”
He then followed up with:
“They replaced it in 2006 with an algorithm that gives approximately-similar results but is significantly faster to compute. The replacement algorithm is the number that’s been reported in the toolbar, and what Google claims as PageRank (it even has a similar name, and so Google’s claim isn’t technically incorrect).
Both algorithms are O(N log N) but the replacement has a much smaller constant on the log N factor, because it does away with the need to iterate until the algorithm converges. That’s fairly important as the web grew from ~1-10M pages to 150B+.”
PageRank and New PageRank
Hamlet Batista tweeted about the revelation contained in the Hacker News discussion.
Search patent expert Bill Slawski responded by tweeting:
“Google’s newer version of PageRank was granted as a patent in 2006. Coincidence?”
Bill Slawski wrote about this new PageRank in November 2015.
In that 2015 article, Bill wrote:
“Under this new patent, Google adds a diversified set of trusted pages to act as seed sites. When calculating rankings for pages. Google would calculate a distance from the seed pages to the pages being ranked.”
Here is what Bill noted about the new PageRank in a follow-up post from April 2018:
“The original PageRank patent, assigned to Stanford University, has expired. Google had an exclusive license to use PageRank. Google filed a PageRank update, with a different algorithm behind it. “
Bill then quoted from the patent:
“A popular search engine developed by Google Inc. of Mountain View, Calif. uses PageRank.RTM. as a page-quality metric for efficiently guiding the processes of web crawling, index selection, and web page ranking.”
Is New PageRank the Link Distance Ranking Algorithm?
The Google patents that Bill Slawski cites are focused on ranking links beginning with a trusted seed set. It’s not a trust algorithm. The name of the patent is Producing a Ranking for Pages Using Distances in a Web-link Graph.
It is evident by the title that this is a link distance ranking algorithm, that uses the distances from a trusted seed set to calculate a form of PageRank. It is not a trust algorithm.
Original PageRank Algorithm No Longer in Use?
If this software engineer is to be believed, the original PageRank algorithm is no longer in use. It may have been replaced by a more efficient algorithm with a similar name, as Bill Slawski suggested.
Is this Really an ex-Googler?
I believe this is an ex-Googler. According to his Hacker News profile, his name is Jonathan Tang.
That name corresponds to a LinkedIn profile of the same name with the following background information:
“Senior Software Engineer
Company Name: Google
Dates Employed: Jan 2009 – May 2014
I joined as a UI software engineer in Search and then gradually moved more toward backend work, eventually working with the full Search stack. Also helped Google+ and GFiber launch.”
Google Engineer Reveals More about Google
The engineer shared that a reason some may find Google search results unsatisfactory is because it’s tuned to satisfy the masses and not the individual. I called that the Fruit Loops effect, where Google, like a supermarket cereal aisle, will show users what they expect to see, which in some cases is Fruit Loops.
Here’s his explanation for why Google SERPs may be unsatisfactory to some:
“The reason for that is because Google’s building for a mainstream audience, because the mainstream (by definition) is much bigger than any niche. They increase aggregate happiness (though not your specific happiness) a lot more by doing so.”
Commercial Searches Subsidize non-Commercial Searches
The Googler also discussed the percentages of revenues that comes from commercial searches, although he allowed that his numbers may be dated.
“Google makes basically 80% of their revenue from searches for commercial products or services (insurance, lawyers, therapists, SaaS, flowers, etc.) The remainder is split between AdSense, Cloud, Android, Google Play, GFiber, YouTube, DoubleClick, etc. (may be a bit higher now).”
How Google’s Document Retrieval Works
He then discussed how documents are retrieved for every query:
“Remember, search touches (nearly) every indexed document on every query – if you throw in 200ms request latency for 4B documents your request will take roughly 25 years to complete.
…it uses an index and touches only documents that appear in one of the relevant posting lists. However, after stemming, spell-correcting, synonyms, and a number of other expansions I’m not at liberty to discuss, there can be a lot of query terms that it needs to look through, covering a significant portion of the index.
Each one of these needs to be scored (well, sorta – there are various tricks you can use to avoid scoring some docs, which again I’m not at liberty to discuss), and it’s usually beneficial to merge the scores only after they have been computed for all query terms, because you have more information about context available then.”
Is it Possible that the Original PageRank No Longer in Use?
If one thinks about it, it does make sense that the original PageRank algorithm might not be in use. It’s possible that it has evolved or revised. The ex-Googler claims it has been completely replaced. That claim matches evidence visible in recent Google patent updates, where a new form of PageRank is claimed.
Read the Hacker News discussion here:
Read the Twitter discussion here