Google’s Gary Illyes shared a great deal of information about how Google detects duplicate pages and then chooses the canonical page to be included in the search engine results pages.
He also shared how at least twenty different signals are weighted in order to help identify the canonical page and why machine learning is used to adjust the weights.
How Google Handles Canonicalization
Gary first begins by stating how sites are crawled and documents indexed. Then he moves on to the next step, canonicalization and duplicates detection.
He goes into detail about reducing content to a checksum, a number, which is then compared to the checksums of other pages to identify identical checksums.
“We collect signals and now we ended up with the next step, which is actually canonicalization and dupe detection.
…first you have to detect the dupes, basically cluster them together, saying that all of these pages are dupes of each other. And then you have to basically find a leader page for all of them.
And how we do that is perhaps how most people, other search engines do do it, which is basically reducing the content into a hash or checksum and then comparing the checksums.
And that’s because it’s much easier to do that than comparing perhaps the three thousand words…
…And so we are reducing the content into a checksum and we do that because we don’t want to scan the whole text because it just doesn’t make sense. Essentially it takes more resources and the result would be pretty much the same. So we calculate multiple kinds of checksums about textual content of the page and then we compare to checksums.”
Gary next answers if this process catches near-duplicates or exact duplicates:
Good question. It can catch both. It can also catch near duplicates.
We have several algorithms that, for example, try to detect and then remove the boilerplate from the pages.
So, for example, we exclude the navigation from the checksum calculation. We remove the footer as well. And then you are left with what we call the centerpiece, which is the central content of the page, kind of like the meat of the page.
When we calculate to the checksums and we compare the checksums to each other, then those that are fairly similar,or at least a little bit similar, we will put them together in a dupe cluster.”
Gary was then asked what a checksum is:
“A checksum is basically a hash of the content. Basically a fingerprint. Basically it’s a fingerprint of something. In this case, it’s the content of the file…
And then, once we’ve calculated these checksums, then we have the dupe cluster. Then we have to select one document, that we want to show in the search results.”
Gary then discussed the reason why Google prevents duplicate pages from appearing in the SERP:
“Why do we do that? We do that because typically users don’t like it when the same content is repeated across many search results. And we do that also because our storage space in the index is not infinite. Basically, why would we want to store duplicates in our index?”
Next he returns to the heart of the topic, detecting duplicates and selecting the canonical page:
“But, calculating which one to be the canonical, which page to lead the cluster, is actually not that easy. Because there are scenarios where even for humans it would be quite hard to tell which page should be the one that to be in the search results.
So we employ, I think, over twenty signals, we use over twenty signals, to decide which page to pick as canonical from a dupe cluster.
And most of you can probably guess like what these signals would be. Like one is obviously the content.
But it could be also stuff like PageRank for example, like which page has higher PageRank, because we still use PageRank after all these years.
It could be, especially on same site, which page is on an https URL, which page is included in the sitemap, or if one page is redirecting to the other page, then that’s a very clear signal that the other page should become canonical, the rel=canonical attribute… is quite a strong signal again… because… someone specified that that other page should be the canonical.
And then once we compared all these signals for all page pairs then we end up with actual canonical. And then each of these signals that we use have their own weight. And we use some machine learning voodoo to calculate the weights for these signals.”
He now goes granular and explains the reason why Google would give redirects a heavier weights than the http/https URL signal:
“But for example, to give you an idea, 301 redirect, or any sort of redirect actually, should be much higher weight when it comes to canonicalization than whether the page is on an http URL or https.
Because eventually the user would see the redirect target. So it doesn’t make sense to include the redirect source in the search results.”
Mueller asks him why does Google use machine learning for adjusting the signal weights:
“So do we get that wrong sometimes? Why do we need machine learning, like we clearly just write down these weights once and then it’s perfect, right?”
Gary then shared an anecdote of having worked on canonicalization, trying to introduce hreflang into the calculation as a signal. He related that it was a nightmare to try to adjust the weights manually. He said that manually adjusting the weights can throw off other weights, leading to unexpected outcomes such as strange search results that didn’t make sense.
He shared a bug example of pages with short URLs suddenly ranking better, which Gary called silly.
He also shared an anecdote of manually reducing a site map signal in order to deal with a canonicalization related bug, but that makes another signal stronger, which then causes other issues.
The point being that all the weighting signals are tightly interrelated and it takes machine learning to successfully make changes to the weighting.
“Let’s say that… the weight of the sitemap signal is too high. And then, let’s say that the dupes team says, okay let’s reduce that signal a tiny bit.
But then when they reduce that signal a tiny bit, then some other signal becomes more powerful.
But you can’t actually control which signal because there are like twenty of them.
And then you tweak that other signal that suddenly became more powerful or heavier and then that throws off yet another signal. And then you tweak that one and basically it’s a never-ending game essentially, it’s a whack-a-mole.
So if you feed all these signals to a machine learning algorithm plus all the desired outcomes then you can train it to set these weights for you and then use those weights that were calculated or suggested by a machine learning algorithm.”
John Mueller next asks if those twenty weights, like the previously mentioned sitemap signal could be considered ranking signals.
“Are those weights also like a ranking factor? …Or is canonicalization independent of ranking?”
“So, canonicalization is completely independent of ranking. But the page that we choose as canonical that will end up in the search results pages, and that will be ranked but not based on these signals.”
Gary shared a great deal on how canonicalization works, including the complexity of it. They discussed writing up this information at a later date but they sounded daunted at the task of writing it all up.
The podcast episode was titled, “How technical Search content is written and published at Google, and more!” but I have to say that by far the most interesting part was Gary’s description of canonicalization inside Google.
Listen to the Entire Podcast: