Google uses a predictive method to detect duplicate content based on URL patterns, which could lead to pages being incorrectly identified as duplicates.
In order to prevent unnecessary crawling and indexing, Google tries to predict when pages may contain similar or duplicate content based on their URLs.
When Google crawls pages with similar URL patterns and finds they contain the same content, it may then determine all other pages with that URL pattern have the same content as well.
Unfortunately for site owners that could mean pages with unique content get written off as duplicates because they have the same URL pattern as pages that are actual duplicates. Those pages would then be left out of Google’s index.
This topic is discussed during the Google Search Central SEO hangout recorded on March 5. Site owner Ruchit Patel asks Mueller about his event website where thousands of URLs are not being indexed correctly.
One of Mueller’s theories as to why that’s happening is because of the predictive method used to detect duplicate content.
Read Mueller’s response in the section below.
Google’s John Mueller On Predicting Duplicate Content
Google has multiple levels of determining when web pages have duplicate content.
One of them is to look at the page content directly, and the other is to predict when pages are duplicates based on their URLs.
“What tends to happen on our side is we have multiple levels of trying to understand when there is duplicate content on a site. And one is when we look at the page’s content directly and we kind of see, well, this page has this content, this page has different content, we should treat them as separate pages.
The other thing is kind of a broader predictive approach that we have where we look at the URL structure of a website where we see, well, in the past, when we’ve looked at URLs that look like this, we’ve seen they have the same content as URLs like this. And then we’ll essentially learn that pattern and say, URLs that look like this are the same as URLs that look like this.”
Mueller goes on to explain the reason Google does this is to conserve resources when it comes to crawling and indexing.
When Google thinks a page is a duplicate version of another page because it has a similar URL, it won’t even crawl said page to see what the content really looks like.
“Even without looking at the individual URLs we can sometimes say, well, we’ll save ourselves some crawling and indexing and just focus on these assumed or very likely duplication cases. And I have seen that happen with things like cities.
I have seen that happen with things like, I don’t know, automobiles is another one where we saw that happen, where essentially our systems recognize that what you specify as a city name is something that is not so relevant for the actual URLs. And usually we learn that kind of pattern when a site provides a lot of the same content with alternate names.”
Mueller speaks to how Google’s predictive method of detecting duplicate content may affect event websites:
“So with an event site, I don’t know if this is the case for your website, with an event site it could happen that you take one city, and you take a city that is maybe one kilometer away, and the events pages that you show there are exactly the same because the same events are relevant for both of those places.
And you take a city maybe five kilometers away and you show exactly the same events again. And from our side, that could easily end up in a situation where we say, well, we checked 10 event URLs, and this parameter that looks like a city name is actually irrelevant because we checked 10 of them and it showed the same content.
And that’s something where our systems can then say, well, maybe the city name overall is irrelevant and we can just ignore it.”
What can a site owner do to correct this problem?
As a potential fix for this problem, Mueller suggests looking for situations where there are real cases of duplicate content and to limit that as much as possible.
“So what I would try to do in a case like this is to see if you have this kind of situations where you have strong overlaps of content and to try to find ways to limit that as much as possible.
And that could be by using something like a rel canonical on the page and saying, well, this small city that is right outside the big city, I’ll set the canonical to the big city because it shows exactly the same content.
So that really every URL that we crawl on your website and index, we can see, well, this URL and its content are unique and it’s important for us to keep all of these URLs indexed.
Or we see clear information that this URL you know is supposed to be the same as this other one, you have maybe set up a redirect or you have a rel canonical set up there, and we can just focus on those main URLs and still understand that the city aspect there is critical for your individual pages.”
Mueller doesn’t address this aspect of the issue, but it’s worth noting there’s no penalty or negative ranking signal associated with duplicate content.
At most, Google will not index duplicate content, but it won’t reflect negatively on the site overall.
Hear Mueller’s response in the video below: