Google’s John Mueller answered whether removing pages from a large site helps to solve the problem of pages that are discovered by Google but not crawled. John offered general insights on how to solve this issue.
Discovered – Currently Not Indexed
Search Console a service provided by Google that communicates search related issues and feedback.
Indexing status is an important part of search console because it tells a publisher how much of a site is indexed and eligible for ranking.
The indexing status of webpages are found in the search console Page Indexing Report.
A report that a page was discovered by Google but not indexed is often a sign that a problem needs to be addressed.
There are multiple reasons why Google may discover a page but decline to index it, although Google’s official documentation only lists one reason.
“Discovered – currently not indexed
The page was found by Google, but not crawled yet.
Typically, Google wanted to crawl the URL but this was expected to overload the site; therefore Google rescheduled the crawl.
This is why the last crawl date is empty on the report.”
Google’s John Mueller offers more reasons for why a page would be discovered but not indexed.
De-indexing Non-indexed Pages To Improve Indexing Sitewide?
There is an idea that removing certain pages will help Google crawl the rest of the site by giving it less pages to crawl.
There is a perception that Google has a limited crawl capacity (crawl budget) allocated to every site.
Googler’s have repeatedly said that there is no such thing as a crawl budget in the way that SEOs perceive it.
Google has a number of considerations of how many pages to crawl, including website server’s capacity to handle extensive crawling.
An underlying reason for why Google is choosy about how much it crawls is that Google doesn’t have enough capacity to store every single webpage on the Internet.
That’s why Google tends to index pages that have some value (if the server can handle it) and to not index other pages.
For more information on Crawl Budget read: Google Shares Insights into Crawl Budget
This is the question that was asked:
“Would deindexing and aggregating 8M used products into 2M unique indexable product pages help improve crawlability and indexability (Discovered – currently not indexed problem)?”
Google’s John Mueller first acknowledged that it was not possible to address the person’s specific issue then offered general recommendations.
“It’s impossible to say.
I’d recommend reviewing the large site’s guide to crawl budget in our documentation.
For large sites, sometimes crawling more is limited by how your website can handle more crawling.
In most cases though, it’s more about overall website quality.
Are you significantly improving the overall quality of your website by going from 8 million pages to 2 million pages?
Unless you focus on improving the actual quality, it’s easy to just spend a lot of time reducing the number of indexable pages, but not actually making the website better, and that wouldn’t improve things for search.”
Mueller Offers Two Reasons for Discovered Not Indexed Problem
Google’s John Mueller offered two reasons why Google might discover a page but decline to index it.
- Server Capacity
- Overall Website Quality
1. Server Capacity
Mueller said that Google’s ability to crawl and index webpages can be “limited by how your website can handle more crawling.”
The larger a website gets the more bots it takes to crawl a website. Compounding the issue is that Google is not the only bot crawling a large site.
There are other legitimate bots, for example from Microsoft and Apple, that also are trying to crawl the site. Additionally there are many other bots, some legitimate and others related to hacking and data scraping.
That means that for a large site, especially in the evening hours, there can be thousands of bots using website server resources to crawl a large website.
That’s why one of the first questions I ask a publisher with indexing problem is the state of their server.
In general, a website with millions of pages, or even hundreds of thousands of pages, will need a dedicated server or a cloud host (because cloud servers offer scalable resources such as bandwidth, GPU and RAM).
Sometimes a hosting environment may need more memory assigned to a process, like the PHP memory limit, in order to help the server cope with high traffic and prevent 500 Error Response Messages.
Troubleshooting servers involves analyzing a server error log.
2. Overall Website Quality
This is an interesting reason for not indexing enough pages. Overall site quality is like a score or a determination that Google assigns about a website.
Parts of a Website Can Affect Overall Site Quality
John Mueller has said that a section of a website can affect the overall site quality determination.
“…for some things, we look at the quality of the site overall.
And when we look at the quality of the site overall, if you have significant portions that are lower quality it doesn’t matter for us like why they would be lower quality.
…if we see that there are significant parts that are lower quality then we might think overall this website is not so fantastic as we thought.”
Definition of Site Quality
Google’s John Mueller offered a definition of site quality in another Office Hours video:
“When it comes to the quality of the content, we don’t mean like just the text of your articles.
It’s really the quality of your overall website.
And that includes everything from the layout to the design.
Like, how you have things presented on your pages, how you integrate images, how you work with speed, all of those factors they kind of come into play there.”
How Long it Takes to Determine Overall Site Quality
Another fact about how Google determines site quality is how long it takes Google to determine site quality, it can take months.
“It takes a lot of time for us to understand how a website fits in with regards to the rest of the Internet.
…And that’s something that can easily take, I don’t know, a couple of months, a half a year, sometimes even longer than a half a year…”
Optimizing a Site for Crawling and Indexing
Optimizing an entire site or a section of a site is kind of a general high-level way to look at the problem. It often comes down to optimizing individual pages on a scaled basis.
Particularly for ecommerce sites with thousands of millions of products, optimization can take several forms.
Things to look out for:
Make sure the main menu is optimized to take users to the important sections of the site most users are interested in. The main menu can also link to the most popular pages.
Link to Popular Sections and Pages
The most popular pages and sections can also be linked from a prominent section of the homepage.
This helps users get to the pages and sections that matter most to them but also signals to Google that these are important pages that should be indexed.
Improve Thin Content Pages
Thin content is basically pages with little useful content or pages that are mostly duplicates of other pages (templated content).
It’s not enough to just fill the pages with words. The words and sentences must have meaning and relevance to site visitors.
For products it can be measurements, weight, available colors, suggestions of other products to pair with it, brands that the products work best with, links to manuals, FAQs, ratings and other information that users will find valuable.
Solving Crawled Not Indexed for More Online Sales
In a physical store it seems like it’s enough to just put the products on the shelves.
But the reality is that it often takes knowledgeable salespeople to make those products fly off those shelves.
A webpage can play the role of a knowledgeable salesperson that can communicate to Google why the page should be indexed and helps customers choose those products.
Watch the Google SEO Office Hours at the 13:41 minute mark:
Featured image by Shutterstock/Rembolle