Crawling and indexing – these are the two main tasks of the Google bot. Webmasters can facilitate the indexing of their websites by making several modifications in advance. This enables the bot to do a thorough job and give the websites the opportunity to rank better.
The five steps below help you optimize how your website is crawled and indexed to make your website much easier to find on the Web.
1. The Basics
1.1 The Robots.txt
The robots.txt is a simple text file that gives the Google bot specific instructions on how the website should be crawled. For instance, excluding certain directories. These are often data-sensitive areas, such as login and customer accounts, that should not be indexed.
If you want to exclude a specific directory from the crawl, use the following code in robots.txt:
The star is a placeholder (so-called wildcard) and represents all other content associated with this directory.
After creating the robots.txt file, you need to save it in the root directory of the website:
Use the Google Search Console to test your robots.txt. Please note this requires you to have registered the website in the Search Console.
1.2 The XML Sitemap
Besides robots.txt, there is another file which plays a key role for indexing: the XML sitemap. This is a machine-readable file listing all the URLs on your website. These structured data are created in the form of text and saved in XML format. This file also enables you to additionally transmit other information besides the URLs, such as when the various URLs were last updated.
After you have created the XML file, add it to the Google Search Console to inform Google of the existing URLs. However, the XML sitemap only recommends the URLs to Google and does not give the bot any instructions like in the robots.txt file. Google, therefore, will ignore the contents of the file when indexing the website.
The XML sitemap is often handled poorly despite the fact that it is very handy in the indexing of new and large websites since it informs Google about all existing sub-pages. For instance, if you have new content on a webpage that is not very well interlinked, use the sitemap to inform Google about this content.
The structure of a simple XML sitemap without additional attributes looks like this:
There are different ways to create a sitemap. Some CMS even come with the relevant tools for the automatic creation of a sitemap. You can also use any of the free programs available online.
After the sitemap is ready, save it in the root directory of your website:
Compress the sitemap or save it dynamically to save space on the server.
Google recommends splitting the sitemap if you have over 50,000 URLs. In this case, you need to use an index and create a “sitemap of the sitemap”. The index sitemap should contain all links to the different XML sitemaps. This may look like:
You should then upload the file in the Search Console to enable Google to re-crawl the sub-pages.
If you have a lot of videos and images on your website, you should also check the indexing for the universal search by creating separate sitemaps for the images and videos. The structure of an XML sitemap for media files is similar to that of the normal sitemap.
In many cases, you want your website to be re-crawled as soon as possible after you have made several modifications. The Google Search Console helps in such cases. Call up the respective website there and immediately send it to the Google index. This function is limited to 500 URLs per month for every website.
2. Make Use of the Crawl Budget
The Google bot is a computer program designed to follow links, crawl URLs, and then interpret, classify, and index the content. To do this, the bot has a limited crawl budget. The number of pages which are crawled and indexed depends on the page rank of the respective website, as well as on how easily the bot can follow the links on the website.
An optimized website architecture will make it much easier for the bot. In particular, flat hierarchies help ensure the bot accesses all available webpages. Just as users do not like having to go through more than four clicks to access desired content, the Google bot is often unable to go through large directory depths if the path is complicated.
The crawling can also be influenced by using your internal links. Regardless of a navigation menu, you can provide the bot with hints on other URLs using deep links within the text. This way, links that point to important content from your homepage will be crawled faster. The use of anchor tags to describe the link target gives the bot additional information about what to expect from the link and how to classify the content.
For the bot to be able to crawl your content faster, logically define your headings using h-tags. Here, you should make sure to structure the tags in chronological order. This means using the h1 tag for the main title and h2, h3, etc. for your subheadings.
Many CMS and web designers often use h-tags to format the sizes of their page headings since it is easier. This might confuse the Google bot during the crawl. You should use CSS to specify the font sizes independent of the content.
3. Avoid Forcing the Bot to go Through Detours
Orphan pages and 404 errors stress the crawl budget unnecessarily.
Whenever the Google bot encounters an error page, it is unable to follow any other links and therefore has to go back and start anew from a different point. Browsers or crawlers are often unable to find a URL after website operators delete products from their online shop or after changes to the URLs. In such cases, the server returns a 404 error code (not found). However, a high number of such errors consumes a huge part of the bot’s crawl budget. Webmasters should make sure they fix such errors on a regular basis (also see #5 – “Monitoring”).
Orphan pages are pages that do not have any internal inbound links but might have external links. The bot is either unable to crawl such pages or is abruptly forced to stop the crawl. Similar to 404 errors, you should also try to avoid orphan pages. These pages often result from errors in web design or if the syntax of the internal links is no longer correct.
4. Avoiding Duplicate Content
According to Google, duplicate content is no reason to take action against the respective website. However, this should not be interpreted to mean duplicate content should remain on the websites. If SEOs or webmasters do not do anything about it, the search engine goes ahead and decides which content to index and which URLs to ignore based on the strong similarity. Monitor and control how Google handles such content using these three measures:
- 301 redirects: Duplicate content can occur very quickly, especially if the version with www. and that without are indexed. The same also applies for secured connections via https. To avoid duplicate content, you should use a permanent redirect (301) pointing to the preferred version of the webpage. This requires either modifying your .htaccess file accordingly or adding the preferred version in the Google Search Console.
- Canonical tag: In particular, online shops run the risk of duplicate content arising simply because a product is available on multiple URLs. Solve this problem using a canonical tag. The tag informs the Google bot about the original URL version that should be indexed. You should make sure that all URLs that should not be indexed have a tag pointing to the canonical URL in your source code.There are different tools you can use to test your canonical tags. These tools help you identify pages that have no canonical tag or those that have faulty canonical tags. Ideally, every page should have a canonical tag. Unique/original pages should have self-referencing canonical tags.
- rel=alternate: This tag will be very useful if a website is available in various regional languages or if you have both a mobile and desktop version of your website. The tag informs the Google bot about an alternative URL with the same content.
5. Monitoring: Quick Fixes
Regularly checking the data in the Google Search Console is always a good way of knowing how Google crawls and indexes your website. The Search Console provides a lot of tips help you optimize how your website is crawled.
Under “crawl errors”, you will find a detailed list of both 404 errors and the so-called “Soft 404 errors.” Soft 404 errors describe pages that are not displayed correctly and for which the server does not return any error code.
Here, the crawl statistics are very revealing. These show how often the Google bot visited the website as well as the amount of data downloaded in the process. A random drop in the values might be a clear indication of errors on the website.
In addition to “Fetch as Google” and “robots.txt Tester”, the “URL parameters” tool can also be very useful. This enables webmasters and SEOs to specify how the Google bot should handle certain parameters of a URL. For instance, specifying the significance of a specific parameter for the interpretation of a URL helps you further optimize the crawl budget of the bot.
The options explained in this article will help you optimize how your website is crawled and indexed by the Google bot. In turn, this makes your website much easier to find on Google. Thus, the aforementioned options set the basics for successful websites, so nothing stands in the way of better rankings.
Featured Image: OnPage.org
In-post Photos: Screenshots by [Irina Hey]. Taken June 2016
Subscribe to SEJ
Get our daily newsletter from SEJ's Founder Loren Baker about the latest news in the industry!