As the web evolves, so too does Google and SEO.
This means what is considered best practice is often in flux. What may have been good counsel yesterday, is not so today.
This is especially true for sitemaps, which are almost as old as SEO itself.
The problem is, when every man and their dog has posted answers in forums, published recommendations on blogs and amplified opinions with social media, it takes time to sort valuable advice from misinformation.
So while most of us share a general understanding that submitting a sitemap to Google Search Console is important, you may not know the intricacies of how to implement them in a way that drives SEO key performance indicators (KPIs).
Let’s clear up the confusion around best practices for sitemaps today.
In this article we cover:
- What is an XML sitemap
- XML sitemap format
- Types of sitemaps
- XML sitemap indexation optimization
- XML sitemap best practice checklist
What Is an XML Sitemap
In simple terms, an XML sitemap is a list of your website’s URLs.
It acts as a roadmap to tell search engines what content is available and how to reach it.
A search engine will find all nine pages in a sitemap with one visit to the XML sitemap file.
On the website, it will have to jump through five internal links to find page 9.
This ability of an XML sitemap to assist crawlers in faster indexation is especially important for websites that:
- Have thousands of pages and/or a deep website architecture.
- Frequently add new pages.
- Frequently change content of existing pages.
- Suffer from weak internal linking and orphan pages.
- Lack a strong external link profile.
@nishanthstephen generally anything you put in a sitemap will be picked up sooner
— Gary “鯨理” Illyes (@methode) October 13, 2015
Side note: Submitting a sitemap with noindex URLs can also speed up deindexation. This can be more efficient than removing URLs in Google Search Console if you have many to be deindexed. But use this with care and be sure you only add such URLs temporarily to your sitemaps.
Even though search engines can technically find your URLs without it, by including pages in an XML sitemap you’re indicating that you consider them to be quality landing pages.
While there is no guarantee that an XML sitemap will get your pages crawled, let alone indexed or ranked, submitting one certainly increases your chances.
XML Sitemap Format
A one-page site using all available tags would have this XML sitemap:
But how should an SEO use each of these tags? Is all the metadata valuable?
Loc (a.k.a. Location) Tag
This compulsory tag contains the absolute, canonical version of the URL location.
It should accurately reflect your site protocol (http or https) and if you have chosen to include or exclude www.
For international websites, this is also where you can implement your hreflang handling.
By using the xhtml:link attribute to indicate the language and region variants for each URL, you reduce page load time, which the other implementations of link elements in the <head> or HTTP headers can’t offer.
Yoast has an epic post on hreflang for those wanting to learn more.
Lastmod (a.k.a. Last Modified) Tag
An optional but highly recommended tag used to communicated the file’s last modified date and time.
John Mueller acknowledged Google does use the lastmod metadata understand when the page last changed and if it should be crawled. Contradicting advice from Illyes in 2015.
The last modified time is especially critical for content sites as it assists Google to understand that you are the original publisher.
It’s also powerful to communicate freshness, but be sure to update modification date only when you have made meaningful changes.
Trying to trick search engines that your content is fresh, when it’s not, may result in a Google penalty.
Changefreq (a.k.a. Change Frequency) Tag
Once upon a time, this optional tag hinted how frequently content on the URL was expected to change to search engines.
But Mueller has stated that “change frequency doesn’t really play that much of a role with sitemaps” and that “it is much better to just specify the time stamp directly”.
This optional tag that ostensibly tells search engines how important a page is relative to your other URLs on a scale between 0.0 to 1.0.
At best, it was only ever a hint to search engines and both Mueller and Illyes have clearly stated they ignore it.
The URL + last modification date is what we care about for websearch.
— John ☆.o(≧▽≦)o.☆ (@JohnMu) August 17, 2017
Your website needs an XML sitemap, but not necessarily the priority and change frequency metadata.
Use the lastmod tags accurately and focus your attention on ensuring you have the right URLs submitted.
Types of Sitemaps
There are many different types of sitemaps. Let’s look at the ones you actually need.
XML Sitemap Index
XML sitemaps have a couple of limitations:
- A maximum of 50,000 URLs.
- An uncompressed file size limit of 50MB.
Sitemaps can be compressed using gzip (the file name would become something similar to sitemap.xml.gz) to save bandwidth for your server. But once unzipped, the sitemap still can’t exceed either limit.
Whenever you exceed either limit, you will need to split your URLs across multiple XML sitemaps.
Those sitemaps can then be combined into a single XML sitemap index file, often named sitemap-index.xml. Essentially, a sitemap for sitemaps.
For exceptionally large websites, you can also create multiple sitemap index files. But be aware that you cannot nest sitemap index files.
For search engines to easily find every one of your sitemap files at once, you will want to:
- Submit your sitemap index to Google Search Console and Bing Webmaster Tools.
- Specify your sitemap index URL in your robots.txt file. Pointing search engines directly to your sitemap as you welcome them to crawl.
You can also submit sitemaps by pinging them to Google.
Google no longer pays attention to hreflang entries in “unverified sitemaps”, which Tom Anthony believes to mean those submitted via the ping URL.
XML Image Sitemap
Image sitemaps were designed to improve indexation of image content.
In modern day SEO, however, images are embedded within page content, so will be crawled along with the page URL.
Moreover, it’s best practice to utilize JSON-LD schema.org/ImageObject markup to call out image properties to search engines as it provides more attributes than an image XML sitemap.
Because of this, an XML image sitemap is unnecessary for most websites. Including an image sitemap would only waste crawl budget.
The exception to this is if images help drive your business, such as a stock photo website or ecommerce site gaining product page sessions from Google Image search.
Know that images don’t have to to be on the same domain as your website to be submitted in a sitemap. You can use a CDN as long as it’s verified in Search Console.
XML Video Sitemap
Similar to images, if videos and critical to your business, submit an XML video sitemap.
If not, a video sitemap is unnecessary.
Save your crawl budget for the page the video is embedded into, ensuring you markup all videos with JSON-LD as a schema.org/VideoObject.
Google News Sitemap
Only sites registered with Google News should use this sitemap.
If you are, include articles published in the last two days, up to a limit of 1,000 URLs per sitemap, and update with fresh articles as soon as they’re published.
Contrary to some online advice, Google News sitemaps don’t support image URL.
Google recommends to use schema.org image or og:image to specify your article thumbnail for Google News.
This is not needed for most websites.
Why? Because Mueller confirmed mobile sitemaps are for feature phone pages only. Not for smartphone-compatibility.
So unless you have unique URLs specifically designed for featured phones, a mobile sitemap will be of no benefit.
XML sitemaps take care of search engine needs. HTML sitemaps were designed to assist human users to find content.
The question becomes, if you have a good user experience and well crafted internal links, do you need a HTML sitemap?
Check the page views of your HTML sitemap in Google Analytics. Chances are, it’s very low. If not, it’s a good indication that you need to improve your website navigation.
HTML sitemaps are generally linked in website footers. Taking link equity from every single page of your website.
Ask yourself. Is that the best use of that link equity? Or are you including an HTML sitemap as a nod to legacy website best practices?
If few humans use it. And search engines don’t need it as you have strong internal linking and an XML sitemap. Does that HTML sitemap have a reason to exist? I would argue no.
Dynamic XML Sitemap
Static sitemaps are simple to create using a tool such as Screaming Frog.
The problem is, as soon as you create or remove a page, your sitemap is outdated. If you modify the content of a page, the sitemap won’t automatically update the lastmod tag.
So unless you love manually creating and uploading sitemaps for every single change, it’s best to avoid static sitemaps.
Dynamic XML sitemaps, on the other hand, are automatically updated by your server to reflect relevant website changes as they occur.
To create a dynamic XML sitemap:
- Ask your developer to code a custom script, being sure to provide clear specifications
- Use a dynamic sitemap generator tool
- Install a plugin for your CMS, for example the Yoast SEO plugin for WordPress
Dynamic XML sitemaps and a sitemap index are modern best practice. Mobile and HTML sitemaps are not.
Use image, video and Google News sitemaps only if improved indexation of these content types drive your KPIs.
XML Sitemap Indexation Optimization
Now for the fun part. How do you use XML sitemaps to drive SEO KPIs.
Only Include SEO Relevant Pages in XML Sitemaps
An XML sitemap is a list of pages you recommend to be crawled, which isn’t necessarily every page of your website.
A search spider arrives at your website with an “allowance” for how many pages it will crawl.
The XML sitemap indicates you consider the included URLs more important than those that aren’t blocked but aren’t in the sitemap.
You are using it to tell search engines “I’d really appreciate it if you’d focus on these URLs in particular.”
Essentially, it helps you use crawl budget effectively.
By including only SEO relevant pages, you help search engines crawl your site more intelligently in order to reap the benefits of better indexation.
You should exclude:
- Non-canonical pages.
- Duplicate pages.
- Paginated pages.
- Parameter or session ID based URLs.
- Site search result pages.
- Reply to comment URLs.
- Share via email URLs.
- URLs created by filtering that are unnecessary for SEO.
- Archive pages.
- Any redirections (3xx), missing pages (4xx) or server error pages (5xx).
- Pages blocked by robots.txt.
- Pages with noindex.
- Resource pages accessible by a lead gen form (e.g., white paper PDFs).
I want to share an example from Michael Cottam about prioritizing pages:
Say your website has 1,000 pages. 475 of those 1,000 pages are SEO relevant content. You highlight those 475 pages in an XML sitemap, essentially asking Google to deprioritize indexing the remainder.
Now, let’s say Google crawls those 475 pages, and algorithmically decides that 175 are “A” grade, 200 are “B+”, and 100 “B” or “B-”. That’s a strong average grade, and probably indicates a quality website to which to send users.
Contrast that against submitting all 1,000 pages via the XML sitemap. Now, Google looks at the 1,000 pages you say are SEO relevant content, and sees over 50 percent are “D” or “F” pages. Your average grade isn’t looking so good anymore and that may harm your organic sessions.
But remember, Google is going to use your XML submit only as a clue to what’s important on your site.
Just because it’s not in your XML sitemap doesn’t necessarily mean that Google won’t index those pages.
When it comes to SEO, overall site quality is a key factor.
Do a site: search to see all the pages Google has indexed.
Once you find low-quality pages, invest the time to fix them. Not only is this better for search engines but more importantly offers a better user experience.
Manage crawl budget by limiting XML sitemap URLs only to SEO relevant pages and reduce low-quality pages on your website.
Track the impact of your optimizations via the crawl stats in Google Search Console.
Fully Leverage Sitemap Reporting
While the Google Search Console sitemap report is useful for discovering warnings and errors (especially with the new Index Coverage report), you can do so much more than simple validation.
How can you make your sitemap indexation data actionable?
Imagine you have a website with 131,000 SEO relevant pages.
Because of the size, you need multiple sitemaps.
Both Google and Schema.org show examples encouraging numbered naming.
So you may have ended up with something like this:
Not the most insightful naming convention.
Knowing that the whole site has 64 percent of its pages indexed isn’t particularly useful.
You need to discover which types of pages were left out.
The problem is, if you try to dive deeper, sequential sitemap names give no indication of which section of your website is suffering from the indexation problem.
It’s time-consuming to discover all the types of pages in sitemap-3.
Instead, use descriptive sitemap names that reflect your site structure.
By grouping different types of pages in clearly named sitemaps, you have isolated men’s product pages as your main indexation issue.
But article pages also aren’t performing too well.
To uncover insights into why, break the articles sitemap into smaller sitemaps based on an additional dimension.
Now we have clarity it’s health-related articles that are underperforming on indexation.
Sitemaps should be mutually exclusive – with each URL existing in only one sitemap. Else, it makes analysis more difficult and you may see more indexed pages than submitted.
The same URL in multiple sitemaps is counted separately, which is why you could see something like that. I’d keep URLs in a single sitemap.
— John ☆.o(≧▽≦)o.☆ (@JohnMu) August 16, 2017
In addition to identifying warnings and errors, you can use XML sitemap as a sleuthing tool to isolate indexation problems by descriptively naming sitemaps to reflect your website architecture.
Sitemaps are no longer divided up because they hit the 50,000 URL limit. They are grouped to provide actionable insights on page indexation (or lack thereof).
XML Sitemap Best Practice Checklist
Do invest time to:
✓ Include hreflang tags in XML sitemaps
✓ Include the <loc> and <lastmod> tags
✓ Compress sitemap files using gzip
✓ Use a sitemap index file
✓ Use image, video and Google news sitemaps only if indexation drives your KPIs
✓ Dynamically generate XML sitemaps
✓ Reference sitemap index URLs in robots.txt
✓ Submit sitemap index to both Google Search Console and Bing Webmaster Tools
✓ Include only SEO relevant pages in XML sitemaps
✓ Fix all sitemap errors and warnings
✓ Group URLs in descriptively named sitemaps based on page type
✓ Download sitemap data from GSC to analyze indexation rates
✓ Ensure URLs are included only in a single sitemap
Now, go check your own sitemap and make sure you’re doing it right.
More SEO Resources:
- How to Optimize XML Sitemaps: 13 SEO Best Practices
- Google Black Hat Sitemap Bug: What It Means for XML Sitemaps
- Google’s John Mueller Answers: How to Add Sitemaps for More Than 50,000 URLs
Feature Image: Created by author, August 2018
Screenshots taken by author, August 2018