Google search console warns publishers about 404 errors: 404 and soft 404.
While they’re both called 404, they are very different.
Consequently, it’s essential to understand the difference between the errors to fix them.
HTTP Status Codes
A webpage accessed by a browser responds with a status code that communicates whether the request was successful and, if not, why it wasn’t.
These responses are communicated with what is referred to as HTTP response codes, but officially they are called HTTP status codes.
A server provides five categories of response codes; this article is specifically about one response, the 404 page not found status code.
The Meaning Of A 404 Response Code
All codes within the 4xx series of responses mean the request could not be fulfilled because the page was not found.
The official definition is:
4xx (Client Error): The request contains bad syntax or cannot be fulfilled
The 404 response is ambiguous as to whether the webpage might return.
Examples Of Why 404 Page Not Found Happens
- If someone mistakenly deletes a webpage, the server responds with the 404 page not found response.
- If someone links to a non-existent webpage, the server responds that the page was not found (404).
The official documentation is clear about the ambiguity of whether a page is temporarily or permanently gone:
“The 404 (Not Found) status code indicates that the origin server did not find a current representation for the target resource or is not willing to disclose that one exists.
A 404 status code does not indicate whether this lack of representation is temporary or permanent…”
To summarize, the 404 page not found code means there was an error in the browser request because the requested page could not be found.
What Is A Soft 404 Error?
A soft 404 error is not an official status code. The server does not send a soft 404 response to a browser because there is no such thing as a soft 404 status code.
Soft 404 describes a situation when the server presents a webpage and responds with a 200 OK status code, indicating success when the webpage or content is actually missing.
Four Common Reasons For A Soft 404
A webpage is missing, and a server sends 200 OK status.
This kind of soft 404 happens when a page is missing, but the server configuration redirects the missing page to the home page or a custom URL.
The page is gone, but the publisher has done something to fulfill the request for the missing page.
Content is missing or “thin.”
When content is completely missing, or there’s very little of it (a.k.a. thin content), the server will respond with a 200 status code, which means the request for the page was successful.
But for indexing webpages that are not successful webpage requests, search engines call this soft 404s.
The missing page redirects to the home page.
Some mistakenly believe that there’s something wrong with a 404 error response.
So, to stop the 404 error responses, a publisher may redirect the missing page to the homepage, even though the homepage is not what was requested.
Google calls these failed page requests soft 404s.
Missing page redirected to a custom webpage.
Sometimes, missing pages redirect to a custom-made webpage that serves a 200 status code, which results in Google labeling these pages as soft 404s.
Who Invented The Phrase Soft 404?
The concept of a soft 404 may have originated in a 2004 research paper titled, Towards an Understanding of the Web’s Decay (PDF).
The missing pages that are improperly substituted present a problem to search engines that are trying to index real pages.
Here is how the research paper frames soft 404s:
“According to the HTTP protocol when a request is made to a server for a page that is no longer available, the server is supposed to return an error code…
…in fact many servers, including most reputable ones, do not return a 404 code—instead the servers return a substitute page and an OK code (200).
…Our study shows that these type of substitutions, called “soft-404s” account for more than 15% of the dead links.”
Soft 404 Due To Coding Errors
There are cases where the page isn’t missing, but specific problems (like coding errors) have triggered Google to categorize it as a missing page.
Soft 404s are essential to investigate because they could signal broken code.
Typical coding issues:
- Missing file or include that’s supposed to populate a webpage with content.
- Database error.
- Empty search results pages.
404 Errors Have Two Main Causes
- An error in the link directs users to a page that doesn’t exist.
- A link to a page that used to exist but suddenly disappeared.
If the cause of the 404 is a linking error, you have to fix the links.
The tricky part of this task is finding all the broken links on a site. It can be more challenging to crawl large complex sites with thousands or millions of pages.
In instances like this, crawling tools come in handy.
You have so many site crawler software options to choose from: the free Xenu and Greenflare; or paid software like Screaming Frog, DeepCrawl, Botify, Sitebulb, and OnCrawl, where several of these have free trial versions or free but limited feature versions.
A Page That No Longer Exists
When a page no longer exists, you have two options:
- Restore the page if the removal was accidental.
- 301 redirect it to the closest related page if the removal was on purpose.
First, you have to locate all the linking errors on the site. Similar to finding all errors in linking for a large-scale website, you can use crawling tools.
However, crawling tools may not find orphaned pages: pages not linked from anywhere within the navigational links or from any of the pages.
Orphaned pages can exist if they used to be part of the website, then, after a website redesign, the link going to this old page disappears, but external links from other websites might still be linking to them.
To double-check if these kinds of pages exist on your site, you can use various tools.
How To Identify 404 Response Pages
Google Search Console Reports
The Coverage report lists 404 error URLs on a website.
The Search Console will report 404 pages as Google crawls through all the pages it can find. This can include links from other sites to a page that used to exist on your website.
You won’t find a missing page report in Google Analytics by default. However, you can track them in different ways.
For one, you can create a custom report and segment out pages with a page title mentioning Error 404 – Page Not Found.
Another way to find orphaned pages within Google Analytics is to create custom content groupings and assign all 404 pages to a content group.
Site: Operator Search Command
One cannot use the site: search command to find 404 errors because Google doesn’t index 404 webpages or soft 404 webpages.
Google’s site: search operator is useful for finding webpages on a site that contain a specific keyword phrase in the content of the webpages.
Google’s Search Console is the best source for identifying a list of soft 404s and regular 404s.
The website traffic error logs are a useful source for identifying 404 error responses.
Other Backlink Research Tools
Backlink research tools like Majestic, Ahrefs, Moz Open Site Explorer, Sistrix, Semrush, LinkResearchTools, and CognitiveSEO can also help.
Most of these tools will export a list of backlinks linking to your domain. From there, you can check all the linked pages and look for 404 errors.
How To Fix Soft 404 Errors
Crawling tools won’t detect a soft 404 because it isn’t a 404 error. But you can use crawling tools to catch something else.
Here are a few things to find:
- Thin Content: Some crawling tools report pages that have thin content along with a sortable word count. Start with pages with the least amount of words to evaluate whether the page has thin content.
- Duplicate Content: Some crawling tools are sophisticated enough to discern what percentage of the page is template content. And there are also tools made specifically for finding internal duplicate content like SiteLiner. If the main content is nearly the same as many other pages, you should look into these pages and determine why duplicate content exists on your site.
Aside from the crawling tools, you can also use Google Search Console and check under crawl errors to find pages listed under soft 404s.
Crawling an entire site to find issues that cause soft 404s allows you to locate and correct problems before Google detects them.
After detecting these soft 404 issues, you will need to correct them.
Most of the time, the solutions appear to be common sense. This can include simple things like expanding pages with thin content or replacing duplicate content with new and unique ones.
Throughout this process, here are a few things to consider:
Sometimes, thin content is caused by being too specific with the page topic, leaving you with little to say.
Merging several thin pages into one page can be more appropriate if the topics are related. Not only does this solve thin content issues, but it can fix duplicate content issues as well.
For example, an ecommerce site selling shoes in different colors and sizes may have a different URL for each size and color combination. This leaves a large number of pages with content that is thin and relatively identical.
The more effective approach is to put this all on one page instead and enumerate the options available.
Find Technical Issues That Cause Duplicate Content
Using even the most straightforward web crawling tool like Xenu (which doesn’t look at content but only URLs, response codes, and title tags), you can still find duplicate content issues by looking at URLs.
This includes www vs. non-www URLs, HTTP and HTTPS, with index.html and without, with tracking parameters and without, etc.
404 Errors And Soft 404 Errors
The most important thing to remember about 404 errors is that if the pages are truly missing, then there is nothing to fix. It’s okay to show a 404 response for requests for pages that do not exist.
But if the pages exist but on a different URL, then that’s something to fix by redirecting a broken link to the actual URL, restoring a missing page, or redirecting the old URL to a new page that replaced it.
A soft 404 is always the result of a problem that must be diagnosed and fixed.
Understanding the difference between the 404s is essential to keeping a website operating at peak performance.
Featured Image: Paulo Bobita/Search Engine Journal