14 Must-Know Tips For Crawling Millions Of Webpages

Learn how to successfully crawl millions of pages for an enterprise site SEO audit with these 14 helpful tips.

SEJ STAFF Roger Montti

July 8, 2022
⋅
14 min read

SEJ STAFF Roger Montti Owner - Martinibuster.com at Martinibuster.com

Bio

227

SHARES
24K

READS

14 Must-Know Tips For Crawling Millions Of Webpages

Crawling enterprise sites has all the complexities of any normal crawl plus several additional factors that need to be considered before beginning the crawl.

The following approaches show how to accomplish a large-scale crawl and achieve the given objectives, whether it’s part of an ongoing checkup or a site audit.

1. Make The Site Ready For Crawling

An important thing to consider before crawling is the website itself.

It’s helpful to fix issues that may slow down a crawl before starting the crawl.

That may sound counterintuitive to fix something before fixing it but when it comes to really big sites, a small problem multiplied by five million becomes a significant problem.

Adam Humphreys, the founder of Making 8 Inc. digital marketing agency, shared a clever solution he uses for identifying what is causing a slow TTFB (time to first byte), a metric that measures how responsive a web server is.

A byte is a unit of data. So the TTFB is the measurement of how long it takes for a single byte of data to be delivered to the browser.

TTFB measures the amount of time between a server receiving a request for a file to the time that the first byte is delivered to the browser, thus providing a measurement of how fast the server is.

A way to measure TTFB is to enter a URL in Google’s PageSpeed Insights tool, which is powered by Google’s Lighthouse measurement technology.

TTFB Score on PageSpeed Insights Tool — *Screenshot from PageSpeed Insights Tool, July 2022*

Adam shared: “So a lot of times, Core Web Vitals will flag a slow TTFB for pages that are being audited. To get a truly accurate TTFB reading one can compare the raw text file, just a simple text file with no HTML, loading up on the server to the actual website.

Throw some Lorem ipsum or something on a text file and upload it then measure the TTFB. The idea is to see server response times in TTFB and then isolate what resources on the site are causing the latency.

More often than not it’s excessive plugins that people love. I refresh both Lighthouse in incognito and web.dev/measure to average out measurements. When I see 30–50 plugins or tons of JavaScript in the source code, it’s almost an immediate problem before even starting any crawling.”

When Adam says he’s refreshing the Lighthouse scores, what he means is that he’s testing the URL multiple times because every test yields a slightly different score (which is due to the fact that the speed at which data is routed through the Internet is constantly changing, just like how the speed of traffic is constantly changing).

So what Adam does is collect multiple TTFB scores and average them to come up with a final score that then tells him how responsive a web server is.

If the server is not responsive, the PageSpeed Insights tool can provide an idea of why the server is not responsive and what needs to be fixed.

2. Ensure Full Access To Server: Whitelist Crawler IP

Firewalls and CDNs (Content Delivery Networks) can block or slow down an IP from crawling a website.

So it’s important to identify all security plugins, server-level intrusion prevention software, and CDNs that may impede a site crawl.

Typical WordPress plugins to add an IP to the whitelist are Sucuri Web Application Firewall (WAF) and Wordfence.

3. Crawl During Off-Peak Hours

Crawling a site should ideally be unintrusive.

Under the best-case scenario, a server should be able to handle being aggressively crawled while also serving web pages to actual site visitors.

But on the other hand, it could be useful to test how well the server responds under load.

This is where real-time analytics or server log access will be useful because you can immediately see how the server crawl may be affecting site visitors, although the pace of crawling and 503 server responses are also a clue that the server is under strain.

If it’s indeed the case that the server is straining to keep up then make note of that response and crawl the site during off-peak hours.

A CDN should in any case mitigate the effects of an aggressive crawl.

4. Are There Server Errors?

The Google Search Console Crawl Stats report should be the first place to research if the server is having trouble serving pages to Googlebot.

Any issues in the Crawl Stats report should have the cause identified and fixed before crawling an enterprise-level website.

Server error logs are a gold mine of data that can reveal a wide range of errors that may affect how well a site is crawled. Of particular importance is being able to debug otherwise invisible PHP errors.

5. Server Memory

Perhaps something that’s not routinely considered for SEO is the amount of RAM (random access memory) that a server has.

RAM is like short-term memory, a place where a server stores information that it’s using in order to serve web pages to site visitors.

A server with insufficient RAM will become slow.

So if a server becomes slow during a crawl or doesn’t seem to be able to cope with a crawling then this could be an SEO problem that affects how well Google is able to crawl and index web pages.

Take a look at how much RAM the server has.

A VPS (virtual private server) may need a minimum of 1GB of RAM.

However, 2GB to 4GB of RAM may be recommended if the website is an online store with high traffic.

More RAM is generally better.

If the server has a sufficient amount of RAM but the server slows down then the problem might be something else, like the software (or a plugin) that’s inefficient and causing excessive memory requirements.

6. Periodically Verify The Crawl Data

Keep an eye out for crawl anomalies as the website is crawled.

Sometimes the crawler may report that the server was unable to respond to a request for a web page, generating something like a 503 Service Unavailable server response message.

So it’s useful to pause the crawl and check out what’s going on that might need fixing in order to proceed with a crawl that provides more useful information.

Sometimes it’s not getting to the end of the crawl that’s the goal.

The crawl itself is an important data point, so don’t feel frustrated that the crawl needs to be paused in order to fix something because the discovery is a good thing.

7. Configure Your Crawler For Scale

Out of the box, a crawler like Screaming Frog may be set up for speed which is probably great for the majority of users. But it’ll need to be adjusted in order for it to crawl a large website with millions of pages.

Screaming Frog uses RAM for its crawl which is great for a normal site but becomes less great for an enterprise-sized website.

Overcoming this shortcoming is easy by adjusting the Storage Setting in Screaming Frog.

This is the menu path for adjusting the storage settings:

Configuration > System > Storage > Database Storage

If possible, it’s highly recommended (but not absolutely required) to use an internal SSD (solid-state drive) hard drive.

Most computers use a standard hard drive with moving parts inside.

An SSD is the most advanced form of hard drive that can transfer data at speeds from 10 to 100 times faster than a regular hard drive.

Using a computer with SSD results will help in achieving an amazingly fast crawl which will be necessary for efficiently downloading millions of web pages.

To ensure an optimal crawl it’s necessary to allocate 4 GB of RAM and no more than 4 GB for a crawl of up to 2 million URLs.

For crawls of up to 5 million URLs, it is recommended that 8 GB of RAM are allocated.

Adam Humphreys shared: “Crawling sites is incredibly resource intensive and requires a lot of memory. A dedicated desktop or renting a server is a much faster method than a laptop.

I once spent almost two weeks waiting for a crawl to complete. I learned from that and got partners to build remote software so I can perform audits anywhere at any time.”

8. Connect To A Fast Internet

If you are crawling from your office then it’s paramount to use the fastest Internet connection possible.

Using the fastest available Internet can mean the difference between a crawl that takes hours to complete to a crawl that takes days.

In general, the fastest available Internet is over an ethernet connection and not over a Wi-Fi connection.

If your Internet access is over Wi-Fi, it’s still possible to get an ethernet connection by moving a laptop or desktop closer to the Wi-Fi router, which contains ethernet connections in the rear.

This seems like one of those “it goes without saying” pieces of advice but it’s easy to overlook because most people use Wi-Fi by default, without really thinking about how much faster it would be to connect the computer straight to the router with an ethernet cord.

9. Cloud Crawling

Another option, particularly for extraordinarily large and complex site crawls of over 5 million web pages, crawling from a server can be the best option.

All normal constraints from a desktop crawl are off when using a cloud server.

Ash Nallawalla, an Enterprise SEO specialist and author, has over 20 years of experience working with some of the world’s biggest enterprise technology firms.

So I asked him about crawling millions of pages.

He responded that he recommends crawling from the cloud for sites with over 5 million URLs.

Ash shared: “Crawling huge websites is best done in the cloud. I do up to 5 million URIs with Screaming Frog on my laptop in database storage mode, but our sites have far more pages, so we run virtual machines in the cloud to crawl them.

Our content is popular with scrapers for competitive data intelligence reasons, more so than copying the articles for their textual content.

We use firewall technology to stop anyone from collecting too many pages at high speed. It is good enough to detect scrapers acting in so-called “human emulation mode.” Therefore, we can only crawl from whitelisted IP addresses and a further layer of authentication.”

Adam Humphreys agreed with the advice to crawl from the cloud.

He said: “Crawling sites is incredibly resource intensive and requires a lot of memory. A dedicated desktop or renting a server is a much faster method than a laptop. I once spent almost two weeks waiting for a crawl to complete.

I learned from that and got partners to build remote software so I can perform audits anywhere at any time from the cloud.”

10. Partial Crawls

A technique for crawling large websites is to divide the site into parts and crawl each part according to sequence so that the result is a sectional view of the website.

Another way to do a partial crawl is to divide the site into parts and crawl on a continual basis so that the snapshot of each section is not only kept up to date but any changes made to the site can be instantly viewed.

So rather than doing a rolling update crawl of the entire site, do a partial crawl of the entire site based on time.

This is an approach that Ash strongly recommends.

Ash explained: “I have a crawl going on all the time. I am running one right now on one product brand. It is configured to stop crawling at the default limit of 5 million URLs.”

When I asked him the reason for a continual crawl he said it was because of issues beyond his control which can happen with businesses of this size where many stakeholders are involved.

Ash said: “For my situation, I have an ongoing crawl to address known issues in a specific area.”

11. Overall Snapshot: Limited Crawls

A way to get a high-level view of what a website looks like is to limit the crawl to just a sample of the site.

This is also useful for competitive intelligence crawls.

For example, on a Your Money Or Your Life project I worked on I crawled about 50,000 pages from a competitor’s website to see what kinds of sites they were linking out to.

I used that data to convince the client that their outbound linking patterns were poor and showed them the high-quality sites their top-ranked competitors were linking to.

So sometimes, a limited crawl can yield enough of a certain kind of data to get an overall idea of the health of the overall site.

12. Crawl For Site Structure Overview

Sometimes one only needs to understand the site structure.

In order to do this faster one can set the crawler to not crawl external links and internal images.

There are other crawler settings that can be un-ticked in order to produce a faster crawl so that the only thing the crawler is focusing on is downloading the URL and the link structure.

13. How To Handle Duplicate Pages And Canonicals

Unless there’s a reason for indexing duplicate pages, it can be useful to set the crawler to ignore URL parameters and other URLs that are duplicates of a canonical URL.

It’s possible to set a crawler to only crawl canonical pages. But if someone set paginated pages to canonicalize to the first page in the sequence then you’ll never discover this error.

For a similar reason, at least on the initial crawl, one might want to disobey noindex tags in order to identify instances of the noindex directive on pages that should be indexed.

14. See What Google Sees

As you’ve no doubt noticed, there are many different ways to crawl a website consisting of millions of web pages.

A crawl budget is how many resources Google devotes to crawling a website for indexing.

The more webpages are successfully indexed the more pages have the opportunity to rank.

Small sites don’t really have to worry about Google’s crawl budget.

But maximizing Google’s crawl budget is a priority for enterprise websites.

In the previous scenario illustrated above, I advised against respecting noindex tags.

Well for this kind of crawl you will actually want to obey noindex directives because the goal for this kind of crawl is to get a snapshot of the website that tells you how Google sees the entire website itself.

Google Search Console provides lots of information but crawling a website yourself with a user agent disguised as Google may yield useful information that can help improve getting more of the right pages indexed while discovering which pages Google might be wasting the crawl budget on.

For that kind of crawl, it’s important to set the crawler user agent to Googlebot, set the crawler to obey robots.txt, and set the crawler to obey the noindex directive.

That way, if the site is set to not show certain page elements to Googlebot you’ll be able to see a map of the site as Google sees it.

This is a great way to diagnose potential issues such as discovering pages that should be crawled but are getting missed.

For other sites, Google might be finding its way to pages that are useful to users but might be perceived as low quality by Google, like pages with sign-up forms.

Crawling with the Google user agent is useful to understand how Google sees the site and help to maximize the crawl budget.

Beating The Learning Curve

One can crawl enterprise websites and learn how to crawl them the hard way. These fourteen tips should hopefully shave some time off the learning curve and make you more prepared to take on those enterprise-level clients with gigantic websites.

More resources:

Featured Image: SvetaZi/Shutterstock

Category SEO Strategy Technical SEO

Win AI Citations Across Every Location

How We Earned 1,000+ Links For GEO/SEO

Inside AI Max, PMax & Smart Bidding

Why AI Content Stopped Working & What To Do About It