Advertisement
  1. SEJ
  2.  ⋅ 
  3. SEO

How to Block OpenAI ChatGPT From Using Your Website Content

ChatGPT gets access to website content to learn from it. This is how to block your content from becoming AI training data.

How to Block OpenAI ChatGPT From Using Your Website Content

There is concern about the lack of an easy way to opt out of having one’s content used to train large language models (LLMs) like ChatGPT. There is a way to do it, but it’s neither straightforward nor guaranteed to work.

Updated 08-09-2023:

OpenAI published Robots.txt standards for blocking GPTBot.

GPTBot is the user agent for OpenAI’s crawler. OpenAI says it may crawl the web to improve their systems.

They don’t say that GPTBot is used to create the datasets that are used to train ChatGPT. It could be, but they don’t say that explicitly. So keep that in mind if you’re thinking of blocking GPTBot to keep out of OpenAI’s training dataset, because that’s not necessarily what will happen.

Another consideration is that there is a public dataset by CommonCrawl, which already crawls the Internet, so there’s no reason for OpenAI to duplicate that work.

More on how to block CommonCrawl further down in this article.

The full user agent string for GPTBot is:

User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

It can be blocked (disallowed) via robots.txt with the following lines:

User-agent: GPTBot
Disallow: /

GPTBot also obeys the following directives which controls which parts of a website are allowed for crawling and which parts are prohibited.

User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/

OpenAI also publishes an IP range that can be used to identify the official GPTBot (as opposed to a crawler that is spoofing the user agent).

It’s possible to block that IP range through .htaccess but the IP range can change, which means that the .htaccess file will have to be updated.

This point cannot be overstated, the IP range can change, so always check to see what the most current IP ranges are.

So it’s more convenient to use the range for confirming the user agent and blocking GPTBot with the robots.txt file.

These are the current GPTBot IP ranges as of 08-09-2023:

20.15.240.64/28
20.15.240.80/28
20.15.240.96/28
20.15.240.176/28
20.15.241.0/28
20.15.242.128/28
20.15.242.144/28
20.15.242.192/28
40.83.2.64/28

How AIs Learn From Your Content

Large Language Models (LLMs) are trained on data that originates from multiple sources. Many of these datasets are open source and are freely used for training AIs.

In general, Large Language Models use a wide variety of sources to train from.

Examples of the kinds of sources used:

  • Wikipedia
  • Government court records
  • Books
  • Emails
  • Crawled websites

There are actually portals and websites offering datasets that are giving away vast amounts of information.

One of the portals is hosted by Amazon, offering thousands of datasets at the Registry of Open Data on AWS.

Screenshot from Amazon, January 2023

The Amazon portal with thousands of datasets is just one portal out of many others that contain more datasets.

Wikipedia lists 28 portals for downloading datasets, including the Google Dataset and the Hugging Face portals for finding thousands of datasets.

Datasets Used to Train ChatGPT

ChatGPT is based on GPT-3.5, also known as InstructGPT.

The datasets used to train GPT-3.5 are the same used for GPT-3. The major difference between the two is that GPT-3.5 used a technique known as reinforcement learning from human feedback (RLHF).

The five datasets used to train GPT-3 (and GPT-3.5) are described on page 9 of the research paper, Language Models are Few-Shot Learners  (PDF)

The datasets are:

  1. Common Crawl (filtered)
  2. WebText2
  3. Books1
  4. Books2
  5. Wikipedia

Of the five datasets, the two that are based on a crawl of the Internet are:

  • Common Crawl
  • WebText2

About the WebText2 Dataset

WebText2 is a private OpenAI dataset created by crawling links from Reddit that had three upvotes.

The idea is that these URLs are trustworthy and will contain quality content.

WebText2 is an extended version of the original WebText dataset developed by OpenAI.

The original WebText dataset had about 15 billion tokens. WebText was used to train GPT-2.

WebText2 is slightly larger at 19 billion tokens. WebText2 is what was used to train GPT-3 and GPT-3.5

OpenWebText2

WebText2 (created by OpenAI) is not publicly available.

However, there is a publicly available open-source version called OpenWebText2.  OpenWebText2 is a public dataset created using the same crawl patterns that presumably offer similar, if not the same, dataset of URLs as the OpenAI WebText2.

I only mention this in case someone wants to know what’s in WebText2. One can download OpenWebText2 to get an idea of the URLs contained in it.

A cleaned up version of OpenWebText2 can be downloaded here. The raw version of OpenWebText2 is available here.

I couldn’t find information about the user agent used for either crawler, maybe it’s just identified as Python, I’m not sure.

So as far as I know, there is no user agent to block, although I’m not 100% certain.

Nevertheless, we do know that if your site is linked from Reddit with at least three upvotes then there’s a good chance that your site is in both the closed-source OpenAI WebText2 dataset and the open-source version of it, OpenWebText2.

More information about OpenWebText2 is here.

Common Crawl

One of the most commonly used datasets consisting of Internet content is the Common Crawl dataset that’s created by a non-profit organization called Common Crawl.

Common Crawl data comes from a bot that crawls the entire Internet.

The data is downloaded by organizations wishing to use the data and then cleaned of spammy sites, etc.

The name of the Common Crawl bot is, CCBot.

CCBot obeys the robots.txt protocol so it is possible to block Common Crawl with Robots.txt and prevent your website data from making it into another dataset.

However, if your site has already been crawled then it’s likely already included in multiple datasets.

Nevertheless, by blocking Common Crawl it’s possible to opt out your website content from being included in new datasets sourced from newer Common Crawl datasets.

This is what I meant at the very beginning of the article when I wrote that the process is “neither straightforward nor guaranteed to work.”

The CCBot User-Agent string is:

CCBot/2.0

Add the following to your robots.txt file to block the Common Crawl bot:

User-agent: CCBot
Disallow: /

An additional way to confirm if a CCBot user agent is legit is that it crawls from Amazon AWS IP addresses.

CCBot also obeys the nofollow robots meta tag directives.

Use this in your robots meta tag:

<meta name="CCBot" content="nofollow">

A Consideration Before You Block any Bots

Many datasets, including Common Crawl, could be used by companies that filter and categorize URLs in order to create lists of websites to target with advertising.

For example, a company named Alpha Quantum offers a dataset of URLs categorized using the Interactive Advertising Bureau Taxonomy. The dataset is useful for AdTech marketing and contextual advertising.  Exclusion from a database like that could cause a publisher to lose potential advertisers.

Blocking AI From Using Your Content

Search engines allow websites to opt out of being crawled. Common Crawl also allows opting out. But there is currently no way to remove one’s website content from existing datasets.

Furthermore, research scientists don’t seem to offer website publishers a way to opt out of being crawled.

The article, Is ChatGPT Use Of Web Content Fair? explores the topic of whether it’s even ethical to use website data without permission or a way to opt out.

Many publishers may appreciate it if in the near future, they are given more say on how their content is used, especially by AI products like ChatGPT.

Whether that will happen is unknown at this time.

More resources:

Featured image by Shutterstock/ViDI Studio

Category News SEO
ADVERTISEMENT
SEJ STAFF Roger Montti Owner - Martinibuster.com at Martinibuster.com

I have 25 years hands-on experience in SEO, evolving along with the search engines by keeping up with the latest ...