OpenAI has launched GPTBot, a new web crawler to improve future artificial intelligence models like GPT-4 and the future GPT-5.
How GPTBot Works
Recognizable by the following user agent token and the entire user-agent string, this system scours the web for data that can enhance AI technology’s accuracy, capabilities, and safety.
User agent token: GPTBot Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
Reportedly, it should strictly filter out any paywall-restricted sources, sources that violate OpenAI’s policies, or sources that gather personally identifiable information.
The utilization of GPTBot can potentially provide a significant boost to AI models.
By allowing it to access your site, you contribute to this data pool, thereby improving the overall AI ecosystem.
However, it’s not a one-size-fits-all scenario. OpenAI has given web admins the power to choose whether or not to grant GPTBot access to their websites.
Restricting GPTBot Access
If website owners wish to restrict GPTBot from their site, they can modify their robots.txt file.
By including the following, they can prevent GPTBot from accessing the entirety of their website.
User-agent: GPTBot Disallow: /
In contrast, those who wish to grant partial access can customize the directories that GPTBot can access. To do this, add the following to the robots.txt file.
User-agent: GPTBot Allow: /directory-1/ Disallow: /directory-2/
Regarding the technical operations of GPTBot, any calls made to websites originate from IP address ranges documented on OpenAI’s website. This detail provides added transparency and clarity to web admins about the traffic source on their sites.
Allowing or disallowing the GPTBot web crawler could significantly affect your site’s data privacy, security, and contribution to AI advancement.
Legal And Ethical Concerns
OpenAI’s latest news has sparked a debate on Hacker News around the ethics and legality of using scraped web data to train proprietary AI systems.
GPTBot identifies itself so web admins can block it via robots.txt, but some argue there’s no benefit to allowing it, unlike search engine crawlers that drive traffic. A significant concern is copyrighted content being used without attribution. ChatGPT does not currently cite sources.
There are also questions about how GPTBot handles licensed images, videos, music, and other media found on websites. If that media ends in model training, it could constitute copyright infringement. Some experts think crawler-generated data could degrade models if AI-written content gets fed back into training.
Conversely, some believe OpenAI has the right to use public web data freely, likening it to a person learning from online content. However, others argue that OpenAI should share profits if it monetizes web data for commercial gain.
Overall, GPTBot has opened complex debates around ownership, fair use, and the incentives of web content creators. While following robots.txt is a good step, transparency is still lacking. The tech community wonders how their data will be used as AI products advance rapidly.
Featured image: Vitor Miranda/Shutterstock