OpenAI’s Web Crawler

There is an upheaval on the internet regarding GPTBot and how it will be used to train GPT5.

However, crawlers are as old as the internet, with companies like Google using crawlers to perform actions for its products automatically.

Crawlers, or bots are generic terms for an automated process to automatically discover and scan websites, following links.

Web Google uses crawlers and fetchers to perform actions for its products, either automatically or triggered by user request.

Hence the notion of bots have been used extensively by various companies; should the transparency of OpenAI and the ability to opt-out not lauded?

GPTBot, OpenAI’s web crawler can be identified by its user agent and string:

User agent token: GPTBot

Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

Disallowing GPTBot

In order to block the GPTBot to access a website, add the following to robots.txt:


User-agent: GPTBot
Disallow: /

GPTBot Customised Access

Directing GPTBot to only access parts of a site, add the GPTBot token to site’s robots.txt like this:


User-agent: GPTBotAllow: /directory-1/Disallow: /directory-2/

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

Subscribe to HumanFirst Blog

Get the latest posts delivered right to your inbox