
OpenAI’s Web Crawler
There is an upheaval on the internet regarding GPTBot and how it will be used to train GPT5.
However, crawlers are as old as the internet, with companies like Google using crawlers to perform actions for its products automatically.
Crawlers, or bots are generic terms for an automated process to automatically discover and scan websites, following links.
Web Google uses crawlers and fetchers to perform actions for its products, either automatically or triggered by user request.
Hence the notion of bots have been used extensively by various companies; should the transparency of OpenAI and the ability to opt-out not lauded?
GPTBot, OpenAI’s web crawler can be identified by its user agent and string:
Disallowing GPTBot
In order to block the GPTBot to access a website, add the following to robots.txt:
GPTBot Customised Access
Directing GPTBot to only access parts of a site, add the GPTBot token to site’s robots.txt like this:
I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.