OpenAI's GPTBot can now be prevented from crawling your sites

AI tech leader specifies how to disallow ChatGPT's crawler

by Rob Corbidge

Published: 14:31, 07 August 2023

Publishers can check if ChatGPT is crawling their content and prevent access to any or all of it in the robots.txt protocol, after documentation was published by OpenAI on the GPTBot.

OpenAI carries the following statement in the documentation: "Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety. Below, we also share how to disallow GPTBot from accessing your site."

The documentation is here.

As SEOrountable has previously noted, publishers were previously able to block ChatGPT plugins.

The ability will come as welcome news, particularly to smaller and medium sized publishers who have no direct conduit to strike deals with companies such as OpenAI, who are benefiting at the very core of their business model by harvesting content.

Alternatively, others may utilise the ability to decide which parts of their site can be crawled, with a goal of trying to get it to harvest a particular kind of content.

Rob Corbidge • Head of Content Intelligence

Rob Corbidge is Head of Content Intelligence at Glide Publishing Platform, applying the latest knowledge about advances and ideas in the publishing industry to our own product and helping clients get the most from their content.