How to block GenAI crawlers such as Google's Bard or OpenAI's ChatGPT from your website

By: Rob Corbidge, 04 October 2023

How to block GenAI crawlers such as Google's Bard or OpenAI's ChatGPT from your website

Keep unwanted bots at bay by using the robots.txt file to tell AI crawlers you don't want them to scrape your content.

Overview

With the emergence of sophisticated AI technologies such as OpenAI's ChatGPT and Google's Bard, assisted by various web crawlers, the internet is awash with automated agents that engage in hoovering up your content. 

While these bots can be innovative and beneficial, GenAI's "take and ask for permission later" approach should be nipped in the bud. This article provides a guide on how to block ChatGPT and other AI crawlers from accessing a website.

Understanding the Bots

Before moving towards blocking strategies, it is important to understand how AI bots and crawlers work and how to identify them. 

Bots like ChatGPT may interact with web content through APIs or web scraping, while other generic web crawlers scan websites to index them for search engines or data retrieval purposes. Identifying them typically involves analysing user agent strings, IP addresses, or behavioural patterns.

Here we look mostly at a robots.txt which will signal to the crawlers to skip your website. No playing dumb if they don't!

How to block GenAI crawlers using Robots.txt

A robots.txt file instructs bots on how they should interact with the website. To block all bots or specific ones from accessing your entire site or specific sections, modify the robots.txt file.

Google, Bard, Google-Extended, Google-not-care

UPDATE: It looks like the blocker for Google's AI, Google-Extended, does NOT stop the new Google Search Generative Experience from scanning your content. The only way to fully block it is to go nuclear and fully block Google crawlers from your site - which means no search juice. Decisions decisions... or something the competition authorities should look at.

https://searchengineland.com/google-extended-does-not-stop-google-search-generative-experience-from-using-your-sites-content-433058


The Google-Extended instruction is below for completeness. 

User-agent: Google-Extended
Disallow: /

Robots.txt

A current selection of bots to block is as follows (please let us know of any additions via Contact Us page!):

--------------------------------------------

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: GPTBot
Disallow: /

* Wayback machine bonus ban
User-agent: ia_archiver
Disallow: /

--------------------------------------------