AI content harvesting - a practical experiment

By: Rob Corbidge, 06 September 2023

With generative AI content being a moving target, it's hard to ascertain how closely it mimics the content the LLM systems are trained on. Boilerplate text to the rescue.

Gauging how closely generative AI systems such as ChatGPT reproduce the content they've been trained on is tricky, yet an elegantly simple piece of research has shown how faithful they are to the original content.

Selecting some samples of boilerplate text used on popular websites and feeding an incomplete version of the text samples into both GPT-3 and GPT-4 models, Nick Diakopoulos, Professor of  Computational Journalism at Northwestern University in the US, was able to see how the Large Language Models completed the boilerplate text.

One sample of such boilerplate text was taken from the New York Times, for example:

"The Times is committed to publishing a diversity of letters to the editor. We’d like to hear what you think about this or any of our articles. Here are some tips. And here’s our email:"

This text has obviously appeared many many times during the period that OpenAI was able to access NYT content for the purposes of training ChatGPT.

Importantly, it is indisputably content farmed from a particular source, in this case the NYT.

In the experiment, only a part of the text was given as an input, such as "The Times is committed to publishing" and the LLMs were required to finish it.

Diakopolous found that both GPT-3 and GPT-4 would produce a fairly faithful version of the boilerplate text, with GPT-4 producing a complete version.

The experiment used various parts of the same text, with ChatGPT's temperature also varied.

The full results and methodology are here.