arrow Products
Glide CMS image Glide CMS image
Glide CMS arrow
The powerful intuitive headless CMS for busy content and editorial teams, bursting with features and sector insight. MACH architecture gives you business freedom.
Glide Go image Glide Go image
Glide Go arrow
Enterprise power at start-up speed. Glide Go is a pre-configured deployment of Glide CMS with hosting and front-end problems solved.
Glide Nexa image Glide Nexa image
Glide Nexa arrow
Audience authentication, entitlements, and preference management in one system designed for publishers and content businesses.
For your sector arrow arrow
Media & Entertainment
arrow arrow
Built for any content to thrive, whomever it's for. Get content out faster and do more with it.
Sports & Gaming
arrow arrow
Bring fans closer to their passions and deliver unrivalled audience experiences wherever they are.
Publishing
arrow arrow
Tailored to the unique needs of publishing so you can fully focus on audiences and content success.
For your role arrow arrow
Technology
arrow arrow
Unlock resources and budget with low-code & no-code solutions to do so much more.
Editorial & Content
arrow arrow
Make content of higher quality quicker, and target it with pinpoint accuracy at the right audiences.
Developers
arrow arrow
MACH architecture lets you kickstart development, leveraging vast native functionality and top-tier support.
Commercial & Marketing
arrow arrow
Speedrun ideas into products, accelerate ROI, convert interest, and own the conversation.
Technology Partners arrow
AWS image
AWS
arrow arrow
Vercel image
Vercel
arrow arrow
Pugpig image
Pugpig
arrow arrow
Poool image
Poool
arrow arrow
Solution Partners arrow
Code Store image
Code Store
arrow arrow
The App Lab image
The App Lab
arrow arrow
Polemic Digital image
Polemic Digital
arrow arrow
Made by Many image
Made by Many
arrow arrow
Industry Insights arrow arrow
News
arrow arrow
News from inside our world, about Glide Publishing Platform, our customers, and other cool things.
Comment
arrow arrow
Insight and comment about the things which make content and publishing better - or sometimes worse.
Newsletter
arrow arrow
The Content Aware weekly newsletter, with news and comment every Thursday.
Knowledge arrow arrow
Customer Support
arrow arrow
Learn more about the unrivalled customer support from the team at Glide.
Documentation
arrow arrow
User Guides and Technical Documentation for Glide Publishing Platform headless CMS, Glide Go, and Glide Nexa.
Developer Experience
arrow arrow
Learn more about using Glide headless CMS, Glide Go, and Glide Nexa identity management.

AI content harvesting - a practical experiment

With generative AI content being a moving target, it's hard to ascertain how closely it mimics the content the LLM systems are trained on. Boilerplate text to the rescue.

by Rob Corbidge
Published: 11:25, 06 September 2023

Rob Corbidge is Head of Content Intelligence at Glide Publishing Platform, applying the latest knowledge about advances and ideas in the publishing industry to our own product and helping clients get the most from their content.

Boilerplate text image

Gauging how closely generative AI systems such as ChatGPT reproduce the content they've been trained on is tricky, yet an elegantly simple piece of research has shown how faithful they are to the original content.

Selecting some samples of boilerplate text used on popular websites and feeding an incomplete version of the text samples into both GPT-3 and GPT-4 models, Nick Diakopoulos, Professor of  Computational Journalism at Northwestern University in the US, was able to see how the Large Language Models completed the boilerplate text.

One sample of such boilerplate text was taken from the New York Times, for example:

"The Times is committed to publishing a diversity of letters to the editor. We’d like to hear what you think about this or any of our articles. Here are some tips. And here’s our email: letters@nytimes.com."

This text has obviously appeared many many times during the period that OpenAI was able to access NYT content for the purposes of training ChatGPT.

Importantly, it is indisputably content farmed from a particular source, in this case the NYT.

In the experiment, only a part of the text was given as an input, such as "The Times is committed to publishing" and the LLMs were required to finish it.

Diakopolous found that both GPT-3 and GPT-4 would produce a fairly faithful version of the boilerplate text, with GPT-4 producing a complete version.

The experiment used various parts of the same text, with ChatGPT's temperature also varied.

The full results and methodology are here.

Latest articles

a robot signing a ownership of a piece of content
Content Aware media news: July 18, 2024
arrow button
a birthday cake with "150" written on it
Content Aware media news: July 11, 2024
arrow button
Sand castle crumbling
Content Aware media news: July 4, 2024
arrow button

Ready to get started?

No matter where you are on your CMS journey, we're here to help. Want more info or to see Glide Publishing Platform in action? We got you.

Book a demo