arrow Products
Glide CMS image Glide CMS image
Glide CMS arrow
The powerful intuitive headless CMS for busy content and editorial teams, bursting with features and sector insight. MACH architecture gives you business freedom.
Glide Go image Glide Go image
Glide Go arrow
Enterprise power at start-up speed. Glide Go is a pre-configured deployment of Glide CMS with hosting and front-end problems solved.
Glide Nexa image Glide Nexa image
Glide Nexa arrow
Audience authentication, entitlements, and preference management in one system designed for publishers and content businesses.
For your sector arrow arrow
Media & Entertainment
arrow arrow
Built for any content to thrive, whomever it's for. Get content out faster and do more with it.
Sports & Gaming
arrow arrow
Bring fans closer to their passions and deliver unrivalled audience experiences wherever they are.
arrow arrow
Tailored to the unique needs of publishing so you can fully focus on audiences and content success.
For your role arrow arrow
arrow arrow
Unlock resources and budget with low-code & no-code solutions to do so much more.
Editorial & Content
arrow arrow
Make content of higher quality quicker, and target it with pinpoint accuracy at the right audiences.
arrow arrow
MACH architecture lets you kickstart development, leveraging vast native functionality and top-tier support.
Commercial & Marketing
arrow arrow
Speedrun ideas into products, accelerate ROI, convert interest, and own the conversation.
Technology Partners arrow
AWS image
arrow arrow
Vercel image
arrow arrow
Pugpig image
arrow arrow
Poool image
arrow arrow
Solution Partners arrow
Code Store image
Code Store
arrow arrow
The App Lab image
The App Lab
arrow arrow
Polemic Digital image
Polemic Digital
arrow arrow
Made by Many image
Made by Many
arrow arrow
Industry Insights arrow arrow
arrow arrow
News from inside our world, about Glide Publishing Platform, our customers, and other cool things.
arrow arrow
Insight and comment about the things which make content and publishing better - or sometimes worse.
arrow arrow
The Content Aware weekly newsletter, with news and comment every Thursday.
Knowledge arrow arrow
Customer Support
arrow arrow
Learn more about the unrivalled customer support from the team at Glide.
arrow arrow
User Guides and Technical Documentation for Glide Publishing Platform headless CMS, Glide Go, and Glide Nexa.
Developer Experience
arrow arrow
Learn more about using Glide headless CMS, Glide Go, and Glide Nexa identity management.

The rise of the robot gangsters

Content harvesting robo-bandits are demolishing long held standards on the open internet.

by Rob Corbidge
Published: 13:35, 27 June 2024

Rob Corbidge is Head of Content Intelligence at Glide Publishing Platform, applying the latest knowledge about advances and ideas in the publishing industry to our own product and helping clients get the most from their content.

A robot gangster stealing content and data.

During the westward expansion of the United States, the various governmental authorities in charge were at times simply reactive in their decision making. Settlers often made the facts on the ground before anyone in authority even knew of it, and land ended up being recategorised retrospectively. If there was a gold rush, then all bets were off.

Something similar seems to be occurring with greater frequency as data hungry Gen-AI businesses are now violating a basic protocol, the Robots Exclusion Protocol, which has helped keep the internet reasonably honest for years.

After writing previously about the suspect crawling activities of "AI answers" company Perplexity as revealed by Wired, it was soon reported by Reuters that according to research from a business aiming to insert itself between publishers and AI companies, plenty of others are ignoring specific crawling exclusions in the robots.txt file in order to harvest data for LLM training purposes.

A caveat here, as the company who say they have proof of this, TollBit, is "positioning itself as a matchmaker between content-hungry AI companies and publishers open to striking licensing deals with them". So it has a dog in the fight, as it were, but I'm willing to entertain that it might be a dog publishers can harness. 

So, like the Wild West, we're seeing a land grab. Except the land is occupied by original content producers such as publishers. Are we to expect some retrospective legal rulings about content theft that change nothing after the fact?

Perplexity's CEO Aravind Srinivas has given an interview with Fast Company in which he attempted to explain away the concerns about content harvesting, telling the publication that the "mysterious web crawler that Wired identified was not owned by Perplexity, but by a third-party provider of web crawling and indexing services".

So that's ok then? It's not our burglar giving us the stuff. "It's complicated," said Srinivas. You bet it is. Perplexity have this week announced investment from SoftBank that values them at $3 billion.

Srinivas then took the position of a Formula One race team boss who has been discovered using a triple-baffle unobtainium centrifugal redistributor not specifically banned by the rulebook, telling Fast Company that that the Robot Exclusion Protocol is "not a legal framework" and suggesting that the emergence of AI requires a new kind of working relationship between content creators, or publishers, and sites like his. 

Reddit have moved in the past few days to update their own robots.txt file. A spokesman told TechCrunch that "bots and crawlers will be rate-limited or blocked if they don’t abide by Reddit’s Public Content Policy and don’t have an agreement with the platform".

As a reminder, Reddit has cut a $60 million deal with Google to allow to train its AI models on Reddit's user generated content. So it's holding a big legal hand if anyone violates the terms of its Public Content Policy and has no usage agreement with the platform.

Again, most publishers don't have such a legal resort so maybe there is a slot for specialist outfits along the lines of TollBit to cut through the crap and get us all a better deal?

Latest articles

Journalism under surveillance takes a new turn as OpenAI asks to see your notebooks
OpenAI's dystopian hello to journalists and publishers
arrow button
a person running away from technology
Quit running from news: fear of fakery is greater than the fake itself
arrow button
Googles golden jail cell and the problem of the internet
Google's golden jail cell is a metaphor for the web, and we're all struggling to break out
arrow button

Ready to get started?

No matter where you are on your CMS journey, we're here to help. Want more info or to see Glide Publishing Platform in action? We got you.

Book a demo