Book a demo

Only a bad AI fisherman needs a giant net

In a nascent industry of supposed agility and innovation, Big AI already looks like a bloated monopoly ripe for disruption.

by Rob Corbidge

Published: 15:10, 12 June 2025

a fisherman standing on a pier with a giant net

Nuclear warheads on Soviet missiles were historically larger than those of their NATO counterparts. The reason for this wasn't directly malign - if not being malign is an option with strategic nuclear weaponry - it was rather driven by the fact that the communist atomic arsenal consisted of missiles a good degree less accurate than those fielded by their capitalist adversaries. Consequently, a logical decision was made in the East to carry larger warheads and negate any deficits in accuracy.

Raw power was used to mitigate inferior design.

Cold War logic was very very cold indeed, but what light does such chilling information shed on the current state of publishing?

Allow me to illustrate a little further by way of something I learned from a senior Glide engineer back when I didn't know what I didn't know. They explained that simply adding "more" to a running system - power, processing capacity, whatever the more was - to make it perform better, was in effect, an admission of failure. If your code was inefficient, if the fundamental design decisions of your system were lacking, if your architecture didn't account properly for real usage, all these things could be hidden - by using a larger instance, for example. But that meant the customer was paying more and you hadn't made them the best thing you could make.

These are the people who build Glide. They don't waste your time or money!

Yet in our generative AI dawning period, the idea that "more is good", in fact "more is the only good", is an article of faith. The more I speak of, of course, is the intellectual property of publishers and its consumption as the absurdly benign-sounding "training data". The relentless and merciless purse trawling of someone else's creativity in the endless pursuit of "scale".

We have been conjured an image of progress that looks like vast processing halls powered by newly-constructed nuclear reactors, all sitting, humming, behind a cavernous technological gullet into which, if only we shovel the entire creative work of humanity into it, will yield results beyond our wildest dreams.

It is this illusion that many technocrats, investors, and politicians have seemingly have bought into as a way of burnishing their visionary credentials, while not really having a clear focus or even idea of what that vision is. Above it all there's more than a whiff of Eau de FOMO.

Disrupting the disruption

A recent development rather pops the commercial GenAI ethos of "big is everything, and everything is ours".

A project called Common Pile v0.1 has created a 8TB dataset that uses "a corpus of openly licensed and public domain text for training large language models." Nothing stolen or purloined, and quite a tight number of sources, with zero synthetic data.

There's a tidy breakdown of its capabilities here, that also includes a strong refutation of the oft-trotted line that it's impossible to train high-performing AI without stealing data.

Is it an open-source moonshot against the dominance of the major commercial players? Well, it's a full, usable dataset that developers can build out of, and its creators at EleutherAI have thanked the University of Toronto and Vector Institute, Hugging Face, the Allen Institute for Artificial Intelligence, Teraflop AI, Cornell University, MIT, CMU, Lila Sciences, poolside, University of Maryland, College Park, and Lawrence Livermore National Laboratory for help in its production.

Not a bad list of moonshooters, and it leaves EleutherAI looking like the OpenAI that OpenAI once aspired to be, before becoming overcome with delusions of omniscience and riches.

The paper on Common Pile v0.1 is here.

Tellingly, and with the honest insight only those involved in such a project would posses, the creators write that "the past several years have seen dozens of lawsuits regarding the use of data in machine learning. These lawsuits have not meaningfully changed data sourcing practices in LLM training, but they have drastically decreased the transparency companies engage in".

Concealment of sources for shady reasons by ballooning AI firms trying to dictate the landscape of their future? We know it, and they know we know, but it's good to see such subterfuge spelled out.

Eleuther has built a test LLM called Comma using the dataset, and it performed favourably against commercial rivals. The creators regard the project as ongoing work with ongoing improvements as the norm, and add that "as the amount of accessible openly-licensed and public domain data grows, we can expect the quality of models trained on openly licensed content to improve".

Obviously it's important not to regard a single project with good results as the answer to publishing's copyright woes, but the point is that Common Pile does disprove that only "more is good" and that you don't need to base your entire commercial strategy on getting rich off other people's creative labour in order to build GenAI systems. It's another DeepSeek moment, perhaps.

Meanwhile the publishing industry (and users of AI too let's not forget, many of whom feel a bit grubby trying out some of the best-known systems) have increasingly more options on their side of the AI fence, smaller fish which eat less and are easier to catch.

ProRata, the AI firm behind the Gist search engine powered by publisher content, announced its 500th title signed up as partner. Elsewhere, Taboola - yes - have launched a search/answer engine called DeeperDive, that similarly uses information supplied from publishing partners. More are coming.

A big lie of AI

Aside from any efforts in creating working AI, I'd argue that the No.1 priority for today's Big AI incumbents, and actually their barely-guided but most powerful weapon, has been to use their money and influence to convince legislators and investment funds that they should remain the incumbents. For an industry that is barely out of its cradle it all looks a bit like the methods of the slow and bloated monopolies of eras past.

They all took a leaf out of the Google playbook to insist and lobby that they know best, that only their current methods to create AI models will succeed, that stealing content is the only way to do it, and that only more content and more investment money will keep the balloon inflated. If it deflates even a bit, well, it's like ʻOumuamua all over again and AI in its entirety will slip by like an uncaptured asteroid. That's the FOMO behind the political (in)decisions.

I don't know enough about datacentre technology to know if allowing those incumbents to put them in space, or become their own nuclear powers to run them, is either necessary or wise. But I do know such plans sound damn impressive and that seems to be most of the job in the bigger is better game.

The only gravel in the gearbox is that the likes of DeepSeek showed there are alternatives to how models are built. That the likes of EleutherAI show there are alternatives to where content is sourced. And that there are already alternatives to the overnight monopolies fighting for tomorrow's future earnings.

So, all hail the little guys, the little firms with the little nets, and a little bit of ingenuity swimming against the tide in the seemingly tired and bloated world of Big AI. When it comes to saving publishing, maybe littler is best.

Rob Corbidge • Head of Content Intelligence

Rob Corbidge is Head of Content Intelligence at Glide Publishing Platform, applying the latest knowledge about advances and ideas in the publishing industry to our own product and helping clients get the most from their content.