Could OpenAI be forced to wipe its training data?

By: Rob Corbidge, 17 August 2023

Vintage sports car with classic curves, captured in a moody, low key light with selective focus on the grille, classic, retro, moody, detailed  Logo Logo Logo Logo a robot made of newspapers, art deco, analogue photography, pastel colours

The gloves could come off as the NYT prepares to get militant over its content being used by what it clearly sees as a competitor

Raising the prospect of OpenAI having to scrub its training data clean of content it has taken without permission, lawyers for the New York Times are exploring the possibility of taking the ChatGPT company to court to protect the NYT's intellectual property rights, NPR has reported.

The key thing that has occurred is that the NYT obviously considers OpenAI a competitor. They are not being sucked in by the "if you don't participate you won't benefit" talk. 

If can you recall back in the full-on blizzard-of-change days, when the future of publishing was what happened next week, then you'll remember the NYT adopting a maxim of "we will survive this if no one else does" at the boardroom level. It's served them well.

This NPR leak is likely a negotiating tactic of course. No one really wants to go to litigation, but if it's necessary then the NYT are showing OpenAI they have options in their armoury, and likely more friends in Washington too.

As the well-sourced NPR report says "For weeks, The Times and the maker of ChatGPT have been locked in tense negotiations over reaching a licensing deal in which OpenAI would pay The Times for incorporating its stories in the tech company's AI tools, but the discussions have become so contentious that the paper is now considering legal action." 

Will this be the seismic battle between those that create content, and those are looking to create content for free on the back of someone else's content, and then sell it back to them? Can the Generative AI juggernaut be made to live inside boundaries that are beneficial to the content creators and owners that the technology requires to thrive?

Paraphrasing its source, NPR reported that chief among the concerns senior NYT figures have is that ChatGPT is becoming a direct competitor with them by generating text that "answers questions based on the original reporting and writing of the paper's staff".

The prospect of OpenAI having to delete its dataset is also raised, as if "OpenAI is found to have violated any copyrights in this process, federal law allows for the infringing articles to be destroyed at the end of the case".

What happens in the US is under US law of course, but any action by the NYT should have wider consequences, in the very least at boosting the morale of any publisher and demonstrating that the correct way to to think about the Generative AI businesses is that they are rivals. 

It's worth noting that the current proposed EU legislation around Generative AI proposes "publishing summaries of copyrighted data used for training".

The legal threat from the NYT comes as this week Google has detailed more SGE features, essentially next generation AI-assisted search. Key among those is an article summary feature that allows users, once on a page, to press a button to get a summary of the content. It's designed to work "only on articles that are freely available to the pubic on the web".

It's their summary. Not your summary. A publisher's summary is designed to compliment the content, Google's looks like replacing.

Is there some dwell time diminution competition I'm not aware of? I can't help but feel Google just want to be seen doing something. In the example used by Google, a key points summary of an article about the famed American highway, Route 66, was used. SGE delivered a wonderfully short set of bullet points.

Yet Route 66 is now largely a tourist route, a route to be taken to enjoy the journey and scenery, and dwell on the automobile's powerful place in US history. Why would you want such a summary unless you're a professional bore, or are hoping to win the prize ham at next week's pub quiz?

By your theoretical use cases so shall you know you your product?