OpenAI and Microsoft face major copyright suit from nearly 400 local newspapers

A coalition of nearly 400 local and regional newspapers has filed a copyright lawsuit against OpenAI and Microsoft, alleging the companies scraped and ingested their content without permission or compensation to train AI models. For AI builders, this case is the latest signal that training data legality and licensing terms are tightening, which could reshape how developers source data for model training.

What happened

Publishers that collectively own and operate nearly 400 newspapers filed the lawsuit in federal court in New York, accusing OpenAI and Microsoft of systematically scraping, copying, and ingesting copyrighted news articles to train their AI models. The complaint alleges that Microsoft "secretly crawled" websites, including content behind paywalls, and copied the information to its own servers. The coalition is represented by former New Jersey Attorney General Matt Platkin. This lawsuit adds to a growing wave of copyright infringement lawsuits against OpenAI from publishers such as The New York Times, Ziff Davis, Merriam-Webster, and Encyclopedia Britannica.

Why AI builders should care

The case directly challenges the legality of using publicly available web content for training data without explicit licensing. If the court finds that scraping paywalled or copyrighted news articles violates copyright law, it could set a precedent that forces AI companies to negotiate licensing agreements with publishers before using their content. For builders relying on large language models trained on web-scale data, this could mean reduced access to certain types of content or increased costs for licensed training data.

Practical implications

AI developers may need to audit their training data pipelines to ensure they are not using content from publishers that have not granted permission. The lawsuit could accelerate the shift toward licensed data sources, similar to how OpenAI has already signed deals with some publishers. Builders should monitor how this case influences the availability of news content in models like ChatGPT and Microsoft Copilot. If the plaintiffs prevail, financial penalties or court-ordered data removal could affect model performance or require retraining.

Caveats

This is an early-stage lawsuit, and legal outcomes are uncertain. The case may take years to resolve, and the specific claims may be narrowed or dismissed. Different jurisdictions may interpret copyright law differently, and the defendants will likely argue that their use of publicly available content constitutes fair use. The details of the complaint and any counterarguments are still developing.

FAQs

What are the allegations against OpenAI and Microsoft regarding training data?

The publishers allege that OpenAI and Microsoft scraped, copied, and ingested their copyrighted news articles without permission or compensation to train AI models, including content behind paywalls.

How many newspapers are involved in the lawsuit and who represents them?

Nearly 400 local and regional newspapers are part of the coalition. The lawsuit is represented by former New Jersey Attorney General Matt Platkin.

What legal issues are at stake in the publishers' copyright claims?

The core issues include copyright infringement, the legality of scraping paywalled content, and whether AI training requires explicit licensing from content creators.

Could this affect licensing terms or financial penalties for AI training?

Yes. If the court rules against OpenAI and Microsoft, it could establish licensing requirements for training data and potentially impose financial damages, influencing how all AI companies source data.

Which other publishers have filed similar lawsuits?

Other publishers that have sued OpenAI include The New York Times, Ziff Davis, Merriam-Webster, and Encyclopedia Britannica.

What does this mean for AI models like Copilot or ChatGPT in terms of training data usage?

The lawsuit highlights ongoing uncertainty around training data usage. If the case restricts access to news content, models may have less up-to-date information or require licensed data feeds, potentially affecting their utility for tasks like news summarization or fact-checking.