Research Indicates OpenAI Is Taking Initiative On Copyright Matters

OpenAI Accused of Copyright Infringement: O’Reilly Media Takes a Stand

Background of the Issue

Tim O’Reilly, the head of O’Reilly Media, alleges that OpenAI illegally used his company’s copyrighted books to train its GPT-4o model. This accusation comes amidst a growing number of lawsuits that OpenAI faces concerning the unauthorized use of copyrighted materials for its various AI models, including the well-known ChatGPT. OpenAI has denied these claims, asserting that it has not violated any copyright laws.

The Study on AI Training Data

O’Reilly, along with researchers Sruly Rosenblat and Ilan Strauss, authored a study titled “Beyond Public Access in LLM Pre-Training Data: Non-public book content in OpenAI’s Models.” This study was released by the AI Disclosures Project. Within this study, the authors focused on whether OpenAI’s models had been trained without consent on 34 copyrighted O’Reilly Media books, which are primarily available through paid access.

What is Non-Public Content?

In the context of this research, non-public content refers to material that is behind a paywall or otherwise not available for free. The study excludes content accessible on websites that engage in the illegal distribution of copyrighted material.

Methodology Used in the Research

To investigate the claims, the researchers employed a technique called DE-COP inference attacks. This method involved asking the OpenAI model a series of multiple-choice questions where each set presented a verbatim passage from an O’Reilly book among several machine-generated paraphrased options.

If the model tended to identify the verbatim paragraphs correctly, it indicated that it might have been trained on that copyrighted text.

Measuring Results

The researchers calculated an AUROC (Area Under the Receiver Operating Characteristic) score to assess the likelihood that the model had been trained on the O’Reilly books. Higher scores suggested a stronger connection to the copyrighted material, while scores close to 50% indicated a lack of training on that specific data.

Findings from the Testing

The research included tests on various versions of OpenAI’s models, including GPT-3.5 Turbo, GPT-4o Mini, and GPT-4o. The results were mixed. Notably, the GPT-4o model, which was released in May 2024, achieved an AUROC score of 82%. This score implies a strong chance that it was indeed trained on O’Reilly’s books. The researchers speculated that OpenAI may have utilized the LibGen database, known for hosting copyrighted books, to train its model.

In comparison, the older GPT-3.5 model’s score was just above 50%, suggesting minimal training on O’Reilly content. The smaller GPT-4o Mini model, however, did not appear to have been trained on O’Reilly’s texts, which sparked discussion about whether the mini model’s fewer parameters diminished its memory retention abilities.

Implications of the Research

The authors of the study emphasized the increasing reliance on non-public data in training AI models. They called for greater transparency regarding the sources of pre-training data to establish formal licensing frameworks for utilizing content in AI models. They noted that failing to fairly compensate content creators could harm the quality and diversity of online materials, potentially leading to a decline in the overall value of internet resources.

Industry Responses and Legal Developments

As AI companies like OpenAI navigate these legal challenges, they are increasingly forming licensing agreements with publishers and social networks. OpenAI, for instance, has brokered deals with platforms such as Reddit and Time Magazine for access to their archives. Yet, they have also argued for changes in copyright laws, urging the U.S. government to reconsider rigid regulations hindering innovation.

Simultaneously, some companies are implementing protective measures for copyrighted content. For example, Cloudflare has introduced an AI system designed to block unauthorized scrapers attempting to access and misuse content.

This situation highlights the ongoing dialogue between content creators, AI companies, and policymakers regarding the future of copyright in the age of artificial intelligence.

Please follow and like us: