Internal Documents Reveal Meta Employees Talked About Using Copyrighted Material For AI Training

Internal Discussions and Copyright Issues in AI Model Training at Meta

Unveiling Internal Chats

Recent court documents reveal that Meta’s employees have had ongoing internal conversations regarding the use of copyrighted materials for training their AI models. These developments come from the ongoing case Kadrey v. Meta, which is part of a broader series of copyright disputes centered on artificial intelligence in the United States. The plaintiffs, which include well-known authors Sarah Silverman and Ta-Nehisi Coates, dispute Meta’s claim that using copyrighted works, particularly books, falls under "fair use."

Strategic Decisions by Meta

Previously submitted documents indicated that CEO Mark Zuckerberg approved the AI team’s use of copyrighted content. Furthermore, it appears that Meta paused discussions about licensing agreements with book publishers, possibly to sidestep legality concerns. New filings include snippets from internal communications that shed light on Meta’s approach to utilizing copyrighted data for their Llama AI models.

In one significant chat, summarizing a discussion from February 2023, a Meta engineer, Xavier Martinet, advocated for a strategy of acquiring books rather than obtaining licenses through traditional means. Martinet suggested that trying to secure permission beforehand could slow down the process, stating, “ask forgiveness, not permission.” He speculated that many startup companies likely utilize pirated books, implying that exploring unauthorized resources might be a viable option for Meta as well.

Discussions on Licensing and Legal Approvals

Melanie Kambadur, a senior manager on the Llama team, emphasized that while any usage of publicly available data still necessitates approvals, the likelihood of securing those approvals had increased due to the company’s larger budget and more efficient legal support.

“Yeah, we definitely need to get licenses or approvals on publicly available data still,” she remarked during the chats, pointing out that the legal team was becoming more lenient with their evaluations.

Potential Use of Libgen

In another internal discussion, Kambadur suggested looking into Libgen, a well-known site that aggregates links to copyrighted works. Libgen has faced numerous legal challenges and fines for copyright infringement. Some within Meta seemed to believe that not utilizing Libgen could place the company at a disadvantage in the highly competitive AI landscape.

Sony Theakanath, director of product management at Meta, characterized access to Libgen as essential for achieving top performance against rival AI models. To mitigate legal risks, he also stated that the use of Libgen data would not be publicly acknowledged.

Navigating Legal Risks

Meta’s internal strategy included the potential removal of any clearly marked pirated content from the data sourced from Libgen. To further shield the company from litigation risks, employees discussed modifying their models to avoid content-related prompts that could lead to legal issues.

Additionally, there are indications that Meta may have sourced data from Reddit, an action that raises ethical concerns regarding data privacy and copyrights, especially as Reddit has indicated intentions to charge companies for similar access in the future.

Changing Legal Perspectives on Data Usage

Recent communications also show Meta’s leadership contemplating reversing previous decisions related to the use of certain types of licensed content, like Quora posts or scientific articles. Chaya Nayak, director of product management for Meta’s generative AI organization, emphasized the necessity of expanding their training data to ensure competitive performance.

Case Developments

The legal case of Kadrey v. Meta has seen multiple amendments. The latest updates suggest that Meta is under scrutiny for possibly comparing pirated books to licensed books to determine if pursuing a licensing agreement would be beneficial. To enhance their legal defense, Meta has recruited two high-profile Supreme Court litigators, reflecting the seriousness of the allegations against them.

As this case progresses, it highlights a pivotal moment in the intersection of AI development and copyright law, with implications for not only Meta but for the broader tech industry navigating similar challenges.

Please follow and like us: