Meta used copyrighted books for AI training despite its own lawyers'
warnings, authors allege
Send a link to a friend
[December 13, 2023] By
Katie Paul
NEW YORK (Reuters) - Meta Platforms' lawyers had warned it about the
legal perils of using thousands of pirated books to train its AI models,
but the company did it anyway, according to a new filing in a copyright
infringement lawsuit initially brought this summer.
The new filing late on Monday night consolidates two lawsuits brought
against the Facebook and Instagram owner by comedian Sarah Silverman,
Pulitzer Prize winner Michael Chabon and other prominent authors, who
allege that Meta has used their works without permission to train its
artificial-intelligence language model, Llama.
A California judge last month dismissed part of the Silverman lawsuit
and indicated that he would give the authors permission to amend their
claims.
Meta did not immediately respond to a request for comment on the
allegations.
The new complaint, filed on Monday, includes chat logs of a
Meta-affiliated researcher discussing procurement of the dataset in a
Discord server, a potentially significant piece of evidence indicating
that Meta was aware that its use of the books may not be protected by
U.S. copyright law.
In the chat logs quoted in the complaint, researcher Tim Dettmers
describes his back-and-forth with Meta's legal department over whether
use of the book files as training data would be "legally ok."
"At Facebook, there are a lot of people interested in working with (T)he
(P)ile, including myself, but in its current form, we are unable to use
it for legal reasons," Dettmers wrote in 2021, referring to a dataset
Meta has acknowledged using to train its first version of Llama,
according to the complaint.
The month prior, Dettmers wrote that Meta's lawyers had told him "the
data cannot be used or models cannot be published if they are trained on
that data," the complaint said.
While Dettmers does not describe the lawyers' concerns, his counterparts
in the chat identify "books with active copyrights" as the biggest
likely source of worry. They say training on the data should "fall under
fair use," a U.S. legal doctrine that protects certain unlicensed uses
of copyrighted works.
[to top of second column] |
Meta AI logo is seen in this illustration taken September 28, 2023.
REUTERS/Dado Ruvic/Illustration/File Photo
Dettmers, a doctoral student at the University of Washington, told
Reuters he was not immediately able to comment on the claims.
Tech companies have been facing a slew of lawsuits this year from
content creators who accuse them of ripping off copyright-protected
works to build generative AI models that have created a global
sensation and spurred a frenzy of investment.
If successful, those cases could dampen the generative AI craze, as
they could raise the cost of building the data-hungry models by
compelling AI companies to compensate artists, authors and other
content creators for the use of their works.
At the same time, new provisional rules in Europe regulating
artificial intelligence could force companies to disclose the data
they use to train their models, potentially exposing them to more
legal risk.
Meta released a first version of its Llama large language model in
February and published a list of datasets used for training,
including "the Books3 section of ThePile." The person who assembled
that dataset has said elsewhere that it contains 196,640 books,
according to the complaint.
The company did not disclose training data for its latest version of
the model, Llama 2, which it made available for commercial use this
summer.
Llama 2 is free to use for companies with fewer than 700 million
monthly active users. Its release was seen in the tech sector as a
potential game-changer in the market for generative AI software,
threatening to upend the dominance of players like OpenAI and Google
that charge for use of their models.
(Reporting by Katie Paul in New York; Editing by Kenneth Li and
Matthew Lewis)
[© 2023 Thomson Reuters. All rights
reserved.]
This material may not be published,
broadcast, rewritten or redistributed.
Thompson Reuters is solely responsible for this content.
|