> Alsup ruled that Anthropic's use of copyrighted books to train its AI models was "exceedingly transformative" and qualified as fair use
> "All Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies"
It was always somewhat obvious that pirating a library would be copyright infringement. The interesting findings here are that scanning and digitizing a library for internal use is OK, and using it to train models is fair use.
https://investors.autodesk.com/news-releases/news-release-de...
https://torrentfreak.com/spotifys-beta-used-pirate-mp3-files...
Funky quote:
> Rumors that early versions of Spotify used ‘pirate’ MP3s have been floating around the Internet for years. People who had access to the service in the beginning later reported downloading tracks that contained ‘Scene’ labeling, tags, and formats, which are the tell-tale signs that content hadn’t been obtained officially.
Stealing is stealing. Let's stop with the double standards.
Sayi "they have the money" is not an argument. It's about the amount of effort that is needed to individually buy, scan, process millions of pages. If that's done for you, why re-do it all?
We've held China accountable for counterfeiting products for decades and regulated their exports. So why should Anthropic be allowed to export their products and services after engaging in the same illegal activity?
You are often allowed to nake a digital copy of a physical work you bought. There are tons of used, physical works thay would be good for training LLM's. They'd also be good for training OCR which could do many things, including improve book scanning for training.
This could be reduced to a single act of book destruction per copyrighted work or made unnecessary if copyright law allowed us to share others' works digitally with their licensed customers. Ex: people who own a physical copy or a license to one. Obviously, the implementation could get complex but we wouldn't have to destroy books very often.
Someone correct me if I am wrong but aren't these works being digitized and transformed in a way to make a profit off of the information that is included in these works?
It would be one thing for an individual to make person use of one or more books, but you got to have some special blindness not to see that a for-profit company's use of this information to improve a for-profit model is clearly going against what copyright stands for.
Which of the following are true?
(a) the legal industry is susceptible to influence and corruption
(b) engineers don't understand how to legally interpret legal text
(c) AI tech is new, and judges aren't technically qualified to decide these scenarios
Most likely option is C, as we've seen this pattern many times before.
Alsup detailed Anthropic's training process with books: The OpenAI rival
spent "many millions of dollars" buying used print books, which the
company or its vendors then stripped of their bindings, cut the pages,
and scanned into digital files.
I've noticed an increase in used book prices in the recent past and now wonder if there is an LLM effect in the market."Anthropic cut up millions of used books to train Claude — and downloaded over 7 million pirated ones too, a judge said."
A not-so-subtle difference.
That said, in a sane world, they shouldn't have needed to cut up all those used books yet again when there's obviously already an existing file that does all the work.
Or is it perhaps not an universal cultural/moral aspect ?
I guess for example in Europe people could be more sensitive to it.
https://ia800101.us.archive.org/15/items/gov.uscourts.cand.4...
Right guys we don't have rules for thee but not for me in the land of the free?
Some previous discussions:
https://news.ycombinator.com/item?id=44367850
Also please don't use word "learning", use "creating software using copyrighted materials".
Also let's think together how can we prevent AI companies from using our work using technical measures if the law doesn't work?
If I didn’t license all the books I trained on, am I not depriving the publisher of revenue, given people will pay me for the AI instead of buying the book?
as long as you buy the book it still should be legal, that is if you actually buy the book and not a "read only" eBook
but the 7_000_000 pirated books are a huge issue, and one from which we have a lot of reason to believe isn't just specific to Anthropic
But this analogy seems wrong. First, LLM is not a human and cannot "learn" or "train" - only human can do it. And LLM developers are not aspiring to become writers and do not learn anything, they just want to profit by making software using copyrighted material. Also people do not read millions of books to become a writer.
2020's: (Steals a bunch of books to profit off acquired knowledge.)
"Anthropic had no entitlement to use pirated copies for its central library...Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic's piracy." --- the ruling
If they committed piracy 7 million times and the minimum fine for each instance is $750 million then the law says that anthropic is liable for $5.25 billion. I just want it to be out there that they definitely broke the law and the penalty is a minimum $5.25 billion in fines according to the law, this way when none of this actually happens we at least can't pretend we didn't know.
Can anyone make a compelling argument that any of these AI companies have the public's best interest in mind (alignment/superalignment)?
Ensure the models are open source, so everyone can use them, as everyones data is in there?
Close those companies and force them to delete the models, as they used copyright material?
As a researcher I've been furious that we publish papers where the research data is unknown. To add insult to injury, we have the audacity to start making claims about "zero-shot", "low-shot", "OOD", and other such things. It is utterly laughable. These would be tough claims to make *even if we knew the data*, simply because of its size. But not knowing the data, it is outlandish. Especially because the presumptions are "everything on the internet." It would be like training on all of GitHub and then writing your own simple programming questions to test an LLM[0]. Analyzing that amount of data is just intractable, and we currently do not have the mathematical tools to do so. But this is a much harder problem to crack when we're just conjecturing and ultimately this makes interoperability more difficult.
On top of all of that, we've been playing this weird legal game. Where it seems that every company has had to cheat. I can understand how smaller companies turn to torrenting to compete, but when it is big names like Meta, Google, Nvidia, OpenAI (Microsoft), etc it is just wild. This isn't even following the highly controversial advice of Eric Schmidt "Steal everything, then if you get big, let the lawyers figure it out." This is just "steal everything, even if you could pay for it." We're talking about the richest companies in the entire world. Some of the, if not the, richest companies to ever exist.
Look, can't we just try to be a little ethical? There is, in fact, enough money to go around. We've seen unprecedented growth in the last few years. It was only 2018 when Apple became the first trillion dollar company, 2020 when it became the second two trillion, and 2022 when it became the first three trillion dollar company. Now we have 10 companies north of the trillion dollar mark![3] (5 above $2T and 3 above $3T) These values have exploded in the last 5 years! It feels difficult to say that we don't have enough money to do things better. To at least not completely screw over "the little guy." I am unconvinced that these companies would be hindered if they had to broker some deal for training data. Hell, they're already going to war over data access.
My point here is that these two things align. We're talking about how this technology is so dangerous (every single one of those CEOs has made that statement) and yet we can't remain remotely ethical? How can you shout "ONLY I CAN MAKE SAFE AI" while acting so unethically? There's always moral gray areas but is this really one of them? I even say this as someone who has torrented books myself![4] We are holding back the data needed to make AI safe and interpretable while handing the keys to those who actively demonstrate that they should not hold the power. I don't understand why this is even that controversial.
[0] Yes, this is a snipe at HumanEval. Yes, I will make the strong claim that the dataset was spoiled from day 1. If you doubt it, go read the paper and look at the questions (HuggingFace).
[1] https://www.theverge.com/2024/8/14/24220658/google-eric-schm...
[2] https://en.wikipedia.org/wiki/List_of_public_corporations_by...
[3] https://companiesmarketcap.com/
[4] I can agree it is wrong, but can we agree there is a big difference between a student torrenting a book and a billion/trillion dollar company torrenting millions of books? I even lean on the side of free access to information, and am a fan of Aaron Swartz and SciHub. I make all my works available on ArXiv. But we can recognize there's a big difference between a singular person doing this at a small scale and a huge multi-national conglomerate doing it at a large scale. I can't even believe we so frequently compare these actions!
> In fact this business was the ultimate in deconstruction: First one and then the other would pull books off the racks and toss them into the shredder's maw. The maintenance labels made calm phrases of the horror: The raging maw was a "NaviCloud custom debinder." The fabric tunnel that stretched out behind it was a "camera tunnel...." The shredded fragments of books and magazine flew down the tunnel like leaves in tornado, twisting and tumbling. The inside of the fabric was stitched with thousands of tiny cameras. The shreds were being photographed again and again, from every angle and orientation, till finally the torn leaves dropped into a bin just in front of Robert. Rescued data. BRRRRAP! The monster advanced another foot into the stacks, leaving another foot of empty shelves behind it.
Against companies like Elsevier locking up the worlds knowledge.
Authors are no different to scientists, many had government funding at one point, and it's the publishing companies that got most of the sales.
You can disagree and think Aaron Swartz was evil, but you can't have both.
You can take what Anthropic have show you is possible and do this yourself now.
isohunt: freedom of information
If I was China I would buy every lawyer to drown western AI companies in lawsuits, because it's an easy way to win AI race.
Ie. This is not a big deal. The only difference now is ppl are rapidly frothing to be outraged by the mere sniff of new tech on the horizon. Overton window in effect.