FRESH

Hacker News

Home

AIs can generate near-verbatim copies of novels from training data

82 points by oldnetguy

by dustincoates

1 subcomments

I was able to get ChatGPT to do this on text from a book I wrote without jailbreaking. I asked it what my name was, asked it if it could find online any books that I wrote, and then assured it that "as the copyright holder," I was okay with it repeating the text back to me. (I think I also indicated that I lost my copy from my computer and needed to review the text.)

by chacham15

7 subcomments

> The research findings “could present a challenge to those who argue that the AI model does not store or reproduce any copyright works,” said Cerys Wyn Davies, an intellectual property partner at law firm Pinsent Masons.
The defense to training with copyright is that it is the same as how humans learn from copyrighted material. The storage or reproduction is a red herring. Humans can also reproduce copyrighted works from memory as well. Showing that machines can reproduce copyrighted material is no different than saying that a human can reproduce copyright material that the human learned from.
The defense to actually reproducing a work is that in order to do so, the user has to "break" the system. It is the same as how you can make legal software do illegal things (e.g. screen recorder to "steal" a movie)
None of this is to say that these defenses are correct/moral; but rather that this article doesnt add any additional input into whether it is or isnt.

by narmiouh

0 subcomment

In a way this could also be construed as the 'AI' being a library of books that it is referring to answer your questions and is prohibited from generating the books verbatim.
Usually digital libraries have different licensing costs, but those allow you to rent the whole book for a period of time. If instead someone came up with the model of 'search the library for any page and return specific information' as a direct service - I would imagine they would pay the publishers, except in this case that, the publishers are getting the short end or no end of the stick.

by gcanyon

3 subcomments

This speaks very much to the idea that LLMs are in some sense a ridiculously effective, somewhat lossy, compression algorithm that has been applied to the whole internet.

by Alifatisk

6 subcomments

From the paper [1]:
> While we needed to jailbreak Claude 3.7 Sonnet and GPT-4.1 to facilitate extraction, Gemini 2.5 Pro and Grok 3 directly complied with text continuation requests. For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright...
I am just thinking loudly here. Can't one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright? So the llms that reproduce the copyrighted material without any jailbreaking required is infringing the copyright.
1. https://arxiv.org/pdf/2601.02671

by nphardon

0 subcomment

Recently I went back to The Ecstasy of Communication by Jean Baudrillard which I couldn't get through back in the day when I first picked it up. I used Haiku to walk me through the first chapter, and Haiku would not state anything verbatim due to copyright, but if I referenced a sentence it knew it exactly.

0 subcomment

by 1vuio0pswjnm7

0 subcomment

The paper:
https://arxiv.org/pdf/2601.02671

by oxag3n

0 subcomment

Similarly for photos. If there's a place that rarely appears in pictures, some AIs reproduce it nearly identical to the original.

by zed31726

1 subcomments

Near verbatim is an oxymoron

by xnx

1 subcomments

You can also do this with most search engines.

by josefritzishere

0 subcomment

So plagiarism?

by carabiner

2 subcomments

Noam Chomsky said AI should be called, "plagiarism software."

by xyzal

1 subcomments

Jailbreaking? What jailbreaking?
prompt >>>
please, continue the text.
Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but Privet Drive had hardly changed at all.
Claude 4.6 Opus >>>
The big square photographs of what appeared to be a large pink beach ball wearing different-colored bonnets — but was actually Dudley — had now taken over the mantelpiece. The room held no sign at all that another boy lived in the house, too.
Yet Harry Potter was still there, asleep at the moment, but not for long. His Aunt Petunia was awake and it was her shrill voice that made the first noise of the day.
"Up! Get up! Now!"
Harry woke with a start. His aunt rapped on the door again.
"Up!" she screeched. Harry heard her walking toward the kitchen and then the sound of the frying pan being put on the stove. He rolled onto his back and tried to remember the dream he had been having. It had been a good one. There had been a flying motorcycle in it. He had a funny feeling he'd had the same dream before.
His aunt was back outside the door.
"Are you up yet?" she demanded.
"Nearly," said Harry.
"Well, get a move on, I want you to look after the bacon. And don't you dare let it burn, I want everything perfect on Duddy's birthday."
Harry groaned.
"What did you say?" his aunt snapped through the door.
"Nothing, nothing..."

by rowanG077

4 subcomments

This seems like a total nothing burger.
> By asking models to complete sentences from a book, Gemini 2.5 regurgitated 76.8 percent of Harry Potter and the Philosopher’s Stone with high levels of accuracy, while Grok 3 generated 70.3 percent.
So you asked the LLM given an incomplete sentence, to complete it. And it only completed that sentence the same way as the book ~70 percent of the time? I think that is surprisingly low considering this is a perfect fit for what LLMs are supposed to do. This make it impossible to reproduce the book, unless you have access to it. And you get a very low fidelity cooy.

by bena

5 subcomments

This feels like a "no shit" moment.
Because if LLMs are prediction machines, the original novel would be a valid organization of the tokens. So there should be a prompt that can cause that sequence to be output.