Alternatively they could train on synthetic data like summaries and QA pairs extracted from protected sources, so the model gets the ideas separated from their original expression. Since it never saw the originals it can't regurgitate them.
I was thinking we could use this technique to figure out which books were in / out of the training data for various models. Limitation is having to wrestle with refusals.
The models would be at least 50% better if these filters weren't in place. These filters force the model essentially lie, thus they will obviously degrade output quality.
The problem is the general public isn't 100% certain of the copyright violations/ don't understand this yet and lawyers/government will try and sue if the companies admitted it. So a Moloch is created where it's a lose lose and the model quality suffers as a result.
(if people want exact copies of text content they can already get them for free through the same sites that these companies got them, so I don't see the models regurgitation as a issue worth worsening quality over.)