Mr. Chatterbox is a Victorian-era ethically trained model
- One thing I think would be very useful here is national archive data: there will be thousands of letters, memos and official documents shared between people alive back then under the care of a museum or government.
One of my dreams is to help digitise and make available the thousands of Second World War-era documents in the National Archives at Kew.
We’re at the point where a simple phone camera and a robust LLM-powered process can digitise ENORMOUS amounts of archive material almost effortlessly [1]. This is going to be enormous for historians eager to dive into the millions of interesting primary sources.
[1 https://generativehistory.substack.com/p/gemini-3-solves-han...]
- Prior art: https://news.ycombinator.com/item?id=46590280
>TimeCapsuleLLM: LLM trained only on data from 1800-1875
- I'd missed this when I first published my post but it turns out Trip had a much more detailed write-up of the project here: https://www.estragon.news/mr-chatterbox-or-the-modern-promet...
- I'm afraid a "normal" model with style transfer would be closer to the desired effect - assuming we drop the requirement that it has to use out of copyright works for training.
Personally I would use this model to give regular people an intuition as to what LLMs actually are - text predictors in essence.
- I am sure the the British Library has ensured everything is out of copyright, but just limiting the books to before 1899 is not enough in the UK. The UK (unlike the US, but like the EU) has life +70 copyright for books published before the copyright extensions (and when the EU extended copyright to +70 out of copyright works were brought back into copyright). For example, Shaw's works only came out of copyright in 2020. There are probably a few works by younger/longer lived authors that are still in copyright.
- The hard turn from this:
> Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I’ve been dreaming of a model like this for a couple of years now.
To this:
> I got Claude Code to do most of the work
Gave me whiplash
by bossyTeacher
0 subcomment
- Prompt: do you know what america is?
Response: Indeed! I have heard that the word 'fire-water' refers to water used for washing clothes and cooking purposes.
- after testing, i'm pretty sure that either a) i dont understand Victorian speech very well or b) a model with 340million parameters doesn't generate particularly coherent speech
by lovelearning
7 subcomments
- I thought the title meant the training data used was ethics content and ethical reasoning. Turns out "ethically trained" means the training data used doesn't violate copyright laws.
by no-name-here
0 subcomment
- Also see the post yesterday by simonw on this: https://news.ycombinator.com/item?id=47575062
- You could try these techniques to get over the data sparsity.
https://qlabs.sh/10x
It’s really cool, I’d love to see it smart.
- Looks like a model size issue,
but the behavior already seems largely shaped by the data distribution.
by gen6acd60af
0 subcomment
>Honestly, it’s pretty terrible.
>But what a fun project!
- I wonder if you could generate synthetic Victorian-era training data.
- I say, those chat logs read like Wodehouse.
by voidUpdate
3 subcomments
- It may be legally trained, but is it ethically trained? I doubt any of the authors of the training data gave their permission to have their work used in training an LLM