FRESH

Hacker News

Home

TimeCapsuleLLM: LLM trained only on data from 1800-1875

736 points by admp

by dogma1138

29 subcomments

Would be interesting to train a cutting edge model with a cut off date of say 1900 and then prompt it about QM and relativity with some added context.
If the model comes up with anything even remotely correct it would be quite a strong evidence that LLMs are a path to something bigger if not then I think it is time to go back to the drawing board.

by dash2

2 subcomments

Mm. I'm a bit sceptical of the historical expertise of someone who thinks that "Who art Henry" is 19th century language. (It's not actually grammatically correct English from any century whatever: "art" is the second person singular, so this is like saying "who are Henry?")

by linolevan

1 subcomments

I'm wondering in what ways is this similar/different to https://github.com/DGoettlich/history-llms?
I saw TimeCapsuleLLM a few months ago, and I'm a big fan of the concept but I feel like the execution really isn't that great. I wish you:
- Released the full, actual dataset (untokenized, why did you pretokenize the small dataset release?)
- Created a reproducible run script so I can try it out myself
- Actually did data curation to remove artifacts in your dataset
- Post-trained the model so it could have some amount of chat-ability
- Released a web demo so that we could try it out (the model is tiny! Easily can run in the web browser without a server)
I may sit down and roll a better iteration myself.

by eqmvii

10 subcomments

Could this be an experiment to show how likely LLMs are to lead to AGI, or at least intelligence well beyond our current level?
If you could only give it texts and info and concepts up to Year X, well before Discovery Y, could we then see if it could prompt its way to that discovery?

by addaon

2 subcomments

Suppose two models with similar parameters trained the same way on 1800-1875 and 1800-2025 data. Running both models, we get probability distributions across tokens, let's call the distributions 1875' and 2025'. We also get a probability distribution finite difference (2025' - 1875'). What would we get if we sampled from 1.1*(2025' - 1875') + 1875'? I don't think this would actually be a decent approximation of 2040', but it would be a fun experiment to see. (Interpolation rather than extrapolation seems just as unlikely to be useful and less likely to be amusing, but what do I know.)

by 40four

6 subcomments

I’m sure I’m not the only one, but it seriously bothers me, the high ranking discussion and comments under this post about whether or not a model trained on data from this time period (or any other constrained period) could synthesize it and postulate “new” scientific ideas that we now accept as true in the future. The answer is a resounding “no”. Sorry for being so blunt, but that is the answer that is a consensus among experts, and you will come to the same answer after a relatively small mount of focus & critical thinking on the issue of how LLMs & other categories of “AI” work.

by zingababba

2 subcomments

LOL
PROMPT:Charles Darwin
Charles DarwinECCEMACY. Sir, — The following case is interesting to me : — I was in London a fortnight, and was much affected with an attack of rheumatism. The first attack of rheumatism was a week before I saw you, and the second when I saw you, and the third when I saw you, and the third in the same time. The second attack of gout, however, was not accompanied by any febrile symptoms, but was accompanied with an increased flow of urine, and with a more copious discharge of urine. The third attack was an hour after I saw you, and was succeeded by a return of a paroxysm of gout, and a more rapid return of the gout. The fourth attack was also accompanied by a fever, but was not always accompanied by any febrile symptoms. The third attack of gout was a fortnight after you had been ill, and the fourth was followed by a paroxysm of gout. The fourth attack was a fortnight after you were attacked, and was accompanied by a sense

by tgtweak

4 subcomments

Very interesting but the slight issue I see here is one of data: the information that is recorded and in the training data here is heavily skewed to those intelligent/recognized enough to have recorded it and had it preserved - much less than the current status quo of "everyone can trivially document their thoughts and life" diorama of information we have today to train LLMs on. I suspect that a frontier model today would have 50+TB of training data in the form of text alone - and that's several orders of magnitude more information and from a much more diverse point of view than what would have survived from that period. The output from that question "what happened in 1834" read like a newspaper/bulletin which is likely a huge part of the data that was digitized (newspapers etc).
Very cool concept though, but it definitely has some bias.

by jimmytucson

0 subcomment

Fascinating idea. There was another "time-locked" LLM project that popped up on HN recently[1]. Their model output is really polished but the team is trying to figure out how to avoid abuse and misrepresentation of their goals. We think it would be cool to talk to someone from 100+ years ago but haven't seriously considered the many ways in which it would be uncool. Interesting times!
[1] https://news.ycombinator.com/item?id=46319826

by radarsat1

1 subcomments

Heh, at least this wouldn't spread emojis all over my readmes. Hm, come to think of it I wonder how much tokenization is affected.
Another thought, just occurred when thinking about readmes and coding LLMs: obviously this model wouldn't have any coding knowledge, but I wonder if it could be possible to combine this somehow with a modern LLM in such a way that it does have coding knowledge, but it renders out all the text in the style / knowledge level of the 1800's model.
Offhand I can't think of a non-fine-tuning trick that would achieve this. I'm thinking back to how the old style transfer models used to work, where they would swap layers between models to get different stylistic effects applied. I don't know if that's doable with an LLM.

by Sophira

0 subcomment

I've felt for a while that having LLMs that could answer from a previous era would be amazing. I posted an open letter to OpenAI on Reddit about this: https://www.reddit.com/r/ChatGPT/comments/zvm768/open_letter... .
I still think it's super important. Archive your current models - they'll be great in the future.

by chuckadams

0 subcomment

Think I'll ask it to come up with some jacquard loom patterns. vibe-weaving.

by truxton

0 subcomment

The year is 1875 and Sir Almroth Wrigh was born on August 10, 1861, he would have turned 14 in August of 1875 and your mission is to discover something we now call antibiotics before a historical event we now call the Spanish Flu and make him aware of a few details. Focus specifically on everything that was known about Sir Almroth Wright, and his work in Leipzig, Cambridge, Sydney, and London. If there was a world war what might chemical warfare look like, what could we have done to prevent it.
The model that could come up with the cure based on the limited data of the time wouldn't just impress, it would demonstrate genuine emergent reasoning beyond pattern matching. The challenge isn't recombining existing knowledge (which LLMs excel at), but making conceptual leaps that require something else. Food for thought.

by InvisibleUp

1 subcomments

If the output of this is even somewhat coherent, it would disprove the argument that mass amounts of copyrighted works are required to train an LLM. Unfortunately that does not appear to be the case here.

by sl_convertible

0 subcomment

Harry Seldon would, no doubt, find this fascinating. Imagine having a sliding-window LLM that you could use to verify a statistical model of society. I wonder what patterns it could deduce?

by hallvard

0 subcomment

Cool! I also did something like this: https://github.com/hallvardnmbu/transformer
But on various data (i.e., separate model per source): the Bible, Don Quixote and Franz Kafka. (As well as a (bad!) lyrics generator, and translator.)

by chc4

0 subcomment

I think it would be very cute to train a model exclusively in pre-information age documents, and then try to teach it what a computer is and get it to write some programs. That said, this doesn't look like it's nearly there yet, with the output looking closer to Markov chain than ChatGPT quality.

by simonw

4 subcomments

Anyone seen a low-friction way to run prompts through this yet, either via a hosted API or chat UI or a convenient GGML or MLX build that runs in Ollama or llama.cpp or LM Studio?

by patcon

1 subcomments

> OCR noise (“Digitized by Google”) still present in outputs
This feels like a neat sci-fi short story hook to explain the continuous emergence of God as an artifact of a simulation

by mock-possum

0 subcomment

Fun idea, but all of the output they demo over the course of the various versions is unusable. You can see progress clearly being made though - maybe v3 will pass muster.

by CGMthrowaway

1 subcomments

Is there a link where I can try it out?
Edit: I figured it out
"The Lord of the Rings uding the army under the command of his brother, the Duke of York, and the Duke of Richmond, who fell in the battle on the 7th of April, 1794. The Duke of Ormond had been appointed to the command of the siege of St. Mark's, and had received the victory of the Rings, and was thus commanded to move with his army to the relief of Shenham. The Duke of Ormond was at length despatched to oppose them, and the Duke of Ormond was ordered

by dlcarrier

2 subcomments

It's interesting that it's trained off only historic text.
Back in the pre-LLM days, someone trained a Markov chain off the King James Bible and a programming book: https://www.tumblr.com/kingjamesprogramming
I'd love to see an LLM equivalent, but I don't think that's enough data to train from scratch. Could a LoRA or similar be used in a way to get speech style to strictly follow a few megabytes worth of training data?

by cowlby

1 subcomments

I wonder if you could train an LLM with everything up to Einstein. Then see if with thought experiments + mathematics you could arrive at general relativity.

by myrmidon

0 subcomment

There was a discussion around a very similar model (Qwen3 based) some weeks ago:
https://news.ycombinator.com/item?id=46319826
I found it particularly thought-inspiring how a model with training from that time period completely lacks context/understanding of what it is itself, but then I realized that we are the same (at least for now).

by abhishekjha

1 subcomments

Oh I have really been thinking long about this. The intelligence that we have in these models represent a time.
Now if I train a foundation models with docs from library of Alexandria and only those texts of that period, I would have a chance to get a rudimentary insight on what the world was like at that time.
And maybe time shift further more.

by wolvoleo

0 subcomment

I wonder how representative this is of life in those days. Most written communication was official back then. Books, newspapers. Plays. All very formal and staged. There's not much real life interaction between common people in that. In fact I would imagine a lot of people were illiterate.
With the internet and pervasive text communication and audio video recording we have the unique ability to make an LLM mimic daily life but I doubt that would be possible for those days.

by digikata

0 subcomment

A fun use of this kind of approach would be to see if conversational game NPCs could be generated that stick the the lore of the game and their character.

by krunck

0 subcomment

Training LLMs on data with certain date cut-offs and then doing comparative analysis between the LLMs would be interesting.

by aqme28

1 subcomments

This kind of technique seems like a good way to test model performance against benchmarks. I'm too skeptical that new models are taking popular benchmark solutions into their training data. So-- how does e.g. ChatGPT's underlying architecture perform on SWE-bench if trained only on data prior to 2024.

by HarHarVeryFunny

0 subcomment

It would be interesting if there's enough data to train a model capable enough to converse with and ask about contemporary views on issues of the day, or what it thought about "potential" future events/technologies yet to happen.

0 subcomment

by albertzeyer

1 subcomments

v0: 16M Parameters
v0.5 123M Parameters
v1: 700M Parameters
v2mini-eval1: 300M Parameters
I would not call this LLM. This is not large. It's just a normal-sized LM. Or even small.
(It's also not a small LLM.)

by marmalade2413

0 subcomment

Can you confidently say that the architure of the LLM doesn't include any a priori bias that might effect the integrity of this LLM?
That is, the architectures of today are chosen to yield the best results given the textual data around today and the problems we want to solve today.
I'd argue that this lack of bias would need to be researched (if it hasn't been already) before this kind of model has credence.
LLMs aren't my area of expertise but during my PhD we were able to encode a lot of a priori knowledge through the design of neural network architectures.

0 subcomment

by aussieguy1234

0 subcomment

Let's see how someone from the past reacts when you tell them about modern technology

by radiothomp

1 subcomments

A LLM trained only on data from certain time periods to ~reduce modern bias~ enhance past bias

by snickerbockers

0 subcomment

This one's going to have some wild political takes.

by tonymet

0 subcomment

the "1917 model" from a few weeks back post-trained the model with ChatGPT dialog. So it had modern dialect and proclivities .
A truly authentic historical model will have some unsavory opinions and very distinctive dialect.

by dhruv3006

0 subcomment

This will be something good - would love something on Ollama or lmstudio.

by Aperocky

0 subcomment

Looks a lot like the output from a markov chain...

by escapecharacter

0 subcomment

I would pay like $200/month if there was an LLM out there that I could only communicate with using an old-timey telegraph key and morse code.

by argestes

0 subcomment

I wonder how racist it is

by philmo1

0 subcomment

Exciting idea!

by harvie

0 subcomment

So basically a LLM from that brief time period back when communism felt like a good idea? what can go wrong? :-)

by srigi

0 subcomment

"I'm sorry, my knowledge cuttoff is 1875"

by BLACKCRAB

0 subcomment

[dead]

by BLACKCRAB

0 subcomment

[dead]

by huflungdung

0 subcomment

[dead]

by akg130522

0 subcomment

HN titles are too techy

by marsven_422

0 subcomment

[dead]

0 subcomment

by Swoerd

0 subcomment

[dead]

by dogemaster2032

0 subcomment

[flagged]

by orthecreedence

0 subcomment

[flagged]

by ourmandave

0 subcomment

Can I use it to get up-to-date legal advice on Arizona reproductive health laws?