This has been done successfully in the past:
https://huggingface.co/featherless-ai/QRWKV-72B
Note that this is a 72B model which would be very expensive to train from scratch, but here they did the conversion for less than $2000.
Experiments I want to build on top of it:
1. Adding lsp context to the embeddings - that way the model could _see_ the syntax better, closer to how we use IDEs and would not need to read/grep 25k of lines just to find where something is used. 2. Experiments with different "compression" ratios. Each embedding could encode a different amount of bytes and we would not rely on a huge static token dictionary.
I'm aware that papers exist that explore these ideas, but so far no popular/good open source models employ this. Unless someone can prove me wrong.
> Standard AI models (like GPT-4) treat data using Global Geometry. They imagine every word as a point floating in a massive, flat, high-dimensional room. To see how two words relate, they draw a straight line between them.
> Local Topology changes the "room" into a landscape (a manifold). Instead of a flat void, the data exists on a curved surface that has hills, valleys, and paths.