FRESH

Hacker News

Continuous Autoregressive Language Models

110 points by Anon84

by killerstorm

1 subcomments

Would be interesting to combine it with Reasoning In the Latent Space: feed the vector from the output layer of transformer back to input.
Obviously, you can't do it in pre-training. But you can add it later as an optional 'extra' vector, I think. E.g. `input_embedding + MLP(prev_output) * alpha`. Alpha is zero during pre-training.

by mentalgear

1 subcomments

Very interesting. Also I find these training parameters quite elegant:
- Diversity: This term encourages the model to generate a diverse set of samples, preventing mode collapse. - Fidelity: This term rewards the model for making predictions that are close to the ground-truth
I'm wondering if a continuos next-vector generative approach also increase innate "reasoning" capabilities of the model, since it could potentially capture more of the semantics of the data vs just tokens.

by mike_hearn

0 subcomment

If they can reinvent RL so it works with this then I guess the big labs will be all over it, as ~halving inference costs would be huge (especially if Ed Zitron's leaked OpenAI inf costs are accurate). Potentially the difference between inferencing being profitable and loss making. It's an elegant approach.
I also wonder how far they can push K if other aspects are tweaked. The approach of just doubling each parameter each time leaves a lot of space between the chosen value and the next value known to not work.

by vatsachak

0 subcomment

K being fixed here seems like it will eventually be done away with
When I'm thinking about math proofs, sometimes I can have a single idea which can be unfolded into a hundred lines of proof
Maybe I'm getting the wrong analogy here, but if vectors = ideas then K should depend on the vector

by notrealyme123

0 subcomment

by suddenlybananas

0 subcomment

The technique of compressing tokens down reminds me a bit of byte latent transformers

by Gormanu

0 subcomment