I wonder how much of that is LLMs being bad, and how much is LLMs being (over) aligned not to do it.
AFAIK, Chat GPT Voice mode had to have a lot of safeguards put on it to prevent music generation, accent matching (if you sound Indian, it shouldn't also sound Indian), and assuming ethnicity / biasing based on accents.
It doesn't seem that impossible to me that some of these behaviors have been aligned out of these models out of an abundance of caution.
Maybe a transformer could be running in parallel, but much lower frequency, where the linear model feeds it "summary" tokens once per second, whose information would mostly be "text", but also some hint of emotion and other cues. Then the output of this could be fed back to the linear model so that it would know what it was saying and with what emotion. Basically the transformer would be the low frequency long range context thinker (and feeler), and the linear model would translate that to and from phonetics.
They'd be trained in parallel, so those transformer tokens would attain meaning at training time, not something that would have to be pre-defined. So it'd still be purely phonetic e2e, no direct translation to text. It could even end up being a good way to compress text for LLMs, since low-value words might have smaller representation in the token.
Probably would never reach the level of text based LLMs for logic and code and such, but that somewhat parallels humans anyway; it's pretty hard to explain an algorithm in detail in plain conversation.
Each MP3 frame is entirely self-contained and can completely reconstruct a few tens of milliseconds of original audio. It does not require other frames to do this. I think this is the most important element. At 128kbps CBR, each MP3 frame is ~418 bytes and covers ~26 milliseconds of time. This is a reduction of 10-11x over the raw PCM waveform. MP3 is also designed to eliminate the information that humans don't seem to care about.
I don't know if it's possible to use 400 byte tokens in a transformer model, but I would be very compelled to try.
I attempted some similar VQ-VAE work instead trying to tokenize rendered text. I was curious if I could make a visual llm working on 10 pt rendered font, but I also tried using PDF sources. The basic idea was to do what more advanced diffusion image models can do where they generate images of text. Make a specific image text diffusion model to do completions. Further I wondered if I could embed things like document type and language so you could have a latent representation of text more abstracted than current dictionary tokenizers. Learned a lot and thought it was all beautifully displayed in this post.
Obviously working directly with audio is vastly more complex than with text.
But it is very exciting to see how part of making LLMs work natively with speech, is finding a codec that is maximally efficient at encoding speech.
I even have to wonder if, at some point, we ultimately create a popular voice codec usable with LLMs based not on the Fourier transform or similar, but rather on some kind of set of physical parameters describing vocal cord shape, tongue position, throat/chest/mouth shape, etc.
I can imagine such a model being arrived at statistically (determining the necessary number of parameters), and then almost becoming "hard-coded" as a standard since human anatomy doesn't change much there, beyond certain ranges.
I think it's called formant speech encoding, and it would be interesting if LLMs wind up massively advancing that field. Since I think historically it's had to do more with speech synthesis than audio compression.
quite unfortunate, however, their approach to accessibility. unmute [1], which uses the approach discussed in this post, runs quite well with claimed feature of adapting to any voice provided you have a 10 second recording. this is not made available to public at all, despite an issue raised since july. [2]
given the pace of the industry, it is a shame that we need to look elsewhere for using an otherwise well-designed tooling.
[1] https://news.ycombinator.com/item?id=44109610 [2] https://github.com/kyutai-labs/unmute/issues/99
But I can say the same about tokenization. LLMs first convert groups of characters to tokens, then use that to generate tokens, and then convert the tokens back to characters. That's not real understanding! If LLMs are so smart, we should be able to skip the tokenization step.
I think even for text models, "streams" could be useful. Perhaps if the LLM sees too long of a pause after explaining something and asking a question, they could interject a "do you need help?" or something. Pure chat GPTs don't have that ability.
All the streaming services are shit at it. They can't do much beyond shallow similarities or hardcoded recommendations that are probably just based on manually-entered keywords like the genre etc.
Has that already been done?
Or is it yet another of those what-could-have-been utopian things that got crippled before it was born because of corporate overcontrolling/overcautiousness (not being able to train on copyrighted music)
Maybe some open-source project could do it?
(I don't even feel confident in asking AI if a music-recc AI exists because ChatGPT 5 didn't know ChatGPT 5 was out, and Claude still thinks iOS 26 isn't out yet..sigh)
Read some Wittgenstein and Goodman, but especially Derrida who calls this logocentrism.