Speaking of taking my money, what's the economic model for a company like this? They've published a fair amount about their architecture - enough that I imagine frontier labs could implement. Patents? Trade secrets? It's hard for me to understand how you'd be able to beat that training compute and knowhow at Anthropic/GOOG/oAI/Meta without some sort of legal protection.
I can't wait to see what these model architectures do with like 30-40% lower latency and more model intelligence. Very appealing. For reference, these look to be roughly 1/10 the size of Opus 4.7 / GPT 5.x series -- 275B, 12B active. So there's lots of room to add intelligence, and lots of hope that we could see lower latency.
> Time-Aligned Micro-Turns. The interaction model works with micro-turns continuously interleaving the processing of 200ms worth of input and generation of 200ms worth of output. Rather than consuming a complete user-turn and generating a complete response, both input and output tokens are treated as streams. Working with 200ms chunks of these streams enables near real-time concurrency of multiple input and output modalities.
That's probably the main thing that distinguishes it from the multimodal models from other frontier labs as far as I can tell.
With the exception of the real-time translation (which seems like it should be a separate product all by itself), none of the use-cases they presented had much utility. I don't want anything to count the number animals in my stories or time a trivia quiz for me. The auto-slouch-detector, while the demo was pretty funny, just seems so dystopian and weird. AI interrupting you to scold you about taking elderly parents mountain biking instead of waiting for you to finish to scold you? No thanks.
The UX is also an issue - the model interrupting the user (even when apparently required by these strange use-cases) is jarring and makes one lose their flow. You can even see this in the demo videos that they put out - the employees/actors had to really concentrate to continue speaking as if they weren't being interrupted by a brash robotic machine. A human, when participating in this (rare) "invited interruption" has the ability to speak "under" the main speaker and I feel it's generally timed with a lot of nuance.
Even in the auto-translation demo, they ducked the human's audio but the AI steamrolled him and it would have been impossible to actually do that demo without either an incredible amount of control over one's speaking, or (more likely) muting the output. A human translator has a way of "pointing" the "output" to the intended speaker.
The very best part of this tech was presented in the first video where it shows the AI not needlessly interrupting the user. This seems to me more of an important bug fixed that the current models still (somehow) have.
Maybe a good use-case for this would be counting "um's" and the like while practising public speaking.
Local models will catch up soon.
Every demo by openai showing of their models is "tell me how tall the statue of liberty is divided by the year the inventor of steam engines was born". It's cool but it's so hard to find an actual use. As a personal answer machine I find it very useful but if someone told me 5 years ago; here's a natural language computer as smart as at least every 15 year old, it costs a few bucks per million words. I would have thought that the applications would just scream out but till this day - outside of programming (a big deal tbc) - no one has found a good use for intelligence. It's so so weird.
I guess even a company can't just automatically make more money by hiring more people but I'm still confused
is it separate batches of special "skills" that are added post training? how can they guarantee the models won't eventually lose a skill?