- My understanding is that this is purely a strategic choice by the bigger labs. When OpenAI released Whisper, it was by far best-in-class, and they haven't released any major upgrades since then. It's been 3.5 years... Whisper is older than ChatGPT.
Gemini 3 Pro Preview has superlative audio listening comprehension. If I send it a recording of myself in a car, with me talking, and another passenger talking to the driver, and the radio playing, me in English, the radio in Portuguese, and the driver+passenger in Spanish, Gemini can parse all 4 audio streams as well as other background noises and give a translation for each one, including figuring out which voice belongs to which person, and what everyone's names are (if it's possible to figure that out from the conversation).
I'm sure it would have superlative audio generation capabilities too, if such a feature were enabled.
by d4rkp4ttern
3 subcomments
- It's amazing how good open-weight STT and TTS have gotten, so there's no need to pay for Wispr Flow, Superwhisper, Eleven-Labs etc.
Sharing my setup in case it may be useful for others; it's especially useful when working with CLI agents like Code Code or Codex-CLI:
STT: Hex [1] (open-source), with Parakeet V3 - stunningly fast, near-instant transcription. The slight accuracy drop relative to bigger models is immaterial when you're talking to an AI. I always ask it to restate back to me what it understood, and it gives back a nicely structured version -- this helps confirm understanding as well as likely helps the CLI agent stay on track. It is a MacOS native app and leverages the CoreML/Neural Engine to get extremely fast transcription (I used to recommend a similar app Handy but it has frequent stuttering issues, and Hex is actually even faster, which I didn't think was possible!)
TTS: Kyutai's Pocket-TTS [2], just 100M params, and amazing speech quality (English only). I made a voice plugin [3] based on this, for Claude Code so it can speak out short updates whenever CC stops. It uses a combination of hooks that nudge the main agent to append a speakable summary, falling back to using a headless agent in case the main agent forgets. Turns out to be surprisingly useful. It's also fun as you can customize the speaking style and mirror your vibe and "colorful language" etc.
The voice plugin gives commands to control it:
/voice:speak stop
/voice:speak azelma (change the voice)
/voice:speak prompt <your arbitrary prompt to control the style>
[1] Hex https://github.com/kitlangton/Hex[2] Pocket-TTS https://github.com/kyutai-labs/pocket-tts
[3] Voice plugin for Claude Code: https://pchalasani.github.io/claude-code-tools/plugins-detai...
by nowittyusername
2 subcomments
- Good article and I agree with everything in there. For my own voice agent I decided to make him PTT by default as the problems of the model accurately guessing the end of utterance are just too great. I think it can be solved in the future but, I haven't seen a really good example of it being done with modern day tech including this labs. Fundamentally it all comes down to the fact that different humans have different ways of speaking, and the human listening to them updates their own internal model of the speech pattern. Adjusting their own model after a couple of interactions and arriving at the proper way of speaking with said person. Something very similar will need to be done and at very fast latency's for it to succeed in the audio ml world. But I don't think we have anything like that yet. It seems currently best you can do is tune the model on a generic speech pattern that you expect to fit over a larger percentage of the human population and that's about the best you can do, anyone who falls outside of that will feel the pain of getting interrupted every time.
- There's too much noise at large organizations
- OpenAI and google are too scared of music industry lawyers to tackle this. Internally they without a doubt have models that would crush these startups over night if they chose to release them.
- Never any mention of Soniox and they are on the Pareto frontier[1]
https://www.daily.co/blog/benchmarking-stt-for-voice-agents/
by giancarlostoro
3 subcomments
- OpenAI being the death star and audio AI being the rebels is such a weird comparison, like what? Wouldn't the real rebels be the ones running their own models locally?
- There’s simply not enough of a market for these bigger orgs to be truly interested/invested in audio, video and even image to an extent.
They’ll wait for progress to be made and then buy the capability/expertise/talent when the time is right.
by d4rkp4ttern
1 subcomments
- Speaking of audio + AI, here's a "learning hack" I've been trying with voice mode, and the 3 big AI labs still haven't nailed it:
While on a walk with mobile phone + earphones, dump an article/paper/HN-Post/github-repo into the mobile chat app (chat-gpt, claude or gemini), and use voice mode to have it walk you through it conversationally, so you can ask follow up questions during the walk-thru and the AI would do web-search etc. I know I could do something like this with NotebookLM, but I want to engage in the conversation, and NotebookLM does have interactive mode but it has been super-flaky to say the least.
I pay for ChatGPT Pro and the voice mode is really bad: it pretends to do web searches and makes up things, and when pushed says it didn't actually read the article. Also the voice sounds super-condescending.
Gemini Pro mobile app - similarly refuses to open links and sounds as if it's talking to a baby.
Claude mobile app was the best among these - the voice is very tolerable in terms of tone, but like the others it can't open links. I does do web searches, but gets some type of summaries of pages, and it doesn't actually go into the links themselves to give me details.
by AustinDev
1 subcomments
- Audio models are also tiny, which is probably why small labs are doing well in the space. I run a LoRA'd Whisper v3 Large for a client. We can fit 4 versions of the model in memory at once on a ~$1/hr A10 and have half the VRAM leftover.
Each of the LoRA tunes we did took maybe 2-3 hours on the same A10 instance.
- I check every day for a new full-duplex model. I was so hyped about PersonaPlex from their demos, but in my test it was oddly dumb and unable to follow instructions.
So I am hoping for something like PersonaPlex but a bit larger.
Has anyone tested MiniCPM-o?.How is it at instruction following?
- Moshi was an amazing tech demo, building the entire stack from scratch in 6 months with a small team was an amazing show of skill: 7B text LLM data + training, emotive TTS for synth data generation (again model + data collection), synth data pipeline, novel speech codec, rust inference stack for low latency, audio LLM architecture incl. text "thoughts" stream which was novel.
But, this piece is a fluff piece: "underfunded" means a total of around $400 million ($330 million in the initial round, $70 million for Gradium). Compare to Elevenlabs who used a $2 million pre-seed for creating their initial product.
A bunch of other stuff there is disingenuous, like comparing their 7B model to Llama-3 405B (hint: the 7B model is a _lot_ dumber). There's also the outright lie: team of 4 made Moshi, which is corrected _in the same piece_ to 8 if you read enough.
- Is there something that will read books to me? I.e I have some books in epub format and want audiobook versions for them, with a nice voice.
- Most current "voice assistants" still feel like glorified walkie-talkies... you talk, pause awkwardly, they respond, and any interruption breaks the flow
- Can someone reccomend to me: a service that will generate a loopable engine drone for a "WWII Plane Japan Kawasaki Ki-61"? It doesn't have to be perfect, just convincing in a hollywood blockbuster context, and not just a warmed over clone of a Merlin engine sound. Turns out Suno will make whatever background music I need, but I want a "unique sound effect on demand" service. I'm not convinced voice AI stuff is sustainable
by umairnadeem123
0 subcomment
- i buy the thesis that audio is a wedge because latency/streaming constraints are brutal, but i wonder if it's also just that evaluation is easier. with vision, it's hard to say if a model is 'right' without human taste, but with speech you can measure wer, speaker similarity, diarization errors, and stream jitter. do you think the real moat is infra (real-time) or data (voices / conversational corpora)?
by bossyTeacher
2 subcomments
- Surprised ElevenLabs is not mentioned
- Maybe OpenAI has finally learned that dancing on all the parties at once, when all parties are progressing towards "commodity".
- Probably because the big companies have their focus elsewhere.
by mrbluecoat
0 subcomment
- The bigger players probably avoid it because it's a bigger legal liability: https://news.ycombinator.com/item?id=47025864
..plenty of money to be made elsewhere
- Also: porn.
Audio is too niche and porn is too ethically messy and legally risky.
There's also music, which the giants also don't touch. Suno is actually really impressive.
by SilverElfin
1 subcomments
- Does Wisprflow count as an audio “lab”?
- Right now small labs also have the best chance at tool harness improvements, which can yield just as many gains in AI performance as model training research.
by RobMurray
1 subcomments
- for a laugh enter nonsense at https://gradium.ai/
You get all kinds of weird noises and random words. Jack is often apologetic about the problem you are having with the Hyperion xt5000 smart hub.
- [dead]
- [dead]
- [flagged]
- [flagged]
by anvevoice
1 subcomments
- [flagged]