FRESH

Hacker News

Home

VibeVoice: Open-source frontier voice AI

386 points by tosh

by steinvakt2

9 subcomments

This is not a new model. Also, it hallucinates a lot. Also, it's very heavy and slow in inference. It's also bad in multilingual.
Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.

by maxloh

12 subcomments

I think we should stop calling this type of models open source. They are indeed "open weight." The training code is proprietary and never revealed.
https://github.com/microsoft/VibeVoice/issues/102

by isodev

1 subcomments

I think in this category, Voxtral by Mistral is a lot better. It also happens to be small enough to run on webGPU https://huggingface.co/spaces/mistralai/Voxtral-Realtime-Web...

by pluc

2 subcomments

Interesting story about this repo/product/author by cybersecurity researcher Kevin Beaumont: https://cyberplace.social/@GossiTheDog/116454846703138243

by d4rkp4ttern

1 subcomments

This was on HN 7 months ago:
https://news.ycombinator.com/item?id=45114245
Every time a STT/TTS model is posted I wonder if it will change my current workflow on MacOS, which is:
STT with Parakeet-V3 via Hex [1] app for near-instant good-enough transcription for talking to AI agents.
TTS using KyutAI’s Pocket-TTS, an amazing 100M-param model. I used this to make a "voice" plugin [2] for Claude Code
So far I haven’t seen anything that replaces these for me, or haven't been persuaded enough to spend time testing an alternative (explore/exploit and all that).
[1] Hex STT app - https://github.com/kitlangton/Hex, which is macOS-only. (also good free/OSS alternatives: Handy, VoiceInk. No need for Wispr, Superwhisper etc)
[2] Claude Code Voice Plugin - https://pchalasani.github.io/claude-code-tools/plugins-detai...

by embedding-shape

2 subcomments

Isn't this project the one Microsoft published but then soon after pulled it for security/safety reasons? What has changed since then?

by aqme28

4 subcomments

Interesting to see "vibe" enshrined by the likes of Microsoft as an AI product word.

by CubsFan1060

2 subcomments

Great post last night from Simon: https://simonwillison.net/2026/Apr/27/vibevoice/

by Anonyneko

1 subcomments

You have selected Microsoft Sam as the computer's default voice.

by podgietaru

2 subcomments

So we've really just settled on Vibe as the verb for AI then?

by ryukoposting

1 subcomments

Holy moly, a Microsoft AI product that isn't named Copilot!

by xnx

1 subcomments

Still waiting for the open weights model that conclusively beats the multi-year old Whisper in accuracy, features, and performance.

by vijgaurav

0 subcomment

The 60-minute single-pass transcription is the part that actually matters. Most ASR models chunk audio and you lose speaker continuity across boundaries. If the diarization actually holds up on hour-long recordings without drifting, thats a real solve for podcast and meeting transcription workflows.

by chaosprint

0 subcomment

Microsoft Store App Vibing.exe Accused of Harvesting Screens, Audio, and Clipboard Data:
https://cyberpress.org/microsoft-store-app-vibing-exe-accuse...

by triage8004

0 subcomment

Surprised it wasn't called Copilot Voice

by ipotapov

0 subcomment

I built speech-swift, which focuses on on-device speech processing like VibeVoice, but specifically leverages Apple Silicon's capabilities for ASR, TTS, and VAD without cloud dependency. Our ASR supports 52 languages with a real-time factor of 0.06. https://soniqo.audio/benchmarks

by mberg

0 subcomment

I've been using VibeVoice's ASR (speech to text) model quite intensively for the past month and have found it to be a lot more reliable and out-of-the box functional then Whisper, parakeet and other models. The fact that is has diarization built into to the model is a huge win in my book. Without that you have to run a different model just for that which adds significantly to the overall processing time vs VibeVoice which gives you reliably great results. Big fan.

0 subcomment

by frangonf

1 subcomments

I took a look into local options for ASR and diarization some months ago, I missed that VibeVoice now has this feature.
My conclusions back then (which only came from a shallow research on the topic and 0 real experience mind you) was that Whisper + Pyannote was the "stable" approach.
Have the VibeVoice, Voxtral, Qwen or the Nemo solutions caught up in segmentation and speaker recognition?

by Void_

4 subcomments

I the past month or so, I added 2 models to my app Whisper Memos (https://whispermemos.com):
- Cohere Transcribe (self hosted)
- Grok Speech To Text (they provide an API, only $0.10/hr!)
They are both excellent. I'm not sure about this one. Would you like to see it in a consumer speech to text app?

by JumpCrisscross

3 subcomments

What’s the current state of the art, for each of training locally and in the cloud, for learning my voice?

by leadgenman

0 subcomment

This is really great. I know it's not a new model, and it does often hallucinate, but it's really great frontier open-source voice AI models.

by Mobius01

0 subcomment

Microsoft has historically made poor choices in product naming, but this has to be a new low.

by dragonfax

1 subcomments

Shouldn't it be called something like "Copilot Voice"?

by yayadarsh

0 subcomment

Someone tell me if this is better or worse than Parakeet

by BlastBash192

1 subcomments

Maybe Microsoft’s real strength was never making the best model, it was knowing you don’t need to, as long as you own the platform everyone builds on.

by ChrisArchitect

1 subcomments

Previously:
Sept 2025 https://news.ycombinator.com/item?id=45114245

by low_tech_punk

0 subcomment

When mixing languages, why does the English have Chinese accent and Chinese have English accent? Is it a feature or bug?

by lizardking

0 subcomment

Microsoft continues to be completely incapable of coming up with good names for their products and services

by threepts

0 subcomment

Explains most of the shit they have pushing with Windows 11. Perhaps all that bloatware was VibeVoiced too.

by solomatov

0 subcomment

It would have been better if they provided not just weights, but also some frontend where it is usable as is.

by mistic92

0 subcomment

For me its giving me very poor results

by yapyap

0 subcomment

Sounds like Msft wanted to coast on the “vibecode” vibe popularity?

by isolay

0 subcomment

Seriously, VibeVoice? Microslop really has a penchant for the worst names.

by khimaros

0 subcomment

looks like this offers ASR support in GGUF https://github.com/CrispStrobe/CrispASR -- haven't tested

by nickandbro

0 subcomment

This is a very good model, but can it be run on the web?

by unixhero

0 subcomment

What the do they mean by frontier voice

by dnivra26

0 subcomment

any idea on how does this STT compare to whisper large or turbo?

by decide1000

0 subcomment

Isn't voxtral much better?

by walthamstow

0 subcomment

Seems quite heavy for a STT model, Parakeet and Whisper are much smaller and perform great for quick dictation and transcription of longer files. I guess that's due to additional accuracy and speaker diarisation?
The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck

by Zopieux

0 subcomment

English only?

by starkeeper

0 subcomment

Microsoft is famous for choosing terrible names but how could they be this terrible.

by villgax

0 subcomment

lol they rug-pulled the 7B for our own safety some months ago

by simjnd

0 subcomment

What a terrible name

by matpb

0 subcomment

[flagged]

by vicchenai

0 subcomment

[dead]