FRESH

Hacker News

Home

VibeVoice: A Frontier Open-Source Text-to-Speech Model

436 points by lastdong

by simiones

11 subcomments

I read the comments praising these voices as very life like, and went to the page primed to hear very convincing voices. That is not at all what I heard though.
The voices are decent, but the intonation is off on almost every phrase, and there is a very clear robotic-sounding modulation. It's generally very impressive compared to many text-to-speech solutions from a few years ago, but for today, I find it very uninspiring. The AI generated voice you hear all over YouTube shorts is at least as good as most of the samples on this page.
The only part that seemed impressive to me was the English + (Mandarin?) Chinese sample, that one seemed to switch very seamlessly between the two. But this may well be simply because (1) I'm not familiar with any Chinese language, so I couldn't really judge the pronunciation of that, and (2) the different character systems make it extremely clear that the model needs to switch between different languages. Peut-être que cela n'aurait pas été si simple if it had been switching between two languages using the same writing system - I'm particularly curious how it would have read "simple" in the phrase above (I think it should be read with the French pronunication, for example).
And, of course, the singing part is painfully bad, I am very curious why they even included it.

by giancarlostoro

3 subcomments

I really hope someone within Microsoft is naming their open source coding agent Microsoft VibeCode. Let this be a thing. Its either that or "Lo" then you can have Lo work with Phi, so you can Vibe code with Lo Phi.
https://techcommunity.microsoft.com/blog/azure-ai-foundry-bl...

by davorak

2 subcomments

Any insight on my the code and the large model were removed? Some copies are floating around and are MIT licensed. In cases like this I do not know why the projects are yanked. If the project was mistakenly released under MIT, copied elsewhere, is any damage control possible by yanking the copies you have control over? Mostly seems like bad PR, if minor.

by malnourish

3 subcomments

This is clearly high quality but there's something about the voices, the male voices in particular, which immediately register as computer generated. My audio vocabulary is not rich enough to articulate what it is.

by strangescript

1 subcomments

The male voices seem much worse than the female voices, borderline robotic. Every sample of their website starts with a female voice. They clearly are aware of the issue.

by aargh_aargh

4 subcomments

Is there a current, updated list (ideally, a ranking) of the best open weights TTS models?
I'm actually more interested in STT (ASR) but the choices there are rather limited.

by TheAceOfHearts

1 subcomments

Unfortunately it's not usable if you're GPU-poor. Couldn't figure out how to run this with an old 1080. I tried VibeVoice-1.5B on my old CPU with torch.float32 and it took 832 seconds to generate a 66 second audio clip. Switching from torch.bfloat16 also introduced some weird sound artifacts in the audio output. If you're GPU-poor the best TTS model I've tried so far is Kokoro.
Someone else mentioned in this thread that you cannot add annotations to the text to control the output. I think for these models to really level up there will have to be an intermediate step that takes your regular text as input and it generates an annotated output, which can be passed to the TTS model. That would give users way more control over the final output, since they would be able to inspect and tweak any details instead of expecting the model to get everything correctly in a single pass.

by Insanity

2 subcomments

What an odd name to me, becaus "Vibe" is, in my mind, equal to somewhat poor quality. Like "Vibe Coding". But that's probably just some bias from my side.

by Meneth

1 subcomments

Open-source, eh? Where's the training data, then?

by rafaelmn

2 subcomments

The Spontaneous Emotion dailog sounds like a team member venting through LLMs.
They could have skipped the singing part, it would be better if the model did not try to do that :)

by stuffoverflow

0 subcomment

VibeVoice-Large is the first local TTS that can produce convincing Finnish speech with little to no accent. I tinkered with it yesterday and was pleasantly surprised at how good the voice cloning is and how it "clones" the emotion in the speech as well.

by crvdgc

0 subcomment

Very impressive that it can reproduce the Mandarin accent when speaking English and English accent when speaking Mandarin.

by data-ottawa

0 subcomment

Looks like the repo went private
https://github.com/microsoft/VibeVoice
I was trying to get this working on strix halo.

by lxe

0 subcomment

There are 2 "best" TTS models out right now: HiggsAudio and VibeVoice. I found that Higgs is both faster and much higher fidelity than Vibe. Can't speak to expressiveness, but don't sleep on it.

by mpaepper

0 subcomment

Unfortunate naming given I named my repo which does open source locally running speech to text vibevoice 7 months ago:
https://github.com/mpaepper/vibevoice

by ndkap

0 subcomment

Here is AI being as close as possible to the most animated person I know and here I am sounding robotic in every conversation I have, despite my best efforts to sound otherwise. Sometimes, I just wish I could have an AI speak for me

by glenstein

0 subcomment

Very good and I could see how I might believe they are real people if I let my guard down. The male voice sounded a little sedated though and there was a smoothness to it that could be samey over long stretches.
Still not at the astonishing level of Google Notebook text to speech which has been out for a while now. I still can't believe how good that one is.

by regularfry

0 subcomment

Ok, this is nit-picking, but it's very obvious that the sample voices these were trained with were captured in different audio environments. There's noticeable reverb on the male voice that's not there on the other.
So that's a useful next step: for multi-voice TTS models, make them sound like they're in the same room.

by cush

0 subcomment

To me this is like early generative AI art, where the images came out very "smooth" and visually buttery, but instead there's no timbre to the voices. Intonation issues aside, these models could use a touch of vocal fry and some body to be more believable

by bityard

0 subcomment

I thought the name sounded familiar, I'm guessing its no relation to this project which has been around for 7 months? https://github.com/mpaepper/vibevoice

by faxmeyourcode

0 subcomment

I tried the colab notebook that they link to and couldn't replicate the quality for whatever reason. I just swapped out the text and let it run on the introduction paragraph of Metamorphosis by Franz Kafka and it seemingly could not handle the intricacies.

by wewewedxfgdf

2 subcomments

I'm really hoping one day there will be TTS does that does really nice British accents - I've surveyed them all deeply, none do.
Most that claim to do a British accent end up sounding like Kelsey Grammer - sort of an American accent pretending to be British.

by bazlan

0 subcomment

Sad to not see vui on the comparisons!
A 100M podcast model
https://huggingface.co/spaces/fluxions/vui-space

by ementally

2 subcomments

they vibecoded their demo website? the text is invisible on Firefox.

by qwertytyyuu

0 subcomment

Woah they even immitate the western chinese accent well

by baal80spam

2 subcomments

Wow. I admit that I am not a native speaker, but this looks (or rather, sounds) VERY impressive and I could mistake it for hearing two people talking.

by ml_basics

0 subcomment

what's the relationship between this work and the recently announced voice models from Microsoft AI? https://microsoft.ai/news/two-new-in-house-models/

by ehutch79

0 subcomment

The examples are kind of off-putting. We're definitely in uncanny valley territory here.

by nextworddev

0 subcomment

Still haven’t found anything better than kokoro tts. Anyone know something better?

by weeb

1 subcomments

does anyone know of recent TTS options that let you specify IPA rather than written words? Azure lets you do this, but something local (and better than existing OS voices) would be great for my project.

by tehlike

0 subcomment

The comments in the html code is chinese, which is very interesting.

by egorfine

1 subcomments

[deleted - I'm an idiot]

by swiftcoder

1 subcomments

Ah, yes, the Furious 7 soundtrack. Definitely something everyone recalls

by baxuz

1 subcomments

Looking forward to the day when tts and speech recognition will work on Croatian, or other less prevalent languages.
It seems that it's only variants of English, Spanish and Chinese which are somewhat working.

by throwaw12

0 subcomment

Will there be a support for SSML to have more control of conversation?

by lagniappe

1 subcomments

Bots should never sing.

by Havoc

3 subcomments

MIT license - very nice!

by agos

1 subcomments

seemingly supports only English, Indian and Chinese

by cush

0 subcomment

I tried using the demo but it just errors out

0 subcomment

by lyu07282

1 subcomments

Did they delete the repo? It's 404 for me now: https://github.com/microsoft/VibeVoice

by amelius

2 subcomments

I tried some TTS models a while ago, but I noticed that none of them allowed to put markup statements in the text. For example, it would be nice to do something like:
```
     Hey look! [enthusiastic] Should we tell the others? Maybe not ... [giggles]
```
etc.
In fact, I think this kind of thing is absolutely necessary if you want to use this to replace a voice actor.

by sciencesama

1 subcomments

Need this for mac

by watsonmusic

0 subcomment

one of the best models built by Microsoft

by anarticle

0 subcomment

The first example sounds like a cry for help.
Some of them have tone wobbles which iirc was more common in early TTS models. Looks like the huge context window is really helping out here.

by viggity

1 subcomments

I feel like this is a step in the right direction, but a lot of emotive text-to-speech models are only changing the duration and loudness of each word, the timing/pauses are better too.
I would love to have a model that can make sense of things like stressing particular syllables or phonemes to make a point.

by enigma101

1 subcomments

only microsoft could come up with such a name rofl