FRESH

Hacker News

Home

Qwen3-TTS family is now open sourced: Voice design, clone, and generation

733 points by Palmik

by simonw

9 subcomments

If you want to try out the voice cloning yourself you can do that an this Hugging Face demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS - switch to the "Voice Clone" tab, paste in some example text and use the microphone option to record yourself reading that text - then paste in other text and have it generate a version of that read using your voice.
I shared a recording of audio I generated with that here: https://simonwillison.net/2026/Jan/22/qwen3-tts/

by simonw

4 subcomments

I got this running on macOS using mlx-audio thanks to Prince Canuma: https://x.com/Prince_Canuma/status/2014453857019904423
Here's the script I'm using: https://github.com/simonw/tools/blob/main/python/q3_tts.py
You can try it with uv (downloads a 4.5GB model on first run) like this:
```
  uv run https://tools.simonwillison.net/python/q3_tts.py \
    'I am a pirate, give me your gold!' \
    -i 'gruff voice' -o pirate.wav
```

by TheAceOfHearts

2 subcomments

Interesting model, I've managed to get the 0.6B param model running on my old 1080 and I can generated 200 character chunks safely without going OOM, so I thought that making an audiobook of the Tao Te Ching would be a good test. Unfortunately each snippet varies drastically in quality: sometimes the speaker is clear and coherent, but other times it bursts out laughing or moaning. In a way it feels a bit like magical roulette, never being quite certain of what you're going to get. It does have a bit of charm, when you chain the various snippets together you really don't know what direction it's gonna go.
Using speaker Ryan seems to be the most consistent, I tried speaker Eric and it sounded like someone putting on a fake exaggerated Chinese accent to mock speakers.
If it wasn't for the unpredictable level of emotions from each chunk, I'd say this is easily the highest quality TTS model I've tried.

by genewitch

3 subcomments

it isn't often that tehcnology gives me chills, but this did it. I've used "AI" TTS tools since 2018 or so, and i thought the stuff from two years ago was about the best we were going to get. I don't know the size of these, i scrolled to the samples. I am going to get the models set up somewhere and test them out.
Now, maybe the results were cherrypicked. i know everyone else who has released one of these cherrypicks which to publish. However, this is the first time i've considered it plausible to use AI TTS to remaster old radioplays and the like, where a section of audio is unintelligible but can be deduced from context, like a tape glitch where someone says "HEY [...]LAR!" and it's an episode of Yours Truly, Johnny Dollar...
I have dozens of hours of audio of like Bob Bailey and people of that era.

by throwaw12

9 subcomments

Qwen team, please please please, release something to outperform and surpass the coding abilities of Opus 4.5.
Although I like the model, I don't like the leadership of that company and how close it is, how divisive they're in terms of politics.

by chriswep

0 subcomment

In my tests this doesn't come close to the years old coqui/XTTS-v2. It has great voice cloning capabilities and creates rich speech with emotions with low latency. I tried out several local-TTS projects over the years but i'm somewhat confused that nothing seems to be able to match coqui despite the leaps that we see in other areas of AI. Can somebody with more knowledge in this field explain why that might be? Or am i completely missing something?

by girvo

0 subcomment

Amusingly one of their examples (the final Age Control example) is prompted to have American English as an accent, but sounds like an Australian trying to sounds American to my ear haha

by viraptor

1 subcomments

I can't quite figure this out: Can you save a generated voice for reuse later? The mlx-audio I looked at seems to take the text itself in every interface and doesn't expose it as a separate object. (I can dive deeper, but wanted to check if anyone's done it already)

by rahimnathwani

4 subcomments

Has anyone successfully run this on a Mac? The installation instructions appear to assume an NVIDIA GPU (CUDA, FlashAttention), and I’m not sure whether it works with PyTorch’s Metal/MPS backend.

by PunchyHamster

2 subcomments

Looking forward for my grandma being scammed by one!

by satvikpendem

0 subcomment

This would be great for audiobooks, some of the current AI TTS still struggle.

by d4rkp4ttern

0 subcomment

Curious how it compares to last week’s release of Kyutai’s Pocket-TTS [1] which is just 100M params, and excellent in both speed and quality (English only). I use it in my voice plugin [2] for quick voice updates in Claude Code.
[1] https://github.com/kyutai-labs/pocket-tts
[2] https://github.com/pchalasani/claude-code-tools?tab=readme-o...

by anotherevan

0 subcomment

Is there any way to take a cloned voice model and plug into Android TTS and/or Windows?
I have a friend with a paralysed larynx who is often using his phone or a small laptop to type in order to communicate. I know he would love it if it was possible to take old recordings of him speaking and use that to give him back "his" voice, at least in some small measure.

by 7777777phil

0 subcomment

Here is a Colab Notebook where you can test it on any of the available GPUs (H100, A100, T4): https://colab.research.google.com/drive/1szmNh25TmMpPd4aKjWX...

by thedangler

2 subcomments

Kind of a noob, how would I implement this locally? How do I pass it audio to process. I'm assuming its in the API spec?

by gunalx

0 subcomment

Voice actors are slo cooked. Some of the demos arguably sounded way better than a lot of indie voice-acting.

by indigodaddy

2 subcomments

How does the cloning compare to pocket TTS?

by whinvik

2 subcomments

Haha something that I want to try out. I have started using voice input more and more instead of typing and now I am on my second app and second TTS model, namely Handy and Parakeet V3.
Parakeet is pretty good, but there are times it struggles. Would be interesting to see how Qwen compares once Handy has it in.

by khimaros

0 subcomment

i made an epub to audiobook generator using this with optional LLM integration for dramatized output: https://github.com/khimaros/autiobook -- also submitted here: https://news.ycombinator.com/item?id=46737968

by lostmsu

0 subcomment

I still don't know anyone who managed Qwen3-Omni to work properly on a local machine.

by JonChesterfield

0 subcomment

I see a lot of references to `device_map="cuda:0"` but no cuda in the github repo, is the complete stack flash attention plus this python plus the weights file, or does one need vLLM running as well?

by naveen-zerocool

0 subcomment

I just created a video trying it out - https://youtu.be/0LU9nmnR0cs

by albertwang

7 subcomments

great news, this looks great! is it just me, or do most of the english audio samples sound like anime voices?

by sails

1 subcomments

Any recommendations for an iOS app to test models like this? There are a few good ones for text gen, and it’s a great way to try models

by sinnickal

0 subcomment

Prepare for an influx of sensational hot-mic clips allegedly from high profile people

by swaraj

0 subcomment

Tried the voice clone with a 30s trump clip (with reference text), and it didn't sound like him at all.

by jakobdabo

0 subcomment

Can anyone please provide directions/links to tools that can be run locally, and that take an audio recording of a voice as an input, and produce an output with the same voice saying the same thing with the same intonations, but with a fixed/changed accent?
This is needed for processing an indie game's voice recordings, where the voice actors weren't native speakers and had some accent.

by dangoodmanUT

0 subcomment

Many voices clone better than 11labs, while admitedly lower bitrate

by jonkoops

0 subcomment

Honestly, this seems like it could be pretty cool for video games. I always liked Oblivion's 'Radiant AI', this could be a natural progression, give characters motivations, relations with the player and other NPCs and have an LLM spit out background dialogue, then have another model generate the audio.

by ideashower

2 subcomments

Huh. One of the English Voice Clone examples features Obama.

by wahnfrieden

2 subcomments

How is it for Japanese?

by salzig

0 subcomment

So now we're getting every movie in "original voice" but local language? Can't wait to view anime or Bollywood :D