TTS doesn't suck anymore
6 months ago I wrote a blog post on how TTS still sucked. The tables have turned. We have options now.
About 6 months ago I wrote a small rant on how open source TTS models still sucked. 6 months later, I’m happy to report that isn’t the case anymore.
January this year, Qwen, the famous Chinese AI lab, released Qwen3-TTS, an open-weights series of TTS models. The release included 2 CustomVoice models (pre-made voices + style control), 2 base models (zero-shot voice cloning + fine-tuning), and a VoiceDesign model (create voices from descriptions). With a 0.6B and a 1.7B variant - they’re all quite small.

There are a lot of things to like. First, it fully supports voice cloning - via fine-tuning or zero-shot conditioning. Voxtral, for example, doesn’t. Second: the license is Apache 2.0 - which means we can do whatever we want with it (beware). Third, it’s supported by a strong inference engine. In this case vLLM-Omni. And more importantly: it avoids many of the small issues other open-source TTS models had when generating longer pieces of text — squeaks, audio drops, weird pacing, etc.
There are some caveats. There is a small bug on the fine-tuning code which creates some weird accelerations. The base models don’t support “style-guidance” - e.g., you can’t tell your fine-tuned model to sound very angry - or very sad.
For some reason, it’s not on the Speech Arena leaderboard for Open Weights models. There are two entries in the ranking - not sure what model they refer to. But from my experience, it’s not all about the model. The inference around it and how it “behaves in the wild” is what matters most and Qwen3-TTS delivers.
I spent a morning getting a dataset ready for fine-tuning by..


