Chatterbox
PremiumZero-shot voice cloning with expressive speech in 23 languages
Fast
Speed
Very Good
Quality
Yes
Cloning
23
Languages
About Chatterbox
Chatterbox is a powerful voice cloning TTS model from Resemble AI. It performs zero-shot voice cloning from just a few seconds of reference audio, supporting 23 languages with natural expression. Chatterbox includes paralinguistic tags for adding natural sounds like laughter and coughs to generated speech.
Key Features
Zero-Shot Voice Cloning
Clone any voice from a few seconds of audio - no training required.
23 Languages
From Arabic to Chinese, covering most major world languages.
Expressive Tags
Add [laugh], [cough], [chuckle] for natural paralinguistic sounds.
Fast Inference
Sub-200ms latency with the Turbo variant for real-time applications.
Use Cases
Voice cloning for content creation
Multilingual voice applications
Character voice design for games
Personalized voice assistants
Frequently Asked Questions
Chatterbox is a zero-shot voice cloning TTS model from Resemble AI. It can replicate any voice from just a few seconds of reference audio and generate natural speech in 23 languages.
Yes, Chatterbox is fully MIT licensed - both code and model weights. It can be used freely in commercial applications. Generated audio includes an optional neural watermark that can be disabled.
Chatterbox supports 23 languages: Arabic, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Swedish, Swahili, Turkish, and Chinese.
Provide a reference audio clip of any voice (a few seconds is enough). Chatterbox extracts the voice characteristics and can then generate new speech in that voice. No fine-tuning or training is needed.
Chatterbox supports special tags in your text: [laugh] for laughter, [cough] for coughing, and [chuckle] for chuckling. These add natural non-verbal sounds to the generated speech.
The standard variant generates speech quickly on GPU. The Turbo variant achieves sub-200ms latency, making it suitable for real-time conversational applications.
Chatterbox requires 4-8GB of VRAM depending on the variant. The standard model works well with 6GB, while Turbo requires 4GB.
Both support voice cloning, but Chatterbox supports more languages (23 vs 2) and includes expressive tags. F5-TTS may produce slightly more natural prosody for English. Choose Chatterbox for multilingual cloning.
Both offer voice cloning with good quality. Chatterbox supports 23 languages vs OpenVoice with fewer languages. OpenVoice offers unique tone style controls (friendly, sad, angry, etc.) that Chatterbox does not.
Technical Specs
- Generation Speed Fast
- Output Quality Very Good
- Voice Cloning Supported
- Languages 23
- GPU VRAM 4-8GB
- Credits/1000 chars 25