Chatterbox

Premium

Zero-shot voice cloning with expressive speech in 23 languages

Fast Speed
Very Good Quality
Yes Cloning
23 Languages

About Chatterbox

Chatterbox is a powerful voice cloning TTS model from Resemble AI. It performs zero-shot voice cloning from just a few seconds of reference audio, supporting 23 languages with natural expression. Chatterbox includes paralinguistic tags for adding natural sounds like laughter and coughs to generated speech.

Key Features

Zero-Shot Voice Cloning

Clone any voice from a few seconds of audio - no training required.

23 Languages

From Arabic to Chinese, covering most major world languages.

Expressive Tags

Add [laugh], [cough], [chuckle] for natural paralinguistic sounds.

Fast Inference

Sub-200ms latency with the Turbo variant for real-time applications.

Use Cases

Voice cloning for content creation Multilingual voice applications Character voice design for games Personalized voice assistants

Frequently Asked Questions

Chatterbox is a zero-shot voice cloning TTS model from Resemble AI. It can replicate any voice from just a few seconds of reference audio and generate natural speech in 23 languages.

Yes, Chatterbox is fully MIT licensed - both code and model weights. It can be used freely in commercial applications. Generated audio includes an optional neural watermark that can be disabled.

Chatterbox supports 23 languages: Arabic, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Swedish, Swahili, Turkish, and Chinese.

Provide a reference audio clip of any voice (a few seconds is enough). Chatterbox extracts the voice characteristics and can then generate new speech in that voice. No fine-tuning or training is needed.

Chatterbox supports special tags in your text: [laugh] for laughter, [cough] for coughing, and [chuckle] for chuckling. These add natural non-verbal sounds to the generated speech.

The standard variant generates speech quickly on GPU. The Turbo variant achieves sub-200ms latency, making it suitable for real-time conversational applications.

Chatterbox requires 4-8GB of VRAM depending on the variant. The standard model works well with 6GB, while Turbo requires 4GB.

Both support voice cloning, but Chatterbox supports more languages (23 vs 2) and includes expressive tags. F5-TTS may produce slightly more natural prosody for English. Choose Chatterbox for multilingual cloning.

Both offer voice cloning with good quality. Chatterbox supports 23 languages vs OpenVoice with fewer languages. OpenVoice offers unique tone style controls (friendly, sad, angry, etc.) that Chatterbox does not.

Technical Specs

  • Generation Speed Fast
  • Output Quality Very Good
  • Voice Cloning Supported
  • Languages 23
  • GPU VRAM 4-8GB
  • Credits/1000 chars 25

Try Chatterbox Now

Generate your first audio free. No credit card required.

Start Free