VITS

Standard

Fast End-to-End TTS with Natural Speech

Very Fast Speed

Good Quality

No Cloning

10 Languages

About VITS

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is a fast, end-to-end neural TTS model that generates natural-sounding speech. It combines variational autoencoders with adversarial training for efficient synthesis. VITS is excellent for batch processing and applications requiring both quality and speed.

Key Features

Fast Synthesis

End-to-end architecture for rapid speech generation.

Batch Processing

Efficiently process multiple texts simultaneously.

Natural Speech

VAE+GAN training produces natural prosody and rhythm.

Multi-Speaker

Single model supports multiple speaker voices.

Efficient

Low memory footprint with good performance.

Open Source

MIT licensed for any use case.

Use Cases

Batch Audio Generation E-Learning Platforms News Readers Automated Announcements IVR Systems High-Volume Content

VITS Voices

View All 109

LJSpeech (English Female)

VCTK Speaker 225 (English Female)

VCTK Speaker 226 (English Male)

VCTK Speaker 227 (English Male)

VCTK Speaker 228 (English Female)

VCTK Speaker 229

VCTK Speaker 230

VCTK Speaker 231

VCTK Speaker 232

VCTK Speaker 233

VCTK Speaker 234

VCTK Speaker 236

Frequently Asked Questions

VITS (Variational Inference with adversarial learning for Text-to-Speech) is an end-to-end neural TTS model that combines VAE and GAN training. It generates natural speech quickly and efficiently.

Yes, VITS is open-source under MIT license. On TextToSpeechAI, we charge just 10 credits per 1000 characters (Standard tier) due to its efficient resource usage.

VITS supports multiple languages depending on the trained model. Common versions support English, Chinese, Japanese, Korean, German, French, and other major languages with dedicated models.

VITS is very fast, generating speech in real-time or faster on GPU. Its end-to-end architecture avoids the multiple processing stages of other models, enabling rapid synthesis.

Standard VITS does not support voice cloning - it uses pre-trained speaker models. For voice cloning, use StyleTTS2, F5-TTS, OpenVoice, or Tortoise instead.

VITS produces good quality audio with natural prosody. While not at the level of StyleTTS 2 or Tortoise, it offers excellent quality for its speed, especially for batch processing scenarios.

VITS is very memory-efficient, requiring only 1-2GB of VRAM. It runs well on consumer GPUs and can even work on CPU with reasonable performance.

Yes, VITS is MIT licensed and supports full commercial use without restrictions. It is widely used in commercial products and services.

Select a VITS voice from our voice library (marked with VITS badge) and use it in your API requests. VITS is great for applications needing fast turnaround on many requests.

VITS outputs WAV audio at 22050Hz natively. Through TextToSpeechAI, you can request MP3, WAV, or OGG formats with automatic conversion.

Yes, VITS supports speed adjustments and some models support pitch modification. These allow customization of the voice output for different use cases.

VITS offers an excellent speed-quality balance for standard TTS needs. It is similar to Piper in speed but with slightly higher quality. For voice cloning, use other models. For highest quality, use StyleTTS 2.

Technical Specs

Generation Speed Very Fast
Output Quality Good
Voice Cloning Not Supported
Languages 10
GPU VRAM 1-2GB
Credits/1000 chars 10

Try VITS Now

Generate your first audio free. No credit card required.

Start Free

Other TTS Engines

VITS

About VITS

Key Features

Fast Synthesis

Batch Processing

Natural Speech

Multi-Speaker

Efficient

Open Source

Use Cases

VITS Voices

LJSpeech (English Female)

VCTK Speaker 225 (English Female)

VCTK Speaker 226 (English Male)

VCTK Speaker 227 (English Male)

VCTK Speaker 228 (English Female)

VCTK Speaker 229

VCTK Speaker 230

VCTK Speaker 231

VCTK Speaker 232

VCTK Speaker 233

VCTK Speaker 234

VCTK Speaker 236

Frequently Asked Questions

What is VITS?

Is VITS free to use?

What languages does VITS support?

How fast is VITS?

Does VITS support voice cloning?

What is the audio quality of VITS?

How much GPU memory does VITS need?

Can I use VITS commercially?

How do I use VITS with the TextToSpeechAI API?

What audio formats does VITS output?

Can I adjust speed and pitch with VITS?

How does VITS compare to other TTS engines?

Technical Specs

Try VITS Now

Other TTS Engines

Bark

Chatterbox

CosyVoice2