Understand the two voice generation modes available for your AI assistants and when to use each one
Mode | How it works | Typical latency | Best for | Voice options |
---|---|---|---|---|
Pipeline | Speech-to-Text → LLM → Text-to-Speech | ~800–1500 ms | Complex reasoning, dynamic prompts, multi-sentence replies | All library voices, including custom-cloned |
Speech-to-Speech (Multimodal) | Direct speech-to-speech generation (no intermediate text) | ~300–600 ms | Snappy back-and-forth, short and reactive replies | Limited set; expanding over time |
Open assistant settings
Select a mode
Choose a voice (if using Pipeline)
Place a quick test call
Decide and roll out