Latency varies by language, model, and network conditions. Values below are typical ranges.
Quick comparison
| Mode | How it works | Typical latency | Best for | Voice options |
|---|---|---|---|---|
| Pipeline | Speech-to-Text → LLM → Text-to-Speech | ~800–1500 ms | Complex reasoning, dynamic prompts, multi-sentence replies | All library voices, including custom-cloned |
| Speech-to-Speech (Multimodal) | Direct speech-to-speech generation (no intermediate text) | ~300–600 ms | Snappy back-and-forth, short and reactive replies | Limited set; expanding over time |
1. Pipeline
- Label in UI: Pipeline
- How it works: Speech-to-Text → LLM → Text-to-Speech
- Latency: ~800 – 1500 ms (depends on language & model)
- Best for: Complex reasoning, dynamic prompts, multi-sentence replies
- Supports all voices in the library (including custom-cloned voices).
- Handles long-form answers or paragraph-style responses well.
- Allows the LLM to inject variables and reference earlier context cleanly.
When to choose Pipeline
- You need rich, multi-sentence answers (e.g., support queries, detailed explanations).
- The assistant must reason over structured data or complex prompts.
- You prefer absolute control of the spoken voice (clone or brand voice).
2. Speech-to-Speech (Multimodal)
- Label in UI: Speech-to-speech
- How it works: Direct speech-to-speech generation (no intermediate text)
- Latency: ~300 – 600 ms (ultra low)
- Best for: Natural back-and-forth, short & reactive replies
- Fast turn-taking – callers experience near-instant responses.
- Generates more expressive prosody natively (intonation, fillers).
- Currently supports a limited voice set, but more are added regularly.
When to choose Speech-to-Speech
- The conversation needs to feel snappy (sales, booking confirmations).
- Your replies are generally short sentences or quick acknowledgements.
- You’re okay with the system-provided voice options for faster interaction.
Switching modes
Test both modes and pick the best balance of speed and quality for your use case.1
Open assistant settings
Go to Assistant → Settings → Voice Engine for the specific assistant.
2
Select a mode
Choose Pipeline or Speech-to-speech based on conversation style and latency needs.
3
Choose a voice (if using Pipeline)
Select a built-in voice or a custom-cloned voice. See: Voice Selection & Voice Cloning.
4
Place a quick test call
Record two short calls—one per mode—covering your most common scenarios.
Confirm acceptable latency, turn-taking, and tone consistency.
5
Decide and roll out
Pick the mode that best fits your flow and keep monitoring call recordings for quality.

