Speech-to-Speech (S2S) mode skips separate STT and TTS stages — the model handles audio natively for ultra-low latency (~300ms).
Configuration
Set pipeline_mode: "s2s" in the agent config. The stt, llm, and tts fields are ignored in S2S mode.
{
"pipeline_mode": "s2s",
"s2s": {
"provider": "openai",
"model": "gpt-4o-realtime-preview",
"voice": "alloy"
}
}
Options
| Field | Required | Description |
|---|
provider | Yes | openai or google |
model | No | Model name (defaults per provider) |
voice | No | Voice name (default: alloy / Charon) |
turn_detection | No | server_vad (default) or pipecat_vad |
Pipeline
transport.input → user_agg (VAD)
→ S2S_LLM (OpenAI Realtime / Gemini Live WebSocket)
→ transport.output
→ context_aggregator.assistant → observability
Required API Keys
| Provider | Environment Variable |
|---|
| OpenAI Realtime | OPENAI_API_KEY |
| Gemini Live | GOOGLE_API_KEY |
S2S mode cannot be combined with voicemail_detection.enabled: true.