Skip to main content
Speech-to-Speech (S2S) mode skips separate STT and TTS stages — the model handles audio natively for ultra-low latency (~300ms).

Configuration

Set pipeline_mode: "s2s" in the agent config. The stt, llm, and tts fields are ignored in S2S mode.
{
  "pipeline_mode": "s2s",
  "s2s": {
    "provider": "openai",
    "model": "gpt-4o-realtime-preview",
    "voice": "alloy"
  }
}

Options

FieldRequiredDescription
providerYesopenai or google
modelNoModel name (defaults per provider)
voiceNoVoice name (default: alloy / Charon)
turn_detectionNoserver_vad (default) or pipecat_vad

Pipeline

transport.input → user_agg (VAD)
  → S2S_LLM (OpenAI Realtime / Gemini Live WebSocket)
  → transport.output
  → context_aggregator.assistant → observability

Required API Keys

ProviderEnvironment Variable
OpenAI RealtimeOPENAI_API_KEY
Gemini LiveGOOGLE_API_KEY
S2S mode cannot be combined with voicemail_detection.enabled: true.