WebSocket Protocol

The WebSocket API provides real-time bidirectional communication for live conversations. Clients connect, authenticate, and exchange messages for conversation management, user input, and AI responses.

Connection

Connect to the WebSocket endpoint:

ws://localhost:3000/ws

Authentication

After connecting, the client must authenticate with an API key:

Client → Server:

json

{
  "requestId": "req-1",
  "type": "auth",
  "apiKey": "your-project-api-key",
  "sessionSettings": {
    "sendVoiceInput": true,
    "sendTextInput": true,
    "receiveVoiceOutput": true,
    "receiveTranscriptionUpdates": true,
    "receiveEvents": false
  }
}

Server → Client:

json

{
  "requestId": "req-1",
  "type": "auth",
  "sessionId": "session-uuid",
  "projectSettings": {
    "projectId": "my-project",
    "acceptVoice": true,
    "generateVoice": true,
    "asrConfig": { ... }
  }
}

Session Settings

Setting	Default	Description
`sendVoiceInput`	`true`	Client will send voice audio
`sendTextInput`	`true`	Client will send text messages
`receiveVoiceOutput`	`true`	Client wants TTS audio chunks
`receiveTranscriptionUpdates`	`true`	Client wants intermediate transcription chunks
`receiveEvents`	`true`	Client wants raw conversation event broadcasts

Message Format

All messages follow this structure:

Client → Server:

json

{
  "requestId": "unique-id",
  "type": "message_type",
  "sessionId": "session-id",
  ...payload
}

Server → Client:

json

{
  "requestId": "original-request-id",
  "type": "message_type",
  "sessionId": "session-id",
  ...payload
}

Conversation Lifecycle

Start Conversation

Client → Server:

json

{
  "requestId": "req-2",
  "type": "start_conversation",
  "sessionId": "session-uuid",
  "userId": "user-123",
  "stageId": "greeting",
  "timezone": "America/New_York"
}

Field	Required	Description
`userId`	Yes	User initiating the conversation
`stageId`	Yes	Stage to start at
`agentId`	No	Override the default agent
`timezone`	No	IANA timezone identifier (e.g. `America/New_York`). Takes highest precedence in the timezone resolution chain: `start_conversation.timezone` → `userProfile.timezone` → `project.timezone` → UTC. Persisted for the lifetime of the conversation so resume works correctly.

Server → Client:

json

{
  "requestId": "req-2",
  "type": "start_conversation",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid"
}

Resume Conversation

Client → Server:

json

{
  "requestId": "req-3",
  "type": "resume_conversation",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid"
}

End Conversation

Client → Server:

json

{
  "requestId": "req-4",
  "type": "end_conversation",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid"
}

User Input

Text Input

Client → Server:

json

{
  "requestId": "req-5",
  "type": "send_user_text_input",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid",
  "text": "Hello, I need help with my order"
}

Server → Client:

json

{
  "requestId": "req-5",
  "type": "send_user_text_input",
  "sessionId": "session-uuid",
  "inputTurnId": "turn-uuid"
}

Voice Input

Voice input uses a three-step streaming protocol:

1. Start voice stream:

json

{
  "requestId": "req-6",
  "type": "start_user_voice_input",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid"
}

2. Send audio chunks (repeated):

json

{
  "type": "send_user_voice_chunk",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid",
  "inputTurnId": "turn-uuid",
  "audioData": "<base64-encoded-audio>",
  "ordinal": 1
}

3. End voice stream:

json

{
  "type": "end_user_voice_input",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid",
  "inputTurnId": "turn-uuid"
}

ASR Transcription Updates

When receiveTranscriptionUpdates is enabled, the server sends interim and final transcriptions:

json

{
  "type": "user_transcribed_chunk",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid",
  "inputTurnId": "turn-uuid",
  "chunkId": "chunk-uuid",
  "chunkText": "Hello, I need",
  "ordinal": 1,
  "isFinal": false
}

AI Responses

AI responses are streamed as a sequence of messages:

1. Start Generation

json

{
  "type": "start_ai_generation_output",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid",
  "outputTurnId": "turn-uuid",
  "expectVoice": true
}

2. Text Chunks (streamed)

json

{
  "type": "ai_transcribed_chunk",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid",
  "outputTurnId": "turn-uuid",
  "chunkId": "chunk-uuid",
  "chunkText": "I'd be happy to help",
  "ordinal": 1,
  "isFinal": false
}

3. Voice Chunks (streamed, if voice enabled)

json

{
  "type": "send_ai_voice_chunk",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid",
  "outputTurnId": "turn-uuid",
  "audioData": "<base64-encoded-audio>",
  "audioFormat": "mp3",
  "sampleRate": 44100,
  "ordinal": 1,
  "isFinal": false
}

4. End Generation

json

{
  "type": "end_ai_generation_output",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid",
  "outputTurnId": "turn-uuid",
  "fullText": "I'd be happy to help you with your order. Can you give me your order number?"
}

Image and Audio Outputs

For multimodal responses:

json

{
  "type": "send_ai_image_output",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid",
  "outputTurnId": "turn-uuid",
  "imageData": "<base64-encoded-image>",
  "mimeType": "image/png",
  "sequenceNumber": 1
}

Client Commands

Clients can send commands to control the conversation:

Go to Stage

json

{
  "requestId": "req-7",
  "type": "go_to_stage",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid",
  "stageId": "troubleshooting"
}

Set Variable

json

{
  "requestId": "req-8",
  "type": "set_var",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid",
  "stageId": "current-stage",
  "variableName": "selectedProduct",
  "variableValue": "Widget Pro"
}

Get Variable

json

{
  "requestId": "req-9",
  "type": "get_var",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid",
  "stageId": "current-stage",
  "variableName": "selectedProduct"
}

Get All Variables

json

{
  "requestId": "req-10",
  "type": "get_all_vars",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid",
  "stageId": "current-stage"
}

Run Action

json

{
  "requestId": "req-11",
  "type": "run_action",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid",
  "actionName": "check-order-status",
  "parameters": { "orderId": "ORD-123" }
}

Call Tool

json

{
  "requestId": "req-12",
  "type": "call_tool",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid",
  "toolId": "translate",
  "parameters": { "text": "Hello", "language": "es" }
}

Event Broadcasting

When receiveEvents is enabled in session settings, the server broadcasts conversation events:

json

{
  "type": "conversation_event",
  "sessionId": "session-uuid",
  "conversationId": "conv-uuid",
  "eventType": "classification",
  "eventData": {
    "classifierId": "default-classifier",
    "input": "I need help with returns",
    "actions": [{ "name": "handle_return", "parameters": {} }]
  }
}

See Conversations for all event types.

Error Handling

Errors are returned with a standard structure:

json

{
  "requestId": "req-5",
  "type": "error",
  "sessionId": "session-uuid",
  "error": {
    "code": "INVALID_STATE",
    "message": "Cannot send input while generating response"
  }
}

WebSocket Protocol ​

Connection ​

Authentication ​

Session Settings ​

Message Format ​

Conversation Lifecycle ​

Start Conversation ​

Resume Conversation ​

End Conversation ​

User Input ​

Text Input ​

Voice Input ​

ASR Transcription Updates ​

AI Responses ​

1. Start Generation ​

2. Text Chunks (streamed) ​

3. Voice Chunks (streamed, if voice enabled) ​

4. End Generation ​

Image and Audio Outputs ​

Client Commands ​

Go to Stage ​

Set Variable ​

Get Variable ​

Get All Variables ​

Run Action ​

Call Tool ​

Event Broadcasting ​

Error Handling ​

Connection Flow Summary ​

WebSocket Protocol

Connection

Authentication

Session Settings

Message Format

Conversation Lifecycle

Start Conversation

Resume Conversation

End Conversation

User Input

Text Input

Voice Input

ASR Transcription Updates

AI Responses

1. Start Generation

2. Text Chunks (streamed)

3. Voice Chunks (streamed, if voice enabled)

4. End Generation

Image and Audio Outputs

Client Commands

Go to Stage

Set Variable

Get Variable

Get All Variables

Run Action

Call Tool

Event Broadcasting

Error Handling

Connection Flow Summary