# Vela Protocol and State Machine ## Event Protocol The shared code-level contract lives in the Yarn workspace package `@vela/protocol` so both the gateway and UI import the same event names and envelope shape. Current gateway baseline: - WebSocket endpoint: `/ws` - the gateway sends `session.ready` and `session.state` immediately after a successful socket upgrade - the gateway accepts JSON text messages only in the shared envelope shape Current UI baseline: - the browser opens a WebSocket directly to `/ws` - the UI tracks connection status separately from gateway session status - the UI currently consumes server events but does not send `session.start` or any audio events yet ## WebSocket Message Envelope Every WebSocket message uses one envelope format: ```ts type MessageEnvelope = { type: TType; payload: TPayload; }; ``` This increment intentionally keeps the envelope minimal: - `type` identifies the event - `payload` carries the event body - no sequence numbers, timestamps, or protocol version fields yet - future changes should be additive when possible ### Client → Server ```ts type ClientEvent = | { type: "session.start"; payload: {} } | { type: "input_audio.append"; payload: { chunk: string } } | { type: "input_audio.commit"; payload: {} } | { type: "response.cancel"; payload: {} }; ``` #### Client event intent - `session.start` initializes a voice session without locking in transport or auth details yet - `input_audio.append` carries a chunk of captured input audio as an encoded string - `input_audio.commit` marks the current buffered user turn as ready for downstream processing - `response.cancel` interrupts the active listen/think/speak flow ### Current skeleton behavior - on connect, the gateway creates an ephemeral in-memory session and emits `session.ready` plus `session.state` - `session.start` is accepted as an idempotent session acknowledgment and re-sends readiness/state - `input_audio.append` updates the ephemeral session record and moves the session to `listening` - `input_audio.commit` resets the minimal buffered state and returns the session to `idle` - `response.cancel` resets the minimal session state back to `idle` - malformed JSON produces `error` with code `invalid_json` - invalid envelopes or unsupported client event names produce `error` with code `invalid_message` - malformed WebSocket frames are rejected without crashing the gateway process ### UI connection shell behavior The UI currently exposes a small browser-side connection state machine for the WebSocket transport: ```text not connected → connecting → connected → disconnected → error ``` Notes: - this UI state is transport-oriented and is separate from the shared gateway `session.state` payload - `session.state` currently reflects the gateway session phase (`idle`, `listening`, `thinking`, `speaking`) - the UI treats malformed server messages, browser WebSocket errors, and gateway `error` events as safe error states instead of throwing ### Server → Client ```ts type ServerEvent = | { type: "session.ready"; payload: { sessionId: string } } | { type: "session.state"; payload: { value: "idle" | "listening" | "thinking" | "speaking" }; } | { type: "transcript.partial"; payload: { text: string } } | { type: "transcript.final"; payload: { text: string } } | { type: "response.text.delta"; payload: { text: string } } | { type: "response.completed"; payload: {} } | { type: "error"; payload: { code: string; message: string; retryable?: boolean }; }; ``` #### Server event intent - `session.ready` confirms that the gateway created a session identity - `session.state` exposes the coarse session phase needed by the later UI shell - `transcript.partial` and `transcript.final` support incremental and completed user text display - `response.text.delta` supports streamed assistant text without committing to audio output details yet - `response.completed` marks the current assistant turn as done - `error` is the minimal recoverable failure shape for both UI and gateway work ## Contract Scope for This Increment This contract is intentionally limited to the smallest event set needed to unblock: - the later gateway WebSocket session skeleton - the later UI voice-session shell Explicitly deferred for later increments: - tool-calling events - streamed TTS/output-audio events - reconnect/resume semantics - protocol version negotiation - provider-specific metadata fields ## State Machine ```text idle → listening → thinking → speaking → idle ``` `response.cancel` can occur at: - listening → restart - thinking → cancel - speaking → stop immediately ## `response.cancel` Handling Requirements - immediate stop of TTS playback - immediate stop of LLM streaming - reset session state to listening or idle, depending on UX decision ### Mechanism The `response.cancel` event cancels: - TTS process - current LLM request - tool execution when possible This shared contract uses `response.cancel` consistently for that cancellation signal. ## Protocol Notes for Implementation - keep the protocol backward compatible when possible - prefer additive event changes over breaking renames - document protocol updates in this file whenever implementation changes behavior - when implementation diverges from the initial contract, update this document in the same change