Files
assistant/docs/protocol.md

6.8 KiB

Vela Protocol and State Machine

Event Protocol

The shared code-level contract lives in the Yarn workspace package @vela/protocol so both the gateway and UI import the same event names and envelope shape.

Current gateway baseline:

  • WebSocket endpoint: /ws
  • the gateway sends session.ready and session.state immediately after a successful socket upgrade
  • the gateway accepts JSON text messages only in the shared envelope shape

Current UI baseline:

  • the browser opens a WebSocket directly to /ws
  • the UI tracks connection status separately from gateway session status
  • the UI can send mocked.turn.trigger after session.ready while connected to request one deterministic mocked turn for the active session

WebSocket Message Envelope

Every WebSocket message uses one envelope format:

type MessageEnvelope<TType extends string, TPayload> = {
  type: TType;
  payload: TPayload;
};

This increment intentionally keeps the envelope minimal:

  • type identifies the event
  • payload carries the event body
  • no sequence numbers, timestamps, or protocol version fields yet
  • future changes should be additive when possible

Client → Server

type ClientEvent =
  | { type: "session.start"; payload: {} }
  | { type: "mocked.turn.trigger"; payload: {} }
  | { type: "input_audio.append"; payload: { chunk: string } }
  | { type: "input_audio.commit"; payload: {} }
  | { type: "response.cancel"; payload: {} };

Client event intent

  • session.start initializes a voice session without locking in transport or auth details yet
  • mocked.turn.trigger asks the gateway to run one obviously mocked, deterministic transcript/response turn
  • input_audio.append carries a chunk of captured input audio as an encoded string
  • input_audio.commit marks the current buffered user turn as ready for downstream processing
  • response.cancel interrupts the active listen/think/speak flow

Current skeleton behavior

  • on connect, the gateway creates an ephemeral in-memory session and emits session.ready plus session.state
  • session.start is accepted as an idempotent session acknowledgment and re-sends readiness/state
  • mocked.turn.trigger is accepted only when no other mocked turn is already in flight for that session
  • a mocked turn emits deterministic transcript.final, response.text.delta, response.completed, and session.state events in protocol-valid order
  • input_audio.append updates the ephemeral session record and moves the session to listening
  • input_audio.commit resets the minimal buffered state and returns the session to idle
  • response.cancel resets the minimal session state back to idle
  • a second mocked-turn trigger during an active mocked turn produces error with code mocked_turn_in_flight
  • malformed JSON produces error with code invalid_json
  • invalid envelopes or unsupported client event names produce error with code invalid_message
  • malformed WebSocket frames are rejected without crashing the gateway process

UI connection shell behavior

The UI currently exposes a small browser-side connection state machine for the WebSocket transport:

not connected
 → connecting
 → connected
 → disconnected
 → error

Notes:

  • this UI state is transport-oriented and is separate from the shared gateway session.state payload
  • session.state currently reflects the gateway session phase (idle, listening, thinking, speaking)
  • the UI disables the mocked-turn control until session.ready arrives, while disconnected, or while a mocked turn is already in flight
  • the UI treats malformed server messages, browser WebSocket errors, and gateway error events as safe error states instead of throwing

Server → Client

type ServerEvent =
  | { type: "session.ready"; payload: { sessionId: string } }
  | {
      type: "session.state";
      payload: { value: "idle" | "listening" | "thinking" | "speaking" };
    }
  | { type: "transcript.partial"; payload: { text: string } }
  | { type: "transcript.final"; payload: { text: string } }
  | { type: "response.text.delta"; payload: { text: string } }
  | { type: "response.completed"; payload: {} }
  | {
      type: "error";
      payload: { code: string; message: string; retryable?: boolean };
    };

Server event intent

  • session.ready confirms that the gateway created a session identity
  • session.state exposes the coarse session phase needed by the later UI shell
  • transcript.partial and transcript.final support incremental and completed user text display
  • response.text.delta supports streamed assistant text without committing to audio output details yet
  • response.completed marks the current assistant turn as done
  • error is the minimal recoverable failure shape for both UI and gateway work

Deterministic mocked turn sequence

For this increment, mocked.turn.trigger produces one fixed interaction for the active session:

session.state(listening)
→ transcript.final("[mocked user] What is the current mocked vertical slice?")
→ session.state(thinking)
→ session.state(speaking)
→ response.text.delta("[mocked assistant] ")
→ response.text.delta("This is a deterministic mocked response from the gateway vertical slice.")
→ response.completed
→ session.state(idle)

Notes:

  • the content is intentionally fixed and obviously mocked
  • no audio, STT, LLM, TTS, or external providers participate in this flow
  • response.cancel can stop the mocked turn early and return the session to idle

Contract Scope for This Increment

This contract is intentionally limited to the smallest event set needed to unblock:

  • the later gateway WebSocket session skeleton
  • the later UI voice-session shell

Explicitly deferred for later increments:

  • freeform typed user input
  • tool-calling events
  • streamed TTS/output-audio events
  • reconnect/resume semantics
  • protocol version negotiation
  • provider-specific metadata fields

State Machine

idle
 → listening
 → thinking
 → speaking
 → idle

response.cancel can occur at:

  • listening → restart
  • thinking → cancel
  • speaking → stop immediately

response.cancel Handling Requirements

  • immediate stop of TTS playback
  • immediate stop of LLM streaming
  • reset session state to listening or idle, depending on UX decision

Mechanism

The response.cancel event cancels:

  • TTS process
  • current LLM request
  • tool execution when possible

This shared contract uses response.cancel consistently for that cancellation signal.

Protocol Notes for Implementation

  • keep the protocol backward compatible when possible
  • prefer additive event changes over breaking renames
  • document protocol updates in this file whenever implementation changes behavior
  • when implementation diverges from the initial contract, update this document in the same change