Files
assistant/docs/protocol.md

3.6 KiB

Vela Protocol and State Machine

Event Protocol

The shared code-level contract lives in the Yarn workspace package @vela/protocol so both the gateway and UI import the same event names and envelope shape.

WebSocket Message Envelope

Every WebSocket message uses one envelope format:

type MessageEnvelope<TType extends string, TPayload> = {
  type: TType;
  payload: TPayload;
};

This increment intentionally keeps the envelope minimal:

  • type identifies the event
  • payload carries the event body
  • no sequence numbers, timestamps, or protocol version fields yet
  • future changes should be additive when possible

Client → Server

type ClientEvent =
  | { type: "session.start"; payload: {} }
  | { type: "input_audio.append"; payload: { chunk: string } }
  | { type: "input_audio.commit"; payload: {} }
  | { type: "response.cancel"; payload: {} };

Client event intent

  • session.start initializes a voice session without locking in transport or auth details yet
  • input_audio.append carries a chunk of captured input audio as an encoded string
  • input_audio.commit marks the current buffered user turn as ready for downstream processing
  • response.cancel interrupts the active listen/think/speak flow

Server → Client

type ServerEvent =
  | { type: "session.ready"; payload: { sessionId: string } }
  | {
      type: "session.state";
      payload: { value: "idle" | "listening" | "thinking" | "speaking" };
    }
  | { type: "transcript.partial"; payload: { text: string } }
  | { type: "transcript.final"; payload: { text: string } }
  | { type: "response.text.delta"; payload: { text: string } }
  | { type: "response.completed"; payload: {} }
  | {
      type: "error";
      payload: { code: string; message: string; retryable?: boolean };
    };

Server event intent

  • session.ready confirms that the gateway created a session identity
  • session.state exposes the coarse session phase needed by the later UI shell
  • transcript.partial and transcript.final support incremental and completed user text display
  • response.text.delta supports streamed assistant text without committing to audio output details yet
  • response.completed marks the current assistant turn as done
  • error is the minimal recoverable failure shape for both UI and gateway work

Contract Scope for This Increment

This contract is intentionally limited to the smallest event set needed to unblock:

  • the later gateway WebSocket session skeleton
  • the later UI voice-session shell

Explicitly deferred for later increments:

  • tool-calling events
  • streamed TTS/output-audio events
  • reconnect/resume semantics
  • protocol version negotiation
  • provider-specific metadata fields

State Machine

idle
 → listening
 → thinking
 → speaking
 → idle

response.cancel can occur at:

  • listening → restart
  • thinking → cancel
  • speaking → stop immediately

response.cancel Handling Requirements

  • immediate stop of TTS playback
  • immediate stop of LLM streaming
  • reset session state to listening or idle, depending on UX decision

Mechanism

The response.cancel event cancels:

  • TTS process
  • current LLM request
  • tool execution when possible

This shared contract uses response.cancel consistently for that cancellation signal.

Protocol Notes for Implementation

  • keep the protocol backward compatible when possible
  • prefer additive event changes over breaking renames
  • document protocol updates in this file whenever implementation changes behavior
  • when implementation diverges from the initial contract, update this document in the same change