Files
assistant/docs/protocol.md

8.2 KiB

Vela Protocol and State Machine

Event Protocol

The shared code-level contract lives in the Yarn workspace package @vela/protocol so both the gateway and UI import the same event names and envelope shape.

Current gateway baseline:

  • WebSocket endpoint: /ws
  • the gateway sends session.ready and session.state immediately after a successful socket upgrade
  • the gateway accepts JSON text messages only in the shared envelope shape

Current UI baseline:

  • the browser opens a WebSocket directly to /ws
  • the UI tracks connection status separately from gateway session status
  • the UI can send mocked.turn.trigger after session.ready while connected to request one deterministic mocked turn for the active session
  • the UI exposes a push-to-talk mic control shell that sends placeholder input_audio.append on press and input_audio.commit on release without capturing real audio

WebSocket Message Envelope

Every WebSocket message uses one envelope format:

type MessageEnvelope<TType extends string, TPayload> = {
  type: TType;
  payload: TPayload;
};

This increment intentionally keeps the envelope minimal:

  • type identifies the event
  • payload carries the event body
  • no sequence numbers, timestamps, or protocol version fields yet
  • future changes should be additive when possible

Client → Server

type ClientEvent =
  | { type: "session.start"; payload: {} }
  | { type: "mocked.turn.trigger"; payload: {} }
  | { type: "input_audio.append"; payload: { chunk: string } }
  | { type: "input_audio.commit"; payload: {} }
  | { type: "response.cancel"; payload: {} };

Client event intent

  • session.start initializes a voice session without locking in transport or auth details yet
  • mocked.turn.trigger asks the gateway to run one obviously mocked, deterministic transcript/response turn
  • input_audio.append carries a chunk of captured input audio as an encoded string
  • input_audio.commit marks the current buffered user turn as ready for downstream processing
  • response.cancel interrupts the active listen/think/speak flow

Current skeleton behavior

  • on connect, the gateway creates an ephemeral in-memory session and emits session.ready plus session.state
  • session.start is accepted as an idempotent session acknowledgment and re-sends readiness/state
  • mocked.turn.trigger is accepted only when no other mocked turn is already in flight for that session
  • a mocked turn emits deterministic transcript.final, response.text.delta, response.completed, and session.state events in protocol-valid order
  • input_audio.append updates the ephemeral session record and moves the session to listening
  • input_audio.commit resets the minimal buffered state and returns the session to idle
  • after a completed placeholder input cycle, the same socket can still send mocked.turn.trigger
  • response.cancel is safe to send even when no mocked turn is active
  • response.cancel stops any still-pending mocked turn events for the active turn and resets the minimal session state back to idle
  • a second mocked-turn trigger during an active mocked turn produces error with code mocked_turn_in_flight
  • malformed JSON produces error with code invalid_json
  • invalid envelopes or unsupported client event names produce error with code invalid_message
  • malformed WebSocket frames are rejected without crashing the gateway process

UI connection shell behavior

The UI currently exposes a small browser-side connection state machine for the WebSocket transport:

not connected
 → connecting
 → connected
 → disconnected
 → error

Notes:

  • this UI state is transport-oriented and is separate from the shared gateway session.state payload
  • session.state currently reflects the gateway session phase (idle, listening, thinking, speaking)
  • the UI disables the mocked-turn control until session.ready arrives, while disconnected, or while a mocked turn is already in flight
  • the UI disables the mic control while disconnected, before session.ready, or while a mocked turn is already in flight
  • pressing the mic control sends one placeholder input_audio.append chunk and releasing it sends input_audio.commit
  • the UI copy explicitly labels the mic button as a control shell and not real microphone capture
  • the UI shows a cancel control and enables it only while a mocked turn is active
  • after cancel returns the gateway to idle, the UI clears the active-turn indicator but keeps any transcript or response text that was already rendered
  • the UI treats malformed server messages, browser WebSocket errors, and gateway error events as safe error states instead of throwing

Server → Client

type ServerEvent =
  | { type: "session.ready"; payload: { sessionId: string } }
  | {
      type: "session.state";
      payload: { value: "idle" | "listening" | "thinking" | "speaking" };
    }
  | { type: "transcript.partial"; payload: { text: string } }
  | { type: "transcript.final"; payload: { text: string } }
  | { type: "response.text.delta"; payload: { text: string } }
  | { type: "response.completed"; payload: {} }
  | {
      type: "error";
      payload: { code: string; message: string; retryable?: boolean };
    };

Server event intent

  • session.ready confirms that the gateway created a session identity
  • session.state exposes the coarse session phase needed by the later UI shell
  • transcript.partial and transcript.final support incremental and completed user text display
  • response.text.delta supports streamed assistant text without committing to audio output details yet
  • response.completed marks the current assistant turn as done
  • error is the minimal recoverable failure shape for both UI and gateway work

Deterministic mocked turn sequence

For this increment, mocked.turn.trigger produces one fixed interaction for the active session:

session.state(listening)
→ transcript.final("[mocked user] What is the current mocked vertical slice?")
→ session.state(thinking)
→ session.state(speaking)
→ response.text.delta("[mocked assistant] ")
→ response.text.delta("This is a deterministic mocked response from the gateway vertical slice.")
→ response.completed
→ session.state(idle)

Notes:

  • the content is intentionally fixed and obviously mocked
  • no audio, STT, LLM, TTS, or external providers participate in this flow
  • response.cancel can stop the mocked turn early, suppress any later mocked response events for that turn, and return the session to idle

Contract Scope for This Increment

This contract is intentionally limited to the smallest event set needed to unblock:

  • the later gateway WebSocket session skeleton
  • the later UI voice-session shell

Explicitly deferred for later increments:

  • freeform typed user input
  • tool-calling events
  • streamed TTS/output-audio events
  • reconnect/resume semantics
  • protocol version negotiation
  • provider-specific metadata fields

State Machine

idle
 → listening
 → thinking
 → speaking
 → idle

Current mocked-pipeline behavior:

  • during an active mocked turn, response.cancel returns the session to idle immediately
  • any mocked turn timers that have not fired yet are dropped, so no later response.text.delta or response.completed events are emitted for the cancelled turn
  • once idle is restored, the same WebSocket session can start another mocked turn without reconnecting

More general future-state expectations:

response.cancel can occur at:

  • listening → restart
  • thinking → cancel
  • speaking → stop immediately

response.cancel Handling Requirements

  • immediate stop of TTS playback
  • immediate stop of LLM streaming
  • reset session state to listening or idle, depending on UX decision

Mechanism

The response.cancel event cancels:

  • TTS process
  • current LLM request
  • tool execution when possible

This shared contract uses response.cancel consistently for that cancellation signal.

Protocol Notes for Implementation

  • keep the protocol backward compatible when possible
  • prefer additive event changes over breaking renames
  • document protocol updates in this file whenever implementation changes behavior
  • when implementation diverges from the initial contract, update this document in the same change