Files
assistant/docs/protocol.md

9.8 KiB

Vela Protocol and State Machine

Event Protocol

The shared code-level contract lives in the Yarn workspace package @vela/protocol so both the gateway and UI import the same event names and envelope shape.

Current gateway baseline:

  • WebSocket endpoint: /ws
  • the gateway sends session.ready and session.state immediately after a successful socket upgrade
  • the gateway accepts JSON text messages only in the shared envelope shape

Current UI baseline:

  • the browser opens a WebSocket directly to /ws
  • the UI tracks connection status separately from gateway session status
  • the UI exposes a push-to-talk mic control shell that sends placeholder input_audio.append on press and input_audio.commit on release without capturing real audio
  • the push-to-talk shell is the only supported mocked turn entry path from the shipped UI

WebSocket Message Envelope

Every WebSocket message uses one envelope format:

type MessageEnvelope<TType extends string, TPayload> = {
  type: TType;
  payload: TPayload;
};

This increment intentionally keeps the envelope minimal:

  • type identifies the event
  • payload carries the event body
  • no sequence numbers, timestamps, or protocol version fields yet
  • future changes should be additive when possible

Client → Server

type ClientEvent =
  | { type: "session.start"; payload: {} }
  | { type: "mocked.turn.trigger"; payload: {} }
  | { type: "input_audio.append"; payload: { chunk: string } }
  | { type: "input_audio.commit"; payload: {} }
  | { type: "response.cancel"; payload: {} };

Client event intent

  • session.start initializes a voice session without locking in transport or auth details yet
  • mocked.turn.trigger is a retired legacy event name that the gateway now rejects with a deterministic recoverable error
  • input_audio.append carries a chunk of captured input audio as an encoded string
  • input_audio.commit marks the current buffered user turn as ready for downstream processing
  • response.cancel interrupts the active listen/think/speak flow

Current skeleton behavior

  • on connect, the gateway creates an ephemeral in-memory session and emits session.ready plus session.state
  • session.start is accepted as an idempotent session acknowledgment and re-sends readiness/state
  • mocked.turn.trigger is rejected deterministically with error.code = unsupported_mocked_turn_trigger
  • input_audio.append updates the ephemeral session record and moves the session to listening
  • each accepted input_audio.append emits one deterministic transcript.partial for the current placeholder turn
  • input_audio.commit emits exactly one deterministic transcript.final and then starts the deterministic mocked assistant response stream for that push-to-talk turn
  • after a completed placeholder input cycle, the same socket can start another placeholder push-to-talk turn without reconnecting
  • response.cancel is safe to send even when no mocked turn is active
  • response.cancel stops any still-pending mocked turn events for the active turn and resets the minimal session state back to idle
  • malformed JSON produces error with code invalid_json
  • invalid envelopes or unsupported client event names produce error with code invalid_message
  • malformed WebSocket frames are rejected without crashing the gateway process

UI connection shell behavior

The UI currently exposes a small browser-side connection state machine for the WebSocket transport:

not connected
 → connecting
 → connected
 → disconnected
 → error

Notes:

  • this UI state is transport-oriented and is separate from the shared gateway session.state payload
  • session.state currently reflects the gateway session phase (idle, listening, thinking, speaking)
  • the UI disables the mic control while disconnected, before session.ready, or while a mocked turn is already in flight
  • pressing the mic control sends one placeholder input_audio.append chunk and releasing it sends input_audio.commit
  • while a placeholder push-to-talk turn is in progress, the UI renders the latest transcript.partial
  • after placeholder commit, the UI renders the transcript.final, clears the partial-only display, and streams the mocked assistant text from the downstream response events
  • the UI copy explicitly labels the mic button as a control shell and not real microphone capture
  • the UI shows a cancel control and enables it only while a mocked turn is active
  • after cancel returns the gateway to idle, the UI clears the active-turn indicator but keeps any transcript or response text that was already rendered
  • the UI treats malformed server messages, browser WebSocket errors, and gateway error events as safe error states instead of throwing

Server → Client

type ServerEvent =
  | { type: "session.ready"; payload: { sessionId: string } }
  | {
      type: "session.state";
      payload: { value: "idle" | "listening" | "thinking" | "speaking" };
    }
  | { type: "transcript.partial"; payload: { text: string } }
  | { type: "transcript.final"; payload: { text: string } }
  | { type: "response.text.delta"; payload: { text: string } }
  | { type: "response.completed"; payload: {} }
  | {
      type: "error";
      payload: { code: string; message: string; retryable?: boolean };
    };

Server event intent

  • session.ready confirms that the gateway created a session identity
  • session.state exposes the coarse session phase needed by the later UI shell
  • transcript.partial and transcript.final support incremental and completed user text display
  • response.text.delta supports streamed assistant text without committing to audio output details yet
  • response.completed marks the current assistant turn as done
  • error is the minimal recoverable failure shape for both UI and gateway work

Legacy mocked turn trigger rejection

For this increment, direct mocked.turn.trigger requests no longer start a mocked turn:

mocked.turn.trigger
→ error(code="unsupported_mocked_turn_trigger", message="mocked.turn.trigger is no longer supported; use input_audio.append and input_audio.commit instead.")

Notes:

  • this rejection is deterministic and recoverable
  • the session remains available for the supported push-to-talk flow on the same socket

Deterministic placeholder push-to-talk transcript and mocked response sequence

For this increment, the existing mic-control shell still sends placeholder input_audio.append on press and input_audio.commit on release. The gateway now translates that shell flow into deterministic mocked transcript events and then reuses the existing mocked response stream:

input_audio.append #1
→ session.state(listening) when entering the turn
→ transcript.partial("[mocked partial] Placeholder push-to-talk transcript in progress.")

input_audio.append #N (N > 1)
→ transcript.partial("[mocked partial] Placeholder push-to-talk transcript in progress (N chunks).")

input_audio.commit after N appends
→ transcript.final("[mocked final] Placeholder push-to-talk transcript completed from N appended chunk(s).")
→ session.state(thinking)
→ session.state(speaking)
→ response.text.delta("[mocked assistant] ")
→ response.text.delta("This is a deterministic mocked response from the gateway vertical slice.")
→ response.completed
→ session.state(idle)

Safe deterministic edge cases for this mocked placeholder flow:

  • commit without any prior append is accepted and emits transcript.final("[mocked final] Placeholder push-to-talk transcript completed without appended audio.")
  • repeated appends during one placeholder turn are accepted and each append replaces the latest partial transcript with a chunk-count-based deterministic value
  • after the final transcript, placeholder commit follows the deterministic mocked thinking → speaking → response.text.delta* → response.completed → idle path
  • response.cancel can interrupt this mocked post-commit response path; already-rendered transcript or assistant text is not retracted

Contract Scope for This Increment

This contract is intentionally limited to the smallest event set needed to unblock:

  • the later gateway WebSocket session skeleton
  • the later UI voice-session shell

Explicitly deferred for later increments:

  • freeform typed user input
  • tool-calling events
  • streamed TTS/output-audio events
  • reconnect/resume semantics
  • protocol version negotiation
  • provider-specific metadata fields

State Machine

idle
 → listening
 → thinking
 → speaking
 → idle

Current mocked-pipeline behavior:

  • during an active mocked turn, response.cancel returns the session to idle immediately
  • any mocked turn timers that have not fired yet are dropped, so no later response.text.delta or response.completed events are emitted for the cancelled turn
  • the same cancellation behavior applies when a mocked turn was started by input_audio.commit
  • once idle is restored, the same WebSocket session can start another placeholder push-to-talk turn without reconnecting

More general future-state expectations:

response.cancel can occur at:

  • listening → restart
  • thinking → cancel
  • speaking → stop immediately

response.cancel Handling Requirements

  • immediate stop of TTS playback
  • immediate stop of LLM streaming
  • reset session state to listening or idle, depending on UX decision

Mechanism

The response.cancel event cancels:

  • TTS process
  • current LLM request
  • tool execution when possible

This shared contract uses response.cancel consistently for that cancellation signal.

Protocol Notes for Implementation

  • keep the protocol backward compatible when possible
  • prefer additive event changes over breaking renames
  • document protocol updates in this file whenever implementation changes behavior
  • when implementation diverges from the initial contract, update this document in the same change