# Vela Protocol and State Machine ## Event Protocol The shared code-level contract lives in the Yarn workspace package `@vela/protocol` so both the gateway and UI import the same event names and envelope shape. Current gateway baseline: - WebSocket endpoint: `/ws` - the gateway sends `session.ready` and `session.state` immediately after a successful socket upgrade - the gateway accepts JSON text messages only in the shared envelope shape Current UI baseline: - the browser opens a WebSocket directly to `/ws` - the UI tracks connection status separately from gateway session status - the UI can send `mocked.turn.trigger` after `session.ready` while connected to request one deterministic mocked turn for the active session - the UI exposes a push-to-talk mic control shell that sends placeholder `input_audio.append` on press and `input_audio.commit` on release without capturing real audio ## WebSocket Message Envelope Every WebSocket message uses one envelope format: ```ts type MessageEnvelope = { type: TType; payload: TPayload; }; ``` This increment intentionally keeps the envelope minimal: - `type` identifies the event - `payload` carries the event body - no sequence numbers, timestamps, or protocol version fields yet - future changes should be additive when possible ### Client → Server ```ts type ClientEvent = | { type: "session.start"; payload: {} } | { type: "mocked.turn.trigger"; payload: {} } | { type: "input_audio.append"; payload: { chunk: string } } | { type: "input_audio.commit"; payload: {} } | { type: "response.cancel"; payload: {} }; ``` #### Client event intent - `session.start` initializes a voice session without locking in transport or auth details yet - `mocked.turn.trigger` asks the gateway to run one obviously mocked, deterministic transcript/response turn - `input_audio.append` carries a chunk of captured input audio as an encoded string - `input_audio.commit` marks the current buffered user turn as ready for downstream processing - `response.cancel` interrupts the active listen/think/speak flow ### Current skeleton behavior - on connect, the gateway creates an ephemeral in-memory session and emits `session.ready` plus `session.state` - `session.start` is accepted as an idempotent session acknowledgment and re-sends readiness/state - `mocked.turn.trigger` is accepted only when no other mocked turn is already in flight for that session - a mocked turn emits deterministic `transcript.final`, `response.text.delta`, `response.completed`, and `session.state` events in protocol-valid order - `input_audio.append` updates the ephemeral session record and moves the session to `listening` - each accepted `input_audio.append` emits one deterministic `transcript.partial` for the current placeholder turn - `input_audio.commit` emits exactly one deterministic `transcript.final` and then starts the same deterministic mocked assistant response stream used by `mocked.turn.trigger` - after a completed placeholder input cycle, the same socket can still send `mocked.turn.trigger` - `response.cancel` is safe to send even when no mocked turn is active - `response.cancel` stops any still-pending mocked turn events for the active turn and resets the minimal session state back to `idle` - a second mocked-turn trigger during an active mocked turn produces `error` with code `mocked_turn_in_flight` - malformed JSON produces `error` with code `invalid_json` - invalid envelopes or unsupported client event names produce `error` with code `invalid_message` - malformed WebSocket frames are rejected without crashing the gateway process ### UI connection shell behavior The UI currently exposes a small browser-side connection state machine for the WebSocket transport: ```text not connected → connecting → connected → disconnected → error ``` Notes: - this UI state is transport-oriented and is separate from the shared gateway `session.state` payload - `session.state` currently reflects the gateway session phase (`idle`, `listening`, `thinking`, `speaking`) - the UI disables the mocked-turn control until `session.ready` arrives, while disconnected, or while a mocked turn is already in flight - the UI disables the mic control while disconnected, before `session.ready`, or while a mocked turn is already in flight - pressing the mic control sends one placeholder `input_audio.append` chunk and releasing it sends `input_audio.commit` - while a placeholder push-to-talk turn is in progress, the UI renders the latest `transcript.partial` - after placeholder commit, the UI renders the `transcript.final`, clears the partial-only display, and streams the mocked assistant text from the downstream response events - the UI copy explicitly labels the mic button as a control shell and not real microphone capture - the UI shows a cancel control and enables it only while a mocked turn is active - after cancel returns the gateway to `idle`, the UI clears the active-turn indicator but keeps any transcript or response text that was already rendered - the UI treats malformed server messages, browser WebSocket errors, and gateway `error` events as safe error states instead of throwing ### Server → Client ```ts type ServerEvent = | { type: "session.ready"; payload: { sessionId: string } } | { type: "session.state"; payload: { value: "idle" | "listening" | "thinking" | "speaking" }; } | { type: "transcript.partial"; payload: { text: string } } | { type: "transcript.final"; payload: { text: string } } | { type: "response.text.delta"; payload: { text: string } } | { type: "response.completed"; payload: {} } | { type: "error"; payload: { code: string; message: string; retryable?: boolean }; }; ``` #### Server event intent - `session.ready` confirms that the gateway created a session identity - `session.state` exposes the coarse session phase needed by the later UI shell - `transcript.partial` and `transcript.final` support incremental and completed user text display - `response.text.delta` supports streamed assistant text without committing to audio output details yet - `response.completed` marks the current assistant turn as done - `error` is the minimal recoverable failure shape for both UI and gateway work ### Deterministic mocked turn sequence For this increment, `mocked.turn.trigger` produces one fixed interaction for the active session: ```text session.state(listening) → transcript.final("[mocked user] What is the current mocked vertical slice?") → session.state(thinking) → session.state(speaking) → response.text.delta("[mocked assistant] ") → response.text.delta("This is a deterministic mocked response from the gateway vertical slice.") → response.completed → session.state(idle) ``` Notes: - the content is intentionally fixed and obviously mocked - no audio, STT, LLM, TTS, or external providers participate in this flow - `response.cancel` can stop the mocked turn early, suppress any later mocked response events for that turn, and return the session to `idle` ### Deterministic placeholder push-to-talk transcript and mocked response sequence For this increment, the existing mic-control shell still sends placeholder `input_audio.append` on press and `input_audio.commit` on release. The gateway now translates that shell flow into deterministic mocked transcript events and then reuses the existing mocked response stream: ```text input_audio.append #1 → session.state(listening) when entering the turn → transcript.partial("[mocked partial] Placeholder push-to-talk transcript in progress.") input_audio.append #N (N > 1) → transcript.partial("[mocked partial] Placeholder push-to-talk transcript in progress (N chunks).") input_audio.commit after N appends → transcript.final("[mocked final] Placeholder push-to-talk transcript completed from N appended chunk(s).") → session.state(thinking) → session.state(speaking) → response.text.delta("[mocked assistant] ") → response.text.delta("This is a deterministic mocked response from the gateway vertical slice.") → response.completed → session.state(idle) ``` Safe deterministic edge cases for this mocked placeholder flow: - commit without any prior append is accepted and emits `transcript.final("[mocked final] Placeholder push-to-talk transcript completed without appended audio.")` - repeated appends during one placeholder turn are accepted and each append replaces the latest partial transcript with a chunk-count-based deterministic value - after the final transcript, placeholder commit follows the same mocked `thinking → speaking → response.text.delta* → response.completed → idle` path as `mocked.turn.trigger` - `response.cancel` can interrupt this mocked post-commit response path the same way it interrupts `mocked.turn.trigger`; already-rendered transcript or assistant text is not retracted ## Contract Scope for This Increment This contract is intentionally limited to the smallest event set needed to unblock: - the later gateway WebSocket session skeleton - the later UI voice-session shell Explicitly deferred for later increments: - freeform typed user input - tool-calling events - streamed TTS/output-audio events - reconnect/resume semantics - protocol version negotiation - provider-specific metadata fields ## State Machine ```text idle → listening → thinking → speaking → idle ``` Current mocked-pipeline behavior: - during an active mocked turn, `response.cancel` returns the session to `idle` immediately - any mocked turn timers that have not fired yet are dropped, so no later `response.text.delta` or `response.completed` events are emitted for the cancelled turn - the same cancellation behavior applies when a mocked turn was started by `input_audio.commit` - once `idle` is restored, the same WebSocket session can start another mocked turn without reconnecting More general future-state expectations: `response.cancel` can occur at: - listening → restart - thinking → cancel - speaking → stop immediately ## `response.cancel` Handling Requirements - immediate stop of TTS playback - immediate stop of LLM streaming - reset session state to listening or idle, depending on UX decision ### Mechanism The `response.cancel` event cancels: - TTS process - current LLM request - tool execution when possible This shared contract uses `response.cancel` consistently for that cancellation signal. ## Protocol Notes for Implementation - keep the protocol backward compatible when possible - prefer additive event changes over breaking renames - document protocol updates in this file whenever implementation changes behavior - when implementation diverges from the initial contract, update this document in the same change