8.2 KiB
Vela Protocol and State Machine
Event Protocol
The shared code-level contract lives in the Yarn workspace package @vela/protocol so both the
gateway and UI import the same event names and envelope shape.
Current gateway baseline:
- WebSocket endpoint:
/ws - the gateway sends
session.readyandsession.stateimmediately after a successful socket upgrade - the gateway accepts JSON text messages only in the shared envelope shape
Current UI baseline:
- the browser opens a WebSocket directly to
/ws - the UI tracks connection status separately from gateway session status
- the UI can send
mocked.turn.triggeraftersession.readywhile connected to request one deterministic mocked turn for the active session - the UI exposes a push-to-talk mic control shell that sends placeholder
input_audio.appendon press andinput_audio.commiton release without capturing real audio
WebSocket Message Envelope
Every WebSocket message uses one envelope format:
type MessageEnvelope<TType extends string, TPayload> = {
type: TType;
payload: TPayload;
};
This increment intentionally keeps the envelope minimal:
typeidentifies the eventpayloadcarries the event body- no sequence numbers, timestamps, or protocol version fields yet
- future changes should be additive when possible
Client → Server
type ClientEvent =
| { type: "session.start"; payload: {} }
| { type: "mocked.turn.trigger"; payload: {} }
| { type: "input_audio.append"; payload: { chunk: string } }
| { type: "input_audio.commit"; payload: {} }
| { type: "response.cancel"; payload: {} };
Client event intent
session.startinitializes a voice session without locking in transport or auth details yetmocked.turn.triggerasks the gateway to run one obviously mocked, deterministic transcript/response turninput_audio.appendcarries a chunk of captured input audio as an encoded stringinput_audio.commitmarks the current buffered user turn as ready for downstream processingresponse.cancelinterrupts the active listen/think/speak flow
Current skeleton behavior
- on connect, the gateway creates an ephemeral in-memory session and emits
session.readyplussession.state session.startis accepted as an idempotent session acknowledgment and re-sends readiness/statemocked.turn.triggeris accepted only when no other mocked turn is already in flight for that session- a mocked turn emits deterministic
transcript.final,response.text.delta,response.completed, andsession.stateevents in protocol-valid order input_audio.appendupdates the ephemeral session record and moves the session tolisteninginput_audio.commitresets the minimal buffered state and returns the session toidle- after a completed placeholder input cycle, the same socket can still send
mocked.turn.trigger response.cancelis safe to send even when no mocked turn is activeresponse.cancelstops any still-pending mocked turn events for the active turn and resets the minimal session state back toidle- a second mocked-turn trigger during an active mocked turn produces
errorwith codemocked_turn_in_flight - malformed JSON produces
errorwith codeinvalid_json - invalid envelopes or unsupported client event names produce
errorwith codeinvalid_message - malformed WebSocket frames are rejected without crashing the gateway process
UI connection shell behavior
The UI currently exposes a small browser-side connection state machine for the WebSocket transport:
not connected
→ connecting
→ connected
→ disconnected
→ error
Notes:
- this UI state is transport-oriented and is separate from the shared gateway
session.statepayload session.statecurrently reflects the gateway session phase (idle,listening,thinking,speaking)- the UI disables the mocked-turn control until
session.readyarrives, while disconnected, or while a mocked turn is already in flight - the UI disables the mic control while disconnected, before
session.ready, or while a mocked turn is already in flight - pressing the mic control sends one placeholder
input_audio.appendchunk and releasing it sendsinput_audio.commit - the UI copy explicitly labels the mic button as a control shell and not real microphone capture
- the UI shows a cancel control and enables it only while a mocked turn is active
- after cancel returns the gateway to
idle, the UI clears the active-turn indicator but keeps any transcript or response text that was already rendered - the UI treats malformed server messages, browser WebSocket errors, and gateway
errorevents as safe error states instead of throwing
Server → Client
type ServerEvent =
| { type: "session.ready"; payload: { sessionId: string } }
| {
type: "session.state";
payload: { value: "idle" | "listening" | "thinking" | "speaking" };
}
| { type: "transcript.partial"; payload: { text: string } }
| { type: "transcript.final"; payload: { text: string } }
| { type: "response.text.delta"; payload: { text: string } }
| { type: "response.completed"; payload: {} }
| {
type: "error";
payload: { code: string; message: string; retryable?: boolean };
};
Server event intent
session.readyconfirms that the gateway created a session identitysession.stateexposes the coarse session phase needed by the later UI shelltranscript.partialandtranscript.finalsupport incremental and completed user text displayresponse.text.deltasupports streamed assistant text without committing to audio output details yetresponse.completedmarks the current assistant turn as doneerroris the minimal recoverable failure shape for both UI and gateway work
Deterministic mocked turn sequence
For this increment, mocked.turn.trigger produces one fixed interaction for the active session:
session.state(listening)
→ transcript.final("[mocked user] What is the current mocked vertical slice?")
→ session.state(thinking)
→ session.state(speaking)
→ response.text.delta("[mocked assistant] ")
→ response.text.delta("This is a deterministic mocked response from the gateway vertical slice.")
→ response.completed
→ session.state(idle)
Notes:
- the content is intentionally fixed and obviously mocked
- no audio, STT, LLM, TTS, or external providers participate in this flow
response.cancelcan stop the mocked turn early, suppress any later mocked response events for that turn, and return the session toidle
Contract Scope for This Increment
This contract is intentionally limited to the smallest event set needed to unblock:
- the later gateway WebSocket session skeleton
- the later UI voice-session shell
Explicitly deferred for later increments:
- freeform typed user input
- tool-calling events
- streamed TTS/output-audio events
- reconnect/resume semantics
- protocol version negotiation
- provider-specific metadata fields
State Machine
idle
→ listening
→ thinking
→ speaking
→ idle
Current mocked-pipeline behavior:
- during an active mocked turn,
response.cancelreturns the session toidleimmediately - any mocked turn timers that have not fired yet are dropped, so no later
response.text.deltaorresponse.completedevents are emitted for the cancelled turn - once
idleis restored, the same WebSocket session can start another mocked turn without reconnecting
More general future-state expectations:
response.cancel can occur at:
- listening → restart
- thinking → cancel
- speaking → stop immediately
response.cancel Handling Requirements
- immediate stop of TTS playback
- immediate stop of LLM streaming
- reset session state to listening or idle, depending on UX decision
Mechanism
The response.cancel event cancels:
- TTS process
- current LLM request
- tool execution when possible
This shared contract uses response.cancel consistently for that cancellation signal.
Protocol Notes for Implementation
- keep the protocol backward compatible when possible
- prefer additive event changes over breaking renames
- document protocol updates in this file whenever implementation changes behavior
- when implementation diverges from the initial contract, update this document in the same change