11 KiB
Vela Protocol and State Machine
Event Protocol
The shared code-level contract lives in the Yarn workspace package @vela/protocol so both the
gateway and UI import the same event names and envelope shape.
Current gateway baseline:
- WebSocket endpoint:
/ws - the gateway sends
session.readyandsession.stateimmediately after a successful socket upgrade - the gateway accepts JSON text messages only in the shared envelope shape
Current UI baseline:
- the browser opens a WebSocket directly to
/ws - the UI tracks connection status separately from gateway session status
- the UI can send
mocked.turn.triggeraftersession.readywhile connected to request one deterministic mocked turn for the active session - the UI exposes a push-to-talk mic control shell that sends placeholder
input_audio.appendon press andinput_audio.commiton release without capturing real audio
WebSocket Message Envelope
Every WebSocket message uses one envelope format:
type MessageEnvelope<TType extends string, TPayload> = {
type: TType;
payload: TPayload;
};
This increment intentionally keeps the envelope minimal:
typeidentifies the eventpayloadcarries the event body- no sequence numbers, timestamps, or protocol version fields yet
- future changes should be additive when possible
Client → Server
type ClientEvent =
| { type: "session.start"; payload: {} }
| { type: "mocked.turn.trigger"; payload: {} }
| { type: "input_audio.append"; payload: { chunk: string } }
| { type: "input_audio.commit"; payload: {} }
| { type: "response.cancel"; payload: {} };
Client event intent
session.startinitializes a voice session without locking in transport or auth details yetmocked.turn.triggerasks the gateway to run one obviously mocked, deterministic transcript/response turninput_audio.appendcarries a chunk of captured input audio as an encoded stringinput_audio.commitmarks the current buffered user turn as ready for downstream processingresponse.cancelinterrupts the active listen/think/speak flow
Current skeleton behavior
- on connect, the gateway creates an ephemeral in-memory session and emits
session.readyplussession.state session.startis accepted as an idempotent session acknowledgment and re-sends readiness/statemocked.turn.triggeris accepted only when no other mocked turn is already in flight for that session- a mocked turn emits deterministic
transcript.final,response.text.delta,response.completed, andsession.stateevents in protocol-valid order input_audio.appendupdates the ephemeral session record and moves the session tolistening- each accepted
input_audio.appendemits one deterministictranscript.partialfor the current placeholder turn input_audio.commitemits exactly one deterministictranscript.finaland then starts the same deterministic mocked assistant response stream used bymocked.turn.trigger- after a completed placeholder input cycle, the same socket can still send
mocked.turn.trigger response.cancelis safe to send even when no mocked turn is activeresponse.cancelstops any still-pending mocked turn events for the active turn and resets the minimal session state back toidle- a second mocked-turn trigger during an active mocked turn produces
errorwith codemocked_turn_in_flight - malformed JSON produces
errorwith codeinvalid_json - invalid envelopes or unsupported client event names produce
errorwith codeinvalid_message - malformed WebSocket frames are rejected without crashing the gateway process
UI connection shell behavior
The UI currently exposes a small browser-side connection state machine for the WebSocket transport:
not connected
→ connecting
→ connected
→ disconnected
→ error
Notes:
- this UI state is transport-oriented and is separate from the shared gateway
session.statepayload session.statecurrently reflects the gateway session phase (idle,listening,thinking,speaking)- the UI disables the mocked-turn control until
session.readyarrives, while disconnected, or while a mocked turn is already in flight - the UI disables the mic control while disconnected, before
session.ready, or while a mocked turn is already in flight - pressing the mic control sends one placeholder
input_audio.appendchunk and releasing it sendsinput_audio.commit - while a placeholder push-to-talk turn is in progress, the UI renders the latest
transcript.partial - after placeholder commit, the UI renders the
transcript.final, clears the partial-only display, and streams the mocked assistant text from the downstream response events - the UI copy explicitly labels the mic button as a control shell and not real microphone capture
- the UI shows a cancel control and enables it only while a mocked turn is active
- after cancel returns the gateway to
idle, the UI clears the active-turn indicator but keeps any transcript or response text that was already rendered - the UI treats malformed server messages, browser WebSocket errors, and gateway
errorevents as safe error states instead of throwing
Server → Client
type ServerEvent =
| { type: "session.ready"; payload: { sessionId: string } }
| {
type: "session.state";
payload: { value: "idle" | "listening" | "thinking" | "speaking" };
}
| { type: "transcript.partial"; payload: { text: string } }
| { type: "transcript.final"; payload: { text: string } }
| { type: "response.text.delta"; payload: { text: string } }
| { type: "response.completed"; payload: {} }
| {
type: "error";
payload: { code: string; message: string; retryable?: boolean };
};
Server event intent
session.readyconfirms that the gateway created a session identitysession.stateexposes the coarse session phase needed by the later UI shelltranscript.partialandtranscript.finalsupport incremental and completed user text displayresponse.text.deltasupports streamed assistant text without committing to audio output details yetresponse.completedmarks the current assistant turn as doneerroris the minimal recoverable failure shape for both UI and gateway work
Deterministic mocked turn sequence
For this increment, mocked.turn.trigger produces one fixed interaction for the active session:
session.state(listening)
→ transcript.final("[mocked user] What is the current mocked vertical slice?")
→ session.state(thinking)
→ session.state(speaking)
→ response.text.delta("[mocked assistant] ")
→ response.text.delta("This is a deterministic mocked response from the gateway vertical slice.")
→ response.completed
→ session.state(idle)
Notes:
- the content is intentionally fixed and obviously mocked
- no audio, STT, LLM, TTS, or external providers participate in this flow
response.cancelcan stop the mocked turn early, suppress any later mocked response events for that turn, and return the session toidle
Deterministic placeholder push-to-talk transcript and mocked response sequence
For this increment, the existing mic-control shell still sends placeholder input_audio.append on press and input_audio.commit on release. The gateway now translates that shell flow into deterministic mocked transcript events and then reuses the existing mocked response stream:
input_audio.append #1
→ session.state(listening) when entering the turn
→ transcript.partial("[mocked partial] Placeholder push-to-talk transcript in progress.")
input_audio.append #N (N > 1)
→ transcript.partial("[mocked partial] Placeholder push-to-talk transcript in progress (N chunks).")
input_audio.commit after N appends
→ transcript.final("[mocked final] Placeholder push-to-talk transcript completed from N appended chunk(s).")
→ session.state(thinking)
→ session.state(speaking)
→ response.text.delta("[mocked assistant] ")
→ response.text.delta("This is a deterministic mocked response from the gateway vertical slice.")
→ response.completed
→ session.state(idle)
Safe deterministic edge cases for this mocked placeholder flow:
- commit without any prior append is accepted and emits
transcript.final("[mocked final] Placeholder push-to-talk transcript completed without appended audio.") - repeated appends during one placeholder turn are accepted and each append replaces the latest partial transcript with a chunk-count-based deterministic value
- after the final transcript, placeholder commit follows the same mocked
thinking → speaking → response.text.delta* → response.completed → idlepath asmocked.turn.trigger response.cancelcan interrupt this mocked post-commit response path the same way it interruptsmocked.turn.trigger; already-rendered transcript or assistant text is not retracted
Contract Scope for This Increment
This contract is intentionally limited to the smallest event set needed to unblock:
- the later gateway WebSocket session skeleton
- the later UI voice-session shell
Explicitly deferred for later increments:
- freeform typed user input
- tool-calling events
- streamed TTS/output-audio events
- reconnect/resume semantics
- protocol version negotiation
- provider-specific metadata fields
State Machine
idle
→ listening
→ thinking
→ speaking
→ idle
Current mocked-pipeline behavior:
- during an active mocked turn,
response.cancelreturns the session toidleimmediately - any mocked turn timers that have not fired yet are dropped, so no later
response.text.deltaorresponse.completedevents are emitted for the cancelled turn - the same cancellation behavior applies when a mocked turn was started by
input_audio.commit - once
idleis restored, the same WebSocket session can start another mocked turn without reconnecting
More general future-state expectations:
response.cancel can occur at:
- listening → restart
- thinking → cancel
- speaking → stop immediately
response.cancel Handling Requirements
- immediate stop of TTS playback
- immediate stop of LLM streaming
- reset session state to listening or idle, depending on UX decision
Mechanism
The response.cancel event cancels:
- TTS process
- current LLM request
- tool execution when possible
This shared contract uses response.cancel consistently for that cancellation signal.
Protocol Notes for Implementation
- keep the protocol backward compatible when possible
- prefer additive event changes over breaking renames
- document protocol updates in this file whenever implementation changes behavior
- when implementation diverges from the initial contract, update this document in the same change