242 lines
11 KiB
Markdown
242 lines
11 KiB
Markdown
# Vela Protocol and State Machine
|
|
|
|
## Event Protocol
|
|
|
|
The shared code-level contract lives in the Yarn workspace package `@vela/protocol` so both the
|
|
gateway and UI import the same event names and envelope shape.
|
|
|
|
Current gateway baseline:
|
|
|
|
- WebSocket endpoint: `/ws`
|
|
- the gateway sends `session.ready` and `session.state` immediately after a successful socket upgrade
|
|
- the gateway accepts JSON text messages only in the shared envelope shape
|
|
|
|
Current UI baseline:
|
|
|
|
- the browser opens a WebSocket directly to `/ws`
|
|
- the UI tracks connection status separately from gateway session status
|
|
- the UI can send `mocked.turn.trigger` after `session.ready` while connected to request one deterministic mocked turn for the active session
|
|
- the UI exposes a push-to-talk mic control shell that sends placeholder `input_audio.append` on press and `input_audio.commit` on release without capturing real audio
|
|
|
|
## WebSocket Message Envelope
|
|
|
|
Every WebSocket message uses one envelope format:
|
|
|
|
```ts
|
|
type MessageEnvelope<TType extends string, TPayload> = {
|
|
type: TType;
|
|
payload: TPayload;
|
|
};
|
|
```
|
|
|
|
This increment intentionally keeps the envelope minimal:
|
|
|
|
- `type` identifies the event
|
|
- `payload` carries the event body
|
|
- no sequence numbers, timestamps, or protocol version fields yet
|
|
- future changes should be additive when possible
|
|
|
|
### Client → Server
|
|
|
|
```ts
|
|
type ClientEvent =
|
|
| { type: "session.start"; payload: {} }
|
|
| { type: "mocked.turn.trigger"; payload: {} }
|
|
| { type: "input_audio.append"; payload: { chunk: string } }
|
|
| { type: "input_audio.commit"; payload: {} }
|
|
| { type: "response.cancel"; payload: {} };
|
|
```
|
|
|
|
#### Client event intent
|
|
|
|
- `session.start` initializes a voice session without locking in transport or auth details yet
|
|
- `mocked.turn.trigger` asks the gateway to run one obviously mocked, deterministic transcript/response turn
|
|
- `input_audio.append` carries a chunk of captured input audio as an encoded string
|
|
- `input_audio.commit` marks the current buffered user turn as ready for downstream processing
|
|
- `response.cancel` interrupts the active listen/think/speak flow
|
|
|
|
### Current skeleton behavior
|
|
|
|
- on connect, the gateway creates an ephemeral in-memory session and emits `session.ready` plus `session.state`
|
|
- `session.start` is accepted as an idempotent session acknowledgment and re-sends readiness/state
|
|
- `mocked.turn.trigger` is accepted only when no other mocked turn is already in flight for that session
|
|
- a mocked turn emits deterministic `transcript.final`, `response.text.delta`, `response.completed`, and `session.state` events in protocol-valid order
|
|
- `input_audio.append` updates the ephemeral session record and moves the session to `listening`
|
|
- each accepted `input_audio.append` emits one deterministic `transcript.partial` for the current placeholder turn
|
|
- `input_audio.commit` emits exactly one deterministic `transcript.final` and then starts the same deterministic mocked assistant response stream used by `mocked.turn.trigger`
|
|
- after a completed placeholder input cycle, the same socket can still send `mocked.turn.trigger`
|
|
- `response.cancel` is safe to send even when no mocked turn is active
|
|
- `response.cancel` stops any still-pending mocked turn events for the active turn and resets the minimal session state back to `idle`
|
|
- a second mocked-turn trigger during an active mocked turn produces `error` with code `mocked_turn_in_flight`
|
|
- malformed JSON produces `error` with code `invalid_json`
|
|
- invalid envelopes or unsupported client event names produce `error` with code `invalid_message`
|
|
- malformed WebSocket frames are rejected without crashing the gateway process
|
|
|
|
### UI connection shell behavior
|
|
|
|
The UI currently exposes a small browser-side connection state machine for the WebSocket transport:
|
|
|
|
```text
|
|
not connected
|
|
→ connecting
|
|
→ connected
|
|
→ disconnected
|
|
→ error
|
|
```
|
|
|
|
Notes:
|
|
|
|
- this UI state is transport-oriented and is separate from the shared gateway `session.state` payload
|
|
- `session.state` currently reflects the gateway session phase (`idle`, `listening`, `thinking`, `speaking`)
|
|
- the UI disables the mocked-turn control until `session.ready` arrives, while disconnected, or while a mocked turn is already in flight
|
|
- the UI disables the mic control while disconnected, before `session.ready`, or while a mocked turn is already in flight
|
|
- pressing the mic control sends one placeholder `input_audio.append` chunk and releasing it sends `input_audio.commit`
|
|
- while a placeholder push-to-talk turn is in progress, the UI renders the latest `transcript.partial`
|
|
- after placeholder commit, the UI renders the `transcript.final`, clears the partial-only display, and streams the mocked assistant text from the downstream response events
|
|
- the UI copy explicitly labels the mic button as a control shell and not real microphone capture
|
|
- the UI shows a cancel control and enables it only while a mocked turn is active
|
|
- after cancel returns the gateway to `idle`, the UI clears the active-turn indicator but keeps any transcript or response text that was already rendered
|
|
- the UI treats malformed server messages, browser WebSocket errors, and gateway `error` events as safe error states instead of throwing
|
|
|
|
### Server → Client
|
|
|
|
```ts
|
|
type ServerEvent =
|
|
| { type: "session.ready"; payload: { sessionId: string } }
|
|
| {
|
|
type: "session.state";
|
|
payload: { value: "idle" | "listening" | "thinking" | "speaking" };
|
|
}
|
|
| { type: "transcript.partial"; payload: { text: string } }
|
|
| { type: "transcript.final"; payload: { text: string } }
|
|
| { type: "response.text.delta"; payload: { text: string } }
|
|
| { type: "response.completed"; payload: {} }
|
|
| {
|
|
type: "error";
|
|
payload: { code: string; message: string; retryable?: boolean };
|
|
};
|
|
```
|
|
|
|
#### Server event intent
|
|
|
|
- `session.ready` confirms that the gateway created a session identity
|
|
- `session.state` exposes the coarse session phase needed by the later UI shell
|
|
- `transcript.partial` and `transcript.final` support incremental and completed user text display
|
|
- `response.text.delta` supports streamed assistant text without committing to audio output details yet
|
|
- `response.completed` marks the current assistant turn as done
|
|
- `error` is the minimal recoverable failure shape for both UI and gateway work
|
|
|
|
### Deterministic mocked turn sequence
|
|
|
|
For this increment, `mocked.turn.trigger` produces one fixed interaction for the active session:
|
|
|
|
```text
|
|
session.state(listening)
|
|
→ transcript.final("[mocked user] What is the current mocked vertical slice?")
|
|
→ session.state(thinking)
|
|
→ session.state(speaking)
|
|
→ response.text.delta("[mocked assistant] ")
|
|
→ response.text.delta("This is a deterministic mocked response from the gateway vertical slice.")
|
|
→ response.completed
|
|
→ session.state(idle)
|
|
```
|
|
|
|
Notes:
|
|
|
|
- the content is intentionally fixed and obviously mocked
|
|
- no audio, STT, LLM, TTS, or external providers participate in this flow
|
|
- `response.cancel` can stop the mocked turn early, suppress any later mocked response events for that turn, and return the session to `idle`
|
|
|
|
### Deterministic placeholder push-to-talk transcript and mocked response sequence
|
|
|
|
For this increment, the existing mic-control shell still sends placeholder `input_audio.append` on press and `input_audio.commit` on release. The gateway now translates that shell flow into deterministic mocked transcript events and then reuses the existing mocked response stream:
|
|
|
|
```text
|
|
input_audio.append #1
|
|
→ session.state(listening) when entering the turn
|
|
→ transcript.partial("[mocked partial] Placeholder push-to-talk transcript in progress.")
|
|
|
|
input_audio.append #N (N > 1)
|
|
→ transcript.partial("[mocked partial] Placeholder push-to-talk transcript in progress (N chunks).")
|
|
|
|
input_audio.commit after N appends
|
|
→ transcript.final("[mocked final] Placeholder push-to-talk transcript completed from N appended chunk(s).")
|
|
→ session.state(thinking)
|
|
→ session.state(speaking)
|
|
→ response.text.delta("[mocked assistant] ")
|
|
→ response.text.delta("This is a deterministic mocked response from the gateway vertical slice.")
|
|
→ response.completed
|
|
→ session.state(idle)
|
|
```
|
|
|
|
Safe deterministic edge cases for this mocked placeholder flow:
|
|
|
|
- commit without any prior append is accepted and emits `transcript.final("[mocked final] Placeholder push-to-talk transcript completed without appended audio.")`
|
|
- repeated appends during one placeholder turn are accepted and each append replaces the latest partial transcript with a chunk-count-based deterministic value
|
|
- after the final transcript, placeholder commit follows the same mocked `thinking → speaking → response.text.delta* → response.completed → idle` path as `mocked.turn.trigger`
|
|
- `response.cancel` can interrupt this mocked post-commit response path the same way it interrupts `mocked.turn.trigger`; already-rendered transcript or assistant text is not retracted
|
|
|
|
## Contract Scope for This Increment
|
|
|
|
This contract is intentionally limited to the smallest event set needed to unblock:
|
|
|
|
- the later gateway WebSocket session skeleton
|
|
- the later UI voice-session shell
|
|
|
|
Explicitly deferred for later increments:
|
|
|
|
- freeform typed user input
|
|
- tool-calling events
|
|
- streamed TTS/output-audio events
|
|
- reconnect/resume semantics
|
|
- protocol version negotiation
|
|
- provider-specific metadata fields
|
|
|
|
## State Machine
|
|
|
|
```text
|
|
idle
|
|
→ listening
|
|
→ thinking
|
|
→ speaking
|
|
→ idle
|
|
```
|
|
|
|
Current mocked-pipeline behavior:
|
|
|
|
- during an active mocked turn, `response.cancel` returns the session to `idle` immediately
|
|
- any mocked turn timers that have not fired yet are dropped, so no later `response.text.delta` or `response.completed` events are emitted for the cancelled turn
|
|
- the same cancellation behavior applies when a mocked turn was started by `input_audio.commit`
|
|
- once `idle` is restored, the same WebSocket session can start another mocked turn without reconnecting
|
|
|
|
More general future-state expectations:
|
|
|
|
`response.cancel` can occur at:
|
|
|
|
- listening → restart
|
|
- thinking → cancel
|
|
- speaking → stop immediately
|
|
|
|
## `response.cancel` Handling Requirements
|
|
|
|
- immediate stop of TTS playback
|
|
- immediate stop of LLM streaming
|
|
- reset session state to listening or idle, depending on UX decision
|
|
|
|
### Mechanism
|
|
|
|
The `response.cancel` event cancels:
|
|
|
|
- TTS process
|
|
- current LLM request
|
|
- tool execution when possible
|
|
|
|
This shared contract uses `response.cancel` consistently for that cancellation signal.
|
|
|
|
## Protocol Notes for Implementation
|
|
|
|
- keep the protocol backward compatible when possible
|
|
- prefer additive event changes over breaking renames
|
|
- document protocol updates in this file whenever implementation changes behavior
|
|
- when implementation diverges from the initial contract, update this document in the same change
|