assistant/docs/protocol.md

# Vela Protocol and State Machine

## Event Protocol

The shared code-level contract lives in the Yarn workspace package `@vela/protocol` so both the
gateway and UI import the same event names and envelope shape.

Current gateway baseline:

- WebSocket endpoint: `/ws`
- the gateway sends `session.ready` and `session.state` immediately after a successful socket upgrade
- the gateway accepts JSON text messages only in the shared envelope shape

Current UI baseline:

- the browser opens a WebSocket directly to `/ws`
- the UI tracks connection status separately from gateway session status
- the UI currently consumes server events but does not send `session.start` or any audio events yet

## WebSocket Message Envelope

Every WebSocket message uses one envelope format:

```ts
type MessageEnvelope<TType extends string, TPayload> = {
  type: TType;
  payload: TPayload;
};
```

This increment intentionally keeps the envelope minimal:

- `type` identifies the event
- `payload` carries the event body
- no sequence numbers, timestamps, or protocol version fields yet
- future changes should be additive when possible

### Client → Server

```ts
type ClientEvent =
  | { type: "session.start"; payload: {} }
  | { type: "input_audio.append"; payload: { chunk: string } }
  | { type: "input_audio.commit"; payload: {} }
  | { type: "response.cancel"; payload: {} };
```

#### Client event intent

- `session.start` initializes a voice session without locking in transport or auth details yet
- `input_audio.append` carries a chunk of captured input audio as an encoded string
- `input_audio.commit` marks the current buffered user turn as ready for downstream processing
- `response.cancel` interrupts the active listen/think/speak flow

### Current skeleton behavior

- on connect, the gateway creates an ephemeral in-memory session and emits `session.ready` plus `session.state`
- `session.start` is accepted as an idempotent session acknowledgment and re-sends readiness/state
- `input_audio.append` updates the ephemeral session record and moves the session to `listening`
- `input_audio.commit` resets the minimal buffered state and returns the session to `idle`
- `response.cancel` resets the minimal session state back to `idle`
- malformed JSON produces `error` with code `invalid_json`
- invalid envelopes or unsupported client event names produce `error` with code `invalid_message`
- malformed WebSocket frames are rejected without crashing the gateway process

### UI connection shell behavior

The UI currently exposes a small browser-side connection state machine for the WebSocket transport:

```text
not connected
 → connecting
 → connected
 → disconnected
 → error
```

Notes:

- this UI state is transport-oriented and is separate from the shared gateway `session.state` payload
- `session.state` currently reflects the gateway session phase (`idle`, `listening`, `thinking`, `speaking`)
- the UI treats malformed server messages, browser WebSocket errors, and gateway `error` events as safe error states instead of throwing

### Server → Client

```ts
type ServerEvent =
  | { type: "session.ready"; payload: { sessionId: string } }
  | {
      type: "session.state";
      payload: { value: "idle" | "listening" | "thinking" | "speaking" };
    }
  | { type: "transcript.partial"; payload: { text: string } }
  | { type: "transcript.final"; payload: { text: string } }
  | { type: "response.text.delta"; payload: { text: string } }
  | { type: "response.completed"; payload: {} }
  | {
      type: "error";
      payload: { code: string; message: string; retryable?: boolean };
    };
```

#### Server event intent

- `session.ready` confirms that the gateway created a session identity
- `session.state` exposes the coarse session phase needed by the later UI shell
- `transcript.partial` and `transcript.final` support incremental and completed user text display
- `response.text.delta` supports streamed assistant text without committing to audio output details yet
- `response.completed` marks the current assistant turn as done
- `error` is the minimal recoverable failure shape for both UI and gateway work

## Contract Scope for This Increment

This contract is intentionally limited to the smallest event set needed to unblock:

- the later gateway WebSocket session skeleton
- the later UI voice-session shell

Explicitly deferred for later increments:

- tool-calling events
- streamed TTS/output-audio events
- reconnect/resume semantics
- protocol version negotiation
- provider-specific metadata fields

## State Machine

```text
idle
 → listening
 → thinking
 → speaking
 → idle
```

`response.cancel` can occur at:

- listening → restart
- thinking → cancel
- speaking → stop immediately

## `response.cancel` Handling Requirements

- immediate stop of TTS playback
- immediate stop of LLM streaming
- reset session state to listening or idle, depending on UX decision

### Mechanism

The `response.cancel` event cancels:

- TTS process
- current LLM request
- tool execution when possible

This shared contract uses `response.cancel` consistently for that cancellation signal.

## Protocol Notes for Implementation

- keep the protocol backward compatible when possible
- prefer additive event changes over breaking renames
- document protocol updates in this file whenever implementation changes behavior
- when implementation diverges from the initial contract, update this document in the same change