Files

Johannes Kresner 28712443cc feat(vela): start mocked response flow after push-to-talk commit

2026-04-08 21:20:17 +02:00

11 KiB

Raw Blame History

Vela Protocol and State Machine

Event Protocol

The shared code-level contract lives in the Yarn workspace package @vela/protocol so both the gateway and UI import the same event names and envelope shape.

Current gateway baseline:

WebSocket endpoint: /ws
the gateway sends session.ready and session.state immediately after a successful socket upgrade
the gateway accepts JSON text messages only in the shared envelope shape

Current UI baseline:

the browser opens a WebSocket directly to /ws
the UI tracks connection status separately from gateway session status
the UI can send mocked.turn.trigger after session.ready while connected to request one deterministic mocked turn for the active session
the UI exposes a push-to-talk mic control shell that sends placeholder input_audio.append on press and input_audio.commit on release without capturing real audio

WebSocket Message Envelope

Every WebSocket message uses one envelope format:

type MessageEnvelope<TType extends string, TPayload> = {
  type: TType;
  payload: TPayload;
};

This increment intentionally keeps the envelope minimal:

type identifies the event
payload carries the event body
no sequence numbers, timestamps, or protocol version fields yet
future changes should be additive when possible

Client → Server

type ClientEvent =
  | { type: "session.start"; payload: {} }
  | { type: "mocked.turn.trigger"; payload: {} }
  | { type: "input_audio.append"; payload: { chunk: string } }
  | { type: "input_audio.commit"; payload: {} }
  | { type: "response.cancel"; payload: {} };

Client event intent

session.start initializes a voice session without locking in transport or auth details yet
mocked.turn.trigger asks the gateway to run one obviously mocked, deterministic transcript/response turn
input_audio.append carries a chunk of captured input audio as an encoded string
input_audio.commit marks the current buffered user turn as ready for downstream processing
response.cancel interrupts the active listen/think/speak flow

Current skeleton behavior

on connect, the gateway creates an ephemeral in-memory session and emits session.ready plus session.state
session.start is accepted as an idempotent session acknowledgment and re-sends readiness/state
mocked.turn.trigger is accepted only when no other mocked turn is already in flight for that session
a mocked turn emits deterministic transcript.final, response.text.delta, response.completed, and session.state events in protocol-valid order
input_audio.append updates the ephemeral session record and moves the session to listening
each accepted input_audio.append emits one deterministic transcript.partial for the current placeholder turn
input_audio.commit emits exactly one deterministic transcript.final and then starts the same deterministic mocked assistant response stream used by mocked.turn.trigger
after a completed placeholder input cycle, the same socket can still send mocked.turn.trigger
response.cancel is safe to send even when no mocked turn is active
response.cancel stops any still-pending mocked turn events for the active turn and resets the minimal session state back to idle
a second mocked-turn trigger during an active mocked turn produces error with code mocked_turn_in_flight
malformed JSON produces error with code invalid_json
invalid envelopes or unsupported client event names produce error with code invalid_message
malformed WebSocket frames are rejected without crashing the gateway process

UI connection shell behavior

The UI currently exposes a small browser-side connection state machine for the WebSocket transport:

not connected
 → connecting
 → connected
 → disconnected
 → error

Notes:

this UI state is transport-oriented and is separate from the shared gateway session.state payload
session.state currently reflects the gateway session phase (idle, listening, thinking, speaking)
the UI disables the mocked-turn control until session.ready arrives, while disconnected, or while a mocked turn is already in flight
the UI disables the mic control while disconnected, before session.ready, or while a mocked turn is already in flight
pressing the mic control sends one placeholder input_audio.append chunk and releasing it sends input_audio.commit
while a placeholder push-to-talk turn is in progress, the UI renders the latest transcript.partial
after placeholder commit, the UI renders the transcript.final, clears the partial-only display, and streams the mocked assistant text from the downstream response events
the UI copy explicitly labels the mic button as a control shell and not real microphone capture
the UI shows a cancel control and enables it only while a mocked turn is active
after cancel returns the gateway to idle, the UI clears the active-turn indicator but keeps any transcript or response text that was already rendered
the UI treats malformed server messages, browser WebSocket errors, and gateway error events as safe error states instead of throwing

Server → Client

type ServerEvent =
  | { type: "session.ready"; payload: { sessionId: string } }
  | {
      type: "session.state";
      payload: { value: "idle" | "listening" | "thinking" | "speaking" };
    }
  | { type: "transcript.partial"; payload: { text: string } }
  | { type: "transcript.final"; payload: { text: string } }
  | { type: "response.text.delta"; payload: { text: string } }
  | { type: "response.completed"; payload: {} }
  | {
      type: "error";
      payload: { code: string; message: string; retryable?: boolean };
    };

Server event intent

session.ready confirms that the gateway created a session identity
session.state exposes the coarse session phase needed by the later UI shell
transcript.partial and transcript.final support incremental and completed user text display
response.text.delta supports streamed assistant text without committing to audio output details yet
response.completed marks the current assistant turn as done
error is the minimal recoverable failure shape for both UI and gateway work

Deterministic mocked turn sequence

For this increment, mocked.turn.trigger produces one fixed interaction for the active session:

session.state(listening)
→ transcript.final("[mocked user] What is the current mocked vertical slice?")
→ session.state(thinking)
→ session.state(speaking)
→ response.text.delta("[mocked assistant] ")
→ response.text.delta("This is a deterministic mocked response from the gateway vertical slice.")
→ response.completed
→ session.state(idle)

Notes:

the content is intentionally fixed and obviously mocked
no audio, STT, LLM, TTS, or external providers participate in this flow
response.cancel can stop the mocked turn early, suppress any later mocked response events for that turn, and return the session to idle

Deterministic placeholder push-to-talk transcript and mocked response sequence

For this increment, the existing mic-control shell still sends placeholder input_audio.append on press and input_audio.commit on release. The gateway now translates that shell flow into deterministic mocked transcript events and then reuses the existing mocked response stream:

input_audio.append #1
→ session.state(listening) when entering the turn
→ transcript.partial("[mocked partial] Placeholder push-to-talk transcript in progress.")

input_audio.append #N (N > 1)
→ transcript.partial("[mocked partial] Placeholder push-to-talk transcript in progress (N chunks).")

input_audio.commit after N appends
→ transcript.final("[mocked final] Placeholder push-to-talk transcript completed from N appended chunk(s).")
→ session.state(thinking)
→ session.state(speaking)
→ response.text.delta("[mocked assistant] ")
→ response.text.delta("This is a deterministic mocked response from the gateway vertical slice.")
→ response.completed
→ session.state(idle)

Safe deterministic edge cases for this mocked placeholder flow:

commit without any prior append is accepted and emits transcript.final("[mocked final] Placeholder push-to-talk transcript completed without appended audio.")
repeated appends during one placeholder turn are accepted and each append replaces the latest partial transcript with a chunk-count-based deterministic value
after the final transcript, placeholder commit follows the same mocked thinking → speaking → response.text.delta* → response.completed → idle path as mocked.turn.trigger
response.cancel can interrupt this mocked post-commit response path the same way it interrupts mocked.turn.trigger; already-rendered transcript or assistant text is not retracted

Contract Scope for This Increment

This contract is intentionally limited to the smallest event set needed to unblock:

the later gateway WebSocket session skeleton
the later UI voice-session shell

Explicitly deferred for later increments:

freeform typed user input
tool-calling events
streamed TTS/output-audio events
reconnect/resume semantics
protocol version negotiation
provider-specific metadata fields

State Machine

idle
 → listening
 → thinking
 → speaking
 → idle

Current mocked-pipeline behavior:

during an active mocked turn, response.cancel returns the session to idle immediately
any mocked turn timers that have not fired yet are dropped, so no later response.text.delta or response.completed events are emitted for the cancelled turn
the same cancellation behavior applies when a mocked turn was started by input_audio.commit
once idle is restored, the same WebSocket session can start another mocked turn without reconnecting

More general future-state expectations:

response.cancel can occur at:

listening → restart
thinking → cancel
speaking → stop immediately

`response.cancel` Handling Requirements

immediate stop of TTS playback
immediate stop of LLM streaming
reset session state to listening or idle, depending on UX decision

Mechanism

The response.cancel event cancels:

TTS process
current LLM request
tool execution when possible

This shared contract uses response.cancel consistently for that cancellation signal.

Protocol Notes for Implementation

keep the protocol backward compatible when possible
prefer additive event changes over breaking renames
document protocol updates in this file whenever implementation changes behavior
when implementation diverges from the initial contract, update this document in the same change

11 KiB Raw Blame History