Files

Johannes Kresner 8e14eaeed0 feat(vela): retire legacy mocked turn trigger

2026-04-08 21:50:18 +02:00

9.8 KiB

Raw Blame History

Vela Protocol and State Machine

Event Protocol

The shared code-level contract lives in the Yarn workspace package @vela/protocol so both the gateway and UI import the same event names and envelope shape.

Current gateway baseline:

WebSocket endpoint: /ws
the gateway sends session.ready and session.state immediately after a successful socket upgrade
the gateway accepts JSON text messages only in the shared envelope shape

Current UI baseline:

the browser opens a WebSocket directly to /ws
the UI tracks connection status separately from gateway session status
the UI exposes a push-to-talk mic control shell that sends placeholder input_audio.append on press and input_audio.commit on release without capturing real audio
the push-to-talk shell is the only supported mocked turn entry path from the shipped UI

WebSocket Message Envelope

Every WebSocket message uses one envelope format:

type MessageEnvelope<TType extends string, TPayload> = {
  type: TType;
  payload: TPayload;
};

This increment intentionally keeps the envelope minimal:

type identifies the event
payload carries the event body
no sequence numbers, timestamps, or protocol version fields yet
future changes should be additive when possible

Client → Server

type ClientEvent =
  | { type: "session.start"; payload: {} }
  | { type: "mocked.turn.trigger"; payload: {} }
  | { type: "input_audio.append"; payload: { chunk: string } }
  | { type: "input_audio.commit"; payload: {} }
  | { type: "response.cancel"; payload: {} };

Client event intent

session.start initializes a voice session without locking in transport or auth details yet
mocked.turn.trigger is a retired legacy event name that the gateway now rejects with a deterministic recoverable error
input_audio.append carries a chunk of captured input audio as an encoded string
input_audio.commit marks the current buffered user turn as ready for downstream processing
response.cancel interrupts the active listen/think/speak flow

Current skeleton behavior

on connect, the gateway creates an ephemeral in-memory session and emits session.ready plus session.state
session.start is accepted as an idempotent session acknowledgment and re-sends readiness/state
mocked.turn.trigger is rejected deterministically with error.code = unsupported_mocked_turn_trigger
input_audio.append updates the ephemeral session record and moves the session to listening
each accepted input_audio.append emits one deterministic transcript.partial for the current placeholder turn
input_audio.commit emits exactly one deterministic transcript.final and then starts the deterministic mocked assistant response stream for that push-to-talk turn
after a completed placeholder input cycle, the same socket can start another placeholder push-to-talk turn without reconnecting
response.cancel is safe to send even when no mocked turn is active
response.cancel stops any still-pending mocked turn events for the active turn and resets the minimal session state back to idle
malformed JSON produces error with code invalid_json
invalid envelopes or unsupported client event names produce error with code invalid_message
malformed WebSocket frames are rejected without crashing the gateway process

UI connection shell behavior

The UI currently exposes a small browser-side connection state machine for the WebSocket transport:

not connected
 → connecting
 → connected
 → disconnected
 → error

Notes:

this UI state is transport-oriented and is separate from the shared gateway session.state payload
session.state currently reflects the gateway session phase (idle, listening, thinking, speaking)
the UI disables the mic control while disconnected, before session.ready, or while a mocked turn is already in flight
pressing the mic control sends one placeholder input_audio.append chunk and releasing it sends input_audio.commit
while a placeholder push-to-talk turn is in progress, the UI renders the latest transcript.partial
after placeholder commit, the UI renders the transcript.final, clears the partial-only display, and streams the mocked assistant text from the downstream response events
the UI copy explicitly labels the mic button as a control shell and not real microphone capture
the UI shows a cancel control and enables it only while a mocked turn is active
after cancel returns the gateway to idle, the UI clears the active-turn indicator but keeps any transcript or response text that was already rendered
the UI treats malformed server messages, browser WebSocket errors, and gateway error events as safe error states instead of throwing

Server → Client

type ServerEvent =
  | { type: "session.ready"; payload: { sessionId: string } }
  | {
      type: "session.state";
      payload: { value: "idle" | "listening" | "thinking" | "speaking" };
    }
  | { type: "transcript.partial"; payload: { text: string } }
  | { type: "transcript.final"; payload: { text: string } }
  | { type: "response.text.delta"; payload: { text: string } }
  | { type: "response.completed"; payload: {} }
  | {
      type: "error";
      payload: { code: string; message: string; retryable?: boolean };
    };

Server event intent

session.ready confirms that the gateway created a session identity
session.state exposes the coarse session phase needed by the later UI shell
transcript.partial and transcript.final support incremental and completed user text display
response.text.delta supports streamed assistant text without committing to audio output details yet
response.completed marks the current assistant turn as done
error is the minimal recoverable failure shape for both UI and gateway work

Legacy mocked turn trigger rejection

For this increment, direct mocked.turn.trigger requests no longer start a mocked turn:

mocked.turn.trigger
→ error(code="unsupported_mocked_turn_trigger", message="mocked.turn.trigger is no longer supported; use input_audio.append and input_audio.commit instead.")

Notes:

this rejection is deterministic and recoverable
the session remains available for the supported push-to-talk flow on the same socket

Deterministic placeholder push-to-talk transcript and mocked response sequence

For this increment, the existing mic-control shell still sends placeholder input_audio.append on press and input_audio.commit on release. The gateway now translates that shell flow into deterministic mocked transcript events and then reuses the existing mocked response stream:

input_audio.append #1
→ session.state(listening) when entering the turn
→ transcript.partial("[mocked partial] Placeholder push-to-talk transcript in progress.")

input_audio.append #N (N > 1)
→ transcript.partial("[mocked partial] Placeholder push-to-talk transcript in progress (N chunks).")

input_audio.commit after N appends
→ transcript.final("[mocked final] Placeholder push-to-talk transcript completed from N appended chunk(s).")
→ session.state(thinking)
→ session.state(speaking)
→ response.text.delta("[mocked assistant] ")
→ response.text.delta("This is a deterministic mocked response from the gateway vertical slice.")
→ response.completed
→ session.state(idle)

Safe deterministic edge cases for this mocked placeholder flow:

commit without any prior append is accepted and emits transcript.final("[mocked final] Placeholder push-to-talk transcript completed without appended audio.")
repeated appends during one placeholder turn are accepted and each append replaces the latest partial transcript with a chunk-count-based deterministic value
after the final transcript, placeholder commit follows the deterministic mocked thinking → speaking → response.text.delta* → response.completed → idle path
response.cancel can interrupt this mocked post-commit response path; already-rendered transcript or assistant text is not retracted

Contract Scope for This Increment

This contract is intentionally limited to the smallest event set needed to unblock:

the later gateway WebSocket session skeleton
the later UI voice-session shell

Explicitly deferred for later increments:

freeform typed user input
tool-calling events
streamed TTS/output-audio events
reconnect/resume semantics
protocol version negotiation
provider-specific metadata fields

State Machine

idle
 → listening
 → thinking
 → speaking
 → idle

Current mocked-pipeline behavior:

during an active mocked turn, response.cancel returns the session to idle immediately
any mocked turn timers that have not fired yet are dropped, so no later response.text.delta or response.completed events are emitted for the cancelled turn
the same cancellation behavior applies when a mocked turn was started by input_audio.commit
once idle is restored, the same WebSocket session can start another placeholder push-to-talk turn without reconnecting

More general future-state expectations:

response.cancel can occur at:

listening → restart
thinking → cancel
speaking → stop immediately

`response.cancel` Handling Requirements

immediate stop of TTS playback
immediate stop of LLM streaming
reset session state to listening or idle, depending on UX decision

Mechanism

The response.cancel event cancels:

TTS process
current LLM request
tool execution when possible

This shared contract uses response.cancel consistently for that cancellation signal.

Protocol Notes for Implementation

keep the protocol backward compatible when possible
prefer additive event changes over breaking renames
document protocol updates in this file whenever implementation changes behavior
when implementation diverges from the initial contract, update this document in the same change

9.8 KiB Raw Blame History