feat(vela): mock push-to-talk transcript updates

2026-04-08 20:13:36 +02:00
parent 103bb11954
commit 98bcc543f5
8 changed files with 179 additions and 6 deletions
--- a/docs/architecture.md
+++ b/docs/architecture.md
@@ -36,7 +36,7 @@ The repository now includes separate runnable workspaces for the UI and gateway
 - PWA enabled
 - WebSocket client

-The current implementation is a minimal SvelteKit app with a single voice-session shell page. The shipped UI can open and close a browser WebSocket connection to the gateway `/ws` endpoint, show explicit connection status (`not connected`, `connecting`, `connected`, `disconnected`, `error`), expose mic control shell interactions that emit placeholder `input_audio.append` / `input_audio.commit` events, trigger one deterministic mocked turn while connected, and render the mocked user transcript plus mocked assistant response for the active session. This remains a shell only: there is no real microphone capture, real provider integration, or audio playback yet.
+The current implementation is a minimal SvelteKit app with a single voice-session shell page. The shipped UI can open and close a browser WebSocket connection to the gateway `/ws` endpoint, show explicit connection status (`not connected`, `connecting`, `connected`, `disconnected`, `error`), expose mic control shell interactions that emit placeholder `input_audio.append` / `input_audio.commit` events, trigger one deterministic mocked turn while connected, render deterministic placeholder partial/final transcripts for the push-to-talk shell, and render the mocked user transcript plus mocked assistant response for the existing mocked-turn path. This remains a shell only: there is no real microphone capture, real provider integration, or audio playback yet.

 #### Responsibilities

@@ -105,6 +105,7 @@ The current implementation is a minimal Fastify service with `/`, `/health`, and
 - WebSocket upgrades on `/ws` create an ephemeral session immediately
 - the gateway sends `session.ready` followed by `session.state` (`idle`) when the socket is established
 - valid minimal client events, including placeholder `input_audio.append` / `input_audio.commit`, can move the session between `idle` and `listening`
+- placeholder `input_audio.append` emits deterministic mocked `transcript.partial` events and `input_audio.commit` emits one deterministic mocked `transcript.final`
 - `mocked.turn.trigger` drives a fixed transcript/response event sequence over the existing shared protocol
 - only one mocked turn is allowed in flight per session at a time
 - invalid JSON, invalid envelopes, and malformed frames are handled defensively so the process stays up
@@ -115,12 +116,13 @@ The current implementation is a minimal Fastify service with `/`, `/health`, and
 - exposes connect, disconnect, mic-control shell interactions, and mocked-turn controls
 - does not request microphone permission or capture real microphone audio
 - only emits placeholder `input_audio.append` / `input_audio.commit` events; it does not send real audio data or play back audio
+- renders the latest placeholder partial transcript during a push-to-talk shell turn and replaces it with the final deterministic transcript on commit
 - reads mocked transcript and mocked response events from the shared protocol contract

 ## Voice Pipeline

 ```text
-Mic control shell / mocked turn button → Placeholder `input_audio.append` / `input_audio.commit` or mocked session flow → Transcript events → Response text events → UI
+Mic control shell / mocked turn button → Placeholder `input_audio.append` / `input_audio.commit` or mocked session flow → Deterministic transcript events → Mocked response text events when using mocked.turn.trigger → UI
 ```

 This mocked vertical slice intentionally stands in for the future real pipeline:
--- a/docs/backlog.md
+++ b/docs/backlog.md
@@ -38,7 +38,7 @@ Prove the end-to-end interaction model with mocked or stubbed providers.
 - [x] create a minimal UI with mic control
 - [x] create a gateway WebSocket session skeleton
 - [x] implement a mocked transcript/response vertical slice over the existing WebSocket session
- implement mocked STT flow for partial transcript events
+- [x] implement mocked STT flow for partial transcript events
 - implement mocked LLM response streaming beyond the fixed deterministic slice
 - implement stubbed audio playback or placeholder TTS output
 - [x] implement interrupt handling across the mocked pipeline
@@ -190,6 +190,8 @@ Polish the system after the core voice loop is reliable.
 - `apps/vela-gateway` now exposes a minimal `/ws` WebSocket session skeleton with ephemeral in-memory sessions and defensive message handling
 - `apps/vela-gateway` now accepts `mocked.turn.trigger` and emits protocol-valid mocked transcript/response events with one in-flight mocked turn per session
 - `apps/vela-gateway` now supports placeholder input-audio append/commit cycles before running another mocked turn on the same socket
+- `apps/vela-gateway` now emits deterministic `transcript.partial` events for placeholder `input_audio.append` messages and exactly one deterministic `transcript.final` for each placeholder `input_audio.commit`
+- `apps/vela-ui` now renders the latest placeholder partial transcript during the push-to-talk shell turn and replaces it with the deterministic final transcript on commit
 - `apps/vela-ui` now exposes a cancel control for active mocked turns and keeps already-rendered transcript/response text visible after cancellation
 - `apps/vela-gateway` now honors `response.cancel` during mocked turns by stopping pending mocked response events, returning the session to `idle`, and allowing a new mocked turn on the same socket
 - `apps/vela-protocol` now provides the shared WebSocket event contract for the UI and gateway
--- a/docs/protocol.md
+++ b/docs/protocol.md
@@ -62,7 +62,8 @@ type ClientEvent =
 - `mocked.turn.trigger` is accepted only when no other mocked turn is already in flight for that session
 - a mocked turn emits deterministic `transcript.final`, `response.text.delta`, `response.completed`, and `session.state` events in protocol-valid order
 - `input_audio.append` updates the ephemeral session record and moves the session to `listening`
- `input_audio.commit` resets the minimal buffered state and returns the session to `idle`
+- each accepted `input_audio.append` emits one deterministic `transcript.partial` for the current placeholder turn
+- `input_audio.commit` emits exactly one deterministic `transcript.final`, resets the minimal buffered state, and returns the session to `idle`
 - after a completed placeholder input cycle, the same socket can still send `mocked.turn.trigger`
 - `response.cancel` is safe to send even when no mocked turn is active
 - `response.cancel` stops any still-pending mocked turn events for the active turn and resets the minimal session state back to `idle`
@@ -90,6 +91,8 @@ Notes:
 - the UI disables the mocked-turn control until `session.ready` arrives, while disconnected, or while a mocked turn is already in flight
 - the UI disables the mic control while disconnected, before `session.ready`, or while a mocked turn is already in flight
 - pressing the mic control sends one placeholder `input_audio.append` chunk and releasing it sends `input_audio.commit`
+- while a placeholder push-to-talk turn is in progress, the UI renders the latest `transcript.partial`
+- after placeholder commit, the UI renders the `transcript.final` and clears the partial-only display
 - the UI copy explicitly labels the mic button as a control shell and not real microphone capture
 - the UI shows a cancel control and enables it only while a mocked turn is active
 - after cancel returns the gateway to `idle`, the UI clears the active-turn indicator but keeps any transcript or response text that was already rendered
@@ -144,6 +147,29 @@ Notes:
 - no audio, STT, LLM, TTS, or external providers participate in this flow
 - `response.cancel` can stop the mocked turn early, suppress any later mocked response events for that turn, and return the session to `idle`

+### Deterministic placeholder push-to-talk transcript sequence
+
+For this increment, the existing mic-control shell still sends placeholder `input_audio.append` on press and `input_audio.commit` on release. The gateway now translates that shell flow into deterministic mocked transcript events only:
+
+```text
+input_audio.append #1
+→ session.state(listening) when entering the turn
+→ transcript.partial("[mocked partial] Placeholder push-to-talk transcript in progress.")
+
+input_audio.append #N (N > 1)
+→ transcript.partial("[mocked partial] Placeholder push-to-talk transcript in progress (N chunks).")
+
+input_audio.commit after N appends
+→ transcript.final("[mocked final] Placeholder push-to-talk transcript completed from N appended chunk(s).")
+→ session.state(idle)
+```
+
+Safe deterministic edge cases for this mocked placeholder flow:
+
+- commit without any prior append is accepted and emits `transcript.final("[mocked final] Placeholder push-to-talk transcript completed without appended audio.")`
+- repeated appends during one placeholder turn are accepted and each append replaces the latest partial transcript with a chunk-count-based deterministic value
+- placeholder commit does not automatically start assistant thinking, response streaming, or audio playback
+
 ## Contract Scope for This Increment

 This contract is intentionally limited to the smallest event set needed to unblock: