Files

Johannes Kresner 98bcc543f5 feat(vela): mock push-to-talk transcript updates

2026-04-08 20:13:36 +02:00

5.5 KiB

Raw Blame History

Vela Architecture

High-Level Architecture

[ Browser (PWA UI) ]
        |
   WebSocket
        |
[ Vela Gateway (NanoPi R6S) ]
        |
        +--> STT (local or NAS)
        +--> Ollama (NAS GPU)
        +--> Kokoro TTS (NAS or NanoPi)
        +--> Home Assistant
        +--> SearXNG

Core Components

Repository Structure

apps/
  vela-ui/
  vela-gateway/

The repository now includes separate runnable workspaces for the UI and gateway so implementation can proceed independently while staying aligned through shared documentation.

Frontend — `vela-ui`

Tech

SvelteKit
PWA enabled
WebSocket client

The current implementation is a minimal SvelteKit app with a single voice-session shell page. The shipped UI can open and close a browser WebSocket connection to the gateway /ws endpoint, show explicit connection status (not connected, connecting, connected, disconnected, error), expose mic control shell interactions that emit placeholder input_audio.append / input_audio.commit events, trigger one deterministic mocked turn while connected, render deterministic placeholder partial/final transcripts for the push-to-talk shell, and render the mocked user transcript plus mocked assistant response for the existing mocked-turn path. This remains a shell only: there is no real microphone capture, real provider integration, or audio playback yet.

Responsibilities

Current shell responsibilities:

connection state rendering
mocked-turn trigger rendering with disconnected/in-flight guards
mocked transcript and mocked assistant response rendering
developer-oriented session metadata rendering
browser session connect/disconnect controls

Future UI responsibilities:

audio capture from microphone
audio playback for TTS
broader voice-session UI state rendering
interrupt handling

Main Screen

Current shell:

developer-focused voice-session panel
connect button
disconnect button
mocked-turn button
connection status indicator
mocked transcript display
mocked assistant response display
session metadata display

Future interactive voice screen:

large mic button
live transcript
streamed assistant response text
state indicator:
- idle
- listening
- thinking
- speaking
interrupt button during speaking

Backend — `vela-gateway`

Tech

Fastify (Node)
WebSocket-based session layer

The current implementation is a minimal Fastify service with /, /health, and a documented /ws WebSocket session endpoint. The gateway keeps one ephemeral in-memory session record per live socket connection, removes it on disconnect, and can run one deterministic mocked turn per session without involving any external providers.

Responsibilities

session lifecycle
audio ingestion
STT orchestration
LLM orchestration
tool execution
TTS orchestration
event streaming

Current WebSocket skeleton

GET /ws documents the route for plain HTTP clients and returns 426 Upgrade Required
WebSocket upgrades on /ws create an ephemeral session immediately
the gateway sends session.ready followed by session.state (idle) when the socket is established
valid minimal client events, including placeholder input_audio.append / input_audio.commit, can move the session between idle and listening
placeholder input_audio.append emits deterministic mocked transcript.partial events and input_audio.commit emits one deterministic mocked transcript.final
mocked.turn.trigger drives a fixed transcript/response event sequence over the existing shared protocol
only one mocked turn is allowed in flight per session at a time
invalid JSON, invalid envelopes, and malformed frames are handled defensively so the process stays up

Current UI shell behavior

renders a minimal developer-focused voice-session panel
exposes connect, disconnect, mic-control shell interactions, and mocked-turn controls
does not request microphone permission or capture real microphone audio
only emits placeholder input_audio.append / input_audio.commit events; it does not send real audio data or play back audio
renders the latest placeholder partial transcript during a push-to-talk shell turn and replaces it with the final deterministic transcript on commit
reads mocked transcript and mocked response events from the shared protocol contract

Voice Pipeline

Mic control shell / mocked turn button → Placeholder `input_audio.append` / `input_audio.commit` or mocked session flow → Deterministic transcript events → Mocked response text events when using mocked.turn.trigger → UI

This mocked vertical slice intentionally stands in for the future real pipeline:

Mic → Gateway → STT → Transcript
→ LLM → Tool Calls → Results
→ LLM → Final Response
→ TTS → Audio Stream → UI

Gateway Internal Flow

1. Receive audio
2. Run STT (streaming)
3. Emit partial transcripts
4. On final:
   → call LLM
5. LLM decides:
   → direct response OR tool call
6. Execute tool
7. Feed result back to LLM
8. Generate final response
9. Send text stream
10. Send TTS stream

LLM Layer

Location

NAS with RTX 3050 8GB

Role

intent parsing
tool selection
response generation

Constraints

must use a tool-calling schema
must not directly control systems
target approximately 7B-class models because of hardware limits

Naming

system: Vela
gateway: vela-gateway
UI: vela-ui
voice profile: vela-neutral

5.5 KiB Raw Blame History

Vela Architecture

High-Level Architecture

Core Components

Repository Structure

Frontend — vela-ui

Tech

Responsibilities

Main Screen

Backend — vela-gateway

Tech

Responsibilities

Current WebSocket skeleton

Current UI shell behavior

Voice Pipeline

Gateway Internal Flow

LLM Layer

Location

Role

Constraints

Naming

5.5 KiB

Raw Blame History

Frontend — `vela-ui`

Backend — `vela-gateway`