Files
assistant/docs/architecture.md

5.6 KiB

Vela Architecture

High-Level Architecture

[ Browser (PWA UI) ]
        |
   WebSocket
        |
[ Vela Gateway (NanoPi R6S) ]
        |
        +--> STT (local or NAS)
        +--> Ollama (NAS GPU)
        +--> Kokoro TTS (NAS or NanoPi)
        +--> Home Assistant
        +--> SearXNG

Core Components

Repository Structure

apps/
  vela-ui/
  vela-gateway/

The repository now includes separate runnable workspaces for the UI and gateway so implementation can proceed independently while staying aligned through shared documentation.

Frontend — vela-ui

Tech

  • SvelteKit
  • PWA enabled
  • WebSocket client

The current implementation is a minimal SvelteKit app with a single voice-session shell page. The shipped UI can open and close a browser WebSocket connection to the gateway /ws endpoint, show explicit connection status (not connected, connecting, connected, disconnected, error), expose mic control shell interactions that emit placeholder input_audio.append / input_audio.commit events, trigger one deterministic mocked turn while connected, render deterministic placeholder partial/final transcripts for the push-to-talk shell, and stream the mocked assistant response both for mocked.turn.trigger and for push-to-talk commits. This remains a shell only: there is no real microphone capture, real provider integration, or audio playback yet.

Responsibilities

Current shell responsibilities:

  • connection state rendering
  • mocked-turn trigger rendering with disconnected/in-flight guards
  • mocked transcript and mocked assistant response rendering
  • developer-oriented session metadata rendering
  • browser session connect/disconnect controls

Future UI responsibilities:

  • audio capture from microphone
  • audio playback for TTS
  • broader voice-session UI state rendering
  • interrupt handling

Main Screen

Current shell:

  • developer-focused voice-session panel
  • connect button
  • disconnect button
  • mocked-turn button
  • connection status indicator
  • mocked transcript display
  • mocked assistant response display
  • session metadata display

Future interactive voice screen:

  • large mic button
  • live transcript
  • streamed assistant response text
  • state indicator:
    • idle
    • listening
    • thinking
    • speaking
  • interrupt button during speaking

Backend — vela-gateway

Tech

  • Fastify (Node)
  • WebSocket-based session layer

The current implementation is a minimal Fastify service with /, /health, and a documented /ws WebSocket session endpoint. The gateway keeps one ephemeral in-memory session record per live socket connection, removes it on disconnect, and can run one deterministic mocked turn per session without involving any external providers.

Responsibilities

  • session lifecycle
  • audio ingestion
  • STT orchestration
  • LLM orchestration
  • tool execution
  • TTS orchestration
  • event streaming

Current WebSocket skeleton

  • GET /ws documents the route for plain HTTP clients and returns 426 Upgrade Required
  • WebSocket upgrades on /ws create an ephemeral session immediately
  • the gateway sends session.ready followed by session.state (idle) when the socket is established
  • valid minimal client events, including placeholder input_audio.append / input_audio.commit, can move the session through the mocked turn states on one socket
  • placeholder input_audio.append emits deterministic mocked transcript.partial events and input_audio.commit emits one deterministic mocked transcript.final before starting the existing mocked assistant response stream
  • mocked.turn.trigger drives a fixed transcript/response event sequence over the existing shared protocol
  • only one mocked turn is allowed in flight per session at a time
  • invalid JSON, invalid envelopes, and malformed frames are handled defensively so the process stays up

Current UI shell behavior

  • renders a minimal developer-focused voice-session panel
  • exposes connect, disconnect, mic-control shell interactions, and mocked-turn controls
  • does not request microphone permission or capture real microphone audio
  • only emits placeholder input_audio.append / input_audio.commit events; it does not send real audio data or play back audio
  • renders the latest placeholder partial transcript during a push-to-talk shell turn, replaces it with the final deterministic transcript on commit, and appends streamed mocked assistant text for that same push-to-talk turn
  • reads mocked transcript and mocked response events from the shared protocol contract

Voice Pipeline

Mic control shell / mocked turn button → Placeholder `input_audio.append` / `input_audio.commit` or mocked session flow → Deterministic transcript events → Shared mocked response engine → Mocked response text events → UI

This mocked vertical slice intentionally stands in for the future real pipeline:

Mic → Gateway → STT → Transcript
→ LLM → Tool Calls → Results
→ LLM → Final Response
→ TTS → Audio Stream → UI

Gateway Internal Flow

1. Receive audio
2. Run STT (streaming)
3. Emit partial transcripts
4. On final:
   → call LLM
5. LLM decides:
   → direct response OR tool call
6. Execute tool
7. Feed result back to LLM
8. Generate final response
9. Send text stream
10. Send TTS stream

LLM Layer

Location

  • NAS with RTX 3050 8GB

Role

  • intent parsing
  • tool selection
  • response generation

Constraints

  • must use a tool-calling schema
  • must not directly control systems
  • target approximately 7B-class models because of hardware limits

Naming

  • system: Vela
  • gateway: vela-gateway
  • UI: vela-ui
  • voice profile: vela-neutral