Files
assistant/docs/architecture.md

4.6 KiB

Vela Architecture

High-Level Architecture

[ Browser (PWA UI) ]
        |
   WebSocket
        |
[ Vela Gateway (NanoPi R6S) ]
        |
        +--> STT (local or NAS)
        +--> Ollama (NAS GPU)
        +--> Kokoro TTS (NAS or NanoPi)
        +--> Home Assistant
        +--> SearXNG

Core Components

Repository Structure

apps/
  vela-ui/
  vela-gateway/

The repository now includes separate runnable workspaces for the UI and gateway so implementation can proceed independently while staying aligned through shared documentation.

Frontend — vela-ui

Tech

  • SvelteKit
  • PWA enabled
  • WebSocket client

The current implementation is a minimal SvelteKit app with a single voice-session shell page. The shipped UI can open and close a browser WebSocket connection to the gateway /ws endpoint, show explicit connection status (not connected, connecting, connected, disconnected, error), trigger one deterministic mocked turn while connected, and render the mocked user transcript plus mocked assistant response for the active session. Microphone capture, real provider integration, and audio playback are still future work.

Responsibilities

Current shell responsibilities:

  • connection state rendering
  • mocked-turn trigger rendering with disconnected/in-flight guards
  • mocked transcript and mocked assistant response rendering
  • developer-oriented session metadata rendering
  • browser session connect/disconnect controls

Future UI responsibilities:

  • audio capture from microphone
  • audio playback for TTS
  • broader voice-session UI state rendering
  • interrupt handling

Main Screen

Current shell:

  • developer-focused voice-session panel
  • connect button
  • disconnect button
  • mocked-turn button
  • connection status indicator
  • mocked transcript display
  • mocked assistant response display
  • session metadata display

Future interactive voice screen:

  • large mic button
  • live transcript
  • streamed assistant response text
  • state indicator:
    • idle
    • listening
    • thinking
    • speaking
  • interrupt button during speaking

Backend — vela-gateway

Tech

  • Fastify (Node)
  • WebSocket-based session layer

The current implementation is a minimal Fastify service with /, /health, and a documented /ws WebSocket session endpoint. The gateway keeps one ephemeral in-memory session record per live socket connection, removes it on disconnect, and can run one deterministic mocked turn per session without involving any external providers.

Responsibilities

  • session lifecycle
  • audio ingestion
  • STT orchestration
  • LLM orchestration
  • tool execution
  • TTS orchestration
  • event streaming

Current WebSocket skeleton

  • GET /ws documents the route for plain HTTP clients and returns 426 Upgrade Required
  • WebSocket upgrades on /ws create an ephemeral session immediately
  • the gateway sends session.ready followed by session.state (idle) when the socket is established
  • valid minimal client events can move the session between idle and listening
  • mocked.turn.trigger drives a fixed transcript/response event sequence over the existing shared protocol
  • only one mocked turn is allowed in flight per session at a time
  • invalid JSON, invalid envelopes, and malformed frames are handled defensively so the process stays up

Current UI shell behavior

  • renders a minimal developer-focused voice-session panel
  • exposes connect, disconnect, and mocked-turn controls
  • does not request microphone permission
  • does not send or process audio data
  • reads mocked transcript and mocked response events from the shared protocol contract

Voice Pipeline

Mocked turn button → Gateway mocked session flow → Transcript events → Response text events → UI

This mocked vertical slice intentionally stands in for the future real pipeline:

Mic → Gateway → STT → Transcript
→ LLM → Tool Calls → Results
→ LLM → Final Response
→ TTS → Audio Stream → UI

Gateway Internal Flow

1. Receive audio
2. Run STT (streaming)
3. Emit partial transcripts
4. On final:
   → call LLM
5. LLM decides:
   → direct response OR tool call
6. Execute tool
7. Feed result back to LLM
8. Generate final response
9. Send text stream
10. Send TTS stream

LLM Layer

Location

  • NAS with RTX 3050 8GB

Role

  • intent parsing
  • tool selection
  • response generation

Constraints

  • must use a tool-calling schema
  • must not directly control systems
  • target approximately 7B-class models because of hardware limits

Naming

  • system: Vela
  • gateway: vela-gateway
  • UI: vela-ui
  • voice profile: vela-neutral