# Vela Architecture ## High-Level Architecture ```text [ Browser (PWA UI) ] | WebSocket | [ Vela Gateway (NanoPi R6S) ] | +--> STT (local or NAS) +--> Ollama (NAS GPU) +--> Kokoro TTS (NAS or NanoPi) +--> Home Assistant +--> SearXNG ``` ## Core Components ## Repository Structure ```text apps/ vela-ui/ vela-gateway/ ``` The repository now includes separate runnable workspaces for the UI and gateway so implementation can proceed independently while staying aligned through shared documentation. ### Frontend — `vela-ui` #### Tech - SvelteKit - PWA enabled - WebSocket client The current implementation is a minimal SvelteKit app with a single voice-session shell page. The shipped UI can open and close a browser WebSocket connection to the gateway `/ws` endpoint, show explicit connection status (`not connected`, `connecting`, `connected`, `disconnected`, `error`), expose mic control shell interactions that emit placeholder `input_audio.append` / `input_audio.commit` events, trigger one deterministic mocked turn while connected, and render the mocked user transcript plus mocked assistant response for the active session. This remains a shell only: there is no real microphone capture, real provider integration, or audio playback yet. #### Responsibilities Current shell responsibilities: - connection state rendering - mocked-turn trigger rendering with disconnected/in-flight guards - mocked transcript and mocked assistant response rendering - developer-oriented session metadata rendering - browser session connect/disconnect controls Future UI responsibilities: - audio capture from microphone - audio playback for TTS - broader voice-session UI state rendering - interrupt handling #### Main Screen Current shell: - developer-focused voice-session panel - connect button - disconnect button - mocked-turn button - connection status indicator - mocked transcript display - mocked assistant response display - session metadata display Future interactive voice screen: - large mic button - live transcript - streamed assistant response text - state indicator: - idle - listening - thinking - speaking - interrupt button during speaking ### Backend — `vela-gateway` #### Tech - Fastify (Node) - WebSocket-based session layer The current implementation is a minimal Fastify service with `/`, `/health`, and a documented `/ws` WebSocket session endpoint. The gateway keeps one ephemeral in-memory session record per live socket connection, removes it on disconnect, and can run one deterministic mocked turn per session without involving any external providers. #### Responsibilities - session lifecycle - audio ingestion - STT orchestration - LLM orchestration - tool execution - TTS orchestration - event streaming #### Current WebSocket skeleton - `GET /ws` documents the route for plain HTTP clients and returns `426 Upgrade Required` - WebSocket upgrades on `/ws` create an ephemeral session immediately - the gateway sends `session.ready` followed by `session.state` (`idle`) when the socket is established - valid minimal client events, including placeholder `input_audio.append` / `input_audio.commit`, can move the session between `idle` and `listening` - `mocked.turn.trigger` drives a fixed transcript/response event sequence over the existing shared protocol - only one mocked turn is allowed in flight per session at a time - invalid JSON, invalid envelopes, and malformed frames are handled defensively so the process stays up ### Current UI shell behavior - renders a minimal developer-focused voice-session panel - exposes connect, disconnect, mic-control shell interactions, and mocked-turn controls - does not request microphone permission or capture real microphone audio - only emits placeholder `input_audio.append` / `input_audio.commit` events; it does not send real audio data or play back audio - reads mocked transcript and mocked response events from the shared protocol contract ## Voice Pipeline ```text Mic control shell / mocked turn button → Placeholder `input_audio.append` / `input_audio.commit` or mocked session flow → Transcript events → Response text events → UI ``` This mocked vertical slice intentionally stands in for the future real pipeline: ```text Mic → Gateway → STT → Transcript → LLM → Tool Calls → Results → LLM → Final Response → TTS → Audio Stream → UI ``` ## Gateway Internal Flow ```text 1. Receive audio 2. Run STT (streaming) 3. Emit partial transcripts 4. On final: → call LLM 5. LLM decides: → direct response OR tool call 6. Execute tool 7. Feed result back to LLM 8. Generate final response 9. Send text stream 10. Send TTS stream ``` ## LLM Layer ### Location - NAS with RTX 3050 8GB ### Role - intent parsing - tool selection - response generation ### Constraints - must use a tool-calling schema - must not directly control systems - target approximately 7B-class models because of hardware limits ## Naming - system: **Vela** - gateway: `vela-gateway` - UI: `vela-ui` - voice profile: `vela-neutral`