5.5 KiB
Vela Architecture
High-Level Architecture
[ Browser (PWA UI) ]
|
WebSocket
|
[ Vela Gateway (NanoPi R6S) ]
|
+--> STT (local or NAS)
+--> Ollama (NAS GPU)
+--> Kokoro TTS (NAS or NanoPi)
+--> Home Assistant
+--> SearXNG
Core Components
Repository Structure
apps/
vela-ui/
vela-gateway/
The repository now includes separate runnable workspaces for the UI and gateway so implementation can proceed independently while staying aligned through shared documentation.
Frontend — vela-ui
Tech
- SvelteKit
- PWA enabled
- WebSocket client
The current implementation is a minimal SvelteKit app with a single voice-session shell page. The shipped UI can open and close a browser WebSocket connection to the gateway /ws endpoint, show explicit connection status (not connected, connecting, connected, disconnected, error), expose mic control shell interactions that emit placeholder input_audio.append / input_audio.commit events, trigger one deterministic mocked turn while connected, render deterministic placeholder partial/final transcripts for the push-to-talk shell, and render the mocked user transcript plus mocked assistant response for the existing mocked-turn path. This remains a shell only: there is no real microphone capture, real provider integration, or audio playback yet.
Responsibilities
Current shell responsibilities:
- connection state rendering
- mocked-turn trigger rendering with disconnected/in-flight guards
- mocked transcript and mocked assistant response rendering
- developer-oriented session metadata rendering
- browser session connect/disconnect controls
Future UI responsibilities:
- audio capture from microphone
- audio playback for TTS
- broader voice-session UI state rendering
- interrupt handling
Main Screen
Current shell:
- developer-focused voice-session panel
- connect button
- disconnect button
- mocked-turn button
- connection status indicator
- mocked transcript display
- mocked assistant response display
- session metadata display
Future interactive voice screen:
- large mic button
- live transcript
- streamed assistant response text
- state indicator:
- idle
- listening
- thinking
- speaking
- interrupt button during speaking
Backend — vela-gateway
Tech
- Fastify (Node)
- WebSocket-based session layer
The current implementation is a minimal Fastify service with /, /health, and a documented /ws WebSocket session endpoint. The gateway keeps one ephemeral in-memory session record per live socket connection, removes it on disconnect, and can run one deterministic mocked turn per session without involving any external providers.
Responsibilities
- session lifecycle
- audio ingestion
- STT orchestration
- LLM orchestration
- tool execution
- TTS orchestration
- event streaming
Current WebSocket skeleton
GET /wsdocuments the route for plain HTTP clients and returns426 Upgrade Required- WebSocket upgrades on
/wscreate an ephemeral session immediately - the gateway sends
session.readyfollowed bysession.state(idle) when the socket is established - valid minimal client events, including placeholder
input_audio.append/input_audio.commit, can move the session betweenidleandlistening - placeholder
input_audio.appendemits deterministic mockedtranscript.partialevents andinput_audio.commitemits one deterministic mockedtranscript.final mocked.turn.triggerdrives a fixed transcript/response event sequence over the existing shared protocol- only one mocked turn is allowed in flight per session at a time
- invalid JSON, invalid envelopes, and malformed frames are handled defensively so the process stays up
Current UI shell behavior
- renders a minimal developer-focused voice-session panel
- exposes connect, disconnect, mic-control shell interactions, and mocked-turn controls
- does not request microphone permission or capture real microphone audio
- only emits placeholder
input_audio.append/input_audio.commitevents; it does not send real audio data or play back audio - renders the latest placeholder partial transcript during a push-to-talk shell turn and replaces it with the final deterministic transcript on commit
- reads mocked transcript and mocked response events from the shared protocol contract
Voice Pipeline
Mic control shell / mocked turn button → Placeholder `input_audio.append` / `input_audio.commit` or mocked session flow → Deterministic transcript events → Mocked response text events when using mocked.turn.trigger → UI
This mocked vertical slice intentionally stands in for the future real pipeline:
Mic → Gateway → STT → Transcript
→ LLM → Tool Calls → Results
→ LLM → Final Response
→ TTS → Audio Stream → UI
Gateway Internal Flow
1. Receive audio
2. Run STT (streaming)
3. Emit partial transcripts
4. On final:
→ call LLM
5. LLM decides:
→ direct response OR tool call
6. Execute tool
7. Feed result back to LLM
8. Generate final response
9. Send text stream
10. Send TTS stream
LLM Layer
Location
- NAS with RTX 3050 8GB
Role
- intent parsing
- tool selection
- response generation
Constraints
- must use a tool-calling schema
- must not directly control systems
- target approximately 7B-class models because of hardware limits
Naming
- system: Vela
- gateway:
vela-gateway - UI:
vela-ui - voice profile:
vela-neutral