Add a minimal UI shell that connects to the gateway WebSocket and exposes developer-visible session state. Align the architecture, protocol, setup, integration, and backlog docs with the current UI increment.
163 lines
3.9 KiB
Markdown
163 lines
3.9 KiB
Markdown
# Vela Architecture
|
|
|
|
## High-Level Architecture
|
|
|
|
```text
|
|
[ Browser (PWA UI) ]
|
|
|
|
|
WebSocket
|
|
|
|
|
[ Vela Gateway (NanoPi R6S) ]
|
|
|
|
|
+--> STT (local or NAS)
|
|
+--> Ollama (NAS GPU)
|
|
+--> Kokoro TTS (NAS or NanoPi)
|
|
+--> Home Assistant
|
|
+--> SearXNG
|
|
```
|
|
|
|
## Core Components
|
|
|
|
## Repository Structure
|
|
|
|
```text
|
|
apps/
|
|
vela-ui/
|
|
vela-gateway/
|
|
```
|
|
|
|
The repository now includes separate runnable workspaces for the UI and gateway so implementation can proceed independently while staying aligned through shared documentation.
|
|
|
|
### Frontend — `vela-ui`
|
|
|
|
#### Tech
|
|
|
|
- SvelteKit
|
|
- PWA enabled
|
|
- WebSocket client
|
|
|
|
The current implementation is a minimal SvelteKit app with a single voice-session shell page. The shipped UI can open and close a browser WebSocket connection to the gateway `/ws` endpoint, show explicit connection status (`not connected`, `connecting`, `connected`, `disconnected`, `error`), and surface session metadata for developers. Microphone capture, transcript rendering, interrupt controls, streamed assistant response display, and audio playback are not part of the current shell and remain future work.
|
|
|
|
#### Responsibilities
|
|
|
|
Current shell responsibilities:
|
|
|
|
- connection state rendering
|
|
- developer-oriented session metadata rendering
|
|
- browser session connect/disconnect controls
|
|
|
|
Future UI responsibilities:
|
|
|
|
- audio capture from microphone
|
|
- audio playback for TTS
|
|
- broader voice-session UI state rendering
|
|
- interrupt handling
|
|
|
|
#### Main Screen
|
|
|
|
Current shell:
|
|
|
|
- developer-focused voice-session panel
|
|
- connect button
|
|
- disconnect button
|
|
- connection status indicator
|
|
- session metadata display
|
|
|
|
Future interactive voice screen:
|
|
|
|
- large mic button
|
|
- live transcript
|
|
- streamed assistant response text
|
|
- state indicator:
|
|
- idle
|
|
- listening
|
|
- thinking
|
|
- speaking
|
|
- interrupt button during speaking
|
|
|
|
### Backend — `vela-gateway`
|
|
|
|
#### Tech
|
|
|
|
- Fastify (Node)
|
|
- WebSocket-based session layer
|
|
|
|
The current implementation is a minimal Fastify service with `/`, `/health`, and a documented `/ws` WebSocket session endpoint. The gateway keeps one ephemeral in-memory session record per live socket connection and removes it on disconnect.
|
|
|
|
#### Responsibilities
|
|
|
|
- session lifecycle
|
|
- audio ingestion
|
|
- STT orchestration
|
|
- LLM orchestration
|
|
- tool execution
|
|
- TTS orchestration
|
|
- event streaming
|
|
|
|
#### Current WebSocket skeleton
|
|
|
|
- `GET /ws` documents the route for plain HTTP clients and returns `426 Upgrade Required`
|
|
- WebSocket upgrades on `/ws` create an ephemeral session immediately
|
|
- the gateway sends `session.ready` followed by `session.state` (`idle`) when the socket is established
|
|
- valid minimal client events can move the session between `idle` and `listening`
|
|
- invalid JSON, invalid envelopes, and malformed frames are handled defensively so the process stays up
|
|
|
|
### Current UI shell behavior
|
|
|
|
- renders a minimal developer-focused voice-session panel
|
|
- exposes connect and disconnect controls only
|
|
- does not request microphone permission
|
|
- does not send or process audio data
|
|
- reads `session.ready`, `session.state`, and `error` messages from the shared protocol contract
|
|
|
|
## Voice Pipeline
|
|
|
|
```text
|
|
Mic → Gateway → STT → Transcript
|
|
→ LLM → Tool Calls → Results
|
|
→ LLM → Final Response
|
|
→ TTS → Audio Stream → UI
|
|
```
|
|
|
|
## Gateway Internal Flow
|
|
|
|
```text
|
|
1. Receive audio
|
|
2. Run STT (streaming)
|
|
3. Emit partial transcripts
|
|
4. On final:
|
|
→ call LLM
|
|
5. LLM decides:
|
|
→ direct response OR tool call
|
|
6. Execute tool
|
|
7. Feed result back to LLM
|
|
8. Generate final response
|
|
9. Send text stream
|
|
10. Send TTS stream
|
|
```
|
|
|
|
## LLM Layer
|
|
|
|
### Location
|
|
|
|
- NAS with RTX 3050 8GB
|
|
|
|
### Role
|
|
|
|
- intent parsing
|
|
- tool selection
|
|
- response generation
|
|
|
|
### Constraints
|
|
|
|
- must use a tool-calling schema
|
|
- must not directly control systems
|
|
- target approximately 7B-class models because of hardware limits
|
|
|
|
## Naming
|
|
|
|
- system: **Vela**
|
|
- gateway: `vela-gateway`
|
|
- UI: `vela-ui`
|
|
- voice profile: `vela-neutral`
|