176 lines
5.0 KiB
Markdown
176 lines
5.0 KiB
Markdown
# Vela Architecture
|
|
|
|
## High-Level Architecture
|
|
|
|
```text
|
|
[ Browser (PWA UI) ]
|
|
|
|
|
WebSocket
|
|
|
|
|
[ Vela Gateway (NanoPi R6S) ]
|
|
|
|
|
+--> STT (local or NAS)
|
|
+--> Ollama (NAS GPU)
|
|
+--> Kokoro TTS (NAS or NanoPi)
|
|
+--> Home Assistant
|
|
+--> SearXNG
|
|
```
|
|
|
|
## Core Components
|
|
|
|
## Repository Structure
|
|
|
|
```text
|
|
apps/
|
|
vela-ui/
|
|
vela-gateway/
|
|
```
|
|
|
|
The repository now includes separate runnable workspaces for the UI and gateway so implementation can proceed independently while staying aligned through shared documentation.
|
|
|
|
### Frontend — `vela-ui`
|
|
|
|
#### Tech
|
|
|
|
- SvelteKit
|
|
- PWA enabled
|
|
- WebSocket client
|
|
|
|
The current implementation is a minimal SvelteKit app with a single voice-session shell page. The shipped UI can open and close a browser WebSocket connection to the gateway `/ws` endpoint, show explicit connection status (`not connected`, `connecting`, `connected`, `disconnected`, `error`), expose mic control shell interactions that emit placeholder `input_audio.append` / `input_audio.commit` events, trigger one deterministic mocked turn while connected, and render the mocked user transcript plus mocked assistant response for the active session. This remains a shell only: there is no real microphone capture, real provider integration, or audio playback yet.
|
|
|
|
#### Responsibilities
|
|
|
|
Current shell responsibilities:
|
|
|
|
- connection state rendering
|
|
- mocked-turn trigger rendering with disconnected/in-flight guards
|
|
- mocked transcript and mocked assistant response rendering
|
|
- developer-oriented session metadata rendering
|
|
- browser session connect/disconnect controls
|
|
|
|
Future UI responsibilities:
|
|
|
|
- audio capture from microphone
|
|
- audio playback for TTS
|
|
- broader voice-session UI state rendering
|
|
- interrupt handling
|
|
|
|
#### Main Screen
|
|
|
|
Current shell:
|
|
|
|
- developer-focused voice-session panel
|
|
- connect button
|
|
- disconnect button
|
|
- mocked-turn button
|
|
- connection status indicator
|
|
- mocked transcript display
|
|
- mocked assistant response display
|
|
- session metadata display
|
|
|
|
Future interactive voice screen:
|
|
|
|
- large mic button
|
|
- live transcript
|
|
- streamed assistant response text
|
|
- state indicator:
|
|
- idle
|
|
- listening
|
|
- thinking
|
|
- speaking
|
|
- interrupt button during speaking
|
|
|
|
### Backend — `vela-gateway`
|
|
|
|
#### Tech
|
|
|
|
- Fastify (Node)
|
|
- WebSocket-based session layer
|
|
|
|
The current implementation is a minimal Fastify service with `/`, `/health`, and a documented `/ws` WebSocket session endpoint. The gateway keeps one ephemeral in-memory session record per live socket connection, removes it on disconnect, and can run one deterministic mocked turn per session without involving any external providers.
|
|
|
|
#### Responsibilities
|
|
|
|
- session lifecycle
|
|
- audio ingestion
|
|
- STT orchestration
|
|
- LLM orchestration
|
|
- tool execution
|
|
- TTS orchestration
|
|
- event streaming
|
|
|
|
#### Current WebSocket skeleton
|
|
|
|
- `GET /ws` documents the route for plain HTTP clients and returns `426 Upgrade Required`
|
|
- WebSocket upgrades on `/ws` create an ephemeral session immediately
|
|
- the gateway sends `session.ready` followed by `session.state` (`idle`) when the socket is established
|
|
- valid minimal client events, including placeholder `input_audio.append` / `input_audio.commit`, can move the session between `idle` and `listening`
|
|
- `mocked.turn.trigger` drives a fixed transcript/response event sequence over the existing shared protocol
|
|
- only one mocked turn is allowed in flight per session at a time
|
|
- invalid JSON, invalid envelopes, and malformed frames are handled defensively so the process stays up
|
|
|
|
### Current UI shell behavior
|
|
|
|
- renders a minimal developer-focused voice-session panel
|
|
- exposes connect, disconnect, mic-control shell interactions, and mocked-turn controls
|
|
- does not request microphone permission or capture real microphone audio
|
|
- only emits placeholder `input_audio.append` / `input_audio.commit` events; it does not send real audio data or play back audio
|
|
- reads mocked transcript and mocked response events from the shared protocol contract
|
|
|
|
## Voice Pipeline
|
|
|
|
```text
|
|
Mic control shell / mocked turn button → Placeholder `input_audio.append` / `input_audio.commit` or mocked session flow → Transcript events → Response text events → UI
|
|
```
|
|
|
|
This mocked vertical slice intentionally stands in for the future real pipeline:
|
|
|
|
```text
|
|
Mic → Gateway → STT → Transcript
|
|
→ LLM → Tool Calls → Results
|
|
→ LLM → Final Response
|
|
→ TTS → Audio Stream → UI
|
|
```
|
|
|
|
## Gateway Internal Flow
|
|
|
|
```text
|
|
1. Receive audio
|
|
2. Run STT (streaming)
|
|
3. Emit partial transcripts
|
|
4. On final:
|
|
→ call LLM
|
|
5. LLM decides:
|
|
→ direct response OR tool call
|
|
6. Execute tool
|
|
7. Feed result back to LLM
|
|
8. Generate final response
|
|
9. Send text stream
|
|
10. Send TTS stream
|
|
```
|
|
|
|
## LLM Layer
|
|
|
|
### Location
|
|
|
|
- NAS with RTX 3050 8GB
|
|
|
|
### Role
|
|
|
|
- intent parsing
|
|
- tool selection
|
|
- response generation
|
|
|
|
### Constraints
|
|
|
|
- must use a tool-calling schema
|
|
- must not directly control systems
|
|
- target approximately 7B-class models because of hardware limits
|
|
|
|
## Naming
|
|
|
|
- system: **Vela**
|
|
- gateway: `vela-gateway`
|
|
- UI: `vela-ui`
|
|
- voice profile: `vela-neutral`
|