assistant/docs/architecture.md

# Vela Architecture

## High-Level Architecture

```text
[ Browser (PWA UI) ]
        |
   WebSocket
        |
[ Vela Gateway (NanoPi R6S) ]
        |
        +--> STT (local or NAS)
        +--> Ollama (NAS GPU)
        +--> Kokoro TTS (NAS or NanoPi)
        +--> Home Assistant
        +--> SearXNG
```

## Core Components

## Repository Structure

```text
apps/
  vela-ui/
  vela-gateway/
```

The repository now includes separate runnable workspaces for the UI and gateway so implementation can proceed independently while staying aligned through shared documentation.

### Frontend — `vela-ui`

#### Tech

- SvelteKit
- PWA enabled
- WebSocket client

The current implementation is a minimal SvelteKit app with a single voice-session shell page. The shipped UI can open and close a browser WebSocket connection to the gateway `/ws` endpoint, show explicit connection status (`not connected`, `connecting`, `connected`, `disconnected`, `error`), expose mic control shell interactions that emit placeholder `input_audio.append` / `input_audio.commit` events, render deterministic placeholder partial/final transcripts for the push-to-talk shell, and stream the mocked assistant response after push-to-talk commit. This remains a shell only: there is no real microphone capture, real provider integration, or audio playback yet.

#### Responsibilities

Current shell responsibilities:

- connection state rendering
- mocked transcript and mocked assistant response rendering
- developer-oriented session metadata rendering
- browser session connect/disconnect controls

Future UI responsibilities:

- audio capture from microphone
- audio playback for TTS
- broader voice-session UI state rendering
- interrupt handling

#### Main Screen

Current shell:

- developer-focused voice-session panel
- connect button
- disconnect button
- connection status indicator
- mocked transcript display
- mocked assistant response display
- session metadata display

Future interactive voice screen:

- large mic button
- live transcript
- streamed assistant response text
- state indicator:
  - idle
  - listening
  - thinking
  - speaking
- interrupt button during speaking

### Backend — `vela-gateway`

#### Tech

- Fastify (Node)
- WebSocket-based session layer

The current implementation is a minimal Fastify service with `/`, `/health`, and a documented `/ws` WebSocket session endpoint. The gateway keeps one ephemeral in-memory session record per live socket connection, removes it on disconnect, and can run one deterministic mocked turn per session without involving any external providers.

#### Responsibilities

- session lifecycle
- audio ingestion
- STT orchestration
- LLM orchestration
- tool execution
- TTS orchestration
- event streaming

#### Current WebSocket skeleton

- `GET /ws` documents the route for plain HTTP clients and returns `426 Upgrade Required`
- WebSocket upgrades on `/ws` create an ephemeral session immediately
- the gateway sends `session.ready` followed by `session.state` (`idle`) when the socket is established
- valid minimal client events, including placeholder `input_audio.append` / `input_audio.commit`, can move the session through the mocked turn states on one socket
- placeholder `input_audio.append` emits deterministic mocked `transcript.partial` events and `input_audio.commit` emits one deterministic mocked `transcript.final` before starting the existing mocked assistant response stream
- only one mocked turn is allowed in flight per session at a time
- invalid JSON, invalid envelopes, and malformed frames are handled defensively so the process stays up
- retired `mocked.turn.trigger` messages are rejected with a deterministic recoverable error

### Current UI shell behavior

- renders a minimal developer-focused voice-session panel
- exposes connect, disconnect, and mic-control shell interactions
- does not request microphone permission or capture real microphone audio
- only emits placeholder `input_audio.append` / `input_audio.commit` events; it does not send real audio data or play back audio
- renders the latest placeholder partial transcript during a push-to-talk shell turn, replaces it with the final deterministic transcript on commit, and appends streamed mocked assistant text for that same push-to-talk turn
- reads mocked transcript and mocked response events from the shared protocol contract

## Voice Pipeline

```text
Mic control shell → Placeholder `input_audio.append` / `input_audio.commit` → Deterministic transcript events → Shared mocked response engine → Mocked response text events → UI
```

This mocked vertical slice intentionally stands in for the future real pipeline:

```text
Mic → Gateway → STT → Transcript
→ LLM → Tool Calls → Results
→ LLM → Final Response
→ TTS → Audio Stream → UI
```

## Gateway Internal Flow

```text
1. Receive audio
2. Run STT (streaming)
3. Emit partial transcripts
4. On final:
   → call LLM
5. LLM decides:
   → direct response OR tool call
6. Execute tool
7. Feed result back to LLM
8. Generate final response
9. Send text stream
10. Send TTS stream
```

## LLM Layer

### Location

- NAS with RTX 3050 8GB

### Role

- intent parsing
- tool selection
- response generation

### Constraints

- must use a tool-calling schema
- must not directly control systems
- target approximately 7B-class models because of hardware limits

## Naming

- system: **Vela**
- gateway: `vela-gateway`
- UI: `vela-ui`
- voice profile: `vela-neutral`