edge-data-center-main-OPEN/Edge-Compute-Center-Architecture.md

# **Edge Data Center Architecture**

## Table of Contents

- [Admin](#admin)
- [Foundational Philosophy](#foundational-philosophy)
  - [Core Methodologies](#core-methodologies)
  - [Security Principles](#security-principles)
  - [Primary And Backup Machines](#primary-and-backup-machines)
  - [Design Intent](#design-intent)
  - [Dev vs Production Process](#dev-process)
  - [Traffic Flow](#traffic-flow)
  - [Email/Comms](#emailcomms)
  - [Logging](#logging)
  - [Layered Architecture](#layered-architecture)
    - [Layer  â€“ Hardware](#layer-hardware)
    - [Layer - Routing](#layer-routing)
    - [Layer â€“ Operating System Standard](#layer-operating-system-standard)
    - [Layer - VMs](#layer-vms)
- [VM Strategy](#vm-strategy)
  - [One App Per VM (Default)](#one-app-per-vm-default)
  - [Core VM Roles](#core-vm-roles)
    - [Primary Workstation/Control Planes](#primary-workstationcontrol-planes)
      - [WIndows Host Layer](#windows-host-layer)
      - [WSL Ubuntu Layer](#wsl-ubuntu-layer)
    - [VM Development Layer](#vm-development-layer)
    - [VM Production Layer](#vm-production-layer)
    - [Model Gateway (control-only)](#model-gateway-control-only)
    - [GPU Node (compute-only)](#gpu-node-compute-only)
    - [SSH + Access Layer + Backup](#ssh-access-layer-backup)
    - [Development/Production Machines](#developmentproduction-machines)
    - [Old Machines as HTOP Monitoring Only](#old-machines-as-htop-monitoring-only)
    - [Git Strategy](#git-strategy)
    - [Monitoring Access](#monitoring-access)
    - [VM For Email System](#vm-for-email-system)
- [Networking Backbone](#networking-backbone)
  - [Tailscale Mesh Network](#tailscale-mesh-network)
  - [API Access Model](#api-access-model)
- [AI Integration Strategy](#ai-integration-strategy)
  - [Development-Only AI](#development-only-ai)
  - [Production AI](#production-ai)
  - [Promotion Workflow](#promotion-workflow)
- [Monitoring Strategy](#monitoring-strategy)
  - [Per-VM Monitoring](#per-vm-monitoring)
  - [Future Option](#future-option)
- [Standardization Framework](#standardization-framework)
  - [Canonical Ubuntu Base](#canonical-ubuntu-base)
  - [Directory Standard](#directory-standard)
  - [ProxMox](#proxmox)
  - [Bootstrap Script](#bootstrap-script)
  - [Containerization Philosophy](#containerization-philosophy)
- [Backup SOP](#backup-sop)
  - [Router](#router)
  - [Workstation/Control Planes](#workstationcontrol-planes)
  - [Proxmox](#proxmox-1)
  - [VMs](#vms)
  - [<primary-nas>](#nas0)
  - [Offsite](#offsite)
  - [SSH](#ssh)
  - [Gitea/projects](#giteaprojects)
- [Disaster Recovery Definition](#disaster-recovery-definition)
- [Environment Variable Policy](#environment-variable-policy)
- [Logging Strategy](#logging-strategy)
- [VM Environments](#vm-environments)
- [API Keys](#api-keys)
- [Configurations](#configurations)
  - [Tailscale](#tailscale)
  - [SSH](#ssh-1)
    - [Windows Native SSH (PowerShell / Windows Terminal)](#windows-native-ssh-powershell-windows-terminal)
    - [WSL Ubuntu SSH (Edge Control Plane)](#wsl-ubuntu-ssh-edge-control-plane)
  - [Git](#git)
  - [VM Template](#vm-template)
  - [Proxmox Setup](#proxmox-setup)
  - [VM Template Setup](#vm-template-setup)
  - [VM Caddy-router setup](#vm-caddy-router-setup)
  - [Caddy SOP](#caddy-sop)
  - [VM Setup Using Template Process](#vm-setup-using-template-process)
  - [Install Discourse on ProxMox VM](#install-discourse-on-proxmox-vm)
  - [Ops Monitor Setup And SOP](#ops-monitor-setup-and-sop)

# **Admin**

Original brainstorming chat:  [https://chat.example.invalid/private-reference](https://chat.example.invalid/private-reference)  and named Edge Data Center Platform Apps Structuring

Master list of projects, agents/workflows, tools, script/commands, hardware, software, VMs, email, cron, strategic assets, agent names, example agent models, validations: https://docs.example.invalid/private-reference

Dashboards: C:\projects\infra\edge-data-center-main\dashboards.md

# **Foundational Philosophy**

## Core Methodologies

* Standardization over improvisation.
* Reproducibility via templates and scripts.
* VM-level isolation.
* Git-first deployment.
* Minimal dependencies.
* Decentralize and isolate machines and, processes to minimize blast radius if anything compromised.
* Gradual complexity scaling.

## Security Principles

* No public SSH exposure anywhere
* No port forwarding.
* Tailscale for private networking.
* Separate dev and prod.
* Keep Ubuntu/Proxmox/Installs clean.
* Principle of minimal installed services.
* AI agents never read production.env, have production SSH key, have production DB credentials. AI agents operate only in dev. Full AI security policy here: [https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)
* Production public access:
  * Reverse proxy (Caddy) controls ports 80 and 443 and routes to all apps from the Internet
  * HTTPS only
* Domains: domain transfer lock, domain privacy
* Firewalls on all systems
* Dedicated infrastructure email account used for cloud service (Cloudflare, GoDaddy, etc) logins, operating in its own browser session, with minimal browser apps installed
* MFA and Yubikey infrastructure account logins
* Ublock Origin/Lite
* Backups:
  * 3 copies of data (ie. NAS)
  * 2 different storage types (ie. cloud)
  * 1 copy stored offsite (ie. external drive)
* Full online security policies and practices here: [https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)
* Cloudflared could be added as well, but may not be necessary
* Route domains through Cloudflare proxied orange
* VM apps, not docker or other.  VMs limit blast radius the most.
* See C:\projects\infra\edge-data-center-main\online-security-practices.md for more

## Primary And Backup Machines

A primary stand-alone vm always has a secondary warm backup machine if primary fails.

## Design Intent

This architecture supports:

* Remote secure access.
* AI-assisted development.
* Production stability.
* Monitoring visibility.
* Future GPU expansion.
* Sovereign model hosting.
* Reduced SaaS dependency.
* Long-term cost control.

## Dev vs Production Process

Production is always on its own machine with a warm backup as standy if primary fails.

Dev is done on separate machine, could be an ephemeral dev machine or could be on a dev machine with dev apps in docker containers.  If docker:

Git is transporting:

- source code
- configs
- assets
- scripts

Docker is only:

- runtime packaging
- dependency isolation
- environment management

So if:

- your app code is portable
- dependencies are compatible
- paths/configs are sane

then:

Docker vs non-Docker is mostly irrelevant.

Whether docker or not, git us used to maintain code

But runtime has to be the same across all machines.

## Traffic Flow

* All domains routed through Cloudflare proxied orange with cache and security rules in place.
* For entry in, no ports, only tailscale.
* Internal on LAN,SSH keys only, not password

## Email/Comms

* Gmail is primary
* Thunderbird imaps to gmail accounts for offline backup
* Secondary SMTP setup to send from desktop if gmail down
* Mbsync Thunderbird to email backup
* All gmail backed up to Onedrive via CloudHQ, and Onedrive synced to NAS

## Logging
https://chat.example.invalid/private-reference

* Prometheus+Grafana = server hardware metrics (guages)
* Loki+Promtail = what happened, who connected, what error occurred, what bot accessed what, what sequence of events occurred (blackbox flight recorder).
* Reverse proxy/web access logs:  Caddy/Nginx for human and Ai traffic, attaches, scans, API usage, bandwith patterns, reponse codes.
* Applicaiton logs:  FastAPI, py apps, discourse, wordpress, listmonk, etc.:  exceptions, interal events, business logic, worklfow history.
* Container logs: docker restarts, failures
* Security logs: fail2ban, UFW, Cloudflare security events
* vm inventory scan: scan all apps, software, dependences, packages installed
* Cloudlfare telemetry: edge telemetry from CF.
* Email telemetry:  listmonk and post mark data on emails - opens, bounces, clicks, etc.
* AI-agent telemetry:  AI access/retrievals, MCPO calls, chunk getches, vector retrievals
* google analytics telemetry
* Business event telemetry:  payments, subscriptions, signups
* Uptime monitoring: uptimemonitor for URL-based apps, but for non, like ops monitor, need to set alerts for this.
* Internet down minotiring:  rasberry PI does this through uptime monitor
* electricity monitoring: is grid down, is UPS down, solar production
* public network:  Cloudflare down, internet down, etc.
* weather data
* Calendar, browser history, location data image metadata, garmin fit data

Any quantitative data to reason over, find correlations, patterns.  Logging gets sent to a unified telemetry lake and is parsed as needed for analysis.

## Layered Architecture

### *Layer  â€“ Hardware*

Tier 1, 2 and 3 classification:  see Physical Servers and PC SOP:  C:\projects\infra\edge-data-center-main\physical-machine-tiers.md

### *Layer \- Routing*

* Caddy on its own VM that is public facing and only publicly exposed VM
* Proxmox only
* No extra tooling installed
* Treated like firmware
* Clean, stable, minimal

### *Layer â€“ Operating System Standard*

* Ubuntu 24.04 LTS (Noble) everywhere
* No Debian/Ubuntu mixing
* No version drift

### *Layer \- VMs*

# **VM Strategy**

## Two Apps Max Per VM (Default)

* Each VM runs proxmox and two VMs, each VM with an app. Could be three, or could be two dockerized apps in a VM.  Goal is if a machine fails, you are not spending hue time trying to install a bunch of vms and apps on one machine.
* No Docker required unless required by app or in some cases, will dockerize a VM for additional apps
* Isolation handled at the VM level.
* Services managed via systemd.

This reduces complexity and improves debuggability.

## Core VM Roles

### *Primary Workstation/Control Planes*

#### WIndows Host Layer

* Tailscale
* Git For Windows \- used as primary git
* GitHub Desktop \- used as primary git
* Filezilla
* AI coding agent installed via VS code
* Development only
* Snapshottable
* Windows terminal
* Powershell7
* Bitlocker: if windows pro version installed
* Python for personal automation as needed

#### WSL Ubuntu Layer

* Git
* OpenSSH client
* curl
* build-essential
* Python
* Node
* AI CLI tooling
* Optional Docker (if required)
* API keys
* tmux (session persistence)
* htop (local visibility)
* jq (JSON processing)
* yq (YAML processing)
* direnv (env management)
* gh CLI (GitHub CLI)
* make (if using build pipelines)
* Ripgrep
* Fzf
* Whisper STT for text-to-speech processing

### *VM Development Layer*

* Prometheus+Grafana+Loki+Promtail...logging layer
* Git
* OpenSSH client
* curl
* build-essential
* Python
* Node
* Optional Docker (if required)
* tmux (session persistence)
* htop (local visibility)
* jq (JSON processing)
* yq (YAML processing)
* direnv (env management)
* gh CLI (GitHub CLI)
* make (if using build pipelines)
* Ripgrep
* Fzf
* Unzip
* Tailscale
* fail2ban

### *VM Production Layer*

* openssh-server
* tailscale
* Prometheus+Grafana...logging layer
* runtime dependencies for the app
* systemd service
* git

### *Model Gateway (control-only)*

* Lightweight API service

* Auth/rate limiting for internal callers (apps)

* Logging/metrics

* Provider routing config

* Tailscale

* Prometheus+Grafana...logging layer

### *GPU Node (compute-only)*

* Ubuntu 24.04

* NVIDIA drivers

* Model server

* Tailscale

* No Git

* Prometheus+Grafana
* No reverse proxy

* No public ports

### *SSH \+ Access Layer \+ Backup*

Primary workstation/control planes must have:

* SSH keys configured
* Known\_hosts cleanly maintained
* Key-based login only
* No passwords
* BitLocker enabled
* Secure backup of SSH keys
* Encrypted backup of WSL distro

### *Development/Production Machines*

* Prometheus+Grafana...logging layer
* Tailscale installed

### *Old Machines as HTOP Monitoring Only*

* antiX Linux
* openssh-server
* tailscale
* fail2ban
* tmux
* htop
* curl
* jq
* net-tools

### *Git Strategy*

Primary workstation/control planes should be:

* Primary Git origin working copy on Windows workstation/control planes
* Push to GitHub
* Dev VM clones from Git
* Prod pulls from Git
* No manual SFTP deploy.

### *Monitoring Access*

Primary workstation/control planes:

* Access *Prometheus+Grafana...logging layer dashboards via Tailscale
* SSH into VMs for deeper inspection
* Does NOT host monitoring
* Keep monitoring on VMs.

### *VM For Email System*

Useful for logs and mail utilities

* sudo apt install \-y \\
* mailutils \\
* logrotate \\
* ca-certificates

# **Networking Backbone**

## Tailscale Mesh Network

Purpose:

* Secure private communication between all machines.
* No port forwarding.
* No exposed public services.

Characteristics:

* WireGuard-based encryption.
* Private 100.x.x.x addressing.
* Works locally and remotely.
* Same addressing scheme everywhere.

Installed on:

* All hardware

## API Access Model

Internal services (e.g., inference server) bind to:

* 100.x.x.x:port
* Not exposed publicly.
* Applications communicate securely over the Tailscale mesh.

# **AI Integration Strategy**

## Development-Only AI

AI tools installed only on:

* <primary-workstation> WSL

AI Responsibilities:

* Edit code
* Run tests
* Restart dev services
* Assist in development

AI does NOT:

* Access production directly
* Hold production credentials
* Modify live systems

## Production AI

* GPU nodes are never publicly exposed.

* Inference APIs bind only to Tailscale IP.

* Public traffic flows through reverse proxy and app layer.

* App layer enforces auth, rate limits, billing, logging.

* GPU nodes are compute-only and stateless.

## Promotion Workflow

dev â†’ git commit â†’ git push â†’ prod pulls â†’ restart service

Production remains deterministic.

# **Monitoring Strategy**

## Per-VM Monitoring

Install on each production VM:

* Prometheus+Grafana...logging layer (real-time dashboard)
* No central aggregation required initially.
* Tmux \>\>\>htop

Monitor:

* CPU
* Memory
* Load
* Disk I/O
* Network
* Process count
* Docker (if ever used)

## Future Option

If scaling increases:

* Add alerting layer (e.g., Uptime Kuma)
* Add Posthog for more analysis

# **Standardization Framework**

## Canonical Ubuntu Base

All VMs must:

* Run Ubuntu 24.04 LTS (Noble)
* Use same apt repositories
* Use same SSH configuration
* Use same directory layout

## Directory Standard

* /srv/apps
* <app-data-root>
* /srv/backups
* Consistent across all machines.

## ProxMox

Installed on multi-core machines to create and manage VMs

See Setup instructions and notes here: "C:\projects\infra\software\proxmox-vm-setup\ProxMox-Setup.md"

* Proxmox VE
* SSH (key only)
* Tailscale
* Proxmox firewall

## Bootstrap Script

Create a reusable bootstrap.sh

Installs:

* Baseline packages
* Prometheus+Grafana...logging layer (prod only)
* Tailscale
* Directory structure
* User setup

Ensures reproducibility.

## Containerization Philosophy

Default:

* No Docker.

* Native systemd services.

Use Docker only when:

* Multiple services per VM required.

* Isolation becomes necessary.

* CI/CD complexity grows.

# **Backup SOP**

## Router

Backup all configs and download \- note router version number as well. Save to <primary-nas>. Restore to backup router as well.  See C:\projects\infra\hardware\routers for specific instructions

## Workstation/Control Planes

Google Drive real time to OneDrive via BackupHQ

Backup WSL Ubuntu with new software changes/updates as tar to <primary-nas>. Manual tar â†’ copy to

See  /path/to/infra\\edge-<primary-workstation>\\README.md

## Proxmox

Backup config files only.  See [https://chat.example.invalid/private-reference](https://chat.example.invalid/private-reference) Caddy Installation Guide towards end

## VMs

* Proxmox UI\>VM\>Backup : Manual, see scripts file
* Snapshots via proxmox UI \- Manual, before server updates
* VM specific: from wsl workstation:   see scripts file
* Automated backup daily 2 am:  see [https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)
  * Backup lod dev, lod prod, community-indx-earth to <primary-nas> promox-backup
  * Real time sync <primary-nas> to <secondary-nas>; real time sync nas to expansion drives

## <primary-nas>

* Real-time backup to <secondary-nas> and external drives

## Offsite

* Backups copied to external drives that are portable so if no one at location, then the last person that leaves takes the external drive with them

## SSH

* SSH key backup location: <primary-nas>

## Gitea/projects

# **Disaster Recovery Definition**

Failure Scenarios

* Dell dies â†’ travel laptop takes over

* AVA dies â†’ clone server takes over

* <primary-nas> dies â†’ <secondary-nas> restore

* GPU node dies â†’ inference unavailable but no data loss

# **Environment Variable Policy**

* .env policy: never committed to Git; enforce via gitignore

* Secrets store on workstation/control plane: wsl..\~/.secrets/

* Backup and store on <primary-nas>

* Dev:  .env loaded by app

* Production:  apps receive secrets via systemd environmental file: EnvironmentFile=<app-install-path>

* .env never world-readable

* File permissions set to 600

* AI agents must never read production.env0

# **Logging Strategy**

* journald persistent storage enabled
* MaxUse=500M (or similar)
* logrotate enabled for app logs
* Application logs stored in: <app-data-root>/logs/appname

# **VM Environments**

https://docs.example.invalid/private-reference

# **API Keys**

* Stored in password vault/manager
* Rotation if workstations compromised

# **Configurations**

## Tailscale

<private-ip> <primary-workstation> (<primary-workstation>)  <user>@  windows

<private-ip> caddy-router

 ssh \-i \~/.ssh/edge\_control\_plane <ssh-user>@<private-ip>

## SSH

### *Windows Native SSH (PowerShell / Windows Terminal)*

main workstation:  <windows-user-profile>\\.ssh backed up to NAS/Archive

In WSL,  <windows-wsl-user-dir>/.ssh/

Use above when SSh from windows directly, using FTP, PuTTY, etc.

### *WSL Ubuntu SSH (Edge Control Plane)*

Main workstation:  <linux-user-home>/.ssh/  backed up to NAS/Archive

Key created with Google Ed password to encrypt the backups using 7-zip

Use above when running SSh from inside WSL, managing proxmox VMs, acting as edge control plane

\~/.ssh/edge\_control\_plane

\~/.ssh/edge\_control\_plane.pub

## Git

Edge-control-plane:  git repository for workstations/control plane

[https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)

Edge-pre01: git repository for pre01 server running proxmox and VMs

## VM Template

* VM 9000 â†’ ubuntu-2404-base
* Status â†’ Converted to template
* 32GB disk
* 2GB RAM
* Clean state

## Proxmox Setup

[https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)

## VM Template Setup

Use this document

## VM Caddy-router setup

C:\projects\infra\software\caddy-router

## Caddy SOP

C:\projects\infra\software\caddy-router

## VM Setup Using Template Process

[https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)

## Install Discourse on ProxMox VM

"/path/to/infra\\edge-community-indx-earth\\discourse-export-topics-csv\\docs\\Discourse Install on ProxMox VM.md"

## Ops Monitor Setup And SOP

[https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)