# **Edge Data Center Architecture** ## Table of Contents - [Admin](#admin) - [Foundational Philosophy](#foundational-philosophy) - [Core Methodologies](#core-methodologies) - [Security Principles](#security-principles) - [Primary And Backup Machines](#primary-and-backup-machines) - [Design Intent](#design-intent) - [Dev vs Production Process](#dev-process) - [Traffic Flow](#traffic-flow) - [Email/Comms](#emailcomms) - [Logging](#logging) - [Layered Architecture](#layered-architecture) - [Layer – Hardware](#layer-hardware) - [Layer - Routing](#layer-routing) - [Layer – Operating System Standard](#layer-operating-system-standard) - [Layer - VMs](#layer-vms) - [VM Strategy](#vm-strategy) - [One App Per VM (Default)](#one-app-per-vm-default) - [Core VM Roles](#core-vm-roles) - [Primary Workstation/Control Planes](#primary-workstationcontrol-planes) - [WIndows Host Layer](#windows-host-layer) - [WSL Ubuntu Layer](#wsl-ubuntu-layer) - [VM Development Layer](#vm-development-layer) - [VM Production Layer](#vm-production-layer) - [Model Gateway (control-only)](#model-gateway-control-only) - [GPU Node (compute-only)](#gpu-node-compute-only) - [SSH + Access Layer + Backup](#ssh-access-layer-backup) - [Development/Production Machines](#developmentproduction-machines) - [Old Machines as HTOP Monitoring Only](#old-machines-as-htop-monitoring-only) - [Git Strategy](#git-strategy) - [Monitoring Access](#monitoring-access) - [VM For Email System](#vm-for-email-system) - [Networking Backbone](#networking-backbone) - [Tailscale Mesh Network](#tailscale-mesh-network) - [API Access Model](#api-access-model) - [AI Integration Strategy](#ai-integration-strategy) - [Development-Only AI](#development-only-ai) - [Production AI](#production-ai) - [Promotion Workflow](#promotion-workflow) - [Monitoring Strategy](#monitoring-strategy) - [Per-VM Monitoring](#per-vm-monitoring) - [Future Option](#future-option) - [Standardization Framework](#standardization-framework) - [Canonical Ubuntu Base](#canonical-ubuntu-base) - [Directory Standard](#directory-standard) - [ProxMox](#proxmox) - [Bootstrap Script](#bootstrap-script) - [Containerization Philosophy](#containerization-philosophy) - [Backup SOP](#backup-sop) - [Router](#router) - [Workstation/Control Planes](#workstationcontrol-planes) - [Proxmox](#proxmox-1) - [VMs](#vms) - [](#nas0) - [Offsite](#offsite) - [SSH](#ssh) - [Gitea/projects](#giteaprojects) - [Disaster Recovery Definition](#disaster-recovery-definition) - [Environment Variable Policy](#environment-variable-policy) - [Logging Strategy](#logging-strategy) - [VM Environments](#vm-environments) - [API Keys](#api-keys) - [Configurations](#configurations) - [Tailscale](#tailscale) - [SSH](#ssh-1) - [Windows Native SSH (PowerShell / Windows Terminal)](#windows-native-ssh-powershell-windows-terminal) - [WSL Ubuntu SSH (Edge Control Plane)](#wsl-ubuntu-ssh-edge-control-plane) - [Git](#git) - [VM Template](#vm-template) - [Proxmox Setup](#proxmox-setup) - [VM Template Setup](#vm-template-setup) - [VM Caddy-router setup](#vm-caddy-router-setup) - [Caddy SOP](#caddy-sop) - [VM Setup Using Template Process](#vm-setup-using-template-process) - [Install Discourse on ProxMox VM](#install-discourse-on-proxmox-vm) - [Ops Monitor Setup And SOP](#ops-monitor-setup-and-sop) # **Admin** Original brainstorming chat: [https://chat.example.invalid/private-reference](https://chat.example.invalid/private-reference) and named Edge Data Center Platform Apps Structuring Master list of projects, agents/workflows, tools, script/commands, hardware, software, VMs, email, cron, strategic assets, agent names, example agent models, validations: https://docs.example.invalid/private-reference Dashboards: C:\projects\infra\edge-data-center-main\dashboards.md # **Foundational Philosophy** ## Core Methodologies * Standardization over improvisation. * Reproducibility via templates and scripts. * VM-level isolation. * Git-first deployment. * Minimal dependencies. * Decentralize and isolate machines and, processes to minimize blast radius if anything compromised. * Gradual complexity scaling. ## Security Principles * No public SSH exposure anywhere * No port forwarding. * Tailscale for private networking. * Separate dev and prod. * Keep Ubuntu/Proxmox/Installs clean. * Principle of minimal installed services. * AI agents never read production.env, have production SSH key, have production DB credentials. AI agents operate only in dev. Full AI security policy here: [https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference) * Production public access: * Reverse proxy (Caddy) controls ports 80 and 443 and routes to all apps from the Internet * HTTPS only * Domains: domain transfer lock, domain privacy * Firewalls on all systems * Dedicated infrastructure email account used for cloud service (Cloudflare, GoDaddy, etc) logins, operating in its own browser session, with minimal browser apps installed * MFA and Yubikey infrastructure account logins * Ublock Origin/Lite * Backups: * 3 copies of data (ie. NAS) * 2 different storage types (ie. cloud) * 1 copy stored offsite (ie. external drive) * Full online security policies and practices here: [https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference) * Cloudflared could be added as well, but may not be necessary * Route domains through Cloudflare proxied orange * VM apps, not docker or other. VMs limit blast radius the most. * See C:\projects\infra\edge-data-center-main\online-security-practices.md for more ## Primary And Backup Machines A primary stand-alone vm always has a secondary warm backup machine if primary fails. ## Design Intent This architecture supports: * Remote secure access. * AI-assisted development. * Production stability. * Monitoring visibility. * Future GPU expansion. * Sovereign model hosting. * Reduced SaaS dependency. * Long-term cost control. ## Dev vs Production Process Production is always on its own machine with a warm backup as standy if primary fails. Dev is done on separate machine, could be an ephemeral dev machine or could be on a dev machine with dev apps in docker containers. If docker: Git is transporting: - source code - configs - assets - scripts Docker is only: - runtime packaging - dependency isolation - environment management So if: - your app code is portable - dependencies are compatible - paths/configs are sane then: Docker vs non-Docker is mostly irrelevant. Whether docker or not, git us used to maintain code But runtime has to be the same across all machines. ## Traffic Flow * All domains routed through Cloudflare proxied orange with cache and security rules in place. * For entry in, no ports, only tailscale. * Internal on LAN,SSH keys only, not password ## Email/Comms * Gmail is primary * Thunderbird imaps to gmail accounts for offline backup * Secondary SMTP setup to send from desktop if gmail down * Mbsync Thunderbird to email backup * All gmail backed up to Onedrive via CloudHQ, and Onedrive synced to NAS ## Logging https://chat.example.invalid/private-reference * Prometheus+Grafana = server hardware metrics (guages) * Loki+Promtail = what happened, who connected, what error occurred, what bot accessed what, what sequence of events occurred (blackbox flight recorder). * Reverse proxy/web access logs: Caddy/Nginx for human and Ai traffic, attaches, scans, API usage, bandwith patterns, reponse codes. * Applicaiton logs: FastAPI, py apps, discourse, wordpress, listmonk, etc.: exceptions, interal events, business logic, worklfow history. * Container logs: docker restarts, failures * Security logs: fail2ban, UFW, Cloudflare security events * vm inventory scan: scan all apps, software, dependences, packages installed * Cloudlfare telemetry: edge telemetry from CF. * Email telemetry: listmonk and post mark data on emails - opens, bounces, clicks, etc. * AI-agent telemetry: AI access/retrievals, MCPO calls, chunk getches, vector retrievals * google analytics telemetry * Business event telemetry: payments, subscriptions, signups * Uptime monitoring: uptimemonitor for URL-based apps, but for non, like ops monitor, need to set alerts for this. * Internet down minotiring: rasberry PI does this through uptime monitor * electricity monitoring: is grid down, is UPS down, solar production * public network: Cloudflare down, internet down, etc. * weather data * Calendar, browser history, location data image metadata, garmin fit data Any quantitative data to reason over, find correlations, patterns. Logging gets sent to a unified telemetry lake and is parsed as needed for analysis. ## Layered Architecture ### *Layer – Hardware* Tier 1, 2 and 3 classification: see Physical Servers and PC SOP: C:\projects\infra\edge-data-center-main\physical-machine-tiers.md ### *Layer \- Routing* * Caddy on its own VM that is public facing and only publicly exposed VM * Proxmox only * No extra tooling installed * Treated like firmware * Clean, stable, minimal ### *Layer – Operating System Standard* * Ubuntu 24.04 LTS (Noble) everywhere * No Debian/Ubuntu mixing * No version drift ### *Layer \- VMs* # **VM Strategy** ## Two Apps Max Per VM (Default) * Each VM runs proxmox and two VMs, each VM with an app. Could be three, or could be two dockerized apps in a VM. Goal is if a machine fails, you are not spending hue time trying to install a bunch of vms and apps on one machine. * No Docker required unless required by app or in some cases, will dockerize a VM for additional apps * Isolation handled at the VM level. * Services managed via systemd. This reduces complexity and improves debuggability. ## Core VM Roles ### *Primary Workstation/Control Planes* #### WIndows Host Layer * Tailscale * Git For Windows \- used as primary git * GitHub Desktop \- used as primary git * Filezilla * AI coding agent installed via VS code * Development only * Snapshottable * Windows terminal * Powershell7 * Bitlocker: if windows pro version installed * Python for personal automation as needed #### WSL Ubuntu Layer * Git * OpenSSH client * curl * build-essential * Python * Node * AI CLI tooling * Optional Docker (if required) * API keys * tmux (session persistence) * htop (local visibility) * jq (JSON processing) * yq (YAML processing) * direnv (env management) * gh CLI (GitHub CLI) * make (if using build pipelines) * Ripgrep * Fzf * Whisper STT for text-to-speech processing ### *VM Development Layer* * Prometheus+Grafana+Loki+Promtail...logging layer * Git * OpenSSH client * curl * build-essential * Python * Node * Optional Docker (if required) * tmux (session persistence) * htop (local visibility) * jq (JSON processing) * yq (YAML processing) * direnv (env management) * gh CLI (GitHub CLI) * make (if using build pipelines) * Ripgrep * Fzf * Unzip * Tailscale * fail2ban ### *VM Production Layer* * openssh-server * tailscale * Prometheus+Grafana...logging layer * runtime dependencies for the app * systemd service * git ### *Model Gateway (control-only)* * Lightweight API service * Auth/rate limiting for internal callers (apps) * Logging/metrics * Provider routing config * Tailscale * Prometheus+Grafana...logging layer ### *GPU Node (compute-only)* * Ubuntu 24.04 * NVIDIA drivers * Model server * Tailscale * No Git * Prometheus+Grafana * No reverse proxy * No public ports ### *SSH \+ Access Layer \+ Backup* Primary workstation/control planes must have: * SSH keys configured * Known\_hosts cleanly maintained * Key-based login only * No passwords * BitLocker enabled * Secure backup of SSH keys * Encrypted backup of WSL distro ### *Development/Production Machines* * Prometheus+Grafana...logging layer * Tailscale installed ### *Old Machines as HTOP Monitoring Only* * antiX Linux * openssh-server * tailscale * fail2ban * tmux * htop * curl * jq * net-tools ### *Git Strategy* Primary workstation/control planes should be: * Primary Git origin working copy on Windows workstation/control planes * Push to GitHub * Dev VM clones from Git * Prod pulls from Git * No manual SFTP deploy. ### *Monitoring Access* Primary workstation/control planes: * Access *Prometheus+Grafana...logging layer dashboards via Tailscale * SSH into VMs for deeper inspection * Does NOT host monitoring * Keep monitoring on VMs. ### *VM For Email System* Useful for logs and mail utilities * sudo apt install \-y \\ * mailutils \\ * logrotate \\ * ca-certificates # **Networking Backbone** ## Tailscale Mesh Network Purpose: * Secure private communication between all machines. * No port forwarding. * No exposed public services. Characteristics: * WireGuard-based encryption. * Private 100.x.x.x addressing. * Works locally and remotely. * Same addressing scheme everywhere. Installed on: * All hardware ## API Access Model Internal services (e.g., inference server) bind to: * 100.x.x.x:port * Not exposed publicly. * Applications communicate securely over the Tailscale mesh. # **AI Integration Strategy** ## Development-Only AI AI tools installed only on: * WSL AI Responsibilities: * Edit code * Run tests * Restart dev services * Assist in development AI does NOT: * Access production directly * Hold production credentials * Modify live systems ## Production AI * GPU nodes are never publicly exposed. * Inference APIs bind only to Tailscale IP. * Public traffic flows through reverse proxy and app layer. * App layer enforces auth, rate limits, billing, logging. * GPU nodes are compute-only and stateless. ## Promotion Workflow dev → git commit → git push → prod pulls → restart service Production remains deterministic. # **Monitoring Strategy** ## Per-VM Monitoring Install on each production VM: * Prometheus+Grafana...logging layer (real-time dashboard) * No central aggregation required initially. * Tmux \>\>\>htop Monitor: * CPU * Memory * Load * Disk I/O * Network * Process count * Docker (if ever used) ## Future Option If scaling increases: * Add alerting layer (e.g., Uptime Kuma) * Add Posthog for more analysis # **Standardization Framework** ## Canonical Ubuntu Base All VMs must: * Run Ubuntu 24.04 LTS (Noble) * Use same apt repositories * Use same SSH configuration * Use same directory layout ## Directory Standard * /srv/apps * * /srv/backups * Consistent across all machines. ## ProxMox Installed on multi-core machines to create and manage VMs See Setup instructions and notes here: "C:\projects\infra\software\proxmox-vm-setup\ProxMox-Setup.md" * Proxmox VE * SSH (key only) * Tailscale * Proxmox firewall ## Bootstrap Script Create a reusable bootstrap.sh Installs: * Baseline packages * Prometheus+Grafana...logging layer (prod only) * Tailscale * Directory structure * User setup Ensures reproducibility. ## Containerization Philosophy Default: * No Docker. * Native systemd services. Use Docker only when: * Multiple services per VM required. * Isolation becomes necessary. * CI/CD complexity grows. # **Backup SOP** ## Router Backup all configs and download \- note router version number as well. Save to . Restore to backup router as well. See C:\projects\infra\hardware\routers for specific instructions ## Workstation/Control Planes Google Drive real time to OneDrive via BackupHQ Backup WSL Ubuntu with new software changes/updates as tar to . Manual tar → copy to See /path/to/infra\\edge-\\README.md ## Proxmox Backup config files only. See [https://chat.example.invalid/private-reference](https://chat.example.invalid/private-reference) Caddy Installation Guide towards end ## VMs * Proxmox UI\>VM\>Backup : Manual, see scripts file * Snapshots via proxmox UI \- Manual, before server updates * VM specific: from wsl workstation: see scripts file * Automated backup daily 2 am: see [https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference) * Backup lod dev, lod prod, community-indx-earth to promox-backup * Real time sync to ; real time sync nas to expansion drives ## * Real-time backup to and external drives ## Offsite * Backups copied to external drives that are portable so if no one at location, then the last person that leaves takes the external drive with them ## SSH * SSH key backup location: ## Gitea/projects # **Disaster Recovery Definition** Failure Scenarios * Dell dies → travel laptop takes over * AVA dies → clone server takes over * dies → restore * GPU node dies → inference unavailable but no data loss # **Environment Variable Policy** * .env policy: never committed to Git; enforce via gitignore * Secrets store on workstation/control plane: wsl..\~/.secrets/ * Backup and store on * Dev: .env loaded by app * Production: apps receive secrets via systemd environmental file: EnvironmentFile= * .env never world-readable * File permissions set to 600 * AI agents must never read production.env0 # **Logging Strategy** * journald persistent storage enabled * MaxUse=500M (or similar) * logrotate enabled for app logs * Application logs stored in: /logs/appname # **VM Environments** https://docs.example.invalid/private-reference # **API Keys** * Stored in password vault/manager * Rotation if workstations compromised # **Configurations** ## Tailscale () @ windows caddy-router ssh \-i \~/.ssh/edge\_control\_plane @ ## SSH ### *Windows Native SSH (PowerShell / Windows Terminal)* main workstation: \\.ssh backed up to NAS/Archive In WSL, /.ssh/ Use above when SSh from windows directly, using FTP, PuTTY, etc. ### *WSL Ubuntu SSH (Edge Control Plane)* Main workstation: /.ssh/ backed up to NAS/Archive Key created with Google Ed password to encrypt the backups using 7-zip Use above when running SSh from inside WSL, managing proxmox VMs, acting as edge control plane \~/.ssh/edge\_control\_plane \~/.ssh/edge\_control\_plane.pub ## Git Edge-control-plane: git repository for workstations/control plane [https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference) Edge-pre01: git repository for pre01 server running proxmox and VMs ## VM Template * VM 9000 → ubuntu-2404-base * Status → Converted to template * 32GB disk * 2GB RAM * Clean state ## Proxmox Setup [https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference) ## VM Template Setup Use this document ## VM Caddy-router setup C:\projects\infra\software\caddy-router ## Caddy SOP C:\projects\infra\software\caddy-router ## VM Setup Using Template Process [https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference) ## Install Discourse on ProxMox VM "/path/to/infra\\edge-community-indx-earth\\discourse-export-topics-csv\\docs\\Discourse Install on ProxMox VM.md" ## Ops Monitor Setup And SOP [https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)