edge-data-center-main-OPEN/Edge-Compute-Center-Architecture.md
eddiesoehnel c5bb951281 added
2026-06-15 07:25:04 -06:00

20 KiB

Edge Data Center Architecture

Table of Contents

Admin

Original brainstorming chat: https://chat.example.invalid/private-reference and named Edge Data Center Platform Apps Structuring

Master list of projects, agents/workflows, tools, script/commands, hardware, software, VMs, email, cron, strategic assets, agent names, example agent models, validations: https://docs.example.invalid/private-reference

Dashboards: C:\projects\infra\edge-data-center-main\dashboards.md

Foundational Philosophy

Core Methodologies

  • Standardization over improvisation.
  • Reproducibility via templates and scripts.
  • VM-level isolation.
  • Git-first deployment.
  • Minimal dependencies.
  • Decentralize and isolate machines and, processes to minimize blast radius if anything compromised.
  • Gradual complexity scaling.

Security Principles

  • No public SSH exposure anywhere
  • No port forwarding.
  • Tailscale for private networking.
  • Separate dev and prod.
  • Keep Ubuntu/Proxmox/Installs clean.
  • Principle of minimal installed services.
  • AI agents never read production.env, have production SSH key, have production DB credentials. AI agents operate only in dev. Full AI security policy here: https://docs.example.invalid/private-reference
  • Production public access:
    • Reverse proxy (Caddy) controls ports 80 and 443 and routes to all apps from the Internet
    • HTTPS only
  • Domains: domain transfer lock, domain privacy
  • Firewalls on all systems
  • Dedicated infrastructure email account used for cloud service (Cloudflare, GoDaddy, etc) logins, operating in its own browser session, with minimal browser apps installed
  • MFA and Yubikey infrastructure account logins
  • Ublock Origin/Lite
  • Backups:
    • 3 copies of data (ie. NAS)
    • 2 different storage types (ie. cloud)
    • 1 copy stored offsite (ie. external drive)
  • Full online security policies and practices here: https://docs.example.invalid/private-reference
  • Cloudflared could be added as well, but may not be necessary
  • Route domains through Cloudflare proxied orange
  • VM apps, not docker or other. VMs limit blast radius the most.
  • See C:\projects\infra\edge-data-center-main\online-security-practices.md for more

Primary And Backup Machines

A primary stand-alone vm always has a secondary warm backup machine if primary fails.

Design Intent

This architecture supports:

  • Remote secure access.
  • AI-assisted development.
  • Production stability.
  • Monitoring visibility.
  • Future GPU expansion.
  • Sovereign model hosting.
  • Reduced SaaS dependency.
  • Long-term cost control.

Dev vs Production Process

Production is always on its own machine with a warm backup as standy if primary fails.

Dev is done on separate machine, could be an ephemeral dev machine or could be on a dev machine with dev apps in docker containers. If docker:

Git is transporting:

  • source code
  • configs
  • assets
  • scripts

Docker is only:

  • runtime packaging
  • dependency isolation
  • environment management

So if:

  • your app code is portable
  • dependencies are compatible
  • paths/configs are sane

then:

Docker vs non-Docker is mostly irrelevant.

Whether docker or not, git us used to maintain code

But runtime has to be the same across all machines.

Traffic Flow

  • All domains routed through Cloudflare proxied orange with cache and security rules in place.
  • For entry in, no ports, only tailscale.
  • Internal on LAN,SSH keys only, not password

Email/Comms

  • Gmail is primary
  • Thunderbird imaps to gmail accounts for offline backup
  • Secondary SMTP setup to send from desktop if gmail down
  • Mbsync Thunderbird to email backup
  • All gmail backed up to Onedrive via CloudHQ, and Onedrive synced to NAS

Logging

https://chat.example.invalid/private-reference

  • Prometheus+Grafana = server hardware metrics (guages)
  • Loki+Promtail = what happened, who connected, what error occurred, what bot accessed what, what sequence of events occurred (blackbox flight recorder).
  • Reverse proxy/web access logs: Caddy/Nginx for human and Ai traffic, attaches, scans, API usage, bandwith patterns, reponse codes.
  • Applicaiton logs: FastAPI, py apps, discourse, wordpress, listmonk, etc.: exceptions, interal events, business logic, worklfow history.
  • Container logs: docker restarts, failures
  • Security logs: fail2ban, UFW, Cloudflare security events
  • vm inventory scan: scan all apps, software, dependences, packages installed
  • Cloudlfare telemetry: edge telemetry from CF.
  • Email telemetry: listmonk and post mark data on emails - opens, bounces, clicks, etc.
  • AI-agent telemetry: AI access/retrievals, MCPO calls, chunk getches, vector retrievals
  • google analytics telemetry
  • Business event telemetry: payments, subscriptions, signups
  • Uptime monitoring: uptimemonitor for URL-based apps, but for non, like ops monitor, need to set alerts for this.
  • Internet down minotiring: rasberry PI does this through uptime monitor
  • electricity monitoring: is grid down, is UPS down, solar production
  • public network: Cloudflare down, internet down, etc.
  • weather data
  • Calendar, browser history, location data image metadata, garmin fit data

Any quantitative data to reason over, find correlations, patterns. Logging gets sent to a unified telemetry lake and is parsed as needed for analysis.

Layered Architecture

Layer – Hardware

Tier 1, 2 and 3 classification: see Physical Servers and PC SOP: C:\projects\infra\edge-data-center-main\physical-machine-tiers.md

Layer - Routing

  • Caddy on its own VM that is public facing and only publicly exposed VM
  • Proxmox only
  • No extra tooling installed
  • Treated like firmware
  • Clean, stable, minimal

Layer – Operating System Standard

  • Ubuntu 24.04 LTS (Noble) everywhere
  • No Debian/Ubuntu mixing
  • No version drift

Layer - VMs

VM Strategy

Two Apps Max Per VM (Default)

  • Each VM runs proxmox and two VMs, each VM with an app. Could be three, or could be two dockerized apps in a VM. Goal is if a machine fails, you are not spending hue time trying to install a bunch of vms and apps on one machine.
  • No Docker required unless required by app or in some cases, will dockerize a VM for additional apps
  • Isolation handled at the VM level.
  • Services managed via systemd.

This reduces complexity and improves debuggability.

Core VM Roles

Primary Workstation/Control Planes

WIndows Host Layer

  • Tailscale
  • Git For Windows - used as primary git
  • GitHub Desktop - used as primary git
  • Filezilla
  • AI coding agent installed via VS code
  • Development only
  • Snapshottable
  • Windows terminal
  • Powershell7
  • Bitlocker: if windows pro version installed
  • Python for personal automation as needed

WSL Ubuntu Layer

  • Git
  • OpenSSH client
  • curl
  • build-essential
  • Python
  • Node
  • AI CLI tooling
  • Optional Docker (if required)
  • API keys
  • tmux (session persistence)
  • htop (local visibility)
  • jq (JSON processing)
  • yq (YAML processing)
  • direnv (env management)
  • gh CLI (GitHub CLI)
  • make (if using build pipelines)
  • Ripgrep
  • Fzf
  • Whisper STT for text-to-speech processing

VM Development Layer

  • Prometheus+Grafana+Loki+Promtail...logging layer
  • Git
  • OpenSSH client
  • curl
  • build-essential
  • Python
  • Node
  • Optional Docker (if required)
  • tmux (session persistence)
  • htop (local visibility)
  • jq (JSON processing)
  • yq (YAML processing)
  • direnv (env management)
  • gh CLI (GitHub CLI)
  • make (if using build pipelines)
  • Ripgrep
  • Fzf
  • Unzip
  • Tailscale
  • fail2ban

VM Production Layer

  • openssh-server
  • tailscale
  • Prometheus+Grafana...logging layer
  • runtime dependencies for the app
  • systemd service
  • git

Model Gateway (control-only)

  • Lightweight API service

  • Auth/rate limiting for internal callers (apps)

  • Logging/metrics

  • Provider routing config

  • Tailscale

  • Prometheus+Grafana...logging layer

GPU Node (compute-only)

  • Ubuntu 24.04

  • NVIDIA drivers

  • Model server

  • Tailscale

  • No Git

  • Prometheus+Grafana

  • No reverse proxy

  • No public ports

SSH + Access Layer + Backup

Primary workstation/control planes must have:

  • SSH keys configured
  • Known_hosts cleanly maintained
  • Key-based login only
  • No passwords
  • BitLocker enabled
  • Secure backup of SSH keys
  • Encrypted backup of WSL distro

Development/Production Machines

  • Prometheus+Grafana...logging layer
  • Tailscale installed

Old Machines as HTOP Monitoring Only

  • antiX Linux
  • openssh-server
  • tailscale
  • fail2ban
  • tmux
  • htop
  • curl
  • jq
  • net-tools

Git Strategy

Primary workstation/control planes should be:

  • Primary Git origin working copy on Windows workstation/control planes
  • Push to GitHub
  • Dev VM clones from Git
  • Prod pulls from Git
  • No manual SFTP deploy.

Monitoring Access

Primary workstation/control planes:

  • Access *Prometheus+Grafana...logging layer dashboards via Tailscale
  • SSH into VMs for deeper inspection
  • Does NOT host monitoring
  • Keep monitoring on VMs.

VM For Email System

Useful for logs and mail utilities

  • sudo apt install -y \
  • mailutils \
  • logrotate \
  • ca-certificates

Networking Backbone

Tailscale Mesh Network

Purpose:

  • Secure private communication between all machines.
  • No port forwarding.
  • No exposed public services.

Characteristics:

  • WireGuard-based encryption.
  • Private 100.x.x.x addressing.
  • Works locally and remotely.
  • Same addressing scheme everywhere.

Installed on:

  • All hardware

API Access Model

Internal services (e.g., inference server) bind to:

  • 100.x.x.x:port
  • Not exposed publicly.
  • Applications communicate securely over the Tailscale mesh.

AI Integration Strategy

Development-Only AI

AI tools installed only on:

  • WSL

AI Responsibilities:

  • Edit code
  • Run tests
  • Restart dev services
  • Assist in development

AI does NOT:

  • Access production directly
  • Hold production credentials
  • Modify live systems

Production AI

  • GPU nodes are never publicly exposed.

  • Inference APIs bind only to Tailscale IP.

  • Public traffic flows through reverse proxy and app layer.

  • App layer enforces auth, rate limits, billing, logging.

  • GPU nodes are compute-only and stateless.

Promotion Workflow

dev → git commit → git push → prod pulls → restart service

Production remains deterministic.

Monitoring Strategy

Per-VM Monitoring

Install on each production VM:

  • Prometheus+Grafana...logging layer (real-time dashboard)
  • No central aggregation required initially.
  • Tmux >>>htop

Monitor:

  • CPU
  • Memory
  • Load
  • Disk I/O
  • Network
  • Process count
  • Docker (if ever used)

Future Option

If scaling increases:

  • Add alerting layer (e.g., Uptime Kuma)
  • Add Posthog for more analysis

Standardization Framework

Canonical Ubuntu Base

All VMs must:

  • Run Ubuntu 24.04 LTS (Noble)
  • Use same apt repositories
  • Use same SSH configuration
  • Use same directory layout

Directory Standard

  • /srv/apps
  • /srv/backups
  • Consistent across all machines.

ProxMox

Installed on multi-core machines to create and manage VMs

See Setup instructions and notes here: "C:\projects\infra\software\proxmox-vm-setup\ProxMox-Setup.md"

  • Proxmox VE
  • SSH (key only)
  • Tailscale
  • Proxmox firewall

Bootstrap Script

Create a reusable bootstrap.sh

Installs:

  • Baseline packages
  • Prometheus+Grafana...logging layer (prod only)
  • Tailscale
  • Directory structure
  • User setup

Ensures reproducibility.

Containerization Philosophy

Default:

  • No Docker.

  • Native systemd services.

Use Docker only when:

  • Multiple services per VM required.

  • Isolation becomes necessary.

  • CI/CD complexity grows.

Backup SOP

Router

Backup all configs and download - note router version number as well. Save to . Restore to backup router as well. See C:\projects\infra\hardware\routers for specific instructions

Workstation/Control Planes

Google Drive real time to OneDrive via BackupHQ

Backup WSL Ubuntu with new software changes/updates as tar to . Manual tar → copy to

See /path/to/infra\edge-\README.md

Proxmox

Backup config files only. See https://chat.example.invalid/private-reference Caddy Installation Guide towards end

VMs

  • Proxmox UI>VM>Backup : Manual, see scripts file
  • Snapshots via proxmox UI - Manual, before server updates
  • VM specific: from wsl workstation: see scripts file
  • Automated backup daily 2 am: see https://docs.example.invalid/private-reference
    • Backup lod dev, lod prod, community-indx-earth to promox-backup
    • Real time sync to ; real time sync nas to expansion drives

  • Real-time backup to and external drives

Offsite

  • Backups copied to external drives that are portable so if no one at location, then the last person that leaves takes the external drive with them

SSH

  • SSH key backup location:

Gitea/projects

Disaster Recovery Definition

Failure Scenarios

  • Dell dies → travel laptop takes over

  • AVA dies → clone server takes over

  • dies → restore

  • GPU node dies → inference unavailable but no data loss

Environment Variable Policy

  • .env policy: never committed to Git; enforce via gitignore

  • Secrets store on workstation/control plane: wsl..~/.secrets/

  • Backup and store on

  • Dev: .env loaded by app

  • Production: apps receive secrets via systemd environmental file: EnvironmentFile=

  • .env never world-readable

  • File permissions set to 600

  • AI agents must never read production.env0

Logging Strategy

  • journald persistent storage enabled
  • MaxUse=500M (or similar)
  • logrotate enabled for app logs
  • Application logs stored in: /logs/appname

VM Environments

https://docs.example.invalid/private-reference

API Keys

  • Stored in password vault/manager
  • Rotation if workstations compromised

Configurations

Tailscale

() @ windows

caddy-router

ssh -i ~/.ssh/edge_control_plane @

SSH

Windows Native SSH (PowerShell / Windows Terminal)

main workstation: \.ssh backed up to NAS/Archive

In WSL, /.ssh/

Use above when SSh from windows directly, using FTP, PuTTY, etc.

WSL Ubuntu SSH (Edge Control Plane)

Main workstation: /.ssh/ backed up to NAS/Archive

Key created with Google Ed password to encrypt the backups using 7-zip

Use above when running SSh from inside WSL, managing proxmox VMs, acting as edge control plane

~/.ssh/edge_control_plane

~/.ssh/edge_control_plane.pub

Git

Edge-control-plane: git repository for workstations/control plane

https://docs.example.invalid/private-reference

Edge-pre01: git repository for pre01 server running proxmox and VMs

VM Template

  • VM 9000 → ubuntu-2404-base
  • Status → Converted to template
  • 32GB disk
  • 2GB RAM
  • Clean state

Proxmox Setup

https://docs.example.invalid/private-reference

VM Template Setup

Use this document

VM Caddy-router setup

C:\projects\infra\software\caddy-router

Caddy SOP

C:\projects\infra\software\caddy-router

VM Setup Using Template Process

https://docs.example.invalid/private-reference

Install Discourse on ProxMox VM

"/path/to/infra\edge-community-indx-earth\discourse-export-topics-csv\docs\Discourse Install on ProxMox VM.md"

Ops Monitor Setup And SOP

https://docs.example.invalid/private-reference