20 KiB
Edge Data Center Architecture
Table of Contents
- Admin
- Foundational Philosophy
- VM Strategy
- One App Per VM (Default)
- Core VM Roles
- Networking Backbone
- AI Integration Strategy
- Monitoring Strategy
- Standardization Framework
- Backup SOP
- Disaster Recovery Definition
- Environment Variable Policy
- Logging Strategy
- VM Environments
- API Keys
- Configurations
Admin
Original brainstorming chat: https://chat.example.invalid/private-reference and named Edge Data Center Platform Apps Structuring
Master list of projects, agents/workflows, tools, script/commands, hardware, software, VMs, email, cron, strategic assets, agent names, example agent models, validations: https://docs.example.invalid/private-reference
Dashboards: C:\projects\infra\edge-data-center-main\dashboards.md
Foundational Philosophy
Core Methodologies
- Standardization over improvisation.
- Reproducibility via templates and scripts.
- VM-level isolation.
- Git-first deployment.
- Minimal dependencies.
- Decentralize and isolate machines and, processes to minimize blast radius if anything compromised.
- Gradual complexity scaling.
Security Principles
- No public SSH exposure anywhere
- No port forwarding.
- Tailscale for private networking.
- Separate dev and prod.
- Keep Ubuntu/Proxmox/Installs clean.
- Principle of minimal installed services.
- AI agents never read production.env, have production SSH key, have production DB credentials. AI agents operate only in dev. Full AI security policy here: https://docs.example.invalid/private-reference
- Production public access:
- Reverse proxy (Caddy) controls ports 80 and 443 and routes to all apps from the Internet
- HTTPS only
- Domains: domain transfer lock, domain privacy
- Firewalls on all systems
- Dedicated infrastructure email account used for cloud service (Cloudflare, GoDaddy, etc) logins, operating in its own browser session, with minimal browser apps installed
- MFA and Yubikey infrastructure account logins
- Ublock Origin/Lite
- Backups:
- 3 copies of data (ie. NAS)
- 2 different storage types (ie. cloud)
- 1 copy stored offsite (ie. external drive)
- Full online security policies and practices here: https://docs.example.invalid/private-reference
- Cloudflared could be added as well, but may not be necessary
- Route domains through Cloudflare proxied orange
- VM apps, not docker or other. VMs limit blast radius the most.
- See C:\projects\infra\edge-data-center-main\online-security-practices.md for more
Primary And Backup Machines
A primary stand-alone vm always has a secondary warm backup machine if primary fails.
Design Intent
This architecture supports:
- Remote secure access.
- AI-assisted development.
- Production stability.
- Monitoring visibility.
- Future GPU expansion.
- Sovereign model hosting.
- Reduced SaaS dependency.
- Long-term cost control.
Dev vs Production Process
Production is always on its own machine with a warm backup as standy if primary fails.
Dev is done on separate machine, could be an ephemeral dev machine or could be on a dev machine with dev apps in docker containers. If docker:
Git is transporting:
- source code
- configs
- assets
- scripts
Docker is only:
- runtime packaging
- dependency isolation
- environment management
So if:
- your app code is portable
- dependencies are compatible
- paths/configs are sane
then:
Docker vs non-Docker is mostly irrelevant.
Whether docker or not, git us used to maintain code
But runtime has to be the same across all machines.
Traffic Flow
- All domains routed through Cloudflare proxied orange with cache and security rules in place.
- For entry in, no ports, only tailscale.
- Internal on LAN,SSH keys only, not password
Email/Comms
- Gmail is primary
- Thunderbird imaps to gmail accounts for offline backup
- Secondary SMTP setup to send from desktop if gmail down
- Mbsync Thunderbird to email backup
- All gmail backed up to Onedrive via CloudHQ, and Onedrive synced to NAS
Logging
https://chat.example.invalid/private-reference
- Prometheus+Grafana = server hardware metrics (guages)
- Loki+Promtail = what happened, who connected, what error occurred, what bot accessed what, what sequence of events occurred (blackbox flight recorder).
- Reverse proxy/web access logs: Caddy/Nginx for human and Ai traffic, attaches, scans, API usage, bandwith patterns, reponse codes.
- Applicaiton logs: FastAPI, py apps, discourse, wordpress, listmonk, etc.: exceptions, interal events, business logic, worklfow history.
- Container logs: docker restarts, failures
- Security logs: fail2ban, UFW, Cloudflare security events
- vm inventory scan: scan all apps, software, dependences, packages installed
- Cloudlfare telemetry: edge telemetry from CF.
- Email telemetry: listmonk and post mark data on emails - opens, bounces, clicks, etc.
- AI-agent telemetry: AI access/retrievals, MCPO calls, chunk getches, vector retrievals
- google analytics telemetry
- Business event telemetry: payments, subscriptions, signups
- Uptime monitoring: uptimemonitor for URL-based apps, but for non, like ops monitor, need to set alerts for this.
- Internet down minotiring: rasberry PI does this through uptime monitor
- electricity monitoring: is grid down, is UPS down, solar production
- public network: Cloudflare down, internet down, etc.
- weather data
- Calendar, browser history, location data image metadata, garmin fit data
Any quantitative data to reason over, find correlations, patterns. Logging gets sent to a unified telemetry lake and is parsed as needed for analysis.
Layered Architecture
Layer – Hardware
Tier 1, 2 and 3 classification: see Physical Servers and PC SOP: C:\projects\infra\edge-data-center-main\physical-machine-tiers.md
Layer - Routing
- Caddy on its own VM that is public facing and only publicly exposed VM
- Proxmox only
- No extra tooling installed
- Treated like firmware
- Clean, stable, minimal
Layer – Operating System Standard
- Ubuntu 24.04 LTS (Noble) everywhere
- No Debian/Ubuntu mixing
- No version drift
Layer - VMs
VM Strategy
Two Apps Max Per VM (Default)
- Each VM runs proxmox and two VMs, each VM with an app. Could be three, or could be two dockerized apps in a VM. Goal is if a machine fails, you are not spending hue time trying to install a bunch of vms and apps on one machine.
- No Docker required unless required by app or in some cases, will dockerize a VM for additional apps
- Isolation handled at the VM level.
- Services managed via systemd.
This reduces complexity and improves debuggability.
Core VM Roles
Primary Workstation/Control Planes
WIndows Host Layer
- Tailscale
- Git For Windows - used as primary git
- GitHub Desktop - used as primary git
- Filezilla
- AI coding agent installed via VS code
- Development only
- Snapshottable
- Windows terminal
- Powershell7
- Bitlocker: if windows pro version installed
- Python for personal automation as needed
WSL Ubuntu Layer
- Git
- OpenSSH client
- curl
- build-essential
- Python
- Node
- AI CLI tooling
- Optional Docker (if required)
- API keys
- tmux (session persistence)
- htop (local visibility)
- jq (JSON processing)
- yq (YAML processing)
- direnv (env management)
- gh CLI (GitHub CLI)
- make (if using build pipelines)
- Ripgrep
- Fzf
- Whisper STT for text-to-speech processing
VM Development Layer
- Prometheus+Grafana+Loki+Promtail...logging layer
- Git
- OpenSSH client
- curl
- build-essential
- Python
- Node
- Optional Docker (if required)
- tmux (session persistence)
- htop (local visibility)
- jq (JSON processing)
- yq (YAML processing)
- direnv (env management)
- gh CLI (GitHub CLI)
- make (if using build pipelines)
- Ripgrep
- Fzf
- Unzip
- Tailscale
- fail2ban
VM Production Layer
- openssh-server
- tailscale
- Prometheus+Grafana...logging layer
- runtime dependencies for the app
- systemd service
- git
Model Gateway (control-only)
-
Lightweight API service
-
Auth/rate limiting for internal callers (apps)
-
Logging/metrics
-
Provider routing config
-
Tailscale
-
Prometheus+Grafana...logging layer
GPU Node (compute-only)
-
Ubuntu 24.04
-
NVIDIA drivers
-
Model server
-
Tailscale
-
No Git
-
Prometheus+Grafana
-
No reverse proxy
-
No public ports
SSH + Access Layer + Backup
Primary workstation/control planes must have:
- SSH keys configured
- Known_hosts cleanly maintained
- Key-based login only
- No passwords
- BitLocker enabled
- Secure backup of SSH keys
- Encrypted backup of WSL distro
Development/Production Machines
- Prometheus+Grafana...logging layer
- Tailscale installed
Old Machines as HTOP Monitoring Only
- antiX Linux
- openssh-server
- tailscale
- fail2ban
- tmux
- htop
- curl
- jq
- net-tools
Git Strategy
Primary workstation/control planes should be:
- Primary Git origin working copy on Windows workstation/control planes
- Push to GitHub
- Dev VM clones from Git
- Prod pulls from Git
- No manual SFTP deploy.
Monitoring Access
Primary workstation/control planes:
- Access *Prometheus+Grafana...logging layer dashboards via Tailscale
- SSH into VMs for deeper inspection
- Does NOT host monitoring
- Keep monitoring on VMs.
VM For Email System
Useful for logs and mail utilities
- sudo apt install -y \
- mailutils \
- logrotate \
- ca-certificates
Networking Backbone
Tailscale Mesh Network
Purpose:
- Secure private communication between all machines.
- No port forwarding.
- No exposed public services.
Characteristics:
- WireGuard-based encryption.
- Private 100.x.x.x addressing.
- Works locally and remotely.
- Same addressing scheme everywhere.
Installed on:
- All hardware
API Access Model
Internal services (e.g., inference server) bind to:
- 100.x.x.x:port
- Not exposed publicly.
- Applications communicate securely over the Tailscale mesh.
AI Integration Strategy
Development-Only AI
AI tools installed only on:
- WSL
AI Responsibilities:
- Edit code
- Run tests
- Restart dev services
- Assist in development
AI does NOT:
- Access production directly
- Hold production credentials
- Modify live systems
Production AI
-
GPU nodes are never publicly exposed.
-
Inference APIs bind only to Tailscale IP.
-
Public traffic flows through reverse proxy and app layer.
-
App layer enforces auth, rate limits, billing, logging.
-
GPU nodes are compute-only and stateless.
Promotion Workflow
dev → git commit → git push → prod pulls → restart service
Production remains deterministic.
Monitoring Strategy
Per-VM Monitoring
Install on each production VM:
- Prometheus+Grafana...logging layer (real-time dashboard)
- No central aggregation required initially.
- Tmux >>>htop
Monitor:
- CPU
- Memory
- Load
- Disk I/O
- Network
- Process count
- Docker (if ever used)
Future Option
If scaling increases:
- Add alerting layer (e.g., Uptime Kuma)
- Add Posthog for more analysis
Standardization Framework
Canonical Ubuntu Base
All VMs must:
- Run Ubuntu 24.04 LTS (Noble)
- Use same apt repositories
- Use same SSH configuration
- Use same directory layout
Directory Standard
- /srv/apps
- /srv/backups
- Consistent across all machines.
ProxMox
Installed on multi-core machines to create and manage VMs
See Setup instructions and notes here: "C:\projects\infra\software\proxmox-vm-setup\ProxMox-Setup.md"
- Proxmox VE
- SSH (key only)
- Tailscale
- Proxmox firewall
Bootstrap Script
Create a reusable bootstrap.sh
Installs:
- Baseline packages
- Prometheus+Grafana...logging layer (prod only)
- Tailscale
- Directory structure
- User setup
Ensures reproducibility.
Containerization Philosophy
Default:
-
No Docker.
-
Native systemd services.
Use Docker only when:
-
Multiple services per VM required.
-
Isolation becomes necessary.
-
CI/CD complexity grows.
Backup SOP
Router
Backup all configs and download - note router version number as well. Save to . Restore to backup router as well. See C:\projects\infra\hardware\routers for specific instructions
Workstation/Control Planes
Google Drive real time to OneDrive via BackupHQ
Backup WSL Ubuntu with new software changes/updates as tar to . Manual tar → copy to
See /path/to/infra\edge-\README.md
Proxmox
Backup config files only. See https://chat.example.invalid/private-reference Caddy Installation Guide towards end
VMs
- Proxmox UI>VM>Backup : Manual, see scripts file
- Snapshots via proxmox UI - Manual, before server updates
- VM specific: from wsl workstation: see scripts file
- Automated backup daily 2 am: see https://docs.example.invalid/private-reference
- Backup lod dev, lod prod, community-indx-earth to promox-backup
- Real time sync to ; real time sync nas to expansion drives
- Real-time backup to and external drives
Offsite
- Backups copied to external drives that are portable so if no one at location, then the last person that leaves takes the external drive with them
SSH
- SSH key backup location:
Gitea/projects
Disaster Recovery Definition
Failure Scenarios
-
Dell dies → travel laptop takes over
-
AVA dies → clone server takes over
-
dies → restore
-
GPU node dies → inference unavailable but no data loss
Environment Variable Policy
-
.env policy: never committed to Git; enforce via gitignore
-
Secrets store on workstation/control plane: wsl..~/.secrets/
-
Backup and store on
-
Dev: .env loaded by app
-
Production: apps receive secrets via systemd environmental file: EnvironmentFile=
-
.env never world-readable
-
File permissions set to 600
-
AI agents must never read production.env0
Logging Strategy
- journald persistent storage enabled
- MaxUse=500M (or similar)
- logrotate enabled for app logs
- Application logs stored in: /logs/appname
VM Environments
https://docs.example.invalid/private-reference
API Keys
- Stored in password vault/manager
- Rotation if workstations compromised
Configurations
Tailscale
() @ windows
caddy-router
ssh -i ~/.ssh/edge_control_plane @
SSH
Windows Native SSH (PowerShell / Windows Terminal)
main workstation: \.ssh backed up to NAS/Archive
In WSL, /.ssh/
Use above when SSh from windows directly, using FTP, PuTTY, etc.
WSL Ubuntu SSH (Edge Control Plane)
Main workstation: /.ssh/ backed up to NAS/Archive
Key created with Google Ed password to encrypt the backups using 7-zip
Use above when running SSh from inside WSL, managing proxmox VMs, acting as edge control plane
~/.ssh/edge_control_plane
~/.ssh/edge_control_plane.pub
Git
Edge-control-plane: git repository for workstations/control plane
https://docs.example.invalid/private-reference
Edge-pre01: git repository for pre01 server running proxmox and VMs
VM Template
- VM 9000 → ubuntu-2404-base
- Status → Converted to template
- 32GB disk
- 2GB RAM
- Clean state
Proxmox Setup
https://docs.example.invalid/private-reference
VM Template Setup
Use this document
VM Caddy-router setup
C:\projects\infra\software\caddy-router
Caddy SOP
C:\projects\infra\software\caddy-router
VM Setup Using Template Process
https://docs.example.invalid/private-reference
Install Discourse on ProxMox VM
"/path/to/infra\edge-community-indx-earth\discourse-export-topics-csv\docs\Discourse Install on ProxMox VM.md"