edge-data-center-main-OPEN/Edge-Compute-Center-Architecture.md
eddiesoehnel c5bb951281 added
2026-06-15 07:25:04 -06:00

725 lines
20 KiB
Markdown

# **Edge Data Center Architecture**
## Table of Contents
- [Admin](#admin)
- [Foundational Philosophy](#foundational-philosophy)
- [Core Methodologies](#core-methodologies)
- [Security Principles](#security-principles)
- [Primary And Backup Machines](#primary-and-backup-machines)
- [Design Intent](#design-intent)
- [Dev vs Production Process](#dev-process)
- [Traffic Flow](#traffic-flow)
- [Email/Comms](#emailcomms)
- [Logging](#logging)
- [Layered Architecture](#layered-architecture)
- [Layer – Hardware](#layer-hardware)
- [Layer - Routing](#layer-routing)
- [Layer – Operating System Standard](#layer-operating-system-standard)
- [Layer - VMs](#layer-vms)
- [VM Strategy](#vm-strategy)
- [One App Per VM (Default)](#one-app-per-vm-default)
- [Core VM Roles](#core-vm-roles)
- [Primary Workstation/Control Planes](#primary-workstationcontrol-planes)
- [WIndows Host Layer](#windows-host-layer)
- [WSL Ubuntu Layer](#wsl-ubuntu-layer)
- [VM Development Layer](#vm-development-layer)
- [VM Production Layer](#vm-production-layer)
- [Model Gateway (control-only)](#model-gateway-control-only)
- [GPU Node (compute-only)](#gpu-node-compute-only)
- [SSH + Access Layer + Backup](#ssh-access-layer-backup)
- [Development/Production Machines](#developmentproduction-machines)
- [Old Machines as HTOP Monitoring Only](#old-machines-as-htop-monitoring-only)
- [Git Strategy](#git-strategy)
- [Monitoring Access](#monitoring-access)
- [VM For Email System](#vm-for-email-system)
- [Networking Backbone](#networking-backbone)
- [Tailscale Mesh Network](#tailscale-mesh-network)
- [API Access Model](#api-access-model)
- [AI Integration Strategy](#ai-integration-strategy)
- [Development-Only AI](#development-only-ai)
- [Production AI](#production-ai)
- [Promotion Workflow](#promotion-workflow)
- [Monitoring Strategy](#monitoring-strategy)
- [Per-VM Monitoring](#per-vm-monitoring)
- [Future Option](#future-option)
- [Standardization Framework](#standardization-framework)
- [Canonical Ubuntu Base](#canonical-ubuntu-base)
- [Directory Standard](#directory-standard)
- [ProxMox](#proxmox)
- [Bootstrap Script](#bootstrap-script)
- [Containerization Philosophy](#containerization-philosophy)
- [Backup SOP](#backup-sop)
- [Router](#router)
- [Workstation/Control Planes](#workstationcontrol-planes)
- [Proxmox](#proxmox-1)
- [VMs](#vms)
- [<primary-nas>](#nas0)
- [Offsite](#offsite)
- [SSH](#ssh)
- [Gitea/projects](#giteaprojects)
- [Disaster Recovery Definition](#disaster-recovery-definition)
- [Environment Variable Policy](#environment-variable-policy)
- [Logging Strategy](#logging-strategy)
- [VM Environments](#vm-environments)
- [API Keys](#api-keys)
- [Configurations](#configurations)
- [Tailscale](#tailscale)
- [SSH](#ssh-1)
- [Windows Native SSH (PowerShell / Windows Terminal)](#windows-native-ssh-powershell-windows-terminal)
- [WSL Ubuntu SSH (Edge Control Plane)](#wsl-ubuntu-ssh-edge-control-plane)
- [Git](#git)
- [VM Template](#vm-template)
- [Proxmox Setup](#proxmox-setup)
- [VM Template Setup](#vm-template-setup)
- [VM Caddy-router setup](#vm-caddy-router-setup)
- [Caddy SOP](#caddy-sop)
- [VM Setup Using Template Process](#vm-setup-using-template-process)
- [Install Discourse on ProxMox VM](#install-discourse-on-proxmox-vm)
- [Ops Monitor Setup And SOP](#ops-monitor-setup-and-sop)
# **Admin**
Original brainstorming chat: [https://chat.example.invalid/private-reference](https://chat.example.invalid/private-reference) and named Edge Data Center Platform Apps Structuring
Master list of projects, agents/workflows, tools, script/commands, hardware, software, VMs, email, cron, strategic assets, agent names, example agent models, validations: https://docs.example.invalid/private-reference
Dashboards: C:\projects\infra\edge-data-center-main\dashboards.md
# **Foundational Philosophy**
## Core Methodologies
* Standardization over improvisation.
* Reproducibility via templates and scripts.
* VM-level isolation.
* Git-first deployment.
* Minimal dependencies.
* Decentralize and isolate machines and, processes to minimize blast radius if anything compromised.
* Gradual complexity scaling.
## Security Principles
* No public SSH exposure anywhere
* No port forwarding.
* Tailscale for private networking.
* Separate dev and prod.
* Keep Ubuntu/Proxmox/Installs clean.
* Principle of minimal installed services.
* AI agents never read production.env, have production SSH key, have production DB credentials. AI agents operate only in dev. Full AI security policy here: [https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)
* Production public access:
* Reverse proxy (Caddy) controls ports 80 and 443 and routes to all apps from the Internet
* HTTPS only
* Domains: domain transfer lock, domain privacy
* Firewalls on all systems
* Dedicated infrastructure email account used for cloud service (Cloudflare, GoDaddy, etc) logins, operating in its own browser session, with minimal browser apps installed
* MFA and Yubikey infrastructure account logins
* Ublock Origin/Lite
* Backups:
* 3 copies of data (ie. NAS)
* 2 different storage types (ie. cloud)
* 1 copy stored offsite (ie. external drive)
* Full online security policies and practices here: [https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)
* Cloudflared could be added as well, but may not be necessary
* Route domains through Cloudflare proxied orange
* VM apps, not docker or other. VMs limit blast radius the most.
* See C:\projects\infra\edge-data-center-main\online-security-practices.md for more
## Primary And Backup Machines
A primary stand-alone vm always has a secondary warm backup machine if primary fails.
## Design Intent
This architecture supports:
* Remote secure access.
* AI-assisted development.
* Production stability.
* Monitoring visibility.
* Future GPU expansion.
* Sovereign model hosting.
* Reduced SaaS dependency.
* Long-term cost control.
## Dev vs Production Process
Production is always on its own machine with a warm backup as standy if primary fails.
Dev is done on separate machine, could be an ephemeral dev machine or could be on a dev machine with dev apps in docker containers. If docker:
Git is transporting:
- source code
- configs
- assets
- scripts
Docker is only:
- runtime packaging
- dependency isolation
- environment management
So if:
- your app code is portable
- dependencies are compatible
- paths/configs are sane
then:
Docker vs non-Docker is mostly irrelevant.
Whether docker or not, git us used to maintain code
But runtime has to be the same across all machines.
## Traffic Flow
* All domains routed through Cloudflare proxied orange with cache and security rules in place.
* For entry in, no ports, only tailscale.
* Internal on LAN,SSH keys only, not password
## Email/Comms
* Gmail is primary
* Thunderbird imaps to gmail accounts for offline backup
* Secondary SMTP setup to send from desktop if gmail down
* Mbsync Thunderbird to email backup
* All gmail backed up to Onedrive via CloudHQ, and Onedrive synced to NAS
## Logging
https://chat.example.invalid/private-reference
* Prometheus+Grafana = server hardware metrics (guages)
* Loki+Promtail = what happened, who connected, what error occurred, what bot accessed what, what sequence of events occurred (blackbox flight recorder).
* Reverse proxy/web access logs: Caddy/Nginx for human and Ai traffic, attaches, scans, API usage, bandwith patterns, reponse codes.
* Applicaiton logs: FastAPI, py apps, discourse, wordpress, listmonk, etc.: exceptions, interal events, business logic, worklfow history.
* Container logs: docker restarts, failures
* Security logs: fail2ban, UFW, Cloudflare security events
* vm inventory scan: scan all apps, software, dependences, packages installed
* Cloudlfare telemetry: edge telemetry from CF.
* Email telemetry: listmonk and post mark data on emails - opens, bounces, clicks, etc.
* AI-agent telemetry: AI access/retrievals, MCPO calls, chunk getches, vector retrievals
* google analytics telemetry
* Business event telemetry: payments, subscriptions, signups
* Uptime monitoring: uptimemonitor for URL-based apps, but for non, like ops monitor, need to set alerts for this.
* Internet down minotiring: rasberry PI does this through uptime monitor
* electricity monitoring: is grid down, is UPS down, solar production
* public network: Cloudflare down, internet down, etc.
* weather data
* Calendar, browser history, location data image metadata, garmin fit data
Any quantitative data to reason over, find correlations, patterns. Logging gets sent to a unified telemetry lake and is parsed as needed for analysis.
## Layered Architecture
### *Layer – Hardware*
Tier 1, 2 and 3 classification: see Physical Servers and PC SOP: C:\projects\infra\edge-data-center-main\physical-machine-tiers.md
### *Layer \- Routing*
* Caddy on its own VM that is public facing and only publicly exposed VM
* Proxmox only
* No extra tooling installed
* Treated like firmware
* Clean, stable, minimal
### *Layer – Operating System Standard*
* Ubuntu 24.04 LTS (Noble) everywhere
* No Debian/Ubuntu mixing
* No version drift
### *Layer \- VMs*
# **VM Strategy**
## Two Apps Max Per VM (Default)
* Each VM runs proxmox and two VMs, each VM with an app. Could be three, or could be two dockerized apps in a VM. Goal is if a machine fails, you are not spending hue time trying to install a bunch of vms and apps on one machine.
* No Docker required unless required by app or in some cases, will dockerize a VM for additional apps
* Isolation handled at the VM level.
* Services managed via systemd.
This reduces complexity and improves debuggability.
## Core VM Roles
### *Primary Workstation/Control Planes*
#### WIndows Host Layer
* Tailscale
* Git For Windows \- used as primary git
* GitHub Desktop \- used as primary git
* Filezilla
* AI coding agent installed via VS code
* Development only
* Snapshottable
* Windows terminal
* Powershell7
* Bitlocker: if windows pro version installed
* Python for personal automation as needed
#### WSL Ubuntu Layer
* Git
* OpenSSH client
* curl
* build-essential
* Python
* Node
* AI CLI tooling
* Optional Docker (if required)
* API keys
* tmux (session persistence)
* htop (local visibility)
* jq (JSON processing)
* yq (YAML processing)
* direnv (env management)
* gh CLI (GitHub CLI)
* make (if using build pipelines)
* Ripgrep
* Fzf
* Whisper STT for text-to-speech processing
### *VM Development Layer*
* Prometheus+Grafana+Loki+Promtail...logging layer
* Git
* OpenSSH client
* curl
* build-essential
* Python
* Node
* Optional Docker (if required)
* tmux (session persistence)
* htop (local visibility)
* jq (JSON processing)
* yq (YAML processing)
* direnv (env management)
* gh CLI (GitHub CLI)
* make (if using build pipelines)
* Ripgrep
* Fzf
* Unzip
* Tailscale
* fail2ban
### *VM Production Layer*
* openssh-server
* tailscale
* Prometheus+Grafana...logging layer
* runtime dependencies for the app
* systemd service
* git
### *Model Gateway (control-only)*
* Lightweight API service
* Auth/rate limiting for internal callers (apps)
* Logging/metrics
* Provider routing config
* Tailscale
* Prometheus+Grafana...logging layer
### *GPU Node (compute-only)*
* Ubuntu 24.04
* NVIDIA drivers
* Model server
* Tailscale
* No Git
* Prometheus+Grafana
* No reverse proxy
* No public ports
### *SSH \+ Access Layer \+ Backup*
Primary workstation/control planes must have:
* SSH keys configured
* Known\_hosts cleanly maintained
* Key-based login only
* No passwords
* BitLocker enabled
* Secure backup of SSH keys
* Encrypted backup of WSL distro
### *Development/Production Machines*
* Prometheus+Grafana...logging layer
* Tailscale installed
### *Old Machines as HTOP Monitoring Only*
* antiX Linux
* openssh-server
* tailscale
* fail2ban
* tmux
* htop
* curl
* jq
* net-tools
### *Git Strategy*
Primary workstation/control planes should be:
* Primary Git origin working copy on Windows workstation/control planes
* Push to GitHub
* Dev VM clones from Git
* Prod pulls from Git
* No manual SFTP deploy.
### *Monitoring Access*
Primary workstation/control planes:
* Access *Prometheus+Grafana...logging layer dashboards via Tailscale
* SSH into VMs for deeper inspection
* Does NOT host monitoring
* Keep monitoring on VMs.
### *VM For Email System*
Useful for logs and mail utilities
* sudo apt install \-y \\
* mailutils \\
* logrotate \\
* ca-certificates
# **Networking Backbone**
## Tailscale Mesh Network
Purpose:
* Secure private communication between all machines.
* No port forwarding.
* No exposed public services.
Characteristics:
* WireGuard-based encryption.
* Private 100.x.x.x addressing.
* Works locally and remotely.
* Same addressing scheme everywhere.
Installed on:
* All hardware
## API Access Model
Internal services (e.g., inference server) bind to:
* 100.x.x.x:port
* Not exposed publicly.
* Applications communicate securely over the Tailscale mesh.
# **AI Integration Strategy**
## Development-Only AI
AI tools installed only on:
* <primary-workstation> WSL
AI Responsibilities:
* Edit code
* Run tests
* Restart dev services
* Assist in development
AI does NOT:
* Access production directly
* Hold production credentials
* Modify live systems
## Production AI
* GPU nodes are never publicly exposed.
* Inference APIs bind only to Tailscale IP.
* Public traffic flows through reverse proxy and app layer.
* App layer enforces auth, rate limits, billing, logging.
* GPU nodes are compute-only and stateless.
## Promotion Workflow
dev → git commit → git push → prod pulls → restart service
Production remains deterministic.
# **Monitoring Strategy**
## Per-VM Monitoring
Install on each production VM:
* Prometheus+Grafana...logging layer (real-time dashboard)
* No central aggregation required initially.
* Tmux \>\>\>htop
Monitor:
* CPU
* Memory
* Load
* Disk I/O
* Network
* Process count
* Docker (if ever used)
## Future Option
If scaling increases:
* Add alerting layer (e.g., Uptime Kuma)
* Add Posthog for more analysis
# **Standardization Framework**
## Canonical Ubuntu Base
All VMs must:
* Run Ubuntu 24.04 LTS (Noble)
* Use same apt repositories
* Use same SSH configuration
* Use same directory layout
## Directory Standard
* /srv/apps
* <app-data-root>
* /srv/backups
* Consistent across all machines.
## ProxMox
Installed on multi-core machines to create and manage VMs
See Setup instructions and notes here: "C:\projects\infra\software\proxmox-vm-setup\ProxMox-Setup.md"
* Proxmox VE
* SSH (key only)
* Tailscale
* Proxmox firewall
## Bootstrap Script
Create a reusable bootstrap.sh
Installs:
* Baseline packages
* Prometheus+Grafana...logging layer (prod only)
* Tailscale
* Directory structure
* User setup
Ensures reproducibility.
## Containerization Philosophy
Default:
* No Docker.
* Native systemd services.
Use Docker only when:
* Multiple services per VM required.
* Isolation becomes necessary.
* CI/CD complexity grows.
# **Backup SOP**
## Router
Backup all configs and download \- note router version number as well. Save to <primary-nas>. Restore to backup router as well. See C:\projects\infra\hardware\routers for specific instructions
## Workstation/Control Planes
Google Drive real time to OneDrive via BackupHQ
Backup WSL Ubuntu with new software changes/updates as tar to <primary-nas>. Manual tar → copy to
See /path/to/infra\\edge-<primary-workstation>\\README.md
## Proxmox
Backup config files only. See [https://chat.example.invalid/private-reference](https://chat.example.invalid/private-reference) Caddy Installation Guide towards end
## VMs
* Proxmox UI\>VM\>Backup : Manual, see scripts file
* Snapshots via proxmox UI \- Manual, before server updates
* VM specific: from wsl workstation: see scripts file
* Automated backup daily 2 am: see [https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)
* Backup lod dev, lod prod, community-indx-earth to <primary-nas> promox-backup
* Real time sync <primary-nas> to <secondary-nas>; real time sync nas to expansion drives
## <primary-nas>
* Real-time backup to <secondary-nas> and external drives
## Offsite
* Backups copied to external drives that are portable so if no one at location, then the last person that leaves takes the external drive with them
## SSH
* SSH key backup location: <primary-nas>
## Gitea/projects
# **Disaster Recovery Definition**
Failure Scenarios
* Dell dies → travel laptop takes over
* AVA dies → clone server takes over
* <primary-nas> dies → <secondary-nas> restore
* GPU node dies → inference unavailable but no data loss
# **Environment Variable Policy**
* .env policy: never committed to Git; enforce via gitignore
* Secrets store on workstation/control plane: wsl..\~/.secrets/
* Backup and store on <primary-nas>
* Dev: .env loaded by app
* Production: apps receive secrets via systemd environmental file: EnvironmentFile=<app-install-path>
* .env never world-readable
* File permissions set to 600
* AI agents must never read production.env0
# **Logging Strategy**
* journald persistent storage enabled
* MaxUse=500M (or similar)
* logrotate enabled for app logs
* Application logs stored in: <app-data-root>/logs/appname
# **VM Environments**
https://docs.example.invalid/private-reference
# **API Keys**
* Stored in password vault/manager
* Rotation if workstations compromised
# **Configurations**
## Tailscale
<private-ip> <primary-workstation> (<primary-workstation>) <user>@ windows
<private-ip> caddy-router
ssh \-i \~/.ssh/edge\_control\_plane <ssh-user>@<private-ip>
## SSH
### *Windows Native SSH (PowerShell / Windows Terminal)*
main workstation: <windows-user-profile>\\.ssh backed up to NAS/Archive
In WSL, <windows-wsl-user-dir>/.ssh/
Use above when SSh from windows directly, using FTP, PuTTY, etc.
### *WSL Ubuntu SSH (Edge Control Plane)*
Main workstation: <linux-user-home>/.ssh/ backed up to NAS/Archive
Key created with Google Ed password to encrypt the backups using 7-zip
Use above when running SSh from inside WSL, managing proxmox VMs, acting as edge control plane
\~/.ssh/edge\_control\_plane
\~/.ssh/edge\_control\_plane.pub
## Git
Edge-control-plane: git repository for workstations/control plane
[https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)
Edge-pre01: git repository for pre01 server running proxmox and VMs
## VM Template
* VM 9000 → ubuntu-2404-base
* Status → Converted to template
* 32GB disk
* 2GB RAM
* Clean state
## Proxmox Setup
[https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)
## VM Template Setup
Use this document
## VM Caddy-router setup
C:\projects\infra\software\caddy-router
## Caddy SOP
C:\projects\infra\software\caddy-router
## VM Setup Using Template Process
[https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)
## Install Discourse on ProxMox VM
"/path/to/infra\\edge-community-indx-earth\\discourse-export-topics-csv\\docs\\Discourse Install on ProxMox VM.md"
## Ops Monitor Setup And SOP
[https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)