adminprojects/edge-data-center-main-OPEN

eddiesoehnel c5bb951281 added

2026-06-15 07:25:04 -06:00

20 KiB

Raw Blame History

Edge Data Center Architecture

Admin
Foundational Philosophy
VM Strategy
- One App Per VM (Default)
- Core VM Roles
Networking Backbone
- Tailscale Mesh Network
- API Access Model
AI Integration Strategy
Monitoring Strategy
- Per-VM Monitoring
- Future Option
Standardization Framework
Backup SOP
- Router
- Workstation/Control Planes
- Proxmox
- VMs
- Offsite
- SSH
- Gitea/projects
Disaster Recovery Definition
Environment Variable Policy
Logging Strategy
VM Environments
API Keys
Configurations

Admin

Original brainstorming chat: https://chat.example.invalid/private-reference and named Edge Data Center Platform Apps Structuring

Master list of projects, agents/workflows, tools, script/commands, hardware, software, VMs, email, cron, strategic assets, agent names, example agent models, validations: https://docs.example.invalid/private-reference

Dashboards: C:\projects\infra\edge-data-center-main\dashboards.md

Foundational Philosophy

Core Methodologies

Standardization over improvisation.
Reproducibility via templates and scripts.
VM-level isolation.
Git-first deployment.
Minimal dependencies.
Decentralize and isolate machines and, processes to minimize blast radius if anything compromised.
Gradual complexity scaling.

Security Principles

No public SSH exposure anywhere
No port forwarding.
Tailscale for private networking.
Separate dev and prod.
Keep Ubuntu/Proxmox/Installs clean.
Principle of minimal installed services.
AI agents never read production.env, have production SSH key, have production DB credentials. AI agents operate only in dev. Full AI security policy here: https://docs.example.invalid/private-reference
Production public access:
- Reverse proxy (Caddy) controls ports 80 and 443 and routes to all apps from the Internet
- HTTPS only
Domains: domain transfer lock, domain privacy
Firewalls on all systems
Dedicated infrastructure email account used for cloud service (Cloudflare, GoDaddy, etc) logins, operating in its own browser session, with minimal browser apps installed
MFA and Yubikey infrastructure account logins
Ublock Origin/Lite
Backups:
- 3 copies of data (ie. NAS)
- 2 different storage types (ie. cloud)
- 1 copy stored offsite (ie. external drive)
Full online security policies and practices here: https://docs.example.invalid/private-reference
Cloudflared could be added as well, but may not be necessary
Route domains through Cloudflare proxied orange
VM apps, not docker or other. VMs limit blast radius the most.
See C:\projects\infra\edge-data-center-main\online-security-practices.md for more

Primary And Backup Machines

A primary stand-alone vm always has a secondary warm backup machine if primary fails.

Design Intent

This architecture supports:

Remote secure access.
AI-assisted development.
Production stability.
Monitoring visibility.
Future GPU expansion.
Sovereign model hosting.
Reduced SaaS dependency.
Long-term cost control.

Dev vs Production Process

Production is always on its own machine with a warm backup as standy if primary fails.

Dev is done on separate machine, could be an ephemeral dev machine or could be on a dev machine with dev apps in docker containers. If docker:

Git is transporting:

source code
configs
assets
scripts

Docker is only:

runtime packaging
dependency isolation
environment management

So if:

your app code is portable
dependencies are compatible
paths/configs are sane

then:

Docker vs non-Docker is mostly irrelevant.

Whether docker or not, git us used to maintain code

But runtime has to be the same across all machines.

Traffic Flow

All domains routed through Cloudflare proxied orange with cache and security rules in place.
For entry in, no ports, only tailscale.
Internal on LAN,SSH keys only, not password

Email/Comms

Gmail is primary
Thunderbird imaps to gmail accounts for offline backup
Secondary SMTP setup to send from desktop if gmail down
Mbsync Thunderbird to email backup
All gmail backed up to Onedrive via CloudHQ, and Onedrive synced to NAS

Logging

https://chat.example.invalid/private-reference

Prometheus+Grafana = server hardware metrics (guages)
Loki+Promtail = what happened, who connected, what error occurred, what bot accessed what, what sequence of events occurred (blackbox flight recorder).
Reverse proxy/web access logs: Caddy/Nginx for human and Ai traffic, attaches, scans, API usage, bandwith patterns, reponse codes.
Applicaiton logs: FastAPI, py apps, discourse, wordpress, listmonk, etc.: exceptions, interal events, business logic, worklfow history.
Container logs: docker restarts, failures
Security logs: fail2ban, UFW, Cloudflare security events
vm inventory scan: scan all apps, software, dependences, packages installed
Cloudlfare telemetry: edge telemetry from CF.
Email telemetry: listmonk and post mark data on emails - opens, bounces, clicks, etc.
AI-agent telemetry: AI access/retrievals, MCPO calls, chunk getches, vector retrievals
google analytics telemetry
Business event telemetry: payments, subscriptions, signups
Uptime monitoring: uptimemonitor for URL-based apps, but for non, like ops monitor, need to set alerts for this.
Internet down minotiring: rasberry PI does this through uptime monitor
electricity monitoring: is grid down, is UPS down, solar production
public network: Cloudflare down, internet down, etc.
weather data
Calendar, browser history, location data image metadata, garmin fit data

Any quantitative data to reason over, find correlations, patterns. Logging gets sent to a unified telemetry lake and is parsed as needed for analysis.

Layered Architecture

Layer â€“ Hardware

Tier 1, 2 and 3 classification: see Physical Servers and PC SOP: C:\projects\infra\edge-data-center-main\physical-machine-tiers.md

Layer - Routing

Caddy on its own VM that is public facing and only publicly exposed VM
Proxmox only
No extra tooling installed
Treated like firmware
Clean, stable, minimal

Layer â€“ Operating System Standard

Ubuntu 24.04 LTS (Noble) everywhere
No Debian/Ubuntu mixing
No version drift

Layer - VMs

VM Strategy

Two Apps Max Per VM (Default)

Each VM runs proxmox and two VMs, each VM with an app. Could be three, or could be two dockerized apps in a VM. Goal is if a machine fails, you are not spending hue time trying to install a bunch of vms and apps on one machine.
No Docker required unless required by app or in some cases, will dockerize a VM for additional apps
Isolation handled at the VM level.
Services managed via systemd.

This reduces complexity and improves debuggability.

Core VM Roles

Primary Workstation/Control Planes

WIndows Host Layer

Tailscale
Git For Windows - used as primary git
GitHub Desktop - used as primary git
Filezilla
AI coding agent installed via VS code
Development only
Snapshottable
Windows terminal
Powershell7
Bitlocker: if windows pro version installed
Python for personal automation as needed

WSL Ubuntu Layer

Git
OpenSSH client
curl
build-essential
Python
Node
AI CLI tooling
Optional Docker (if required)
API keys
tmux (session persistence)
htop (local visibility)
jq (JSON processing)
yq (YAML processing)
direnv (env management)
gh CLI (GitHub CLI)
make (if using build pipelines)
Ripgrep
Fzf
Whisper STT for text-to-speech processing

VM Development Layer

Prometheus+Grafana+Loki+Promtail...logging layer
Git
OpenSSH client
curl
build-essential
Python
Node
Optional Docker (if required)
tmux (session persistence)
htop (local visibility)
jq (JSON processing)
yq (YAML processing)
direnv (env management)
gh CLI (GitHub CLI)
make (if using build pipelines)
Ripgrep
Fzf
Unzip
Tailscale
fail2ban

VM Production Layer

openssh-server
tailscale
Prometheus+Grafana...logging layer
runtime dependencies for the app
systemd service
git

Model Gateway (control-only)

Lightweight API service
Auth/rate limiting for internal callers (apps)
Logging/metrics
Provider routing config
Tailscale
Prometheus+Grafana...logging layer

GPU Node (compute-only)

Ubuntu 24.04
NVIDIA drivers
Model server
Tailscale
No Git
Prometheus+Grafana
No reverse proxy
No public ports

SSH + Access Layer + Backup

Primary workstation/control planes must have:

SSH keys configured
Known_hosts cleanly maintained
Key-based login only
No passwords
BitLocker enabled
Secure backup of SSH keys
Encrypted backup of WSL distro

Development/Production Machines

Prometheus+Grafana...logging layer
Tailscale installed

Old Machines as HTOP Monitoring Only

antiX Linux
openssh-server
tailscale
fail2ban
tmux
htop
curl
jq
net-tools

Git Strategy

Primary workstation/control planes should be:

Primary Git origin working copy on Windows workstation/control planes
Push to GitHub
Dev VM clones from Git
Prod pulls from Git
No manual SFTP deploy.

Monitoring Access

Primary workstation/control planes:

Access *Prometheus+Grafana...logging layer dashboards via Tailscale
SSH into VMs for deeper inspection
Does NOT host monitoring
Keep monitoring on VMs.

VM For Email System

Useful for logs and mail utilities

sudo apt install -y \
mailutils \
logrotate \
ca-certificates

Networking Backbone

Tailscale Mesh Network

Purpose:

Secure private communication between all machines.
No port forwarding.
No exposed public services.

Characteristics:

WireGuard-based encryption.
Private 100.x.x.x addressing.
Works locally and remotely.
Same addressing scheme everywhere.

Installed on:

All hardware

API Access Model

Internal services (e.g., inference server) bind to:

100.x.x.x:port
Not exposed publicly.
Applications communicate securely over the Tailscale mesh.

AI Integration Strategy

Development-Only AI

AI tools installed only on:

AI Responsibilities:

Edit code
Run tests
Restart dev services
Assist in development

AI does NOT:

Access production directly
Hold production credentials
Modify live systems

Production AI

GPU nodes are never publicly exposed.
Inference APIs bind only to Tailscale IP.
Public traffic flows through reverse proxy and app layer.
App layer enforces auth, rate limits, billing, logging.
GPU nodes are compute-only and stateless.

Promotion Workflow

dev â†’ git commit â†’ git push â†’ prod pulls â†’ restart service

Production remains deterministic.

Monitoring Strategy

Per-VM Monitoring

Install on each production VM:

Prometheus+Grafana...logging layer (real-time dashboard)
No central aggregation required initially.
Tmux >>>htop

Monitor:

CPU
Memory
Load
Disk I/O
Network
Process count
Docker (if ever used)

Future Option

If scaling increases:

Add alerting layer (e.g., Uptime Kuma)
Add Posthog for more analysis

Standardization Framework

Canonical Ubuntu Base

All VMs must:

Run Ubuntu 24.04 LTS (Noble)
Use same apt repositories
Use same SSH configuration
Use same directory layout

Directory Standard

/srv/apps
/srv/backups
Consistent across all machines.

ProxMox

Installed on multi-core machines to create and manage VMs

See Setup instructions and notes here: "C:\projects\infra\software\proxmox-vm-setup\ProxMox-Setup.md"

Proxmox VE
SSH (key only)
Tailscale
Proxmox firewall

Bootstrap Script

Create a reusable bootstrap.sh

Installs:

Baseline packages
Prometheus+Grafana...logging layer (prod only)
Tailscale
Directory structure
User setup

Ensures reproducibility.

Containerization Philosophy

Default:

No Docker.
Native systemd services.

Use Docker only when:

Multiple services per VM required.
Isolation becomes necessary.
CI/CD complexity grows.

Backup SOP

Router

Backup all configs and download - note router version number as well. Save to . Restore to backup router as well. See C:\projects\infra\hardware\routers for specific instructions

Workstation/Control Planes

Google Drive real time to OneDrive via BackupHQ

Backup WSL Ubuntu with new software changes/updates as tar to . Manual tar â†’ copy to

See /path/to/infra\edge-\README.md

Proxmox

Backup config files only. See https://chat.example.invalid/private-reference Caddy Installation Guide towards end

VMs

Proxmox UI>VM>Backup : Manual, see scripts file
Snapshots via proxmox UI - Manual, before server updates
VM specific: from wsl workstation: see scripts file
Automated backup daily 2 am: see https://docs.example.invalid/private-reference
- Backup lod dev, lod prod, community-indx-earth to promox-backup
- Real time sync to ; real time sync nas to expansion drives

Real-time backup to and external drives

Offsite

Backups copied to external drives that are portable so if no one at location, then the last person that leaves takes the external drive with them

SSH

SSH key backup location:

Gitea/projects

Disaster Recovery Definition

Failure Scenarios

Dell dies â†’ travel laptop takes over
AVA dies â†’ clone server takes over
dies â†’ restore
GPU node dies â†’ inference unavailable but no data loss

Environment Variable Policy

.env policy: never committed to Git; enforce via gitignore
Secrets store on workstation/control plane: wsl..~/.secrets/
Backup and store on
Dev: .env loaded by app
Production: apps receive secrets via systemd environmental file: EnvironmentFile=
.env never world-readable
File permissions set to 600
AI agents must never read production.env0

Logging Strategy

journald persistent storage enabled
MaxUse=500M (or similar)
logrotate enabled for app logs
Application logs stored in: /logs/appname

VM Environments

https://docs.example.invalid/private-reference

API Keys

Stored in password vault/manager
Rotation if workstations compromised

Configurations

Tailscale

() @ windows

caddy-router

ssh -i ~/.ssh/edge_control_plane @

SSH

Windows Native SSH (PowerShell / Windows Terminal)

main workstation: \.ssh backed up to NAS/Archive

In WSL, /.ssh/

Use above when SSh from windows directly, using FTP, PuTTY, etc.

WSL Ubuntu SSH (Edge Control Plane)

Main workstation: /.ssh/ backed up to NAS/Archive

Key created with Google Ed password to encrypt the backups using 7-zip

Use above when running SSh from inside WSL, managing proxmox VMs, acting as edge control plane

~/.ssh/edge_control_plane

~/.ssh/edge_control_plane.pub

Git

Edge-control-plane: git repository for workstations/control plane

https://docs.example.invalid/private-reference

Edge-pre01: git repository for pre01 server running proxmox and VMs

VM Template

VM 9000 â†’ ubuntu-2404-base
Status â†’ Converted to template
32GB disk
2GB RAM
Clean state

Proxmox Setup

https://docs.example.invalid/private-reference

VM Template Setup

Use this document

VM Caddy-router setup

C:\projects\infra\software\caddy-router

Caddy SOP

C:\projects\infra\software\caddy-router

VM Setup Using Template Process

https://docs.example.invalid/private-reference

Install Discourse on ProxMox VM

"/path/to/infra\edge-community-indx-earth\discourse-export-topics-csv\docs\Discourse Install on ProxMox VM.md"

Ops Monitor Setup And SOP

https://docs.example.invalid/private-reference

20 KiB Raw Blame History

Edge Data Center Architecture

Table of Contents

Admin

Foundational Philosophy

Core Methodologies

Security Principles

Primary And Backup Machines

Design Intent

Dev vs Production Process

Traffic Flow

Email/Comms

Logging

Layered Architecture

Layer â€“ Hardware

Layer - Routing

Layer â€“ Operating System Standard

Layer - VMs

VM Strategy

Two Apps Max Per VM (Default)

Core VM Roles

Primary Workstation/Control Planes

WIndows Host Layer

WSL Ubuntu Layer

VM Development Layer

VM Production Layer

Model Gateway (control-only)

GPU Node (compute-only)

SSH + Access Layer + Backup

Development/Production Machines

Old Machines as HTOP Monitoring Only

Git Strategy

Monitoring Access

VM For Email System

Networking Backbone

Tailscale Mesh Network

API Access Model

AI Integration Strategy

Development-Only AI

Production AI

Promotion Workflow

Monitoring Strategy

Per-VM Monitoring

Future Option

Standardization Framework

Canonical Ubuntu Base

Directory Standard

ProxMox

Bootstrap Script

Containerization Philosophy

Backup SOP

Router

Workstation/Control Planes

Proxmox

VMs

Offsite

SSH

Gitea/projects

Disaster Recovery Definition

Environment Variable Policy

Logging Strategy

VM Environments

API Keys

Configurations

Tailscale

SSH

Windows Native SSH (PowerShell / Windows Terminal)

WSL Ubuntu SSH (Edge Control Plane)

Git

VM Template

Proxmox Setup

VM Template Setup

VM Caddy-router setup

Caddy SOP

VM Setup Using Template Process

Install Discourse on ProxMox VM

Ops Monitor Setup And SOP

20 KiB

Raw Blame History