# **Edge Data Center Architecture**

## Table of Contents

- [Admin](#admin)
- [Foundational Philosophy](#foundational-philosophy)
  - [Core Methodologies](#core-methodologies)
  - [Security Principles](#security-principles)
  - [Primary And Backup Machines](#primary-and-backup-machines)
  - [Design Intent](#design-intent)
  - [Dev vs Production Process](#dev-process)
  - [Traffic Flow](#traffic-flow)
  - [Email/Comms](#emailcomms)
  - [Logging](#logging)
  - [Layered Architecture](#layered-architecture)
    - [Layer  â€“ Hardware](#layer-hardware)
    - [Layer - Routing](#layer-routing)
    - [Layer â€“ Operating System Standard](#layer-operating-system-standard)
    - [Layer - VMs](#layer-vms)
- [VM Strategy](#vm-strategy)
  - [One App Per VM (Default)](#one-app-per-vm-default)
  - [Core VM Roles](#core-vm-roles)
    - [Primary Workstation/Control Planes](#primary-workstationcontrol-planes)
      - [WIndows Host Layer](#windows-host-layer)
      - [WSL Ubuntu Layer](#wsl-ubuntu-layer)
    - [VM Development Layer](#vm-development-layer)
    - [VM Production Layer](#vm-production-layer)
    - [Model Gateway (control-only)](#model-gateway-control-only)
    - [GPU Node (compute-only)](#gpu-node-compute-only)
    - [SSH + Access Layer + Backup](#ssh-access-layer-backup)
    - [Development/Production Machines](#developmentproduction-machines)
    - [Old Machines as HTOP Monitoring Only](#old-machines-as-htop-monitoring-only)
    - [Git Strategy](#git-strategy)
    - [Monitoring Access](#monitoring-access)
    - [VM For Email System](#vm-for-email-system)
- [Networking Backbone](#networking-backbone)
  - [Tailscale Mesh Network](#tailscale-mesh-network)
  - [API Access Model](#api-access-model)
- [AI Integration Strategy](#ai-integration-strategy)
  - [Development-Only AI](#development-only-ai)
  - [Production AI](#production-ai)
  - [Promotion Workflow](#promotion-workflow)
- [Monitoring Strategy](#monitoring-strategy)
  - [Per-VM Monitoring](#per-vm-monitoring)
  - [Future Option](#future-option)
- [Standardization Framework](#standardization-framework)
  - [Canonical Ubuntu Base](#canonical-ubuntu-base)
  - [Directory Standard](#directory-standard)
  - [ProxMox](#proxmox)
  - [Bootstrap Script](#bootstrap-script)
  - [Containerization Philosophy](#containerization-philosophy)
- [Backup SOP](#backup-sop)
  - [Router](#router)
  - [Workstation/Control Planes](#workstationcontrol-planes)
  - [Proxmox](#proxmox-1)
  - [VMs](#vms)
  - [<primary-nas>](#nas0)
  - [Offsite](#offsite)
  - [SSH](#ssh)
  - [Gitea/projects](#giteaprojects)
- [Disaster Recovery Definition](#disaster-recovery-definition)
- [Environment Variable Policy](#environment-variable-policy)
- [Logging Strategy](#logging-strategy)
- [VM Environments](#vm-environments)
- [API Keys](#api-keys)
- [Configurations](#configurations)
  - [Tailscale](#tailscale)
  - [SSH](#ssh-1)
    - [Windows Native SSH (PowerShell / Windows Terminal)](#windows-native-ssh-powershell-windows-terminal)
    - [WSL Ubuntu SSH (Edge Control Plane)](#wsl-ubuntu-ssh-edge-control-plane)
  - [Git](#git)
  - [VM Template](#vm-template)
  - [Proxmox Setup](#proxmox-setup)
  - [VM Template Setup](#vm-template-setup)
  - [VM Caddy-router setup](#vm-caddy-router-setup)
  - [Caddy SOP](#caddy-sop)
  - [VM Setup Using Template Process](#vm-setup-using-template-process)
  - [Install Discourse on ProxMox VM](#install-discourse-on-proxmox-vm)
  - [Ops Monitor Setup And SOP](#ops-monitor-setup-and-sop)

# **Admin**

Original brainstorming chat:  [https://chat.example.invalid/private-reference](https://chat.example.invalid/private-reference)  and named Edge Data Center Platform Apps Structuring

Master list of projects, agents/workflows, tools, script/commands, hardware, software, VMs, email, cron, strategic assets, agent names, example agent models, validations: https://docs.example.invalid/private-reference 

Dashboards: C:\projects\infra\edge-data-center-main\dashboards.md

# **Foundational Philosophy**

## Core Methodologies

* Standardization over improvisation.  
* Reproducibility via templates and scripts.  
* VM-level isolation.  
* Git-first deployment.  
* Minimal dependencies.  
* Decentralize and isolate machines and, processes to minimize blast radius if anything compromised.    
* Gradual complexity scaling.

## Security Principles

* No public SSH exposure anywhere  
* No port forwarding.  
* Tailscale for private networking.  
* Separate dev and prod.  
* Keep Ubuntu/Proxmox/Installs clean.  
* Principle of minimal installed services.  
* AI agents never read production.env, have production SSH key, have production DB credentials. AI agents operate only in dev. Full AI security policy here: [https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)   
* Production public access:  
  * Reverse proxy (Caddy) controls ports 80 and 443 and routes to all apps from the Internet  
  * HTTPS only  
* Domains: domain transfer lock, domain privacy  
* Firewalls on all systems  
* Dedicated infrastructure email account used for cloud service (Cloudflare, GoDaddy, etc) logins, operating in its own browser session, with minimal browser apps installed  
* MFA and Yubikey infrastructure account logins  
* Ublock Origin/Lite  
* Backups:  
  * 3 copies of data (ie. NAS)  
  * 2 different storage types (ie. cloud)  
  * 1 copy stored offsite (ie. external drive)  
* Full online security policies and practices here: [https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)   
* Cloudflared could be added as well, but may not be necessary  
* Route domains through Cloudflare proxied orange
* VM apps, not docker or other.  VMs limit blast radius the most.  
* See C:\projects\infra\edge-data-center-main\online-security-practices.md for more

## Primary And Backup Machines

A primary stand-alone vm always has a secondary warm backup machine if primary fails.  

## Design Intent

This architecture supports:

* Remote secure access.  
* AI-assisted development.  
* Production stability.  
* Monitoring visibility.  
* Future GPU expansion.  
* Sovereign model hosting.  
* Reduced SaaS dependency.  
* Long-term cost control.

## Dev vs Production Process

Production is always on its own machine with a warm backup as standy if primary fails.

Dev is done on separate machine, could be an ephemeral dev machine or could be on a dev machine with dev apps in docker containers.  If docker: 

Git is transporting:

- source code
- configs
- assets
- scripts

Docker is only:

- runtime packaging
- dependency isolation
- environment management

So if:

- your app code is portable
- dependencies are compatible
- paths/configs are sane

then:

Docker vs non-Docker is mostly irrelevant.

Whether docker or not, git us used to maintain code

But runtime has to be the same across all machines. 

## Traffic Flow

* All domains routed through Cloudflare proxied orange with cache and security rules in place.  
* For entry in, no ports, only tailscale.    
* Internal on LAN,SSH keys only, not password

## Email/Comms

* Gmail is primary  
* Thunderbird imaps to gmail accounts for offline backup  
* Secondary SMTP setup to send from desktop if gmail down  
* Mbsync Thunderbird to email backup  
* All gmail backed up to Onedrive via CloudHQ, and Onedrive synced to NAS

## Logging
https://chat.example.invalid/private-reference

* Prometheus+Grafana = server hardware metrics (guages)
* Loki+Promtail = what happened, who connected, what error occurred, what bot accessed what, what sequence of events occurred (blackbox flight recorder). 
* Reverse proxy/web access logs:  Caddy/Nginx for human and Ai traffic, attaches, scans, API usage, bandwith patterns, reponse codes.
* Applicaiton logs:  FastAPI, py apps, discourse, wordpress, listmonk, etc.:  exceptions, interal events, business logic, worklfow history.
* Container logs: docker restarts, failures
* Security logs: fail2ban, UFW, Cloudflare security events
* vm inventory scan: scan all apps, software, dependences, packages installed
* Cloudlfare telemetry: edge telemetry from CF. 
* Email telemetry:  listmonk and post mark data on emails - opens, bounces, clicks, etc. 
* AI-agent telemetry:  AI access/retrievals, MCPO calls, chunk getches, vector retrievals
* google analytics telemetry
* Business event telemetry:  payments, subscriptions, signups
* Uptime monitoring: uptimemonitor for URL-based apps, but for non, like ops monitor, need to set alerts for this. 
* Internet down minotiring:  rasberry PI does this through uptime monitor
* electricity monitoring: is grid down, is UPS down, solar production
* public network:  Cloudflare down, internet down, etc. 
* weather data
* Calendar, browser history, location data image metadata, garmin fit data

Any quantitative data to reason over, find correlations, patterns.  Logging gets sent to a unified telemetry lake and is parsed as needed for analysis. 

## Layered Architecture

### *Layer  â€“ Hardware*

Tier 1, 2 and 3 classification:  see Physical Servers and PC SOP:  C:\projects\infra\edge-data-center-main\physical-machine-tiers.md 

### *Layer \- Routing*

* Caddy on its own VM that is public facing and only publicly exposed VM  
* Proxmox only  
* No extra tooling installed  
* Treated like firmware  
* Clean, stable, minimal

### *Layer â€“ Operating System Standard*

* Ubuntu 24.04 LTS (Noble) everywhere  
* No Debian/Ubuntu mixing  
* No version drift

### *Layer \- VMs*

# **VM Strategy**

## Two Apps Max Per VM (Default)

* Each VM runs proxmox and two VMs, each VM with an app. Could be three, or could be two dockerized apps in a VM.  Goal is if a machine fails, you are not spending hue time trying to install a bunch of vms and apps on one machine.  
* No Docker required unless required by app or in some cases, will dockerize a VM for additional apps
* Isolation handled at the VM level.  
* Services managed via systemd.

This reduces complexity and improves debuggability.

## Core VM Roles

### *Primary Workstation/Control Planes*

#### WIndows Host Layer

* Tailscale  
* Git For Windows \- used as primary git  
* GitHub Desktop \- used as primary git  
* Filezilla  
* AI coding agent installed via VS code  
* Development only  
* Snapshottable  
* Windows terminal  
* Powershell7  
* Bitlocker: if windows pro version installed  
* Python for personal automation as needed

#### WSL Ubuntu Layer

* Git  
* OpenSSH client  
* curl  
* build-essential  
* Python  
* Node  
* AI CLI tooling  
* Optional Docker (if required)  
* API keys  
* tmux (session persistence)  
* htop (local visibility)  
* jq (JSON processing)  
* yq (YAML processing)  
* direnv (env management)  
* gh CLI (GitHub CLI)  
* make (if using build pipelines)  
* Ripgrep  
* Fzf  
* Whisper STT for text-to-speech processing

### *VM Development Layer*

* Prometheus+Grafana+Loki+Promtail...logging layer
* Git  
* OpenSSH client  
* curl  
* build-essential  
* Python  
* Node  
* Optional Docker (if required)  
* tmux (session persistence)  
* htop (local visibility)  
* jq (JSON processing)  
* yq (YAML processing)  
* direnv (env management)  
* gh CLI (GitHub CLI)  
* make (if using build pipelines)  
* Ripgrep  
* Fzf   
* Unzip  
* Tailscale  
* fail2ban

### *VM Production Layer*

* openssh-server  
* tailscale  
* Prometheus+Grafana...logging layer
* runtime dependencies for the app  
* systemd service  
* git 

### *Model Gateway (control-only)*

* Lightweight API service

* Auth/rate limiting for internal callers (apps)

* Logging/metrics

* Provider routing config

* Tailscale

* Prometheus+Grafana...logging layer

### *GPU Node (compute-only)*

* Ubuntu 24.04

* NVIDIA drivers

* Model server

* Tailscale

* No Git

* Prometheus+Grafana  
* No reverse proxy

* No public ports

### *SSH \+ Access Layer \+ Backup*

Primary workstation/control planes must have:

* SSH keys configured  
* Known\_hosts cleanly maintained  
* Key-based login only  
* No passwords  
* BitLocker enabled  
* Secure backup of SSH keys  
* Encrypted backup of WSL distro

### *Development/Production Machines*

* Prometheus+Grafana...logging layer 
* Tailscale installed

### *Old Machines as HTOP Monitoring Only*

* antiX Linux  
* openssh-server  
* tailscale  
* fail2ban  
* tmux  
* htop  
* curl  
* jq  
* net-tools

### *Git Strategy*

Primary workstation/control planes should be:

* Primary Git origin working copy on Windows workstation/control planes  
* Push to GitHub   
* Dev VM clones from Git  
* Prod pulls from Git  
* No manual SFTP deploy.

### *Monitoring Access*

Primary workstation/control planes:

* Access *Prometheus+Grafana...logging layer dashboards via Tailscale  
* SSH into VMs for deeper inspection  
* Does NOT host monitoring  
* Keep monitoring on VMs.

### *VM For Email System*

Useful for logs and mail utilities

* sudo apt install \-y \\  
* mailutils \\  
* logrotate \\  
* ca-certificates

# **Networking Backbone**

## Tailscale Mesh Network

Purpose:

* Secure private communication between all machines.  
* No port forwarding.  
* No exposed public services.

Characteristics:

* WireGuard-based encryption.  
* Private 100.x.x.x addressing.  
* Works locally and remotely.  
* Same addressing scheme everywhere.

Installed on:

* All hardware

## API Access Model

Internal services (e.g., inference server) bind to:

* 100.x.x.x:port  
* Not exposed publicly.  
* Applications communicate securely over the Tailscale mesh.

# **AI Integration Strategy**

## Development-Only AI

AI tools installed only on:

* <primary-workstation> WSL

AI Responsibilities:

* Edit code  
* Run tests  
* Restart dev services  
* Assist in development

AI does NOT:

* Access production directly  
* Hold production credentials  
* Modify live systems

## Production AI

* GPU nodes are never publicly exposed.

* Inference APIs bind only to Tailscale IP.

* Public traffic flows through reverse proxy and app layer.

* App layer enforces auth, rate limits, billing, logging.

* GPU nodes are compute-only and stateless.

## Promotion Workflow

dev â†’ git commit â†’ git push â†’ prod pulls â†’ restart service

Production remains deterministic.

# **Monitoring Strategy**

## Per-VM Monitoring

Install on each production VM:

* Prometheus+Grafana...logging layer (real-time dashboard)  
* No central aggregation required initially.  
* Tmux \>\>\>htop

Monitor:

* CPU  
* Memory  
* Load  
* Disk I/O  
* Network  
* Process count  
* Docker (if ever used)

## Future Option

If scaling increases:

* Add alerting layer (e.g., Uptime Kuma)  
* Add Posthog for more analysis 

# **Standardization Framework**

## Canonical Ubuntu Base

All VMs must:

* Run Ubuntu 24.04 LTS (Noble)  
* Use same apt repositories  
* Use same SSH configuration  
* Use same directory layout

## Directory Standard

* /srv/apps  
* <app-data-root>  
* /srv/backups  
* Consistent across all machines.

## ProxMox

Installed on multi-core machines to create and manage VMs

See Setup instructions and notes here: "C:\projects\infra\software\proxmox-vm-setup\ProxMox-Setup.md"

* Proxmox VE  
* SSH (key only)  
* Tailscale  
* Proxmox firewall

## Bootstrap Script

Create a reusable bootstrap.sh

Installs:

* Baseline packages  
* Prometheus+Grafana...logging layer (prod only)  
* Tailscale  
* Directory structure  
* User setup

Ensures reproducibility.

## Containerization Philosophy

Default:

* No Docker.

* Native systemd services.

Use Docker only when:

* Multiple services per VM required.

* Isolation becomes necessary.

* CI/CD complexity grows.

# **Backup SOP**

## Router

Backup all configs and download \- note router version number as well. Save to <primary-nas>. Restore to backup router as well.  See C:\projects\infra\hardware\routers for specific instructions

## Workstation/Control Planes

Google Drive real time to OneDrive via BackupHQ

Backup WSL Ubuntu with new software changes/updates as tar to <primary-nas>. Manual tar â†’ copy to

See  /path/to/infra\\edge-<primary-workstation>\\README.md

## Proxmox

Backup config files only.  See [https://chat.example.invalid/private-reference](https://chat.example.invalid/private-reference) Caddy Installation Guide towards end

## VMs

* Proxmox UI\>VM\>Backup : Manual, see scripts file  
* Snapshots via proxmox UI \- Manual, before server updates  
* VM specific: from wsl workstation:   see scripts file  
* Automated backup daily 2 am:  see [https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)   
  * Backup lod dev, lod prod, community-indx-earth to <primary-nas> promox-backup  
  * Real time sync <primary-nas> to <secondary-nas>; real time sync nas to expansion drives

## <primary-nas>

* Real-time backup to <secondary-nas> and external drives

## Offsite

* Backups copied to external drives that are portable so if no one at location, then the last person that leaves takes the external drive with them

## SSH

* SSH key backup location: <primary-nas>

## Gitea/projects

# **Disaster Recovery Definition**

Failure Scenarios

* Dell dies â†’ travel laptop takes over

* AVA dies â†’ clone server takes over

* <primary-nas> dies â†’ <secondary-nas> restore

* GPU node dies â†’ inference unavailable but no data loss

# **Environment Variable Policy**

* .env policy: never committed to Git; enforce via gitignore

* Secrets store on workstation/control plane: wsl..\~/.secrets/

* Backup and store on <primary-nas>

* Dev:  .env loaded by app

* Production:  apps receive secrets via systemd environmental file: EnvironmentFile=<app-install-path>

* .env never world-readable

* File permissions set to 600

* AI agents must never read production.env0

# **Logging Strategy**

* journald persistent storage enabled  
* MaxUse=500M (or similar)  
* logrotate enabled for app logs  
* Application logs stored in: <app-data-root>/logs/appname

# **VM Environments**

https://docs.example.invalid/private-reference

# **API Keys**

* Stored in password vault/manager  
* Rotation if workstations compromised

# **Configurations**

## Tailscale

<private-ip> <primary-workstation> (<primary-workstation>)  <user>@  windows

<private-ip> caddy-router 

 ssh \-i \~/.ssh/edge\_control\_plane <ssh-user>@<private-ip>

## SSH

### *Windows Native SSH (PowerShell / Windows Terminal)*

main workstation:  <windows-user-profile>\\.ssh backed up to NAS/Archive

In WSL,  <windows-wsl-user-dir>/.ssh/

Use above when SSh from windows directly, using FTP, PuTTY, etc. 

### *WSL Ubuntu SSH (Edge Control Plane)*

Main workstation:  <linux-user-home>/.ssh/  backed up to NAS/Archive

Key created with Google Ed password to encrypt the backups using 7-zip

Use above when running SSh from inside WSL, managing proxmox VMs, acting as edge control plane

\~/.ssh/edge\_control\_plane

\~/.ssh/edge\_control\_plane.pub

## Git

Edge-control-plane:  git repository for workstations/control plane

[https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference) 

Edge-pre01: git repository for pre01 server running proxmox and VMs

## VM Template

* VM 9000 â†’ ubuntu-2404-base  
* Status â†’ Converted to template  
* 32GB disk  
* 2GB RAM  
* Clean state

## Proxmox Setup

[https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference) 

## VM Template Setup

Use this document

## VM Caddy-router setup

C:\projects\infra\software\caddy-router

## Caddy SOP

C:\projects\infra\software\caddy-router

## VM Setup Using Template Process

[https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference) 

## Install Discourse on ProxMox VM

"/path/to/infra\\edge-community-indx-earth\\discourse-export-topics-csv\\docs\\Discourse Install on ProxMox VM.md"

## Ops Monitor Setup And SOP

[https://docs.example.invalid/private-reference](https://docs.example.invalid/private-reference)