edge-data-center-main-OPEN/core-docs-for-app-portability-across-primary-standby-machines.md at c5bb9512816bc0ed9a2fcd16e7e09e07d05b508c

adminprojects/edge-data-center-main-OPEN

eddiesoehnel e36b87addf added

2026-06-06 16:36:52 -06:00

6.1 KiB

Raw Blame History

Those four documents are essentially the minimum viable operational memory for an application.

They are what prevent:

â€œHow did I set this up again?â€ â€œWhat breaks if this VM dies?â€ â€œHow do I rebuild this?â€ â€œWhat exactly do I back up?â€ â€œHow do I restore fast?â€

This becomes critically important in your architecture because:

you are modular you are self-hosted you are intentionally avoiding giant SaaS abstractions you want rebuildability you want warm failover you want ephemeral dev environments

Without operational docs, infrastructure slowly becomes tribal knowledge trapped in your head.

That does not scale even for one person over time.

The Four Core Docs

Think of them as:

Document Purpose setup.md How to build the app from scratch deploy.md How code moves into production backup.md What must be preserved restore.md How to recover from disaster

/docs/setup.md

This is:

â€œHow do I create this app/server from zero?â€

If the VM vanished tomorrow:

how do you rebuild it?

This doc should assume:

blank Ubuntu install no memory no assumptions What Goes Inside Purpose of the app

Example:

LOD API backend for customer management system. Runs FastAPI with PostgreSQL backend. VM specs

Example:

Ubuntu 24.04 2 CPU 4GB RAM 50GB disk Required software

Example:

Python 3.12 PostgreSQL 16 Nginx Git Install steps

Example:

sudo apt update sudo apt install python3.12 python3-venv git Repo cloning git clone :yourorg/lod-api.git Environment variables

Example:

DATABASE_URL= API_KEY= SECRET_KEY=

Never store secrets themselves in Git. Just document them.

Directory structure

Example:

/lod systemd service

Example:

/etc/systemd/system/lod-api.service

And include:

full service file restart instructions Reverse proxy config

Example:

Caddy route: lod.example.com -> :8000 Validation checklist

Example:

API reachable
DB connected
Logs functional
Backups running Why setup.md Is Critical

Because eventually:

you WILL forget details Ubuntu versions WILL change dependencies WILL drift a VM WILL die you WILL rebuild something after months

This document becomes:

your infrastructure memory your reproducibility layer 2. /docs/deploy.md

This is:

â€œHow do changes safely move to production?â€

This is operational workflow.

What Goes Inside Branch strategy

Example:

main = production dev = active development Deployment flow

Example:

Dev VM -> Git push -> Production git pull Production deployment steps

Example:

cd git pull sudo systemctl restart lod-api Pre-deploy checklist

Example:

DB migrations tested
API endpoints verified
Backups confirmed Rollback process

CRITICAL.

Example:

git checkout previous-tag sudo systemctl restart lod-api Version tagging

Example:

git tag v0.4.2 git push origin --tags Downtime expectations

Example:

Expected restart interruption: 5-10 seconds Why deploy.md Matters

Because deployment failures are where most operational stress happens.

This doc prevents:

forgotten steps risky deployments panic during rollback â€œwhat changed?â€ 3. /docs/backup.md

This is:

â€œWhat data matters and how is it protected?â€

Many people back up the wrong things.

You need to know:

what is replaceable what is irreplaceable What Goes Inside What needs backup

Example:

PostgreSQL database Uploaded files .env file SSL certs

NOT:

node_modules Python cache temporary containers Backup frequency

Example:

Database:

nightly full dump
hourly WAL archive Backup locations

Example:

Primary NAS Secondary NAS Offsite encrypted copy Retention policy

Example:

Daily: 14 days Weekly: 8 weeks Monthly: 12 months Backup commands

Example:

pg_dump lod > Validation process

VERY important.

Example:

Monthly restore test required.

Backups that are never tested are fake backups.

Why backup.md Matters

Because during crisis: you do not want to THINK.

You want:

exact commands exact locations exact priorities 4. /docs/restore.md

This is the most important doc of all.

This is:

â€œThe server is dead. Now what?â€

This document should let:

future-you tired-you stressed-you

restore service rapidly.

What Goes Inside Failure scenarios

Example:

VM corruption
accidental deletion
disk failure
ransomware
bad deployment Recovery priority

Example:

Restore database
Restore uploads
Restore API service
Re-enable proxy routing Restore procedure

Example:

createdb lod psql lod < DNS / routing changes

Example:

Update Caddy upstream IP if failover activated. Validation after restore

Example:

login works
uploads visible
API healthy
monitoring active Estimated recovery time

Example:

Expected restore: 15-30 minutes Why restore.md Is The Most Important

Because backups are useless without restore procedures.

Most organizations discover this too late.

You are designing toward:

rapid rebuildability warm failover infrastructure resilience

This document becomes foundational.

The Hidden Benefit

These docs also become:

AI-readable operational intelligence

This is extremely important for your future direction.

Your future agents can:

read setup docs analyze deployment flows validate configs detect drift automate rebuilds generate infrastructure maps

Because your infrastructure knowledge becomes structured.

You are effectively building:

operational memory machine-readable infrastructure cognition What I Would Add For You

You may also eventually want:

File Purpose architecture.md High-level design and dependencies network.md Ports, DNS, routing, Tailscale security.md Auth, firewall, secrets handling monitoring.md Metrics/logging/alerts dependencies.md External systems and APIs dr.md Full disaster recovery strategy The Most Important Principle

These docs should allow you to:

Rebuild the app from scratch without relying on memory.

That is the gold standard.

If future-you can:

rebuild restore redeploy fail over

using only the repo and docs,

then your infrastructure is becoming professionally mature and operationally resilient.

6.1 KiB Raw Blame History

6.1 KiB

Raw Blame History