homelab/HOMELAB.md
Dan V deb6c38d7b chore: commit homelab setup — deployment, services, orchestration, skill
- Add .gitignore: exclude compiled binaries, build artifacts, and Helm
  values files containing real secrets (authentik, prometheus)
- Add all Kubernetes deployment manifests (deployment/)
- Add services source code: ha-sync, device-inventory, games-console,
  paperclip, parts-inventory
- Add Ansible orchestration: playbooks, roles, inventory, cloud-init
- Add hardware specs, execution plans, scripts, HOMELAB.md
- Add skills/homelab/SKILL.md + skills/install.sh to preserve Copilot skill
- Remove previously-tracked inventory-cli binary from git index

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 08:10:32 +02:00

12 KiB
Raw Permalink Blame History

Homelab Specs


Hardware

Dell OptiPlex 7070

  • Role: kube-node-1 (control-plane + worker), bare metal
  • IP: 192.168.2.100
  • SSH: dan@192.168.2.100
  • CPU: Intel Core i5-9500, 6c/6t, 3.0 GHz base / 4.4 GHz boost, 9 MB L3, 65W TDP, VT-x
  • RAM: 16 GB DDR4 2666 MT/s DIMM
  • Storage:
    • nvme0: Samsung PM991 256 GB — 1G EFI, 2G /boot, 235.4G LVM (100G → /)
    • sda: Seagate Expansion 2 TB → /data/photos (ext4)
    • sdb: Seagate Expansion+ 2 TB → /mnt/sdb-ro (ext4, READ-ONLY — never touch)
    • sdc1: Seagate Expansion 1 TB → /data/media (ext4)
    • sdc2: Seagate Expansion 788 GB → /data/games (ext4)
    • sdd: Samsung HD103SI 1 TB → /data/owncloud (ext4)
    • sde: Hitachi HTS545050 500 GB → /data/infra (ext4)
    • sdf: Seagate 1 TB → /data/ai (ext4)
    • Total: ~7 TB
  • Network: 1 Gbit/s
  • NFS server: exports /data/{games,media,photos,owncloud,infra,ai} to LAN

HP ProLiant DL360 G7

  • Role: Proxmox hypervisor (192.168.2.193)
  • SSH: root@192.168.2.193 (local id_rsa)
  • Web UI: https://proxmox.vandachevici.ro
  • Storage:
    • 2× HPE SAS 900 GB in RAID 1+0 → 900 GB usable (Proxmox OS)
    • 4× HPE SAS 900 GB in RAID 1+0 → 1.8 TB usable (VM disks)
    • Promise VTrak J830s: 2× 16 TB → media-pool (ZFS, ~14 TB usable)
    • Total: ~18 TB

Promise VTrak J830s

  • Connected to HP ProLiant via SAS
  • 2× 16 TB disks, ZFS pool media-pool
  • ZFS datasets mounted at /data/X on HP (matching Dell paths)

Storage Layout

Dell /data drives (primary/local)

Mount Device Size Contents
/data/games sdc2 788 GB Game server worlds and kits
/data/media sdc1 1.1 TB Jellyfin media library
/data/photos sda 916 GB Immich photo library
/data/owncloud sdd 916 GB OwnCloud files
/data/infra sde 458 GB Prometheus, infra data
/data/ai sdf 916 GB Paperclip, Ollama models
/mnt/sdb-ro sdb 1.8 TB READ-ONLY archive — never modify

HP VTrak ZFS datasets (HA mirrors)

ZFS Dataset Mountpoint on HP NFS export
media-pool/jellyfin /data/media
media-pool/immich /data/photos
media-pool/owncloud /data/owncloud
media-pool/games /data/games
media-pool/minecraft /data/games/minecraft
media-pool/factorio /data/games/factorio
media-pool/openttd /data/games/openttd
media-pool/infra /data/infra
media-pool/ai /data/ai

Legacy bind mounts at /media-pool/X/data/X preserved for K8s PV compatibility.

Cross-mounts (HA access)

From Mount point To
Dell /mnt/hp/data-{games,media,photos,owncloud,infra,ai} HP VTrak NFS
HP /mnt/dell/data-{games,media,photos,owncloud,infra,ai} Dell NFS

VMs on HP ProLiant (Proxmox)

VM ID Name IP RAM Role
100 kube-node-2 192.168.2.195 16 GB K8s worker
101 kube-node-3 192.168.2.196 16 GB K8s control-plane + worker
103 kube-arbiter 192.168.2.200 6 GB K8s control-plane (etcd + API server, NoSchedule)
104 local-ai 192.168.2.88 Ollama + openclaw-gateway (Tesla P4 GPU passthrough)
106 ansible-control 192.168.2.70 Ansible control node
107 remote-ai 192.168.2.91 openclaw-gateway (remote, cloud AI)

⚠️ kube-node-2, kube-node-3, and kube-arbiter are all VMs on the HP ProLiant. HP ProLiant failure = loss of 3/4 K8s nodes simultaneously. Mitigation: add a Raspberry Pi 4/5 (8 GB) as a 4th physical host.

SSH: dan@<ip> for all VMs


Kubernetes Cluster

  • Version: 1.32.13
  • CNI: Flannel
  • Dashboard: https://192.168.2.100:30443 (self-signed cert, token auth)
  • Token file: /home/dan/homelab/kube/cluster/DASHBOARD-ACCESS.txt
  • StorageClass: local-storage (hostPath on kube-node-1)
  • NFS provisioners: nfs-provisioners namespace (nfs-subdir-external-provisioner)

Nodes

Node Role IP Host
kube-node-1 control-plane + worker 192.168.2.100 Dell OptiPlex 7070 (bare metal)
kube-node-2 worker 192.168.2.195 VM on HP ProLiant (16 GB RAM)
kube-node-3 control-plane + worker 192.168.2.196 VM on HP ProLiant (16 GB RAM)
kube-arbiter control-plane 192.168.2.200 VM on HP ProLiant (1c/6GB, tainted NoSchedule)

etcd: 3 members (kube-node-1 + kube-arbiter + kube-node-3) — quorum survives 1 member failure controlPlaneEndpoint: 192.168.2.100:6443 ⚠️ SPOF — kube-vip (Phase 1b) not yet deployed; if kube-node-1 goes down, workers lose API access even though kube-arbiter and kube-node-3 API servers are still running


High Availability Status

Control Plane

Component Status Notes
etcd 3 members kube-node-1 + kube-arbiter + kube-node-3; tolerates 1 failure
API server VIP ⚠️ Not yet deployed controlPlaneEndpoint hardcoded to 192.168.2.100; kube-vip (Phase 1b) pending
CoreDNS Required anti-affinity Pods spread across different nodes (kube-node-1 + kube-node-2)

Workloads (replicas=2, required pod anti-affinity)

Service Replicas PDB
authentik-server 2
authentik-worker 2
cert-manager 2
cert-manager-webhook 2
cert-manager-cainjector 2
parts-api 2
parts-ui 2
ha-sync-ui 2
games-console-backend 2
games-console-ui 2
ingress-nginx DaemonSet (runs on all workers)

Storage

PV Type Notes
paperclip-data-pv NFS (192.168.2.252) Migrated from hostPath; can schedule on any node
prometheus-storage-pv hostPath on kube-node-1 ⚠️ Still pinned to kube-node-1 (out of scope)

Known Remaining SPOFs

Risk Description Mitigation
HP ProLiant physical host kube-node-2/3 + kube-arbiter are all HP VMs Add Raspberry Pi 4/5 (8 GB) as 4th physical host
controlPlaneEndpoint Hardcoded to kube-node-1 IP Deploy kube-vip with VIP (e.g. 192.168.2.50)

games

Service NodePort Storage
minecraft-home 31112 HP NFS /data/games/minecraft
minecraft-cheats 31111 HP NFS /data/games/minecraft
minecraft-creative 31559 HP NFS /data/games/minecraft
minecraft-johannes 31563 HP NFS /data/games/minecraft
minecraft-noah 31560 HP NFS /data/games/minecraft
Factorio HP NFS /data/games/factorio
OpenTTD HP NFS /data/games/openttd

Minecraft operators: LadyGisela5, tomgates24, anutzalizuk, toranaga_samma

monitoring

  • Helm release: obs, chart prometheus-community/kube-prometheus-stack
  • Values file: /home/dan/homelab/deployment/helm/prometheus/prometheus-helm-values.yaml
  • Components: Prometheus, Grafana, AlertManager, Node Exporter, Kube State Metrics
  • Grafana: NodePort 31473 → http://192.168.2.100:31473
  • Storage: 100 Gi hostPath PV at /data/infra/prometheus on kube-node-1

infrastructure

  • General MySQL/MariaDB (StatefulSet) — HP NFS /media-pool/general-db
  • Speedtest Tracker — HP NFS /media-pool/speedtest
  • DNS updater (DaemonSet, tunix/digitalocean-dyndns) — updates DigitalOcean DNS
  • Proxmox ingress → 192.168.2.193:8006

storage

  • OwnCloud (owncloud/server:10.12) — drive.vandachevici.ro, admin: sefu
    • MariaDB (StatefulSet), Redis (Deployment), OwnCloud server (2 replicas)
    • Storage: HP NFS /data/owncloud

media

  • Jellyfin — media.vandachevici.ro, storage: HP NFS /data/media
  • Immich — photos.vandachevici.ro, storage: HP NFS /data/photos
    • Components: server (2 replicas), ML (2 replicas), valkey, postgresql

iot

  • IoT MySQL (StatefulSet, db: iot_db)
  • IoT API (iot-api:latest, NodePort 30800) — requires topology.homelab/server: dell label

ai

  • Paperclip — paperclip.vandachevici.ro
    • Embedded PostgreSQL at /data/ai/paperclip/instances/default/db
    • Config: /data/ai/paperclip/instances/default/config.json
    • NFS PV via keepalived VIP 192.168.2.252:/data/ai/paperclip (can schedule on any node)
    • Env: PAPERCLIP_AGENT_JWT_SECRET (in K8s secret)

AI / OpenClaw

local-ai VM (192.168.2.88) — GPU instance

  • GPU: NVIDIA Tesla P4, 8 GB VRAM (PCIe passthrough from Proxmox)
    • VFIO: /etc/modprobe.d/vfio.conf ids=10de:1bb3, allow_unsafe_interrupts=1
    • initramfs updated for persistence
  • Ollama: listening on 0.0.0.0:11434, models at /data/ollama/models
    • Loaded: qwen3:8b (5.2 GB)
  • openclaw-gateway: ws://0.0.0.0:18789, auth mode: token
    • Token: in ~/.openclaw/openclaw.jsongateway.auth.token
    • Systemd: openclaw-gateway.service (Type=simple, enabled)

remote-ai VM (192.168.2.91)

  • openclaw-gateway: installed (v2026.3.13), config at ~/.openclaw/openclaw.json
  • Uses cloud AI providers (Claude API key required)

Connecting Paperclip to openclaw

  • URL: ws://192.168.2.88:18789/
  • Auth: token from ~/.openclaw/openclaw.jsongateway.auth.token

Network Endpoints

Service URL / Address
K8s Dashboard https://192.168.2.100:30443
Proxmox UI https://proxmox.vandachevici.ro
Grafana http://192.168.2.100:31473
Jellyfin https://media.vandachevici.ro
Immich (photos) https://photos.vandachevici.ro
OwnCloud https://drive.vandachevici.ro
Paperclip https://paperclip.vandachevici.ro
IoT API http://192.168.2.100:30800
minecraft-home 192.168.2.100:31112
minecraft-cheats 192.168.2.100:31111
minecraft-creative 192.168.2.100:31559
minecraft-johannes 192.168.2.100:31563
minecraft-noah 192.168.2.100:31560
Ollama (local-ai) http://192.168.2.88:11434
openclaw gateway (local-ai) ws://192.168.2.88:18789
Ollama (Dell) http://192.168.2.100:11434

DNS subdomains managed (DigitalOcean)

photos, backup, media, chat, openttd, excalidraw, prv, drive, grafana, paperclip, proxmox


Common Operations

Apply manifests

kubectl apply -f /home/dan/homelab/deployment/<namespace>/

Prometheus (Helm)

helm upgrade obs prometheus-community/kube-prometheus-stack \
  -n monitoring \
  -f /home/dan/homelab/deployment/helm/prometheus/prometheus-helm-values.yaml

NFS provisioners (Helm)

# Example: jellyfin
helm upgrade nfs-jellyfin nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
  -n nfs-provisioners \
  -f /home/dan/homelab/deployment/helm/nfs-provisioners/values-jellyfin.yaml

Troubleshooting: Flannel CNI after reboot

If all pods stuck in ContainerCreating after reboot:

# 1. Check default route exists on kube-node-1
ip route show | grep default
# Fix: sudo ip route add default via 192.168.2.1 dev eno1
# Persist: check /etc/netplan/00-installer-config.yaml has routes section

# 2. Restart flannel pod on node-1
kubectl delete pod -n kube-flannel -l app=flannel --field-selector spec.nodeName=kube-node-1

Troubleshooting: kube-node-3 NotReady after reboot

Likely swap re-enabled:

ssh dan@192.168.2.196 "sudo swapoff -a && sudo sed -i 's|^/swap.img|#/swap.img|' /etc/fstab && sudo systemctl restart kubelet"

Workspace Structure

/home/dan/homelab/
├── HOMELAB.md              — this file
├── plan.md                 — original rebuild plan
├── step-by-step.md         — execution tracker
├── deployment/             — K8s manifests and Helm values
│   ├── 00-namespaces.yaml
│   ├── ai/                 — Paperclip
│   ├── default/            — DNS updater
│   ├── games/              — Minecraft, Factorio, OpenTTD
│   ├── helm/               — Helm values (prometheus, nfs-provisioners)
│   ├── infrastructure/     — ingress-nginx, cert-manager, general-db, speedtest, proxmox-ingress
│   ├── iot/                — IoT DB + API
│   ├── media/              — Jellyfin, Immich
│   ├── monitoring/         — (managed by Helm)
│   └── storage/            — OwnCloud
├── backups/                — K8s secrets backup (gitignored)
├── hardware/               — hardware spec docs
├── orchestration/
│   └── ansible/            — playbooks, inventory, group_vars, cloud-init
└── services/
    └── device-inventory/   — C++ CMake project: network device discovery