homelab/HOMELAB.md
Dan V deb6c38d7b chore: commit homelab setup — deployment, services, orchestration, skill
- Add .gitignore: exclude compiled binaries, build artifacts, and Helm
  values files containing real secrets (authentik, prometheus)
- Add all Kubernetes deployment manifests (deployment/)
- Add services source code: ha-sync, device-inventory, games-console,
  paperclip, parts-inventory
- Add Ansible orchestration: playbooks, roles, inventory, cloud-init
- Add hardware specs, execution plans, scripts, HOMELAB.md
- Add skills/homelab/SKILL.md + skills/install.sh to preserve Copilot skill
- Remove previously-tracked inventory-cli binary from git index

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 08:10:32 +02:00

323 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Homelab Specs
---
## Hardware
### Dell OptiPlex 7070
- **Role**: kube-node-1 (control-plane + worker), bare metal
- **IP**: 192.168.2.100
- **SSH**: `dan@192.168.2.100`
- **CPU**: Intel Core i5-9500, 6c/6t, 3.0 GHz base / 4.4 GHz boost, 9 MB L3, 65W TDP, VT-x
- **RAM**: 16 GB DDR4 2666 MT/s DIMM
- **Storage**:
- `nvme0`: Samsung PM991 256 GB — 1G EFI, 2G /boot, 235.4G LVM (100G → /)
- `sda`: Seagate Expansion 2 TB → `/data/photos` (ext4)
- `sdb`: Seagate Expansion+ 2 TB → `/mnt/sdb-ro` (ext4, **READ-ONLY — never touch**)
- `sdc1`: Seagate Expansion 1 TB → `/data/media` (ext4)
- `sdc2`: Seagate Expansion 788 GB → `/data/games` (ext4)
- `sdd`: Samsung HD103SI 1 TB → `/data/owncloud` (ext4)
- `sde`: Hitachi HTS545050 500 GB → `/data/infra` (ext4)
- `sdf`: Seagate 1 TB → `/data/ai` (ext4)
- **Total**: ~7 TB
- **Network**: 1 Gbit/s
- **NFS server**: exports `/data/{games,media,photos,owncloud,infra,ai}` to LAN
### HP ProLiant DL360 G7
- **Role**: Proxmox hypervisor (192.168.2.193)
- **SSH**: `root@192.168.2.193` (local id_rsa)
- **Web UI**: https://proxmox.vandachevici.ro
- **Storage**:
- 2× HPE SAS 900 GB in RAID 1+0 → 900 GB usable (Proxmox OS)
- 4× HPE SAS 900 GB in RAID 1+0 → 1.8 TB usable (VM disks)
- Promise VTrak J830s: 2× 16 TB → `media-pool` (ZFS, ~14 TB usable)
- **Total**: ~18 TB
### Promise VTrak J830s
- Connected to HP ProLiant via SAS
- 2× 16 TB disks, ZFS pool `media-pool`
- ZFS datasets mounted at `/data/X` on HP (matching Dell paths)
---
## Storage Layout
### Dell `/data` drives (primary/local)
| Mount | Device | Size | Contents |
|---|---|---|---|
| `/data/games` | sdc2 | 788 GB | Game server worlds and kits |
| `/data/media` | sdc1 | 1.1 TB | Jellyfin media library |
| `/data/photos` | sda | 916 GB | Immich photo library |
| `/data/owncloud` | sdd | 916 GB | OwnCloud files |
| `/data/infra` | sde | 458 GB | Prometheus, infra data |
| `/data/ai` | sdf | 916 GB | Paperclip, Ollama models |
| `/mnt/sdb-ro` | sdb | 1.8 TB | **READ-ONLY** archive — never modify |
### HP VTrak ZFS datasets (HA mirrors)
| ZFS Dataset | Mountpoint on HP | NFS export |
|---|---|---|
| media-pool/jellyfin | `/data/media` | ✅ |
| media-pool/immich | `/data/photos` | ✅ |
| media-pool/owncloud | `/data/owncloud` | ✅ |
| media-pool/games | `/data/games` | ✅ |
| media-pool/minecraft | `/data/games/minecraft` | ✅ |
| media-pool/factorio | `/data/games/factorio` | ✅ |
| media-pool/openttd | `/data/games/openttd` | ✅ |
| media-pool/infra | `/data/infra` | ✅ |
| media-pool/ai | `/data/ai` | ✅ |
Legacy bind mounts at `/media-pool/X``/data/X` preserved for K8s PV compatibility.
### Cross-mounts (HA access)
| From | Mount point | To |
|---|---|---|
| Dell | `/mnt/hp/data-{games,media,photos,owncloud,infra,ai}` | HP VTrak NFS |
| HP | `/mnt/dell/data-{games,media,photos,owncloud,infra,ai}` | Dell NFS |
---
## VMs on HP ProLiant (Proxmox)
| VM ID | Name | IP | RAM | Role |
|---|---|---|---|---|
| 100 | kube-node-2 | 192.168.2.195 | 16 GB | K8s worker |
| 101 | kube-node-3 | 192.168.2.196 | 16 GB | K8s control-plane + worker |
| 103 | kube-arbiter | 192.168.2.200 | 6 GB | K8s control-plane (etcd + API server, NoSchedule) |
| 104 | local-ai | 192.168.2.88 | — | Ollama + openclaw-gateway (Tesla P4 GPU passthrough) |
| 106 | ansible-control | 192.168.2.70 | — | Ansible control node |
| 107 | remote-ai | 192.168.2.91 | — | openclaw-gateway (remote, cloud AI) |
⚠️ kube-node-2, kube-node-3, and kube-arbiter are all VMs on the HP ProLiant. HP ProLiant failure = loss of 3/4 K8s nodes simultaneously. Mitigation: add a Raspberry Pi 4/5 (8 GB) as a 4th physical host.
SSH: `dan@<ip>` for all VMs
---
## Kubernetes Cluster
- **Version**: 1.32.13
- **CNI**: Flannel
- **Dashboard**: https://192.168.2.100:30443 (self-signed cert, token auth)
- **Token file**: `/home/dan/homelab/kube/cluster/DASHBOARD-ACCESS.txt`
- **StorageClass**: `local-storage` (hostPath on kube-node-1)
- **NFS provisioners**: `nfs-provisioners` namespace (nfs-subdir-external-provisioner)
### Nodes
| Node | Role | IP | Host |
|---|---|---|---|
| kube-node-1 | control-plane + worker | 192.168.2.100 | Dell OptiPlex 7070 (bare metal) |
| kube-node-2 | worker | 192.168.2.195 | VM on HP ProLiant (16 GB RAM) |
| kube-node-3 | control-plane + worker | 192.168.2.196 | VM on HP ProLiant (16 GB RAM) |
| kube-arbiter | control-plane | 192.168.2.200 | VM on HP ProLiant (1c/6GB, tainted NoSchedule) |
**etcd**: 3 members (kube-node-1 + kube-arbiter + kube-node-3) — quorum survives 1 member failure ✅
**controlPlaneEndpoint**: `192.168.2.100:6443` ⚠️ SPOF — kube-vip (Phase 1b) not yet deployed; if kube-node-1 goes down, workers lose API access even though kube-arbiter and kube-node-3 API servers are still running
---
## High Availability Status
### Control Plane
| Component | Status | Notes |
|---|---|---|
| etcd | ✅ 3 members | kube-node-1 + kube-arbiter + kube-node-3; tolerates 1 failure |
| API server VIP | ⚠️ Not yet deployed | controlPlaneEndpoint hardcoded to 192.168.2.100; kube-vip (Phase 1b) pending |
| CoreDNS | ✅ Required anti-affinity | Pods spread across different nodes (kube-node-1 + kube-node-2) |
### Workloads (replicas=2, required pod anti-affinity)
| Service | Replicas | PDB |
|---|---|---|
| authentik-server | 2 | ✅ |
| authentik-worker | 2 | ✅ |
| cert-manager | 2 | ✅ |
| cert-manager-webhook | 2 | ✅ |
| cert-manager-cainjector | 2 | ✅ |
| parts-api | 2 | ✅ |
| parts-ui | 2 | ✅ |
| ha-sync-ui | 2 | ✅ |
| games-console-backend | 2 | ✅ |
| games-console-ui | 2 | ✅ |
| ingress-nginx | DaemonSet | ✅ (runs on all workers) |
### Storage
| PV | Type | Notes |
|---|---|---|
| paperclip-data-pv | NFS (192.168.2.252) | ✅ Migrated from hostPath; can schedule on any node |
| prometheus-storage-pv | hostPath on kube-node-1 | ⚠️ Still pinned to kube-node-1 (out of scope) |
### Known Remaining SPOFs
| Risk | Description | Mitigation |
|---|---|---|
| HP ProLiant physical host | kube-node-2/3 + kube-arbiter are all HP VMs | Add Raspberry Pi 4/5 (8 GB) as 4th physical host |
| controlPlaneEndpoint | Hardcoded to kube-node-1 IP | Deploy kube-vip with VIP (e.g. 192.168.2.50) |
---
### games
| Service | NodePort | Storage |
|---|---|---|
| minecraft-home | 31112 | HP NFS `/data/games/minecraft` |
| minecraft-cheats | 31111 | HP NFS `/data/games/minecraft` |
| minecraft-creative | 31559 | HP NFS `/data/games/minecraft` |
| minecraft-johannes | 31563 | HP NFS `/data/games/minecraft` |
| minecraft-noah | 31560 | HP NFS `/data/games/minecraft` |
| Factorio | — | HP NFS `/data/games/factorio` |
| OpenTTD | — | HP NFS `/data/games/openttd` |
Minecraft operators: LadyGisela5, tomgates24, anutzalizuk, toranaga_samma
### monitoring
- **Helm release**: `obs`, chart `prometheus-community/kube-prometheus-stack`
- **Values file**: `/home/dan/homelab/deployment/helm/prometheus/prometheus-helm-values.yaml`
- **Components**: Prometheus, Grafana, AlertManager, Node Exporter, Kube State Metrics
- **Grafana**: NodePort 31473 → http://192.168.2.100:31473
- **Storage**: 100 Gi hostPath PV at `/data/infra/prometheus` on kube-node-1
### infrastructure
- General MySQL/MariaDB (StatefulSet) — HP NFS `/media-pool/general-db`
- Speedtest Tracker — HP NFS `/media-pool/speedtest`
- DNS updater (DaemonSet, `tunix/digitalocean-dyndns`) — updates DigitalOcean DNS
- Proxmox ingress → 192.168.2.193:8006
### storage
- **OwnCloud** (`owncloud/server:10.12`) — drive.vandachevici.ro, admin: sefu
- MariaDB (StatefulSet), Redis (Deployment), OwnCloud server (2 replicas)
- Storage: HP NFS `/data/owncloud`
### media
- **Jellyfin** — media.vandachevici.ro, storage: HP NFS `/data/media`
- **Immich** — photos.vandachevici.ro, storage: HP NFS `/data/photos`
- Components: server (2 replicas), ML (2 replicas), valkey, postgresql
### iot
- IoT MySQL (StatefulSet, db: `iot_db`)
- IoT API (`iot-api:latest`, NodePort 30800) — requires `topology.homelab/server: dell` label
### ai
- **Paperclip** — paperclip.vandachevici.ro
- Embedded PostgreSQL at `/data/ai/paperclip/instances/default/db`
- Config: `/data/ai/paperclip/instances/default/config.json`
- NFS PV via keepalived VIP `192.168.2.252:/data/ai/paperclip` (can schedule on any node) ✅
- Env: `PAPERCLIP_AGENT_JWT_SECRET` (in K8s secret)
---
## AI / OpenClaw
### local-ai VM (192.168.2.88) — GPU instance
- **GPU**: NVIDIA Tesla P4, 8 GB VRAM (PCIe passthrough from Proxmox)
- VFIO: `/etc/modprobe.d/vfio.conf` ids=10de:1bb3, allow_unsafe_interrupts=1
- initramfs updated for persistence
- **Ollama**: listening on `0.0.0.0:11434`, models at `/data/ollama/models`
- Loaded: `qwen3:8b` (5.2 GB)
- **openclaw-gateway**: `ws://0.0.0.0:18789`, auth mode: token
- Token: in `~/.openclaw/openclaw.json``gateway.auth.token`
- Systemd: `openclaw-gateway.service` (Type=simple, enabled)
### remote-ai VM (192.168.2.91)
- **openclaw-gateway**: installed (v2026.3.13), config at `~/.openclaw/openclaw.json`
- Uses cloud AI providers (Claude API key required)
### Connecting Paperclip to openclaw
- URL: `ws://192.168.2.88:18789/`
- Auth: token from `~/.openclaw/openclaw.json``gateway.auth.token`
---
## Network Endpoints
| Service | URL / Address |
|---|---|
| K8s Dashboard | https://192.168.2.100:30443 |
| Proxmox UI | https://proxmox.vandachevici.ro |
| Grafana | http://192.168.2.100:31473 |
| Jellyfin | https://media.vandachevici.ro |
| Immich (photos) | https://photos.vandachevici.ro |
| OwnCloud | https://drive.vandachevici.ro |
| Paperclip | https://paperclip.vandachevici.ro |
| IoT API | http://192.168.2.100:30800 |
| minecraft-home | 192.168.2.100:31112 |
| minecraft-cheats | 192.168.2.100:31111 |
| minecraft-creative | 192.168.2.100:31559 |
| minecraft-johannes | 192.168.2.100:31563 |
| minecraft-noah | 192.168.2.100:31560 |
| Ollama (local-ai) | http://192.168.2.88:11434 |
| openclaw gateway (local-ai) | ws://192.168.2.88:18789 |
| Ollama (Dell) | http://192.168.2.100:11434 |
### DNS subdomains managed (DigitalOcean)
`photos`, `backup`, `media`, `chat`, `openttd`, `excalidraw`, `prv`, `drive`, `grafana`, `paperclip`, `proxmox`
---
## Common Operations
### Apply manifests
```bash
kubectl apply -f /home/dan/homelab/deployment/<namespace>/
```
### Prometheus (Helm)
```bash
helm upgrade obs prometheus-community/kube-prometheus-stack \
-n monitoring \
-f /home/dan/homelab/deployment/helm/prometheus/prometheus-helm-values.yaml
```
### NFS provisioners (Helm)
```bash
# Example: jellyfin
helm upgrade nfs-jellyfin nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
-n nfs-provisioners \
-f /home/dan/homelab/deployment/helm/nfs-provisioners/values-jellyfin.yaml
```
### Troubleshooting: Flannel CNI after reboot
If all pods stuck in `ContainerCreating` after reboot:
```bash
# 1. Check default route exists on kube-node-1
ip route show | grep default
# Fix: sudo ip route add default via 192.168.2.1 dev eno1
# Persist: check /etc/netplan/00-installer-config.yaml has routes section
# 2. Restart flannel pod on node-1
kubectl delete pod -n kube-flannel -l app=flannel --field-selector spec.nodeName=kube-node-1
```
### Troubleshooting: kube-node-3 NotReady after reboot
Likely swap re-enabled:
```bash
ssh dan@192.168.2.196 "sudo swapoff -a && sudo sed -i 's|^/swap.img|#/swap.img|' /etc/fstab && sudo systemctl restart kubelet"
```
---
## Workspace Structure
```
/home/dan/homelab/
├── HOMELAB.md — this file
├── plan.md — original rebuild plan
├── step-by-step.md — execution tracker
├── deployment/ — K8s manifests and Helm values
│ ├── 00-namespaces.yaml
│ ├── ai/ — Paperclip
│ ├── default/ — DNS updater
│ ├── games/ — Minecraft, Factorio, OpenTTD
│ ├── helm/ — Helm values (prometheus, nfs-provisioners)
│ ├── infrastructure/ — ingress-nginx, cert-manager, general-db, speedtest, proxmox-ingress
│ ├── iot/ — IoT DB + API
│ ├── media/ — Jellyfin, Immich
│ ├── monitoring/ — (managed by Helm)
│ └── storage/ — OwnCloud
├── backups/ — K8s secrets backup (gitignored)
├── hardware/ — hardware spec docs
├── orchestration/
│ └── ansible/ — playbooks, inventory, group_vars, cloud-init
└── services/
└── device-inventory/ — C++ CMake project: network device discovery
```