- Add .gitignore: exclude compiled binaries, build artifacts, and Helm values files containing real secrets (authentik, prometheus) - Add all Kubernetes deployment manifests (deployment/) - Add services source code: ha-sync, device-inventory, games-console, paperclip, parts-inventory - Add Ansible orchestration: playbooks, roles, inventory, cloud-init - Add hardware specs, execution plans, scripts, HOMELAB.md - Add skills/homelab/SKILL.md + skills/install.sh to preserve Copilot skill - Remove previously-tracked inventory-cli binary from git index Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
354 lines
22 KiB
Markdown
354 lines
22 KiB
Markdown
# HA Sync — Execution Plan
|
||
|
||
## Problem Statement
|
||
|
||
Two servers (Dell OptiPlex 7070 at `192.168.2.100` and HP ProLiant at `192.168.2.193`) each export the same folder set over NFS. A Kubernetes-native tool must keep each folder pair in bidirectional sync: newest file wins, mtime is preserved on copy, delete propagation is strict (one-way per CronJob), and every operation is logged in the MySQL instance in the `infrastructure` namespace.
|
||
|
||
---
|
||
|
||
## Architecture Decisions (Agreed)
|
||
|
||
| Decision | Choice | Rationale |
|
||
|---|---|---|
|
||
| Language | **Go** | Single static binary, excellent async I/O, no runtime overhead |
|
||
| Sync direction | **Bidirectional via two one-way CronJobs** | Each folder pair gets `a→b` and `b→a` jobs; newest-mtime wins |
|
||
| Loop prevention | **Preserve mtime on copy + `--delete-missing` flag** | Mtime equality → skip; no extra DB state needed |
|
||
| Lock | **Kubernetes `Lease` object (coordination.k8s.io/v1)** | Native K8s TTL; survives MySQL outage; sync blocked only if K8s API is down (already required for CronJob) |
|
||
| Change detection | **mtime + size first; MD5 only on mtime/size mismatch** | Efficient for large datasets |
|
||
| Delete propagation | **Strict mirror — configurable per job via `--delete-missing`** | See ⚠️ note below |
|
||
| Volume access | **NFS mounts (both servers already export NFS)** | No HostPath or node-affinity needed |
|
||
|
||
| Audit logging | **Write to opslog file during run; flush to MySQL on completion** | MySQL outage does not block sync; unprocessed opslogs are retried on next run |
|
||
| Opslog storage | **Persistent NFS-backed PVC at `/var/log/ha-sync/`** | `/tmp` is ephemeral (lost on pod exit); NFS PVC persists across CronJob runs for 10-day retention |
|
||
|
||
### Locking: Kubernetes Lease
|
||
|
||
Each sync pair uses a `coordination.k8s.io/v1` Lease object named `ha-sync-<pair>` in the `infrastructure` namespace.
|
||
|
||
- `spec.holderIdentity` = `<pod-name>/<iteration-id>`
|
||
- `spec.leaseDurationSeconds` = `--lock-ttl` (default 3600)
|
||
- A background goroutine renews (`spec.renewTime`) every `leaseDurationSeconds / 3` seconds
|
||
- On normal exit or SIGTERM: Lease is deleted (released)
|
||
- Stale leases (holder crashed without release): expire automatically after `leaseDurationSeconds`
|
||
- Requires RBAC: `ServiceAccount` with `create/get/update/delete` on `leases` in `infrastructure`
|
||
|
||
### Audit Logging: Opslog + MySQL Flush
|
||
|
||
1. On sync start: open `/var/log/ha-sync/opslog-<pair>-<direction>-<RFC3339>.jsonl`
|
||
2. Each file operation: append one JSON line (all `sync_operations` fields)
|
||
3. On sync end: attempt flush to MySQL (`sync_iterations` + `sync_operations` batch INSERT)
|
||
4. On successful flush: delete the opslog file
|
||
5. On MySQL failure: leave the opslog; on next run, scan `/var/log/ha-sync/` for unprocessed opslogs and retry flush before starting new sync
|
||
6. Cleanup: after each run, delete opslogs older than 10 days (`os.Stat` mtime check)
|
||
|
||
### ⚠️ Delete Propagation Warning
|
||
|
||
With two one-way jobs per pair, ordering matters for deletes. If `dell→hp` runs before `hp→dell` and `--delete-missing` is ON for both, files that only exist on HP will be deleted before they're copied to Dell.
|
||
|
||
**Safe default**: `--delete-missing=false` for all jobs. Enable `--delete-missing=true` only on the **primary direction** (e.g., `dell→hp` for each pair) once the initial full sync has completed and both sides are known-equal.
|
||
|
||
---
|
||
|
||
## NFS Sync Pairs
|
||
|
||
| Pair name | Dell NFS (192.168.2.100) | HP NFS (192.168.2.193) |
|
||
|---|---|---|
|
||
| `media` | `/data/media` | `/data/media` |
|
||
| `photos` | `/data/photos` | `/data/photos` |
|
||
| `owncloud` | `/data/owncloud` | `/data/owncloud` |
|
||
| `games` | `/data/games` | `/data/games` |
|
||
| `infra` | `/data/infra` | `/data/infra` |
|
||
| `ai` | `/data/ai` | `/data/ai` |
|
||
|
||
Each pair produces **two CronJobs** in the `infrastructure` namespace.
|
||
|
||
---
|
||
|
||
## CLI Interface (`ha-sync`)
|
||
|
||
```
|
||
ha-sync [flags]
|
||
|
||
Required:
|
||
--src <path> Source directory (absolute path inside pod)
|
||
--dest <path> Destination directory (absolute path inside pod)
|
||
--pair <name> Logical pair name (e.g. "media"); used as Lease name ha-sync-<pair>
|
||
|
||
Optional:
|
||
--direction <str> Label for logging, e.g. "dell-to-hp" (default: "fwd")
|
||
--db-dsn <dsn> MySQL DSN (default: from env HA_SYNC_DB_DSN)
|
||
--lock-ttl <seconds> Lease TTL before considered stale (default: 3600)
|
||
--log-dir <path> Directory for opslog files (default: /var/log/ha-sync)
|
||
--log-retain-days <n> Delete opslogs older than N days (default: 10)
|
||
--mtime-threshold <s> Seconds of tolerance for mtime equality (default: 2)
|
||
--delete-missing Delete dest files not present in src (default: false)
|
||
--workers <n> Concurrent file workers (default: 4)
|
||
--dry-run Compute what would sync, save to DB as dry_run rows, print plan; do not copy/delete (default: false)
|
||
--verbose Verbose output
|
||
--help
|
||
```
|
||
|
||
---
|
||
|
||
## MySQL Schema (database: `general_db`)
|
||
|
||
```sql
|
||
-- One row per CronJob execution
|
||
CREATE TABLE IF NOT EXISTS sync_iterations (
|
||
id BIGINT AUTO_INCREMENT PRIMARY KEY,
|
||
sync_pair VARCHAR(255) NOT NULL,
|
||
direction VARCHAR(64) NOT NULL,
|
||
src VARCHAR(512) NOT NULL,
|
||
dest VARCHAR(512) NOT NULL,
|
||
started_at DATETIME(3) NOT NULL,
|
||
ended_at DATETIME(3),
|
||
status ENUM('running','success','partial_failure','failed') NOT NULL DEFAULT 'running',
|
||
dry_run TINYINT(1) NOT NULL DEFAULT 0,
|
||
files_created INT DEFAULT 0,
|
||
files_updated INT DEFAULT 0,
|
||
files_deleted INT DEFAULT 0,
|
||
files_skipped INT DEFAULT 0,
|
||
files_failed INT DEFAULT 0,
|
||
total_bytes_transferred BIGINT DEFAULT 0,
|
||
error_message TEXT,
|
||
INDEX idx_pair (sync_pair),
|
||
INDEX idx_started (started_at),
|
||
INDEX idx_dry_run (dry_run)
|
||
);
|
||
|
||
-- One row per individual file operation (flushed from opslog on sync completion)
|
||
CREATE TABLE IF NOT EXISTS sync_operations (
|
||
id BIGINT AUTO_INCREMENT PRIMARY KEY,
|
||
iteration_id BIGINT NOT NULL,
|
||
dry_run TINYINT(1) NOT NULL DEFAULT 0,
|
||
operation ENUM('create','update','delete') NOT NULL,
|
||
filepath VARCHAR(4096) NOT NULL,
|
||
size_before BIGINT,
|
||
size_after BIGINT,
|
||
md5_before VARCHAR(32),
|
||
md5_after VARCHAR(32),
|
||
started_at DATETIME(3) NOT NULL,
|
||
ended_at DATETIME(3),
|
||
status ENUM('success','fail') NOT NULL,
|
||
error_message VARCHAR(4096),
|
||
INDEX idx_iteration (iteration_id),
|
||
CONSTRAINT fk_iteration FOREIGN KEY (iteration_id) REFERENCES sync_iterations(id)
|
||
);
|
||
```
|
||
|
||
> No `sync_locks` table — locking is handled by Kubernetes Lease objects.
|
||
|
||
### Dry-run Idempotency Rules
|
||
|
||
1. **`--dry-run` mode**: walk source and dest, compute the full set of would-be operations (create/update/delete), save to DB with `dry_run = 1`, print the plan. **No files are copied or deleted.**
|
||
2. **Idempotency check**: before running a dry-run, query for the last successful dry-run iteration for `(pair, direction)`:
|
||
```sql
|
||
SELECT id, started_at FROM sync_iterations
|
||
WHERE sync_pair = ? AND direction = ? AND dry_run = 1 AND status = 'success'
|
||
ORDER BY started_at DESC LIMIT 1;
|
||
```
|
||
Then re-walk the source and dest and compute the would-be operation set. Compare it against the `sync_operations` rows from that previous dry-run iteration (same set of `filepath + operation + size_before`). If **identical** → print `"Dry-run already current as of <started_at>. Nothing has changed."` and exit without writing new rows.
|
||
3. **Production run (`--dry-run` not set)**: all queries for previous iterations use `WHERE dry_run = 0`. Dry-run rows are **never considered** for skip logic, idempotency, or status reporting in production runs.
|
||
4. **Lease is still acquired** during dry-run (prevents two dry-runs from racing each other).
|
||
|
||
---
|
||
|
||
## Project Structure
|
||
|
||
```
|
||
services/ha-sync/
|
||
cmd/ha-sync/
|
||
main.go # Sync CLI entry point
|
||
cmd/ha-sync-ui/
|
||
main.go # Dashboard HTTP server entry point (serves ha-sync.vandachevici.ro)
|
||
internal/
|
||
config/
|
||
config.go # Config struct, defaults, validation (shared by both binaries)
|
||
db/
|
||
db.go # MySQL connect, auto-migrate schema
|
||
logging.go # StartIteration, FinishIteration, BulkInsertOperations, LastDryRunOps
|
||
lease/
|
||
lease.go # Acquire/release/heartbeat Kubernetes Lease object
|
||
opslog/
|
||
writer.go # Append JSON lines to /var/log/ha-sync/opslog-<pair>-<direction>-<RFC3339>.jsonl
|
||
flusher.go # Scan for unprocessed opslogs, batch INSERT; cleanup logs >10 days
|
||
sync/
|
||
engine.go # Main sync loop: walk, compare, dispatch; dryRun flag skips writes
|
||
walker.go # Recursive directory walk
|
||
compare.go # mtime+size comparison; conditional MD5
|
||
copy.go # File copy with os.Chtimes() mtime preservation
|
||
delete.go # Safe delete with pre-check
|
||
ui/
|
||
handler.go # HTTP handlers: index, /api/iterations, /api/operations, /api/pairs
|
||
templates/
|
||
index.html # Dashboard HTML; auto-refreshes every 10s via fetch(); vanilla JS only
|
||
go.mod
|
||
go.sum
|
||
Dockerfile # Multi-stage: golang:1.22-alpine builder (builds ha-sync + ha-sync-ui) → alpine:3.20
|
||
Makefile # build, docker-build IMAGE=<registry>/ha-sync:latest, docker-push targets
|
||
|
||
deployment/ha-sync/
|
||
serviceaccount.yaml # ServiceAccount: ha-sync, namespace: infrastructure
|
||
rbac.yaml # Role + RoleBinding: leases (coordination.k8s.io) create/get/update/delete
|
||
secret.yaml # NOTE: create manually — see Phase 3C instructions
|
||
pv-logs.yaml # PersistentVolume: NFS 192.168.2.193:/data/infra/ha-sync-logs, 10Gi, RWX
|
||
pvc-logs.yaml # PVC bound to pv-logs; all CronJobs mount at /var/log/ha-sync
|
||
pv-dell-<pair>.yaml # PersistentVolume: NFS 192.168.2.100:/data/<pair> (one per pair × 6)
|
||
pv-hp-<pair>.yaml # PersistentVolume: NFS 192.168.2.193:/data/<pair> (one per pair × 6)
|
||
pvc-dell-<pair>.yaml # PVC → pv-dell-<pair> (one per pair × 6)
|
||
pvc-hp-<pair>.yaml # PVC → pv-hp-<pair> (one per pair × 6)
|
||
cron-<pair>-dell-to-hp.yaml # --dry-run is DEFAULT; remove flag to enable production sync
|
||
cron-<pair>-hp-to-dell.yaml # same
|
||
ui-deployment.yaml # Deployment: ha-sync-ui, 1 replica, image: <registry>/ha-sync:latest, cmd: ha-sync-ui
|
||
ui-service.yaml # ClusterIP Service: port 8080 → ha-sync-ui pod
|
||
ui-ingress.yaml # Ingress: ha-sync.vandachevici.ro → ui-service:8080; cert-manager TLS
|
||
kustomization.yaml # Kustomize root listing all resources
|
||
|
||
scripts/cli/
|
||
ha-sync.md # CLI reference doc
|
||
```
|
||
|
||
### UI Dashboard (`ha-sync.vandachevici.ro`)
|
||
|
||
- **Binary**: `ha-sync-ui` — Go HTTP server, port 8080
|
||
- **Routes**:
|
||
- `GET /` — HTML dashboard; auto-refreshes via `setInterval` + `fetch`
|
||
- `GET /api/pairs` — JSON: per-pair last iteration summary (dry_run=0 and dry_run=1 separately)
|
||
- `GET /api/iterations?pair=&limit=20` — JSON: recent iterations
|
||
- `GET /api/operations?iteration_id=` — JSON: operations for one iteration
|
||
- **Dashboard shows**: per-pair status cards (last real sync, last dry-run, files created/updated/deleted/failed), recent activity table, errors highlighted in red
|
||
- **Env vars**: `HA_SYNC_DB_DSN` (same secret as CronJobs)
|
||
- **K8s**: Deployment in `infrastructure` namespace, 1 replica, same ServiceAccount as CronJobs (read-only DB access only)
|
||
|
||
---
|
||
|
||
## Tasks
|
||
|
||
> **Parallelism key**: Tasks marked `[P]` can be executed in parallel by separate agents. Tasks marked `[SEQ]` must follow the listed dependency chain.
|
||
|
||
---
|
||
|
||
### Phase 0 — Scaffolding `[SEQ]`
|
||
|
||
Must complete before any code is written; all subsequent tasks depend on this.
|
||
|
||
| # | Task | Command / Notes |
|
||
|---|---|---|
|
||
| 0.1 | Create Go module | `cd services/ha-sync && go mod init github.com/vandachevici/homelab/ha-sync` |
|
||
| 0.2 | Create directory tree | `mkdir -p cmd/ha-sync internal/{config,db,lease,opslog,sync}` |
|
||
| 0.3 | Create Dockerfile | Multi-stage: `FROM golang:1.22-alpine AS build` → `FROM alpine:3.20`; copy binary; `ENTRYPOINT ["/ha-sync"]` |
|
||
| 0.4 | Create Makefile | Targets: `build`, `docker-build IMAGE=<registry>/ha-sync:latest`, `docker-push IMAGE=...` |
|
||
|
||
---
|
||
|
||
### Phase 1 — Core Go packages `[P after Phase 0]`
|
||
|
||
Sub-tasks 1A, 1B, 1C, 1E are **fully independent** — assign to separate agents simultaneously. 1D depends on all of them.
|
||
|
||
#### 1A — `internal/config` `[P]`
|
||
| # | Task | Notes |
|
||
|---|---|---|
|
||
| 1A.1 | Write `config.go` | Define `Config` struct with all CLI flags; use `flag` stdlib or `cobra`; set defaults from CLI Interface section above |
|
||
|
||
#### 1B — `internal/db` `[P]`
|
||
| # | Task | Notes |
|
||
|---|---|---|
|
||
| 1B.1 | Write `db.go` | `Connect(dsn string) (*sql.DB, error)`; run `CREATE TABLE IF NOT EXISTS` for both tables (include `dry_run TINYINT(1) NOT NULL DEFAULT 0` column in both) on startup |
|
||
| 1B.2 | Write `logging.go` | `StartIteration(dryRun bool, ...) (id int64)` → INSERT with `dry_run` set; `FinishIteration(id, status, counts)` → UPDATE; `BulkInsertOperations(iterID int64, dryRun bool, []OpRecord)` → batch INSERT; `LastDryRunOps(db, pair, direction string) ([]OpRecord, error)` → fetch ops for last successful `dry_run=1` iteration for idempotency check |
|
||
|
||
#### 1C — `internal/lease` `[P]`
|
||
| # | Task | Notes |
|
||
|---|---|---|
|
||
| 1C.1 | Write `lease.go` | Use `k8s.io/client-go` in-cluster config; `Acquire(ctx, client, namespace, leaseName, holderID, ttlSec)` — create or update Lease if expired; `Release(ctx, client, namespace, leaseName, holderID)` — delete Lease; `Heartbeat(ctx, ...)` — goroutine that calls `Update` on `spec.renewTime` every `ttlSec/3` seconds |
|
||
|
||
#### 1D — `internal/opslog` `[P]`
|
||
| # | Task | Notes |
|
||
|---|---|---|
|
||
| 1D.1 | Write `writer.go` | `Open(logDir, pair, direction string) (*Writer, error)` — creates `/var/log/ha-sync/opslog-<pair>-<direction>-<RFC3339>.jsonl`; `Append(op OpRecord) error` — JSON-encode one line |
|
||
| 1D.2 | Write `flusher.go` | `FlushAll(logDir string, db *sql.DB) error` — scan dir for `*.jsonl`, for each: decode lines → call `BulkInsertOperations`, delete file on success; `CleanOld(logDir string, retainDays int)` — delete files with mtime older than N days |
|
||
|
||
#### 1E — `internal/sync` `[P]`
|
||
| # | Task | Notes |
|
||
|---|---|---|
|
||
| 1E.1 | Write `walker.go` | `Walk(root string) ([]FileInfo, error)` — returns slice of `{RelPath, AbsPath, Size, ModTime, IsDir}`; use `filepath.WalkDir` |
|
||
| 1E.2 | Write `compare.go` | `NeedsSync(src, dest FileInfo, threshold time.Duration) bool` — mtime+size check; `MD5File(path string) (string, error)` — streaming MD5; `MD5Changed(srcPath, destPath string) bool` |
|
||
| 1E.3 | Write `copy.go` | `CopyFile(src, dest string, srcModTime time.Time) error` — copy bytes, then `os.Chtimes(dest, srcModTime, srcModTime)` to preserve mtime |
|
||
| 1E.4 | Write `delete.go` | `DeleteFile(path string) error` — `os.Remove`; `DeleteDir(path string) error` — `os.RemoveAll` only if dir is empty after child removal |
|
||
| 1E.5 | Write `engine.go` | Walk src+dest, compare, dispatch create/update/delete via worker pool (`sync.WaitGroup` + buffered channel of `--workers` size); if `dryRun=true`, build op list but **do not call copy/delete** — return ops for caller to log; write each op to opslog.Writer (tagged with dry_run flag); return summary counts |
|
||
|
||
#### 1F — `cmd/ha-sync/main.go` `[SEQ, depends on 1A+1B+1C+1D+1E]`
|
||
| # | Task | Notes |
|
||
|---|---|---|
|
||
| 1F.1 | Write `main.go` | Parse flags → build config → connect DB → flush old opslogs → acquire Lease → **if `--dry-run`: call `LastDryRunOps`, walk src+dest, compute would-be ops, compare; if identical → print "already current" + exit; else run engine(dryRun=true)** → open opslog writer (tagged dry_run) → start iteration row (`dry_run` = true/false) → run engine → finish iteration → flush opslog to DB → release Lease; trap SIGTERM to release Lease before exit; **production queries always filter `dry_run = 0`** |
|
||
|
||
---
|
||
|
||
### Phase 2 — Build & Docker Image `[SEQ after Phase 1]`
|
||
|
||
| # | Task | Command |
|
||
|---|---|---|
|
||
| 2.1 | Fetch Go deps | `cd services/ha-sync && go mod tidy` |
|
||
| 2.2 | Build binary | `cd services/ha-sync && make build` |
|
||
| 2.3 | Build Docker image | `make docker-build IMAGE=192.168.2.100:5000/ha-sync:latest` *(replace registry if different)* |
|
||
| 2.4 | Push Docker image | `make docker-push IMAGE=192.168.2.100:5000/ha-sync:latest` |
|
||
|
||
---
|
||
|
||
### Phase 3 — Kubernetes Manifests `[P, can start during Phase 1]`
|
||
|
||
All manifest sub-tasks are **independent** and can be parallelized.
|
||
|
||
#### 3A — RBAC + Shared Resources `[P]`
|
||
| # | Task | Notes |
|
||
|---|---|---|
|
||
| 3A.1 | Create `serviceaccount.yaml` | `name: ha-sync`, `namespace: infrastructure` |
|
||
| 3A.2 | Create `rbac.yaml` | `Role` with rules: `apiGroups: [coordination.k8s.io]`, `resources: [leases]`, `verbs: [create, get, update, delete]`; `RoleBinding` binding `ha-sync` SA to the Role |
|
||
| 3A.3 | Create `pv-logs.yaml` + `pvc-logs.yaml` | PV: `nfs.server: 192.168.2.193`, `nfs.path: /data/infra/ha-sync-logs`, capacity `10Gi`, `accessModes: [ReadWriteMany]`; PVC: `storageClassName: ""`, `volumeName: pv-ha-sync-logs`, namespace `infrastructure` |
|
||
|
||
#### 3B — PVs and PVCs per pair `[P]`
|
||
| # | Task | Notes |
|
||
|---|---|---|
|
||
| 3B.1 | Create `pv-dell-<pair>.yaml` for each of 6 pairs | `spec.nfs.server: 192.168.2.100`, `spec.nfs.path: /data/<pair>`; capacity per pair: `media: 2Ti`, `photos: 500Gi`, `games: 500Gi`, `owncloud: 500Gi`, `infra: 100Gi`, `ai: 500Gi`; `accessModes: [ReadWriteMany]` |
|
||
| 3B.2 | Create `pv-hp-<pair>.yaml` for each of 6 pairs | Same structure; `spec.nfs.server: 192.168.2.193` |
|
||
| 3B.3 | Create `pvc-dell-<pair>.yaml` + `pvc-hp-<pair>.yaml` | `namespace: infrastructure`; `accessModes: [ReadWriteMany]`; `storageClassName: ""` (manual bind); `volumeName: pv-dell-<pair>` / `pv-hp-<pair>` |
|
||
|
||
#### 3C — CronJobs `[P, depends on 3A+3B for volume/SA names]`
|
||
| # | Task | Notes |
|
||
|---|---|---|
|
||
| 3C.1 | Create `cron-<pair>-dell-to-hp.yaml` for each pair | `namespace: infrastructure`; `serviceAccountName: ha-sync`; `schedule: "*/15 * * * *"`; image: `<registry>/ha-sync:latest`; args: `["--src=/mnt/dell/<pair>","--dest=/mnt/hp/<pair>","--pair=<pair>","--direction=dell-to-hp","--db-dsn=$(HA_SYNC_DB_DSN)","--log-dir=/var/log/ha-sync"]`; volumeMounts: `pvc-dell-<pair>` → `/mnt/dell/<pair>`, `pvc-hp-<pair>` → `/mnt/hp/<pair>`, `pvc-ha-sync-logs` → `/var/log/ha-sync`; envFrom: `ha-sync-db-secret` |
|
||
| 3C.2 | Create `cron-<pair>-hp-to-dell.yaml` for each pair | Same but src/dest swapped, `direction=hp-to-dell`; offset schedule by 7 min: `"7,22,37,52 * * * *"` |
|
||
| 3C.3 | Create `secret.yaml` | Comment-only file; actual secret created manually: `kubectl create secret generic ha-sync-db-secret --from-literal=HA_SYNC_DB_DSN='<user>:<pass>@tcp(general-purpose-db.infrastructure.svc.cluster.local:3306)/general_db' -n infrastructure` |
|
||
| 3C.4 | Create `kustomization.yaml` | Resources in order: `serviceaccount.yaml`, `rbac.yaml`, `pv-logs.yaml`, `pvc-logs.yaml`, all `pv-*.yaml`, all `pvc-*.yaml`, all `cron-*.yaml` |
|
||
|
||
---
|
||
|
||
### Phase 4 — CLI Documentation `[P, independent]`
|
||
|
||
| # | Task | Notes |
|
||
|---|---|---|
|
||
| 4.1 | Create `scripts/cli/ha-sync.md` | Document all flags, defaults, example invocations, env vars (`HA_SYNC_DB_DSN`); note `--dry-run` for safe first-run; note `--delete-missing` rollout guidance |
|
||
|
||
---
|
||
|
||
### Phase 5 — Deploy & Verify `[SEQ after Phase 2+3]`
|
||
|
||
| # | Task | Command |
|
||
|---|---|---|
|
||
| 5.1 | Create DB secret | `kubectl create secret generic ha-sync-db-secret --from-literal=HA_SYNC_DB_DSN='<user>:<pass>@tcp(general-purpose-db.infrastructure.svc.cluster.local:3306)/general_db' -n infrastructure` |
|
||
| 5.2 | Apply manifests | `kubectl apply -k deployment/ha-sync/` |
|
||
| 5.3 | Dry-run smoke test | `kubectl create job ha-sync-test --from=cronjob/ha-sync-media-dell-to-hp -n infrastructure` then: `kubectl logs -l job-name=ha-sync-test -n infrastructure -f` |
|
||
| 5.4 | Verify Lease is created | `kubectl get lease ha-sync-media -n infrastructure -o yaml` |
|
||
| 5.5 | Verify DB rows | `kubectl exec -it <general-purpose-db-pod> -n infrastructure -- mysql -u<user> -p general_db -e "SELECT * FROM sync_iterations ORDER BY id DESC LIMIT 5;"` |
|
||
| 5.6 | Verify opslog flush | Check `/var/log/ha-sync/` on the logs PVC — no `.jsonl` files should remain after a successful run |
|
||
| 5.7 | Trigger real first run | Delete the test job; let CronJob run on schedule; observe `sync_operations` table |
|
||
|
||
---
|
||
|
||
## Open Questions / Future Work
|
||
|
||
- **MySQL HA**: `general-purpose-db` is a single-replica StatefulSet — no HA. Since locking is now handled by K8s Lease and MySQL is only used for audit logging (with opslog fallback), a MySQL outage won't block sync. If full MySQL HA is later desired, **MariaDB Galera Cluster (3 replicas)** is the recommended path for this homelab.
|
||
- **Conflict resolution**: Currently "newest mtime wins". If clocks drift between nodes, a file could ping-pong. Consider NTP enforcement across all nodes or use `--mtime-threshold` >= observed clock skew.
|
||
- **Delete safety**: `--delete-missing` defaults to `false`. Staged rollout: run one full cycle disabled first → confirm parity → enable on primary direction only.
|
||
- **Alerting**: Add a Prometheus/Grafana alert on `sync_iterations.status = 'failed'` (query general_db directly or expose a future `/metrics` endpoint).
|
||
- **DB retention**: `sync_operations` will grow large. Add a cleanup step: `DELETE FROM sync_operations WHERE started_at < NOW() - INTERVAL 30 DAY` as a weekly CronJob.
|
||
- **Registry**: Dockerfile assumes local registry at `192.168.2.100:5000`. Confirm registry address before Phase 2.
|