HA Sync — Execution Plan
Problem Statement
Two servers (Dell OptiPlex 7070 at 192.168.2.100 and HP ProLiant at 192.168.2.193) each export the same folder set over NFS. A Kubernetes-native tool must keep each folder pair in bidirectional sync: newest file wins, mtime is preserved on copy, delete propagation is strict (one-way per CronJob), and every operation is logged in the MySQL instance in the infrastructure namespace.
Architecture Decisions (Agreed)
| Decision |
Choice |
Rationale |
| Language |
Go |
Single static binary, excellent async I/O, no runtime overhead |
| Sync direction |
Bidirectional via two one-way CronJobs |
Each folder pair gets a→b and b→a jobs; newest-mtime wins |
| Loop prevention |
Preserve mtime on copy + --delete-missing flag |
Mtime equality → skip; no extra DB state needed |
| Lock |
Kubernetes Lease object (coordination.k8s.io/v1) |
Native K8s TTL; survives MySQL outage; sync blocked only if K8s API is down (already required for CronJob) |
| Change detection |
mtime + size first; MD5 only on mtime/size mismatch |
Efficient for large datasets |
| Delete propagation |
Strict mirror — configurable per job via --delete-missing |
See ⚠️ note below |
| Volume access |
NFS mounts (both servers already export NFS) |
No HostPath or node-affinity needed |
| Audit logging | Write to opslog file during run; flush to MySQL on completion | MySQL outage does not block sync; unprocessed opslogs are retried on next run |
| Opslog storage | Persistent NFS-backed PVC at /var/log/ha-sync/ | /tmp is ephemeral (lost on pod exit); NFS PVC persists across CronJob runs for 10-day retention |
Locking: Kubernetes Lease
Each sync pair uses a coordination.k8s.io/v1 Lease object named ha-sync-<pair> in the infrastructure namespace.
spec.holderIdentity = <pod-name>/<iteration-id>
spec.leaseDurationSeconds = --lock-ttl (default 3600)
- A background goroutine renews (
spec.renewTime) every leaseDurationSeconds / 3 seconds
- On normal exit or SIGTERM: Lease is deleted (released)
- Stale leases (holder crashed without release): expire automatically after
leaseDurationSeconds
- Requires RBAC:
ServiceAccount with create/get/update/delete on leases in infrastructure
Audit Logging: Opslog + MySQL Flush
- On sync start: open
/var/log/ha-sync/opslog-<pair>-<direction>-<RFC3339>.jsonl
- Each file operation: append one JSON line (all
sync_operations fields)
- On sync end: attempt flush to MySQL (
sync_iterations + sync_operations batch INSERT)
- On successful flush: delete the opslog file
- On MySQL failure: leave the opslog; on next run, scan
/var/log/ha-sync/ for unprocessed opslogs and retry flush before starting new sync
- Cleanup: after each run, delete opslogs older than 10 days (
os.Stat mtime check)
⚠️ Delete Propagation Warning
With two one-way jobs per pair, ordering matters for deletes. If dell→hp runs before hp→dell and --delete-missing is ON for both, files that only exist on HP will be deleted before they're copied to Dell.
Safe default: --delete-missing=false for all jobs. Enable --delete-missing=true only on the primary direction (e.g., dell→hp for each pair) once the initial full sync has completed and both sides are known-equal.
NFS Sync Pairs
| Pair name |
Dell NFS (192.168.2.100) |
HP NFS (192.168.2.193) |
media |
/data/media |
/data/media |
photos |
/data/photos |
/data/photos |
owncloud |
/data/owncloud |
/data/owncloud |
games |
/data/games |
/data/games |
infra |
/data/infra |
/data/infra |
ai |
/data/ai |
/data/ai |
Each pair produces two CronJobs in the infrastructure namespace.
CLI Interface (ha-sync)
ha-sync [flags]
Required:
--src <path> Source directory (absolute path inside pod)
--dest <path> Destination directory (absolute path inside pod)
--pair <name> Logical pair name (e.g. "media"); used as Lease name ha-sync-<pair>
Optional:
--direction <str> Label for logging, e.g. "dell-to-hp" (default: "fwd")
--db-dsn <dsn> MySQL DSN (default: from env HA_SYNC_DB_DSN)
--lock-ttl <seconds> Lease TTL before considered stale (default: 3600)
--log-dir <path> Directory for opslog files (default: /var/log/ha-sync)
--log-retain-days <n> Delete opslogs older than N days (default: 10)
--mtime-threshold <s> Seconds of tolerance for mtime equality (default: 2)
--delete-missing Delete dest files not present in src (default: false)
--workers <n> Concurrent file workers (default: 4)
--dry-run Compute what would sync, save to DB as dry_run rows, print plan; do not copy/delete (default: false)
--verbose Verbose output
--help
MySQL Schema (database: general_db)
-- One row per CronJob execution
CREATE TABLE IF NOT EXISTS sync_iterations (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
sync_pair VARCHAR(255) NOT NULL,
direction VARCHAR(64) NOT NULL,
src VARCHAR(512) NOT NULL,
dest VARCHAR(512) NOT NULL,
started_at DATETIME(3) NOT NULL,
ended_at DATETIME(3),
status ENUM('running','success','partial_failure','failed') NOT NULL DEFAULT 'running',
dry_run TINYINT(1) NOT NULL DEFAULT 0,
files_created INT DEFAULT 0,
files_updated INT DEFAULT 0,
files_deleted INT DEFAULT 0,
files_skipped INT DEFAULT 0,
files_failed INT DEFAULT 0,
total_bytes_transferred BIGINT DEFAULT 0,
error_message TEXT,
INDEX idx_pair (sync_pair),
INDEX idx_started (started_at),
INDEX idx_dry_run (dry_run)
);
-- One row per individual file operation (flushed from opslog on sync completion)
CREATE TABLE IF NOT EXISTS sync_operations (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
iteration_id BIGINT NOT NULL,
dry_run TINYINT(1) NOT NULL DEFAULT 0,
operation ENUM('create','update','delete') NOT NULL,
filepath VARCHAR(4096) NOT NULL,
size_before BIGINT,
size_after BIGINT,
md5_before VARCHAR(32),
md5_after VARCHAR(32),
started_at DATETIME(3) NOT NULL,
ended_at DATETIME(3),
status ENUM('success','fail') NOT NULL,
error_message VARCHAR(4096),
INDEX idx_iteration (iteration_id),
CONSTRAINT fk_iteration FOREIGN KEY (iteration_id) REFERENCES sync_iterations(id)
);
No sync_locks table — locking is handled by Kubernetes Lease objects.
Dry-run Idempotency Rules
--dry-run mode: walk source and dest, compute the full set of would-be operations (create/update/delete), save to DB with dry_run = 1, print the plan. No files are copied or deleted.
- Idempotency check: before running a dry-run, query for the last successful dry-run iteration for
(pair, direction):
SELECT id, started_at FROM sync_iterations
WHERE sync_pair = ? AND direction = ? AND dry_run = 1 AND status = 'success'
ORDER BY started_at DESC LIMIT 1;
Then re-walk the source and dest and compute the would-be operation set. Compare it against the sync_operations rows from that previous dry-run iteration (same set of filepath + operation + size_before). If identical → print "Dry-run already current as of <started_at>. Nothing has changed." and exit without writing new rows.
- Production run (
--dry-run not set): all queries for previous iterations use WHERE dry_run = 0. Dry-run rows are never considered for skip logic, idempotency, or status reporting in production runs.
- Lease is still acquired during dry-run (prevents two dry-runs from racing each other).
Project Structure
services/ha-sync/
cmd/ha-sync/
main.go # Sync CLI entry point
cmd/ha-sync-ui/
main.go # Dashboard HTTP server entry point (serves ha-sync.vandachevici.ro)
internal/
config/
config.go # Config struct, defaults, validation (shared by both binaries)
db/
db.go # MySQL connect, auto-migrate schema
logging.go # StartIteration, FinishIteration, BulkInsertOperations, LastDryRunOps
lease/
lease.go # Acquire/release/heartbeat Kubernetes Lease object
opslog/
writer.go # Append JSON lines to /var/log/ha-sync/opslog-<pair>-<direction>-<RFC3339>.jsonl
flusher.go # Scan for unprocessed opslogs, batch INSERT; cleanup logs >10 days
sync/
engine.go # Main sync loop: walk, compare, dispatch; dryRun flag skips writes
walker.go # Recursive directory walk
compare.go # mtime+size comparison; conditional MD5
copy.go # File copy with os.Chtimes() mtime preservation
delete.go # Safe delete with pre-check
ui/
handler.go # HTTP handlers: index, /api/iterations, /api/operations, /api/pairs
templates/
index.html # Dashboard HTML; auto-refreshes every 10s via fetch(); vanilla JS only
go.mod
go.sum
Dockerfile # Multi-stage: golang:1.22-alpine builder (builds ha-sync + ha-sync-ui) → alpine:3.20
Makefile # build, docker-build IMAGE=<registry>/ha-sync:latest, docker-push targets
deployment/ha-sync/
serviceaccount.yaml # ServiceAccount: ha-sync, namespace: infrastructure
rbac.yaml # Role + RoleBinding: leases (coordination.k8s.io) create/get/update/delete
secret.yaml # NOTE: create manually — see Phase 3C instructions
pv-logs.yaml # PersistentVolume: NFS 192.168.2.193:/data/infra/ha-sync-logs, 10Gi, RWX
pvc-logs.yaml # PVC bound to pv-logs; all CronJobs mount at /var/log/ha-sync
pv-dell-<pair>.yaml # PersistentVolume: NFS 192.168.2.100:/data/<pair> (one per pair × 6)
pv-hp-<pair>.yaml # PersistentVolume: NFS 192.168.2.193:/data/<pair> (one per pair × 6)
pvc-dell-<pair>.yaml # PVC → pv-dell-<pair> (one per pair × 6)
pvc-hp-<pair>.yaml # PVC → pv-hp-<pair> (one per pair × 6)
cron-<pair>-dell-to-hp.yaml # --dry-run is DEFAULT; remove flag to enable production sync
cron-<pair>-hp-to-dell.yaml # same
ui-deployment.yaml # Deployment: ha-sync-ui, 1 replica, image: <registry>/ha-sync:latest, cmd: ha-sync-ui
ui-service.yaml # ClusterIP Service: port 8080 → ha-sync-ui pod
ui-ingress.yaml # Ingress: ha-sync.vandachevici.ro → ui-service:8080; cert-manager TLS
kustomization.yaml # Kustomize root listing all resources
scripts/cli/
ha-sync.md # CLI reference doc
UI Dashboard (ha-sync.vandachevici.ro)
- Binary:
ha-sync-ui — Go HTTP server, port 8080
- Routes:
GET / — HTML dashboard; auto-refreshes via setInterval + fetch
GET /api/pairs — JSON: per-pair last iteration summary (dry_run=0 and dry_run=1 separately)
GET /api/iterations?pair=&limit=20 — JSON: recent iterations
GET /api/operations?iteration_id= — JSON: operations for one iteration
- Dashboard shows: per-pair status cards (last real sync, last dry-run, files created/updated/deleted/failed), recent activity table, errors highlighted in red
- Env vars:
HA_SYNC_DB_DSN (same secret as CronJobs)
- K8s: Deployment in
infrastructure namespace, 1 replica, same ServiceAccount as CronJobs (read-only DB access only)
Tasks
Parallelism key: Tasks marked [P] can be executed in parallel by separate agents. Tasks marked [SEQ] must follow the listed dependency chain.
Phase 0 — Scaffolding [SEQ]
Must complete before any code is written; all subsequent tasks depend on this.
| # |
Task |
Command / Notes |
| 0.1 |
Create Go module |
cd services/ha-sync && go mod init github.com/vandachevici/homelab/ha-sync |
| 0.2 |
Create directory tree |
mkdir -p cmd/ha-sync internal/{config,db,lease,opslog,sync} |
| 0.3 |
Create Dockerfile |
Multi-stage: FROM golang:1.22-alpine AS build → FROM alpine:3.20; copy binary; ENTRYPOINT ["/ha-sync"] |
| 0.4 |
Create Makefile |
Targets: build, docker-build IMAGE=<registry>/ha-sync:latest, docker-push IMAGE=... |
Phase 1 — Core Go packages [P after Phase 0]
Sub-tasks 1A, 1B, 1C, 1E are fully independent — assign to separate agents simultaneously. 1D depends on all of them.
1A — internal/config [P]
| # |
Task |
Notes |
| 1A.1 |
Write config.go |
Define Config struct with all CLI flags; use flag stdlib or cobra; set defaults from CLI Interface section above |
1B — internal/db [P]
| # |
Task |
Notes |
| 1B.1 |
Write db.go |
Connect(dsn string) (*sql.DB, error); run CREATE TABLE IF NOT EXISTS for both tables (include dry_run TINYINT(1) NOT NULL DEFAULT 0 column in both) on startup |
| 1B.2 |
Write logging.go |
StartIteration(dryRun bool, ...) (id int64) → INSERT with dry_run set; FinishIteration(id, status, counts) → UPDATE; BulkInsertOperations(iterID int64, dryRun bool, []OpRecord) → batch INSERT; LastDryRunOps(db, pair, direction string) ([]OpRecord, error) → fetch ops for last successful dry_run=1 iteration for idempotency check |
1C — internal/lease [P]
| # |
Task |
Notes |
| 1C.1 |
Write lease.go |
Use k8s.io/client-go in-cluster config; Acquire(ctx, client, namespace, leaseName, holderID, ttlSec) — create or update Lease if expired; Release(ctx, client, namespace, leaseName, holderID) — delete Lease; Heartbeat(ctx, ...) — goroutine that calls Update on spec.renewTime every ttlSec/3 seconds |
1D — internal/opslog [P]
| # |
Task |
Notes |
| 1D.1 |
Write writer.go |
Open(logDir, pair, direction string) (*Writer, error) — creates /var/log/ha-sync/opslog-<pair>-<direction>-<RFC3339>.jsonl; Append(op OpRecord) error — JSON-encode one line |
| 1D.2 |
Write flusher.go |
FlushAll(logDir string, db *sql.DB) error — scan dir for *.jsonl, for each: decode lines → call BulkInsertOperations, delete file on success; CleanOld(logDir string, retainDays int) — delete files with mtime older than N days |
1E — internal/sync [P]
| # |
Task |
Notes |
| 1E.1 |
Write walker.go |
Walk(root string) ([]FileInfo, error) — returns slice of {RelPath, AbsPath, Size, ModTime, IsDir}; use filepath.WalkDir |
| 1E.2 |
Write compare.go |
NeedsSync(src, dest FileInfo, threshold time.Duration) bool — mtime+size check; MD5File(path string) (string, error) — streaming MD5; MD5Changed(srcPath, destPath string) bool |
| 1E.3 |
Write copy.go |
CopyFile(src, dest string, srcModTime time.Time) error — copy bytes, then os.Chtimes(dest, srcModTime, srcModTime) to preserve mtime |
| 1E.4 |
Write delete.go |
DeleteFile(path string) error — os.Remove; DeleteDir(path string) error — os.RemoveAll only if dir is empty after child removal |
| 1E.5 |
Write engine.go |
Walk src+dest, compare, dispatch create/update/delete via worker pool (sync.WaitGroup + buffered channel of --workers size); if dryRun=true, build op list but do not call copy/delete — return ops for caller to log; write each op to opslog.Writer (tagged with dry_run flag); return summary counts |
1F — cmd/ha-sync/main.go [SEQ, depends on 1A+1B+1C+1D+1E]
| # |
Task |
Notes |
| 1F.1 |
Write main.go |
Parse flags → build config → connect DB → flush old opslogs → acquire Lease → if --dry-run: call LastDryRunOps, walk src+dest, compute would-be ops, compare; if identical → print "already current" + exit; else run engine(dryRun=true) → open opslog writer (tagged dry_run) → start iteration row (dry_run = true/false) → run engine → finish iteration → flush opslog to DB → release Lease; trap SIGTERM to release Lease before exit; production queries always filter dry_run = 0 |
Phase 2 — Build & Docker Image [SEQ after Phase 1]
| # |
Task |
Command |
| 2.1 |
Fetch Go deps |
cd services/ha-sync && go mod tidy |
| 2.2 |
Build binary |
cd services/ha-sync && make build |
| 2.3 |
Build Docker image |
make docker-build IMAGE=192.168.2.100:5000/ha-sync:latest (replace registry if different) |
| 2.4 |
Push Docker image |
make docker-push IMAGE=192.168.2.100:5000/ha-sync:latest |
Phase 3 — Kubernetes Manifests [P, can start during Phase 1]
All manifest sub-tasks are independent and can be parallelized.
3A — RBAC + Shared Resources [P]
| # |
Task |
Notes |
| 3A.1 |
Create serviceaccount.yaml |
name: ha-sync, namespace: infrastructure |
| 3A.2 |
Create rbac.yaml |
Role with rules: apiGroups: [coordination.k8s.io], resources: [leases], verbs: [create, get, update, delete]; RoleBinding binding ha-sync SA to the Role |
| 3A.3 |
Create pv-logs.yaml + pvc-logs.yaml |
PV: nfs.server: 192.168.2.193, nfs.path: /data/infra/ha-sync-logs, capacity 10Gi, accessModes: [ReadWriteMany]; PVC: storageClassName: "", volumeName: pv-ha-sync-logs, namespace infrastructure |
3B — PVs and PVCs per pair [P]
| # |
Task |
Notes |
| 3B.1 |
Create pv-dell-<pair>.yaml for each of 6 pairs |
spec.nfs.server: 192.168.2.100, spec.nfs.path: /data/<pair>; capacity per pair: media: 2Ti, photos: 500Gi, games: 500Gi, owncloud: 500Gi, infra: 100Gi, ai: 500Gi; accessModes: [ReadWriteMany] |
| 3B.2 |
Create pv-hp-<pair>.yaml for each of 6 pairs |
Same structure; spec.nfs.server: 192.168.2.193 |
| 3B.3 |
Create pvc-dell-<pair>.yaml + pvc-hp-<pair>.yaml |
namespace: infrastructure; accessModes: [ReadWriteMany]; storageClassName: "" (manual bind); volumeName: pv-dell-<pair> / pv-hp-<pair> |
3C — CronJobs [P, depends on 3A+3B for volume/SA names]
| # |
Task |
Notes |
| 3C.1 |
Create cron-<pair>-dell-to-hp.yaml for each pair |
namespace: infrastructure; serviceAccountName: ha-sync; schedule: "*/15 * * * *"; image: <registry>/ha-sync:latest; args: ["--src=/mnt/dell/<pair>","--dest=/mnt/hp/<pair>","--pair=<pair>","--direction=dell-to-hp","--db-dsn=$(HA_SYNC_DB_DSN)","--log-dir=/var/log/ha-sync"]; volumeMounts: pvc-dell-<pair> → /mnt/dell/<pair>, pvc-hp-<pair> → /mnt/hp/<pair>, pvc-ha-sync-logs → /var/log/ha-sync; envFrom: ha-sync-db-secret |
| 3C.2 |
Create cron-<pair>-hp-to-dell.yaml for each pair |
Same but src/dest swapped, direction=hp-to-dell; offset schedule by 7 min: "7,22,37,52 * * * *" |
| 3C.3 |
Create secret.yaml |
Comment-only file; actual secret created manually: kubectl create secret generic ha-sync-db-secret --from-literal=HA_SYNC_DB_DSN='<user>:<pass>@tcp(general-purpose-db.infrastructure.svc.cluster.local:3306)/general_db' -n infrastructure |
| 3C.4 |
Create kustomization.yaml |
Resources in order: serviceaccount.yaml, rbac.yaml, pv-logs.yaml, pvc-logs.yaml, all pv-*.yaml, all pvc-*.yaml, all cron-*.yaml |
Phase 4 — CLI Documentation [P, independent]
| # |
Task |
Notes |
| 4.1 |
Create scripts/cli/ha-sync.md |
Document all flags, defaults, example invocations, env vars (HA_SYNC_DB_DSN); note --dry-run for safe first-run; note --delete-missing rollout guidance |
Phase 5 — Deploy & Verify [SEQ after Phase 2+3]
| # |
Task |
Command |
| 5.1 |
Create DB secret |
kubectl create secret generic ha-sync-db-secret --from-literal=HA_SYNC_DB_DSN='<user>:<pass>@tcp(general-purpose-db.infrastructure.svc.cluster.local:3306)/general_db' -n infrastructure |
| 5.2 |
Apply manifests |
kubectl apply -k deployment/ha-sync/ |
| 5.3 |
Dry-run smoke test |
kubectl create job ha-sync-test --from=cronjob/ha-sync-media-dell-to-hp -n infrastructure then: kubectl logs -l job-name=ha-sync-test -n infrastructure -f |
| 5.4 |
Verify Lease is created |
kubectl get lease ha-sync-media -n infrastructure -o yaml |
| 5.5 |
Verify DB rows |
kubectl exec -it <general-purpose-db-pod> -n infrastructure -- mysql -u<user> -p general_db -e "SELECT * FROM sync_iterations ORDER BY id DESC LIMIT 5;" |
| 5.6 |
Verify opslog flush |
Check /var/log/ha-sync/ on the logs PVC — no .jsonl files should remain after a successful run |
| 5.7 |
Trigger real first run |
Delete the test job; let CronJob run on schedule; observe sync_operations table |
Open Questions / Future Work
- MySQL HA:
general-purpose-db is a single-replica StatefulSet — no HA. Since locking is now handled by K8s Lease and MySQL is only used for audit logging (with opslog fallback), a MySQL outage won't block sync. If full MySQL HA is later desired, MariaDB Galera Cluster (3 replicas) is the recommended path for this homelab.
- Conflict resolution: Currently "newest mtime wins". If clocks drift between nodes, a file could ping-pong. Consider NTP enforcement across all nodes or use
--mtime-threshold >= observed clock skew.
- Delete safety:
--delete-missing defaults to false. Staged rollout: run one full cycle disabled first → confirm parity → enable on primary direction only.
- Alerting: Add a Prometheus/Grafana alert on
sync_iterations.status = 'failed' (query general_db directly or expose a future /metrics endpoint).
- DB retention:
sync_operations will grow large. Add a cleanup step: DELETE FROM sync_operations WHERE started_at < NOW() - INTERVAL 30 DAY as a weekly CronJob.
- Registry: Dockerfile assumes local registry at
192.168.2.100:5000. Confirm registry address before Phase 2.