Disaster Recovery
This topic describes how to back up a KubeSlice controller cluster and restore it to a new Kubernetes cluster, then reconnect worker clusters and restore SliceConfigs.
Prerequisites
Before setting up disaster recovery, ensure that the following prerequisites are met to enable a smooth and reliable backup and restore process.
Clone the Disaster Recovery Repository
Clone the disaster recovery repository that contains the scripts to help you backup and restore the controller cluster.
-
Go to the repository folder that contains the scripts: https://github.com/kubeslice-ent/egs-disaster-recovery/tree/kubeslice-backup-script.
-
Clone the repository using the following command:
git clone /egs-disaster-recovery.gitnoteIf you are running the commands on a Windows terminal, run
dos2unix <script>before running the script.
This repository provides the following scripts.
| Script | What it does |
|---|---|
backup-controller.sh | This script creates controller backup artifact. It is recommended to run this as frequently as needed. For most systems, running every 30 minutes is sufficient. |
restore-controller.sh | This script restores controller from backup to a new cluster. Recommended disaster recovery restore steps: 1. Create a new cluster, or if using an existing cluster, run cleanup-controller.sh.2. Run restore-controller.sh.3. Run reconnect-workers.sh.4. Run restore-sliceconfigs.sh. |
cleanup-controller.sh | This script cleans a cluster so you can retry a restore. |
reconnect-workers.sh | This script reconnects workers to the restored controller. |
restore-sliceconfigs.sh | This script applies SliceConfigs after workers reconnect. |
Required Tools on Your Jumpbox
- kubectl
- helm
- jq
- yq (strongly recommended; required for best sanitization/validation)
Required Access
- Controller kubeconfig (admin)
- Worker clusters' kubeconfig (admin)
- Restore cluster kubeconfig (admin)
Restore Model Assumptions
- Your controller uses embedded PostgreSQL in the
kt-postgresqlnamespace. - You restore into a new cluster (or you run
cleanup-controller.shfirst).
Directory Layout For Your Run
Pick a folder to store backups and logs, for example:
-
Backup base directory:
/data/backups/kubeslice -
A single backup run creates:
- Extracted backup directory:
<backup_base>/<YYYYMMDD_HHMMSS>/ - Tarball archive:
<backup_base>/kubeslice_controller_backup_<YYYYMMDD_HHMMSS>.tar.gz
- Extracted backup directory:
Initial Step: Verify the Controller PostgreSQL Health
This is a recommended step.
If PostgreSQL is not Ready, the backup skips the database dump and a full restore later fails at
the database restore step.
Use the following command:
KUBECONFIG=<controller_kubeconfig> kubectl get pod -n kt-postgresql -l app.kubernetes.io/name=postgresql
Proceed when the pod shows READY 1/1 and STATUS Running in the command output.
Step 1: Create A Controller Backup
- Go to the
egs-disaster-recovery/directory and run the following command:Example:./backup-controller.sh <backup_base_dir> <controller_kubeconfig>./backup-controller.sh /data/backups/kubeslice /path/to/controller-kubeconfig.yaml - Record the printed timestamp, for example:
20260420_021500.
Quick Backup Validation
Run the following commands to validate the backup:
ls -la /data/backups/kubeslice/20260420_021500
ls -la /data/backups/kubeslice/kubeslice_controller_backup_20260420_021500.tar.gz
ls -la /data/backups/kubeslice/20260420_021500/postgres/
In the command output, check if postgres/kubetally_dump.sql.gz is present and non-empty to
perform a full database-backed restore.
Step 2: Prepare the Restore Cluster
This step is for cleaning up the restore cluster.
If there were previous restore attempts on the cluster, clean it first using the following command:
./cleanup-controller.sh <restore_cluster_kubeconfig>
Example:
./cleanup-controller.sh /path/to/restore-cluster.yaml
This script is destructive to KubeSlice resources on that cluster.
Step 3: Run Restore Preflight
This step does not make any changes, as it only verifies the backup structure and chart availability.
Local Charts Mode
This mode is recommended for airgapped/pinned installs. Use the following command for the local charts mode:
./restore-controller.sh <backup_dir> <restore_cluster_kubeconfig> \
--enable-remote-helm=false \
--local-chart-folder=<charts_dir> \
--dry-run=true
Remote Helm Repository Mode
Use the following command for the remote Helm repository mode:
./restore-controller.sh <backup_dir> <restore_cluster_kubeconfig> \
--enable-remote-helm=true \
--helm-repo-url=<helm_repo_url> \
--dry-run=true
Proceed only after the preflight passes. Retry if the preflight check fails.
Step 4: Restore the Controller
Local Charts
Use the following command for local charts:
./restore-controller.sh <backup_dir> <restore_cluster_kubeconfig> \
--enable-remote-helm=false \
--local-chart-folder=<charts_dir> \
--dry-run=false
Validate Controller Pods
Validate the controller pods using the following commands:
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get pods -n kt-postgresql
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get pods -n kubeslice-controller
At this stage:
- Controller should be up.
- Clusters may show not-ready until workers are reconnected.
- SliceConfigs are not applied yet.
Step 5: Restore SliceConfigs
You must perform this step after workers are healthy.
-
Use the following command to restore SliceConfigs:
./restore-sliceconfigs.sh <backup_dir> <restore_cluster_kubeconfig> -
Validate the configuration by using the following command:
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get sliceconfig -A
Step 6: End-To-End Verification Checklist
This step includes the following commands that perform basic validation:
# Controller pods
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get pods -n kubeslice-controller
# PostgreSQL pod
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get pods -n kt-postgresql
# Worker clusters
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get clusters -A
# SliceConfigs
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get sliceconfig -A
Optional validation: Submit a small test request (if your environment supports it) to confirm full control plane/data plane behavior.
Common Issues and Resolutions
Backup Missing PostgreSQL Dump
| Symptom | Solution |
|---|---|
| Restore fails at the database restore step. | Ensure kt-postgresql is Ready on the controller, then rerun backup. |
Workers Stay Unknown/Not Healthy
| Symptom | Solution |
|---|---|
kubectl get clusters -A command output shows Unknown or doesn't transition. | 1. Rerun reconnect-workers.sh for the affected worker.2. Check worker pods in kubeslice-system.3. Ensure worker secrets and kubeslice-hub point to the new controller endpoints. |
SliceConfigs Would Not Apply
| Symptom | Solution |
|---|---|
Error message: "cluster registration not completed" | Ensure workers are healthy first, then retry restore-sliceconfigs.sh. |
Logs Location
- Backup creates:
<backup_base_dir>/backup_<timestamp>.log - Restore writes logs into:
<backup_dir>/restore_<timestamp>.log
Always attach these logs when escalating an issue.