Version: 1.17.0

Disaster Recovery

This topic describes how to back up a KubeSlice controller cluster and restore it to a new Kubernetes cluster, then reconnect worker clusters and restore SliceConfigs.

Prerequisites

Before setting up disaster recovery, ensure that the following prerequisites are met to enable a smooth and reliable backup and restore process.

Clone the Disaster Recovery Repository

Clone the disaster recovery repository that contains the scripts to help you backup and restore the controller cluster.

Go to the repository folder that contains the scripts: https://github.com/kubeslice-ent/egs-disaster-recovery/tree/kubeslice-backup-script.
Clone the repository using the following command:
```
git clone /egs-disaster-recovery.git
```
note
If you are running the commands on a Windows terminal, run dos2unix <script> before running the script.

This repository provides the following scripts.

Script	What it does
`backup-controller.sh`	This script creates controller backup artifact. It is recommended to run this as frequently as needed. For most systems, running every 30 minutes is sufficient.
`restore-controller.sh`	This script restores controller from backup to a new cluster. Recommended disaster recovery restore steps: 1. Create a new cluster, or if using an existing cluster, run `cleanup-controller.sh`. 2. Run `restore-controller.sh`. 3. Run `reconnect-workers.sh`. 4. Run `restore-sliceconfigs.sh`.
`cleanup-controller.sh`	This script cleans a cluster so you can retry a restore.
`reconnect-workers.sh`	This script reconnects workers to the restored controller.
`restore-sliceconfigs.sh`	This script applies SliceConfigs after workers reconnect.

Required Tools on Your Jumpbox

kubectl
helm
jq
yq (strongly recommended; required for best sanitization/validation)

Required Access

Controller kubeconfig (admin)
Worker clusters' kubeconfig (admin)
Restore cluster kubeconfig (admin)

Restore Model Assumptions

Your controller uses embedded PostgreSQL in the kt-postgresql namespace.
You restore into a new cluster (or you run cleanup-controller.sh first).

Directory Layout For Your Run

Pick a folder to store backups and logs, for example:

Backup base directory: /data/backups/kubeslice
A single backup run creates:
- Extracted backup directory: <backup_base>/<YYYYMMDD_HHMMSS>/
- Tarball archive: <backup_base>/kubeslice_controller_backup_<YYYYMMDD_HHMMSS>.tar.gz

Initial Step: Verify the Controller PostgreSQL Health

This is a recommended step.

If PostgreSQL is not Ready, the backup skips the database dump and a full restore later fails at the database restore step.

Use the following command:

KUBECONFIG=<controller_kubeconfig> kubectl get pod -n kt-postgresql -l app.kubernetes.io/name=postgresql

Proceed when the pod shows READY 1/1 and STATUS Running in the command output.

Step 1: Create A Controller Backup

Go to the egs-disaster-recovery/ directory and run the following command:

./backup-controller.sh <backup_base_dir> <controller_kubeconfig>

Example:

./backup-controller.sh /data/backups/kubeslice /path/to/controller-kubeconfig.yaml

Record the printed timestamp, for example: 20260420_021500.

Quick Backup Validation

Run the following commands to validate the backup:

ls -la /data/backups/kubeslice/20260420_021500
ls -la /data/backups/kubeslice/kubeslice_controller_backup_20260420_021500.tar.gz
ls -la /data/backups/kubeslice/20260420_021500/postgres/

In the command output, check if postgres/kubetally_dump.sql.gz is present and non-empty to perform a full database-backed restore.

Step 2: Prepare the Restore Cluster

This step is for cleaning up the restore cluster.

If there were previous restore attempts on the cluster, clean it first using the following command:

./cleanup-controller.sh <restore_cluster_kubeconfig>

Example:

./cleanup-controller.sh /path/to/restore-cluster.yaml

caution

This script is destructive to KubeSlice resources on that cluster.

Step 3: Run Restore Preflight

This step does not make any changes, as it only verifies the backup structure and chart availability.

Local Charts Mode

This mode is recommended for airgapped/pinned installs. Use the following command for the local charts mode:

./restore-controller.sh <backup_dir> <restore_cluster_kubeconfig> \
  --enable-remote-helm=false \
  --local-chart-folder=<charts_dir> \
  --dry-run=true

Remote Helm Repository Mode

Use the following command for the remote Helm repository mode:

./restore-controller.sh <backup_dir> <restore_cluster_kubeconfig> \
  --enable-remote-helm=true \
  --helm-repo-url=<helm_repo_url> \
  --dry-run=true

info

Proceed only after the preflight passes. Retry if the preflight check fails.

Step 4: Restore the Controller

Local Charts

Use the following command for local charts:

./restore-controller.sh <backup_dir> <restore_cluster_kubeconfig> \
  --enable-remote-helm=false \
  --local-chart-folder=<charts_dir> \
  --dry-run=false

Validate Controller Pods

Validate the controller pods using the following commands:

KUBECONFIG=<restore_cluster_kubeconfig> kubectl get pods -n kt-postgresql
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get pods -n kubeslice-controller

At this stage:

Controller should be up.
Clusters may show not-ready until workers are reconnected.
SliceConfigs are not applied yet.

Step 5: Restore SliceConfigs

You must perform this step after workers are healthy.

Use the following command to restore SliceConfigs:

./restore-sliceconfigs.sh <backup_dir> <restore_cluster_kubeconfig>

Validate the configuration by using the following command:

KUBECONFIG=<restore_cluster_kubeconfig> kubectl get sliceconfig -A

Step 6: End-To-End Verification Checklist

This step includes the following commands that perform basic validation:

# Controller pods
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get pods -n kubeslice-controller

# PostgreSQL pod
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get pods -n kt-postgresql

# Worker clusters
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get clusters -A

# SliceConfigs
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get sliceconfig -A

note

Optional validation: Submit a small test request (if your environment supports it) to confirm full control plane/data plane behavior.

Common Issues and Resolutions

Backup Missing PostgreSQL Dump

Symptom	Solution
Restore fails at the database restore step.	Ensure `kt-postgresql` is `Ready` on the controller, then rerun backup.

Workers Stay Unknown/Not Healthy

Symptom	Solution
`kubectl get clusters -A` command output shows `Unknown` or `doesn't transition`.	1. Rerun `reconnect-workers.sh` for the affected worker. 2. Check worker pods in `kubeslice-system`. 3. Ensure worker secrets and `kubeslice-hub` point to the new controller endpoints.

SliceConfigs Would Not Apply

Symptom	Solution
Error message: `"cluster registration not completed"`	Ensure workers are healthy first, then retry `restore-sliceconfigs.sh`.

Logs Location

Backup creates: <backup_base_dir>/backup_<timestamp>.log
Restore writes logs into: <backup_dir>/restore_<timestamp>.log

Always attach these logs when escalating an issue.

Disaster Recovery

Prerequisites​

Clone the Disaster Recovery Repository​

Required Tools on Your Jumpbox​

Required Access​

Restore Model Assumptions​

Directory Layout For Your Run​

Initial Step: Verify the Controller PostgreSQL Health​

Step 1: Create A Controller Backup​

Quick Backup Validation​

Step 2: Prepare the Restore Cluster​

Step 3: Run Restore Preflight​

Local Charts Mode​

Remote Helm Repository Mode​

Step 4: Restore the Controller​

Local Charts​

Validate Controller Pods​

Step 5: Restore SliceConfigs​

Step 6: End-To-End Verification Checklist​

Common Issues and Resolutions​

Backup Missing PostgreSQL Dump​

Workers Stay Unknown/Not Healthy​

SliceConfigs Would Not Apply​

Logs Location​