Skip to main content
Version: 1.17.0

Disaster Recovery

This topic describes how to back up a KubeSlice controller cluster and restore it to a new Kubernetes cluster, then reconnect worker clusters and restore SliceConfigs.

Prerequisites

Before setting up disaster recovery, ensure that the following prerequisites are met to enable a smooth and reliable backup and restore process.

Clone the Disaster Recovery Repository

Clone the disaster recovery repository that contains the scripts to help you backup and restore the controller cluster.

  1. Go to the repository folder that contains the scripts: https://github.com/kubeslice-ent/egs-disaster-recovery/tree/kubeslice-backup-script.

  2. Clone the repository using the following command:

    git clone /egs-disaster-recovery.git
    note

    If you are running the commands on a Windows terminal, run dos2unix <script> before running the script.

This repository provides the following scripts.

ScriptWhat it does
backup-controller.shThis script creates controller backup artifact. It is recommended to run this as frequently as needed. For most systems, running every 30 minutes is sufficient.
restore-controller.shThis script restores controller from backup to a new cluster. Recommended disaster recovery restore steps:
1. Create a new cluster, or if using an existing cluster, run cleanup-controller.sh.
2. Run restore-controller.sh.
3. Run reconnect-workers.sh.
4. Run restore-sliceconfigs.sh.
cleanup-controller.shThis script cleans a cluster so you can retry a restore.
reconnect-workers.shThis script reconnects workers to the restored controller.
restore-sliceconfigs.shThis script applies SliceConfigs after workers reconnect.

Required Tools on Your Jumpbox

  • kubectl
  • helm
  • jq
  • yq (strongly recommended; required for best sanitization/validation)

Required Access

  • Controller kubeconfig (admin)
  • Worker clusters' kubeconfig (admin)
  • Restore cluster kubeconfig (admin)

Restore Model Assumptions

  • Your controller uses embedded PostgreSQL in the kt-postgresql namespace.
  • You restore into a new cluster (or you run cleanup-controller.sh first).

Directory Layout For Your Run

Pick a folder to store backups and logs, for example:

  • Backup base directory: /data/backups/kubeslice

  • A single backup run creates:

    • Extracted backup directory: <backup_base>/<YYYYMMDD_HHMMSS>/
    • Tarball archive: <backup_base>/kubeslice_controller_backup_<YYYYMMDD_HHMMSS>.tar.gz

Initial Step: Verify the Controller PostgreSQL Health

This is a recommended step.

If PostgreSQL is not Ready, the backup skips the database dump and a full restore later fails at the database restore step.

Use the following command:

KUBECONFIG=<controller_kubeconfig> kubectl get pod -n kt-postgresql -l app.kubernetes.io/name=postgresql

Proceed when the pod shows READY 1/1 and STATUS Running in the command output.

Step 1: Create A Controller Backup

  1. Go to the egs-disaster-recovery/ directory and run the following command:
    ./backup-controller.sh <backup_base_dir> <controller_kubeconfig>
    Example:
    ./backup-controller.sh /data/backups/kubeslice /path/to/controller-kubeconfig.yaml
  2. Record the printed timestamp, for example: 20260420_021500.

Quick Backup Validation

Run the following commands to validate the backup:

ls -la /data/backups/kubeslice/20260420_021500
ls -la /data/backups/kubeslice/kubeslice_controller_backup_20260420_021500.tar.gz
ls -la /data/backups/kubeslice/20260420_021500/postgres/

In the command output, check if postgres/kubetally_dump.sql.gz is present and non-empty to perform a full database-backed restore.

Step 2: Prepare the Restore Cluster

This step is for cleaning up the restore cluster.

If there were previous restore attempts on the cluster, clean it first using the following command:

./cleanup-controller.sh <restore_cluster_kubeconfig>

Example:

./cleanup-controller.sh /path/to/restore-cluster.yaml
caution

This script is destructive to KubeSlice resources on that cluster.

Step 3: Run Restore Preflight

This step does not make any changes, as it only verifies the backup structure and chart availability.

Local Charts Mode

This mode is recommended for airgapped/pinned installs. Use the following command for the local charts mode:

./restore-controller.sh <backup_dir> <restore_cluster_kubeconfig> \
--enable-remote-helm=false \
--local-chart-folder=<charts_dir> \
--dry-run=true

Remote Helm Repository Mode

Use the following command for the remote Helm repository mode:

./restore-controller.sh <backup_dir> <restore_cluster_kubeconfig> \
--enable-remote-helm=true \
--helm-repo-url=<helm_repo_url> \
--dry-run=true
info

Proceed only after the preflight passes. Retry if the preflight check fails.

Step 4: Restore the Controller

Local Charts

Use the following command for local charts:

./restore-controller.sh <backup_dir> <restore_cluster_kubeconfig> \
--enable-remote-helm=false \
--local-chart-folder=<charts_dir> \
--dry-run=false

Validate Controller Pods

Validate the controller pods using the following commands:

KUBECONFIG=<restore_cluster_kubeconfig> kubectl get pods -n kt-postgresql
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get pods -n kubeslice-controller

At this stage:

  • Controller should be up.
  • Clusters may show not-ready until workers are reconnected.
  • SliceConfigs are not applied yet.

Step 5: Restore SliceConfigs

You must perform this step after workers are healthy.

  1. Use the following command to restore SliceConfigs:

    ./restore-sliceconfigs.sh <backup_dir> <restore_cluster_kubeconfig>
  2. Validate the configuration by using the following command:

    KUBECONFIG=<restore_cluster_kubeconfig> kubectl get sliceconfig -A

Step 6: End-To-End Verification Checklist

This step includes the following commands that perform basic validation:

# Controller pods
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get pods -n kubeslice-controller

# PostgreSQL pod
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get pods -n kt-postgresql

# Worker clusters
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get clusters -A

# SliceConfigs
KUBECONFIG=<restore_cluster_kubeconfig> kubectl get sliceconfig -A
note

Optional validation: Submit a small test request (if your environment supports it) to confirm full control plane/data plane behavior.

Common Issues and Resolutions

Backup Missing PostgreSQL Dump

SymptomSolution
Restore fails at the database restore step.Ensure kt-postgresql is Ready on the controller, then rerun backup.

Workers Stay Unknown/Not Healthy

SymptomSolution
kubectl get clusters -A command output shows Unknown or doesn't transition.1. Rerun reconnect-workers.sh for the affected worker.
2. Check worker pods in kubeslice-system.
3. Ensure worker secrets and kubeslice-hub point to the new controller endpoints.

SliceConfigs Would Not Apply

SymptomSolution
Error message: "cluster registration not completed"Ensure workers are healthy first, then retry restore-sliceconfigs.sh.

Logs Location

  • Backup creates: <backup_base_dir>/backup_<timestamp>.log
  • Restore writes logs into: <backup_dir>/restore_<timestamp>.log

Always attach these logs when escalating an issue.