Install EGS using Script
This topic describes the steps to install EGS on the cluster using the script provided in the egs-installation
repository.
Prerequisites
-
Kubernetes cluster with GPU nodes.
- If fractional GPUs must be supported, the GPU node must have MIG capability. For example, the NVIDIA A100.
-
GPU Operator is installed on the cluster.
- If MIG capability is supported and if shared CPU features are required, the GPU Operator must be configured correctly. For more information, see GPU Operator with MIG.
-
On the cluster, verify the following:
- NVIDIA GPU Operator is installed, running with the
nvidia-dcgm-exporter
pod is running. - Prometheus is running and using a
pvc
for data persistence. - Prometheus is configured to collect the metrics from the
dcgm
exporter. - Grafana is configured with NVIDIA DCGM dashboard.
- NVIDIA GPU Operator is installed, running with the
-
The Admin Kubeconfig is required to access the Kubernetes cluster.
-
Outbound Internet connectivity from the Kubernetes cluster to several image repositories.
-
You must have privilege to create namespaces
kubeslice-controller
,kubeslice-system
to deploy EGS, and a namespace for a given project namekubeslice-<PROJECT NAME>
. -
You must have permission to create the load balancer service.
-
The PostgresSQL database must be supported by pvc for data persistence.
-
An ingress controller, such as nginx, must be installed on the cluster.
-
The following command line tools are required for installation:
- bash version higher than 5.0.0
- helm
- kubectl
- jq
- yq
For more information, see required tools to install EGS.
-
To receive tokens for image pull secrets, you must first register. To register, visit the KubeSlice registration page.
Clone the Repository
Clone the repository using the following command:
git clone https://github.com/kubeslice-ent/egs-installation.git
Ensure the YAML configuration file is correctly formatted and contains all necessary fields.
The script will exit with an error if any critical steps fail unless configured to skip on failure.
Paths specified in the YAML file must be relative to the base_path
unless absolute paths are used.
Check for Prerequisites
Use the egs-preflight-check.sh
script to verify the prerequisites for installing EGS.
-
Navigate to the cloned repository and use the following command to change the file permission:
chmod +x egs-preflight-check.sh
-
Use the following command to run the script:
./egs-preflight-check.sh --kubeconfig <ADMIN KUBECONFIG> --kubecontext-list <KUBECTX>
After passing all of the necessary checks through script, proceed to install EGS on the cluster.
Modify the Configuration File
-
Gather the following information required for installation:
- Prometheus endpoint
- Grafana endpoint
- ProstgresSQL connection configuration
- Admin kubconfig/context to the cluster with GPU
# from the email received after registering
IMAGE_REPOSITORY="https://index.docker.io/v1/"
USERNAME="xxx"
PASSWORD="xxx"
KUBECONFIG="kubeconfig" #location of kubeconfig file
KUBECONTEXT="kubecontext" # cluster context
# Define required variables
PROMETHEUS_ENDPOINT="http://prometheus.monitoring.svc.cluster.local:9090"
GRAFANA_DASHBOARD_BASE_URL="http://grafana.egs-monitoring.svc.cluster.local:8088"
INGRESS_CLASS_NAME="nginx"
CONTROLLER_ENDPOINT="$(kubectl cluster-info | awk '/control plane/ {print $NF}')"
#set helm version
EGS_VERSION="1.11.0" -
Navigate to the cloned repository and locate the input configuration
egs-only-config.yaml
file. -
Update the
egs-only-config.yaml
file using the information from Step 1. -
Update the following mandatory parameters in the
egs-only-config.yaml
file:a. Set all the Prometheus URL values
-
kubeslice_controller_egs:
inline_values:
global:
KubeTally:
prometheusUrl: <set-prometheus-url> -
kubeslice_ui_egs:
inline_values:
kubeslice:
prometheus:
url: <set-prometheus-url> -
kubeslice_worker_egs:
inline_values:
egs:
prometheusEndpoint: <set-prometheus-url> -
cluster_registration:
cluster_name:
telemetry:
endpoint: <set-telemetry-endpoint>
b. Set the Grafana URL values
kubeslice_worker_egs:
inline_values:
egs:
grafanaDashboardBaseUrl: <set-grafana-url>c. Set all
use_local_charts
to falseuse_local_charts: false
d. Set the helm repo URL
global_helm_repo_url: "https://smartscaler.nexus.aveshalabs.io/repository/kubeslice-egs-helm-ent-prod"
-
You can add the kubeslice.io/managed-by-egs=false
label to GPU nodes.
This label excludes or filters the associated GPU nodes from the EGS inventory.
Install EGS
The installation script creates a default project workspace and registers a worker cluster.
To register additional worker clusters, use the k8s Clusters page on the Admin Portal after running this script. For more information, see Register Clusters.
Use the following command to install EGS:
./egs-installer.sh --input-yaml egs-only-config.yaml
Uninstall EGS
Use the following command to uninstall EGS:
./egs-uninstall.sh --input-yaml egs-only-config.yaml
Troubleshooting
- For missing binaries, ensure all required binaries are installed and accessible in your system's PATH.
- For cluster access issues, verify that kubeconfig files configuration so the script can access the clusters specified in the YAML configuration.
- For timeout issues, if a component fails to install within the specified timeout, increase the verify_install_timeout in the YAML file.