Skip to main content
Version: 1.15.0

Clone the Repository to Install EGS Using the Script

This topic describes the steps to install EGS on the Kubernetes cluster using the script provided in the egs-installation repository. The installation process involves cloning the repository, checking prerequisites, modifying the configuration file, and running the installation script.

Clone the Repository

Clone the EGS installation repository using the following command:

git clone https://github.com/kubeslice-ent/egs-installation.git
note
  • Ensure the YAML configuration file is properly formatted and includes all required fields.
  • The installation script will terminate with an error if any critical step fails, unless explicitly configured to skip on failure.
  • All file paths specified in the YAML must be relative to the base_path, unless absolute paths are provided.

Check for Prerequisites

Use the egs-preflight-check.sh script to verify the prerequisites for installing EGS.

  • Navigate to the cloned repository and change the file permission using the following command:

    chmod +x egs-preflight-check.sh    
  • Use the following command to run the script:

    ./egs-preflight-check.sh --kubeconfig <ADMIN KUBECONFIG> --kubecontext-list <KUBECTX>

Create Namespaces

If your cluster enforces namespace creation policies, pre-create the namespaces required for installation before running the script. This step is optional and only necessary if your cluster has such policies in place.

You can use the create-namespaces.sh script to create the required namespaces. Use the following command to create namespaces:

./create-namespaces.sh \
--input-yaml namespace-input.yaml \
--kubeconfig ~/.kube/config \
--kubecontext-list context1,context2

You must ensure that all required annotations and labels for policy enforcement are correctly configured in the namespace-input.yaml file.

Prerequisites for Controller and Worker Clusters

You can use the egs-install-prerequisites.sh script to configure and install prerequisites for EGS.

  • Controller cluster prerequisites: Verify EGS Controller Prerequisites for:

    • Prometheus configuration
    • PostgreSQL configuration
  • Worker cluster prerequisites: Verify EGS Worker Prerequisites for:

    • GPU Operator installation and configuration
    • Prometheus configuration and Monitoring requirements
  • Before running the prerequisites installer, you must configure the egs-installer-config.yaml file to enable additional applications installation:

    # Enable or disable specific stages of the installation
    enable_install_controller: true # Enable the installation of the Kubeslice controller
    enable_install_ui: true # Enable the installation of the Kubeslice UI
    enable_install_worker: true # Enable the installation of Kubeslice workers

    # Enable or disable the installation of additional applications (prometheus, gpu-operator, postgresql)
    enable_install_additional_apps: true # Set to true to enable additional apps installation

    # Enable custom applications
    # Set this to true if you want to allow custom applications to be deployed.
    # This is specifically useful for enabling NVIDIA driver installation on your nodes.
    enable_custom_apps: false

    # Command execution settings
    # Set this to true to allow the execution of commands for configuring NVIDIA MIG.
    # This includes modifications to the NVIDIA ClusterPolicy and applying node labels
    # based on the MIG strategy defined in the YAML (e.g., single or mixed strategy).
    run_commands: false
    note

    important configuration in the egs-installer-config.yaml file:

    • Set enable_install_additional_appsto true: This enables the installation of GPU Operator, Prometheus, and PostgreSQL.
    • Set enable_custom_apps to true, if you need NVIDIA driver installation on your nodes.
    • Set run_commands to true, if you need NVIDIA MIG configuration and node labeling.
  • After configuring the YAML file, run the egs-install-prerequisites.sh script to set up GPU Operator, Prometheus, and PostgreSQL:

    ./egs-install-prerequisites.sh --input-yaml egs-installer-config.yaml

    This step installs the required infrastructure components before the main EGS installation

Modify the Configuration File

  1. Navigate to the cloned egs-installation repository and locate the input configuration file named egs-installer-config.yaml.

  2. Edit the egs-installer-config.yaml file with the global kubeconfig and kubecontext parameters:

    global_kubeconfig: ""  # Relative path to global kubeconfig file from base_path default is script directory (MANDATORY)
    global_kubecontext: "" # Global kubecontext (MANDATORY)
    use_global_context: true # If true, use the global kubecontext for all operations by default
  3. (AirGap installation only) If you are performing an AirGap installation, update the image_pull_secrets section in the config file with appropriate registry credentials or secret references. You can skip this step if you are not performing AirGap installation.

    # From the email received after registration with Avesha 
    IMAGE_REPOSITORY="https://index.docker.io/v1/"
    USERNAME="xxx"
    PASSWORD="xxx"
  4. (Optional) These settings are required only if you are not using local Helm charts and instead pulling them from a remote Helm repository:arts)

    1. Set use_local_charts to false

      use_local_charts: false
    2. Set the Global Helm repository URL

      global_helm_repo_url: "https://smartscaler.nexus.aveshalabs.io/repository/kubeslice-egs-helm-ent-prod"
  5. (Optional): You can customize the installation process by enabling or disabling specific components and additional applications. The following configuration options are available:

    # Enable or disable specific stages of the installation
    enable_install_controller: true # Enable the installation of the Kubeslice controller
    enable_install_ui: true # Enable the installation of the Kubeslice UI
    enable_install_worker: true # Enable the installation of Kubeslice workers

    # Enable or disable the installation of additional applications (prometheus, gpu-operator, postgresql)
    enable_install_additional_apps: false # Set to true to enable additional apps installation

    # Enable custom applications
    # Set this to true if you want to allow custom applications to be deployed.
    # This is specifically useful for enabling NVIDIA driver installation on your nodes.
    enable_custom_apps: false

    # Command execution settings
    # Set this to true to allow the execution of commands for configuring NVIDIA MIG.
    # This includes modifications to the NVIDIA ClusterPolicy and applying node labels
    # based on the MIG strategy defined in the YAML (e.g., single or mixed strategy).
    run_commands: false
  6. Update the KubeSlice Controller (EGS Controller) configuration, in the egs-installer-config.yaml file:

    #### Kubeslice Controller Installation Settings ####
    kubeslice_controller_egs:
    skip_installation: false # Do not skip the installation of the controller
    use_global_kubeconfig: true # Use global kubeconfig for the controller installation
    specific_use_local_charts: true # Override to use local charts for the controller
    kubeconfig: "" # Path to the kubeconfig file specific to the controller, if empty, uses the global kubeconfig
    kubecontext: "" # Kubecontext specific to the controller; if empty, uses the global context
    namespace: "kubeslice-controller" # Kubernetes namespace where the controller will be installed
    release: "egs-controller" # Helm release name for the controller
    chart: "kubeslice-controller-egs" # Helm chart name for the controller
    #### Inline Helm Values for the Controller Chart ####
    inline_values:
    global:
    imageRegistry: docker.io/aveshasystems # Docker registry for the images
    namespaceConfig: # user can configure labels or annotations that EGS Controller namespaces should have
    labels: {}
    annotations: {}
    kubeTally:
    enabled: false # Enable KubeTally in the controller
    #### Postgresql Connection Configuration for Kubetally ####
    postgresSecretName: kubetally-db-credentials # Secret name in kubeslice-controller namespace for PostgreSQL credentials created by install, all the below values must be specified
    # then a secret will be created with specified name.
    # alternatively you can make all below values empty and provide a pre-created secret name with below connection details format
    postgresAddr: "kt-postgresql.kt-postgresql.svc.cluster.local" # Change this Address to your postgresql endpoint
    postgresPort: 5432 # Change this Port for the PostgreSQL service to your values
    postgresUser: "postgres" # Change this PostgreSQL username to your values
    postgresPassword: "postgres" # Change this PostgreSQL password to your value
    postgresDB: "postgres" # Change this PostgreSQL database name to your value
    postgresSslmode: disable # Change this SSL mode for PostgreSQL connection to your value
    prometheusUrl: http://prometheus-kube-prometheus-prometheus.egs-monitoring.svc.cluster.local:9090 # Prometheus URL for monitoring
    kubeslice:
    controller:
    endpoint: "" # Endpoint of the controller API server; auto-fetched if left empty
    #### Helm Flags and Verification Settings ####
    helm_flags: "--wait --timeout 5m --debug" # Additional Helm flags for the installation
    verify_install: false # Verify the installation of the controller
    verify_install_timeout: 30 # Timeout for the controller installation verification (in seconds)
    skip_on_verify_fail: true # If verification fails, do not skip the step
    #### Troubleshooting Settings ####
    enable_troubleshoot: false # Enable troubleshooting mode for additional logs and checks
  7. (Optional) Configure PostgreSQL to use the KubeTally (Cost Management) feature. The PostgreSQL connection details required by the controller are stored in a Kubernetes Secret in the kubeslice-controller namespace. You can configure the secret in one of the following ways:

    • To use your own Kubernetes Secret, enter only the secret name in the configuration file and leave other fields blank. Confirm the secret exists in the kubeslice-controller namespace and uses the required key-value format.

      postgresSecretName: kubetally-db-credentials   # Existing secret in kubeslice-controller namespace

      postgresAddr: ""
      postgresPort: ""
      postgresUser: ""
      postgresPassword: ""
      postgresDB: ""
      postgresSslmode: ""
    • To automatically create a secret, provide all connection details and the secret name. The installer will then create a Kubernetes Secret in the kubeslice-controller namespace.

      postgresSecretName: kubetally-db-credentials   # Secret to be created in kubeslice-controller namespace

      postgresAddr: "kt-postgresql.kt-postgresql.svc.cluster.local" # PostgreSQL service endpoint
      postgresPort: 5432 # PostgreSQL service port (default 5432)
      postgresUser: "postgres" # PostgreSQL username
      postgresPassword: "postgres" # PostgreSQL password
      postgresDB: "postgres" # PostgreSQL database name
      postgresSslmode: disable # SSL mode for PostgreSQL connection (for example, disable or require)
info

You can add the kubeslice.io/managed-by-egs=false label to GPU nodes. This label excludes or filters the associated GPU nodes from the EGS inventory.

Required Inline Configuration for Multi-Cluster Deployments

In multi-cluster deployments, you must configure the global_auto_fetch_endpoint parameter in the egs-installer-config.yaml file. This configuration is essential for proper monitoring and dashboard URL management across multiple clusters.

In single-cluster deployments, this step is not required, as the controller and worker are in the same cluster.

In multi-cluster setups, the controller and worker clusters may be in different namespaces or even different clusters. To ensure that the controller can access the monitoring endpoints of the worker clusters, you must set the global_auto_fetch_endpoint parameter appropriately. Ensure that the Grafana and Prometheus services are accessible from the controller cluster.

note

In a multi-cluster deployment, the controller cluster must be able to reach the Prometheus endpoint running on the worker clusters.

warning

If the Prometheus endpoints are not configured, you may experience issues with the dashboards (for example, missing or incomplete metric displays).

For multi-cluster setups, follow these steps to update the inline values in your egs-installer-config.yaml file:

  1. Set the global_auto_fetch_endpoint parameter to true.

    • This parameter enables the automatic fetching of monitoring endpoints from the worker clusters.
    • If you set this parameter to true, you must ensure that the worker clusters are properly configured to expose their monitoring endpoints.
  2. By default, global_auto_fetch_endpoint is set to false. If you set the global_auto_fetch_endpoint to true, ensure the following configurations:

    • Worker Cluster Service Details: Provide the service details for each worker cluster to fetch the correct monitoring endpoints.
    • Multiple Worker Clusters: Ensure the service endpoints (for example, Grafana and Prometheus) are accessible from the controller cluster.

    Update the egs-installer-config.yaml file with the following inline values:

    # Global monitoring endpoint settings
    global_auto_fetch_endpoint: true # Enable automatic fetching of monitoring endpoints globally
    global_grafana_namespace: egs-monitoring # Namespace where Grafana is globally deployed
    global_grafana_service_type: ClusterIP # Service type for Grafana (accessible only within the cluster)
    global_grafana_service_name: prometheus-grafana # Service name for accessing Grafana globally
    global_prometheus_namespace: egs-monitoring # Namespace where Prometheus is globally deployed
    global_prometheus_service_name: prometheus-kube-prometheus-prometheus # Service name for accessing Prometheus globally
    global_prometheus_service_type: ClusterIP # Service type for Prometheus (accessible only within the cluster)
  3. If you set global_auto_fetch_endpoint to true, the script will automatically fetch the Grafana and Prometheus endpoints from the worker clusters.

    note

    If you set global_auto_fetch_endpoint to false, you must manually specify the Grafana and Prometheus endpoints in the inline_values object of your egs-installer-config.yaml file.

    • Use the following command to get the Prometheus and Grafana LoadBalancer External IP:

      kubectl get svc prometheus-grafana -n monitoring
      kubectl get svc prometheus -n monitoring

      Example Output

      NAME                        TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)                      AGE
      prometheus-grafana LoadBalancer 10.96.0.1 <grafana-lb> 80:31380/TCP 5d
      prometheus-kube-prometheus LoadBalancer 10.96.0.2 <prometheus-lb> 9090:31381/TCP 5d
    • Update the Prometheus and Grafana LoadBalancer IPs or NodePorts in the inline_values section of your egs-installer-config.yaml file:

      inline_values:  # Inline Helm values for the worker chart
      kubesliceNetworking:
      enabled: false # Disable Kubeslice networking for this worker
      egs:
      prometheusEndpoint: "http://<prometheus-lb>" # Prometheus endpoint
      grafanaDashboardBaseUrl: "http://<grafana-lb>/d/Oxed_c6Wz" # Replace <grafana-lb> with the actual External IP
      metrics:
      insecure: true # Allow insecure connections for metrics

Install EGS

note
  • The installation script creates a default project workspace and registers a worker cluster.
  • To register an additional worker cluster, use the Admin Portal. For more information, see Register Worker Clusters.

Use the following command to install EGS:

./egs-installer.sh --input-yaml egs-installer-config.yaml

Register a Worker Cluster

If you have already installed EGS and want to register an additional worker cluster, you can update the egs-installer-config.yaml file. The installation script allows you to register multiple worker clusters at the same time.

To register worker clusters:

  1. Add worker cluster configuration under:
    • kubeslice_worker_egs array
    • cluster_registration array
  2. Repeat the configuration for each worker cluster you want to register.

Example Configuration for Registering a Worker Cluster

To update the configuration file:

  1. Add a new worker configuration to the kubeslice_worker_egs array in your configuration file:

    kubeslice_worker_egs:
    - name: "worker-1" # Existing worker
    # ... existing configuration ...

    - name: "worker-2" # New worker
    use_global_kubeconfig: true # Use global kubeconfig for this worker
    kubeconfig: "" # Path to the kubeconfig file specific to the worker, if empty, uses the global kubeconfig
    kubecontext: "" # Kubecontext specific to the worker; if empty, uses the global context
    skip_installation: false # Do not skip the installation of the worker
    specific_use_local_charts: true # Override to use local charts for this worker
    namespace: "kubeslice-system" # Kubernetes namespace for this worker
    release: "egs-worker-2" # Helm release name for the worker (must be unique)
    chart: "kubeslice-worker-egs" # Helm chart name for the worker
    inline_values: # Inline Helm values for the worker chart
    global:
    imageRegistry: docker.io/aveshasystems # Docker registry for worker images
    egs:
    prometheusEndpoint: "http://prometheus-kube-prometheus-prometheus.egs-monitoring.svc.cluster.local:9090" # Prometheus endpoint
    grafanaDashboardBaseUrl: "http://<grafana-lb>/d/Oxed_c6Wz" # Grafana dashboard base URL
    egsAgent:
    secretName: egs-agent-access
    agentSecret:
    endpoint: ""
    key: ""
    metrics:
    insecure: true # Allow insecure connections for metrics
    kserve:
    enabled: true # Enable KServe for the worker
    kserve: # KServe chart options
    controller:
    gateway:
    domain: kubeslice.com
    ingressGateway:
    className: "nginx" # Ingress class name for the KServe gateway
    helm_flags: "--wait --timeout 5m --debug" # Additional Helm flags for the worker installation
    verify_install: true # Verify the installation of the worker
    verify_install_timeout: 60 # Timeout for the worker installation verification (in seconds)
    skip_on_verify_fail: false # Do not skip if worker verification fails
    enable_troubleshoot: false # Enable troubleshooting mode for additional logs and checks
  2. Add a worker cluster registration configuration in the cluster_registration array in your configuration file:

    cluster_registration:
    - cluster_name: "worker-1" # Existing cluster
    project_name: "avesha" # Name of the project to associate with the cluster
    telemetry:
    enabled: true # Enable telemetry for this cluster
    endpoint: "http://prometheus-kube-prometheus-prometheus.egs-monitoring.svc.cluster.local:9090" # Telemetry endpoint
    telemetryProvider: "prometheus" # Telemetry provider (Prometheus in this case)
    geoLocation:
    cloudProvider: "" # Cloud provider for this cluster (e.g., GCP)
    cloudRegion: "" # Cloud region for this cluster (e.g., us-central1)

    - cluster_name: "worker-2" # New cluster
    project_name: "avesha" # Name of the project to associate with the cluster
    telemetry:
    enabled: true # Enable telemetry for this cluster
    endpoint: "http://prometheus-kube-prometheus-prometheus.egs-monitoring.svc.cluster.local:9090" # Telemetry endpoint
    telemetryProvider: "prometheus" # Telemetry provider (Prometheus in this case)
    geoLocation:
    cloudProvider: "" # Cloud provider for this cluster (e.g., GCP)
    cloudRegion: "" # Cloud region for this cluster (e.g., us-central1)

Run the Installation Script

After adding the new worker configuration, run the installation script to register an additional worker cluster:

./egs-installer.sh --input-yaml egs-installer-config.yaml

Access the Admin Portal

After the successful installation, the script displays the LoadBalancer external IP address and the admin access token to log in to the Admin Portal.

install

Make a note of the LoadBalancer external IP address and the admin access token required to log in to the Admin Portal. The KubeSlice UI Proxy LoadBalancer URL value is your Admin Portal URL and The token for project avesha (username: admin) is your login token.

Use the URL and the admin access token, from the previous step to log in to the Admin Portal.

installation

Retrieve Admin Credentials Using kubectl

If you missed the LoadBalancer external IP address or the admin access token displayed after installation, you can retrieve them using kubectl commands.

Perform the following steps to retrieve the admin access token and the Admin Portal URL:

  1. Use the following command to retrieve the admin access token:

    kubectl get secret kubeslice-rbac-rw-admin -o jsonpath="{.data.token}" -n kubeslice-avesha | base64 --decode

    Example Output:

    eyJhbGciOiJSUzI1NiIsImtpZCI6IjE2YjY0YzYxY2E3Y2Y0Y2E4YjY0YzYxY2E3Y2Y0Y2E4YjYiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2UtYWNjb3VudCIsImt1YmVybmV0ZXM6c2VydmljZS1hY2NvdW50Om5hbWUiOiJrdWJlc2xpY2UtcmJhYy1ydy1hZG1pbiIsImt1YmVybmV0ZXM6c2VydmljZS1hY2NvdW50OnVpZCI6Ijg3ZjhiZjBiLTU3ZTAtMTFlYS1iNmJlLTRmNzlhZTIyMWI4NyIsImt1YmVybmV0ZXM6c2VydmljZS1hY2NvdW50OnNlcnZpY2UtYWNjb3VudC51aWQiOiI4N2Y4YmYwYi01N2UwLTExZWEtYjZiZS00Zjc5YWUyMjFiODciLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6a3ViZXNsaWNlLXJiYWMtcnctYWRtaW4ifQ.MEYCIQDfXoX8v7b8k7c3
    4mJpXHh3Zk5lYzVtY2Z0eXlLQAIhAJi0r5c1v6vUu8mJxYv1j6Kz3p7G9y4nU5r8yX9fX6c
  2. Use the following command to access the Load Balancer IP:

    Example

    kubectl get svc -n kubeslice-controller | grep kubeslice-ui-proxy

    Example Output

    NAME                                                      TYPE           CLUSTER-IP    EXTERNAL-IP    PORT(S)         AGE
    kubeslice-ui-proxy LoadBalancer 10.96.2.238 172.18.255.201 443:31751/TCP 24h

Note down the LoadBalancer external IP of the kubeslice-ui-proxy pod. In the above example, 172.18.255.201 is the external IP. The EGS Portal URL will be https://<ui-proxy-ip>.

Upload Custom Pricing for Cloud Resources

To upload custom pricing for cloud resources, you can use the custom-pricing-upload.sh script provided in the EGS installation repository. This script allows you to upload custom pricing data for various cloud resources, which can be used for cost estimation and budgeting. Ensure you have installed curl to upload the CSV file.

To upload custom pricing data:

  1. Navigate to the cloned egs-installation repository and change the file permission using the following command:

    chmod +x custom-pricing-upload.sh
  2. Use the customer-pricing-data.yaml file to specify the custom pricing data. The file should contain the following structure:

    kubernetes:
    kubeconfig: "" #absolute path of kubeconfig
    kubecontext: "" #kubecontext name
    namespace: "kubeslice-controller"
    service: "kubetally-pricing-service"

    #we can add as many cloud providers and instance types as needed
    cloud_providers:
    - name: "gcp"
    instances:
    - region: "us-east1"
    component: "Compute Instance"
    instance_type: "a2-highgpu-2g"
    vcpu: 1
    price: 20
    gpu: 1
    - region: "us-east1"
    component: "Compute Instance"
    instance_type: "e2-standard-8"
    vcpu: 1
    price: 5
    gpu: 0
  3. Run the script to upload the custom pricing data:

    ./custom-pricing-upload.sh 

This script automates the process of loading custom cloud pricing data into the pricing API running inside a Kubernetes cluster.

Script Workflow:

  • Reads the cluster connection details (kubeconfig, context) from the YAML input file.

  • Identifies the target service and its exposed port (for example, kubetally-pricing-service:80).

  • Selects a random available local port on the host machine.

  • Establishes a port-forwarding tunnel from the selected local port to the Kubernetes service. Runs in the background to keep the tunnel active during upload.

  • Converts the pricing data from YAML format into CSV format for API ingestion.

  • Uploads the generated CSV file to the pricing API at:

    http://localhost:<random_port>/api/v1/prices

Uninstall EGS

The uninstallation script removes all resources associated with EGS, including:

  • Workspaces
  • GPU Provision Requests (GPRs)
  • All custom resources provisioned by EGS
warning

Before running the uninstallation script, ensure that you have backed up any important data or configurations. The script will remove all EGS-related resources, and this action cannot be undone.

Use the following command to uninstall EGS:

./egs-uninstall.sh --input-yaml egs-installer-config.yaml

Troubleshooting

  • Missing Binaries

    Ensure all required binaries are installed and available in your system’s PATH.

  • Cluster Access Issues

    Verify that your kubeconfig files are correctly configured so the script can access the clusters defined in the YAML configuration.

  • Timeout Issues

    If a component fails to install within the specified timeout, increase the verify_install_timeout value in the YAML file.