Skip to main content
Version: 1.15.0

Register Worker Clusters

In a single-cluster or multi-cluster deployment, the installation script automates the registration of the worker cluster with the EGS Controller. During this process, the Slice Operator is installed on the worker cluster, and the necessary configurations are applied.

In a multi-cluster deployment, the EGS Controller is installed on the controller cluster, while the Slice Operator is installed on each worker cluster. The EGS Controller is responsible for managing the worker clusters and their resources, including the GPU resources installed on each worker cluster. The EGS Controller manages the worker clusters and their resources, including GPU resources.

warning

Limit the cluster name and workspace name to 15 characters or fewer, as exceeding the limit results in a service export error.

Register a Worker Cluster

The Admin can add additional clusters to the EGS Controller to manage the GPU resources on the clusters from the Admin Portal. You can register a worker cluster using the kubeconfig file, either manually or through the automated method provided in the Admin Portal.

You can register a worker cluster in the following ways:

  • Automated Method: Upload the kubeconfig file of the worker cluster to register it with the EGS Controller. The Slice Operator is automatically installed on the worker cluster.

  • Manual Method: Enter the cluster name, cloud name, and the cluster Kube API endpoint parameters during cluster registration. The Slice Operator is not automatically installed on the worker cluster. You must install the Slice Operator using the values file that you download during cluster registration.

Prerequisites

  • Ensure that the worker cluster is up and running and is reachable from the controller cluster.

  • Ensure that the worker cluster has the required Kubernetes version (1.20 or later).

  • Ensure you have the kubeconfig file of the worker cluster that you want to register.

  • Ensure you have EGS endpoint URL and the Admin token. The EGS Agent is responsible for handling Auto GPR Create, Read, Update, and Delete (CRUD) operations.

    • Use the following command to access the Portal URL (endpoint):

      Example

      kubectl get svc -n kubeslice-controller | grep kubeslice-ui-proxy

      Example Output

      NAME                                                      TYPE           CLUSTER-IP    EXTERNAL-IP    PORT(S)         AGE
      kubeslice-ui-proxy LoadBalancer 10.96.2.238 172.18.255.201 443:31751/TCP 24h

      Note down the LoadBalancer external IP of the kubeslice-ui-proxy pod. In the above example, 172.18.255.201 is the external IP. The EGS Portal URL will be https://<ui-proxy-ip>.

    • Use the following command, to access the admin token:

      kubectl --kubeconfig <KUBECONFIG> --context <KUBECONTEXT> -n kubeslice-avesha describe secret kubeslice-rbac-rw-admin | tail -n 1

      Example

      kubectl   --kubeconfig kubeconfig.yaml --context context-cs24taiewqa -n kubeslice-avesha describe secret kubeslice-rbac-rw-admin | tail -n 1

      Example Output

      token:      xxXXXXciOiJSUzI1NiIsImtpZCI6IjYwaGl4V2RMVGhjcHB0ZXpjSHJaQWtycVhRNkdkQ3dwc3lRbHN4SEJ5N3MifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlc2xpY2UtYXZlc2hhIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6Imt1YmVzbGljZS1yYmFjLXJ3LWFkbWluIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQubmFtZSI6Imt1YmVzbGljZS1yYmFjLXJ3LWFkbWluIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYjZmZTRiMGYtYTc1Ny00MjBmLTg4NDEtMTVhY2I3ZjBhYzA2Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Omt1YmVzbGljZS1hdmVzaGE6a3ViZXNsaWNlLXJiYWMtcnctYWRtaW4ifQ.W6-B9Ly0cf2aN5b1gSmgPoFdh6bK4SeWBfk3A0eNLtddv3jOBHF-66W_9rzN2AoNZYjqRz5hRxBRCfarA0dmP5xKc2JiKCRrbDmCjW4z_88lc4dlyUsVv9B-q_1Cq25q21OXWoQ4HurnCWijL7di8U2HL1d9ur_ke466gYsOAm3Y8Zf8_psuLRKIYMN5J95t6AkOy5INjH0UQm0TfaposyzPD0lw9vVCKJQJKJaT76MLZPy0ZqFVHLIBfwWk2UwE1hxdJyoXENpPwOnFvjIOaaOIm-bnfZVX4C9zSWovfxHo5xHEszz-hWAI2ROkCcOeZDgP4hxxxxxxxxx-XX

Automated Method

To register a new worker cluster with the EGS Controller using the automated method:

  1. Go to k8s Clusters on the left sidebar.

  2. On the Clusters page, click the Add Cluster button on the top-right corner.

  3. On the Register Cluster pane, select the Automated mode.

    alt

  4. Click Next to add the cluster details.

  5. In the Add KubeConfig file section, enter the following information:

    • Enter the name of the cluster in the Name of the Cluster text box.

    • Drag and drop the kubeconfig file or Click here to upload the kubeconfig file.

      alt

    • The Enable auto eviction toggle button is disabled. Enable this button for auto eviction of low-priority GPRs. When you enable auto eviction at the cluster level, the feature is applied to all GPRs you create on the cluster.
  6. (Optional) Click Show advanced options and enter the following information:

    • Under EGS Agent:
      • Enter the EGS endpoint URL in the EGS_ENDPOINT text box.
      • Enter the Admin token in the EGS_API_KEY text box. For EGS agent functionality, you must enter the URL and admin token.
    • Enter the Node IP of the worker cluster in the Node IP text box. The Node IP is used to identify the worker cluster and is required for Slice Operator installation. If you do not enter the Node IP, it will be detected automatically during cluster registration.
    • Enter the URL of Prometheus that is installed on your cluster in the Prometheus URL text box.
    • Enter the URL of Grafana that is installed on your cluster in the Grafana URL text box.

    alt

  7. Click the Import Cluster button to register a cluster.

    The status of the cluster changes from In progress to Registered after all the Slice Operator components are up and running. You can view the progress of the cluster registration by clicking the logs (file) icon.

    alt

Manual Method

You can register a worker cluster manually by entering the cluster details, such as the cluster name, cloud name, and the cluster Kube API endpoint. The Slice Operator is not automatically installed on the worker cluster. You must install the Slice Operator using the values file that you download during cluster registration.

To register a new worker cluster with the EGS Controller using the manual method:

  1. Go to k8s Clusters on the left sidebar.

  2. On the Register Cluster pane, select the Manual mode.

  3. Click Next to add the cluster details.

    alt

  4. In the Add Cluster Details section, enter the following information:

    • Select the cloud from the Name of the Cloud drop-down list. The saved value is immutable.

    • Enter a name for a worker cluster in the Name of the cluster text box. The saved value is immutable.

    • Enter the control plane's kube-apiserver endpoint of the controller cluster in the Cluster Kube API Endpoint text box. Run this command on the cluster to get the endpoint: kubectl cluster-info.

    • The Enable auto eviction toggle button is disabled. Enable this button for auto eviction of GPRs.

    info

    You can skip Step 4 and proceed to Step 6 to generate the values file.

  5. (Optional) Click Show advanced options and enter the following information:

    • Under EGS Agent:
      • Enter the EGS endpoint URL in the EGS_ENDPOINT text box.
      • Enter the Admin token in the EGS_API_KEY text box.
    • Enter the Node IP of the worker cluster in the Node IP text box. The Node IP is used to identify the worker cluster and is required for Slice Operator installation. If you do not enter the Node IP, it will be detected automatically during cluster registration.
    • Enter the URL of Prometheus that is installed on your cluster in the Prometheus URL text box.
    • Enter the URL of Grafana that is installed on your cluster in the Grafana URL text box.

    alt

Download the Slice Operator Values File

  1. Click Generate Credentials to generate the values file. The values file is downloaded automatically. Save the file for later use.

    alt

    note

    The values file contains the worker secrets from the controller cluster, and the file is created with the cluster name that you entered in step 3.

    The following is an example values file:

    controllerSecret:
    namespace:
    endpoint:
    ca.crt:
    token:
    cluster:
    name:
    endpoint:
    egsAgent:
    agentSecret: <kubeslice manager endpoint url>
    endpoint: <kubeslice manager access token>
    key:
    egs:
    prometheusEndpoint:
    grafanaDashboardBaseUrl:
    metrics:
    insecure: false
    kserve:
    enabled: true
    kserve:
    controller:
    gateway:
    domain: ""
    ingressGateway:
    className: nginx
    global:
    imageRegistry: harbor.saas1.smart-scaler.io/avesha/aveshasystems
    # imagePullSecrets: # Provide the secrets if the registry requires imagePullSecrets
    # repository: https://index.docker.io/v1/
    # username: ""
    # password: ""
    # email: ""
  2. (Optional) In the Cluster Registration Procedure section, click the download link if the values file does not download automatically.

  3. Copy the Helm command to install the Slice Operator and click Done.

Registration Status

The worker cluster's status on the Clusters page will be Awaiting User Action until the Slice Operator is installed on it. The status changes to Registered after you install the Slice Operator on the worker cluster, which shows that the cluster has been successfully registered.

Install the Slice Operator

You must install the Slice Operator on the cluster to register it with the Slice Controller. Install the Slice Operator using the values or secrets file that you downloaded in step 6 of Manual Method.

To install the Slice Operator:

  1. Switch the context to the worker cluster using the following command:

    kubectx <cluster name>
  2. Run the command you copied in step 7 of Register a Worker Cluster.

    Example

    helm upgrade -i egs-worker kubeslice-egs/kubeslice-worker-egs --namespace kubeslice-system --create-namespace -f worker-2-secret.yaml
  3. Wait for the installation to complete. The installation might take a few minutes, depending on the cluster resources.

Validate the Installation

To validate the Slice Operator installation on a cluster, check the status of the pods that belong to the kubeslice-system namespace.

Use the following command to check if the pods are running:

kubectl get pods -n kubeslice-system

Example Output

NAME                                         READY   STATUS      RESTARTS   AGE
forwarder-kernel-94c8q 1/1 Running 0 8h
kubeslice-dns-679966fd4c-4ppdb 1/1 Running 0 8h
kubeslice-netop-plz52 1/1 Running 0 8h
kubeslice-operator-77fc84cb54-9j2jm 2/2 Running 0 4h36m
nsm-admission-webhook-k8s-864c87f5d4-cqlxn 1/1 Running 0 8h
nsm-install-crds-lbvrx 0/1 Completed 0 2m35s
nsmgr-zqzzg 2/2 Running 0 8h
registry-k8s-84f468f675-g9hzg 1/1 Running 0 8h
spire-install-clusterid-cr-488p6 0/1 Completed 0 2m21s
spire-install-crds-dcm75 0/1 Completed 0 2m28s

The status changes to Registered after all the Slice Operator components are up and running.

If the Node IP is not detected during cluster registration, the Clusters page displays an error icon for that cluster. You can update the correct Node IP by editing a cluster.

Edit a Cluster

To edit a cluster:

  1. Go to k8s Clusters on the left sidebar.

  2. On the Clusters page, click the edit icon for the cluster to change any configuration.

    info

    The names of the cluster and the cloud are immutable.

    alt

  3. Update the values. You can only edit the Cluster Kube API Endpoint and the Node IP under advanced options.

  4. Click Edit Cluster to save the settings.

Detach a Worker Cluster

To detach a worker cluster from a workspace:

  1. Go to Workspaces on the left sidebar.

  2. Click the > icon at the right for the workspace from which you want to detach a cluster.

  3. Click the edit icon at the right.

  4. Click the Edit Workspace button.

  5. In the Connect Clusters tab, under Workspace Clusters, click the minus icon for the cluster you want to detach.

  6. Enter DETACH, and then click the Detach Cluster button.

    note

    Detaching a cluster from a workspace might take some time, depending on the underlying resources.

Deregister a Cluster

warning

You must detach a cluster from its connected workspaces before deregistering or deleting it.

To delete or deregister a worker cluster:

  1. Go to k8s Clusters on the left sidebar.

  2. On the Clusters page, click the delete icon for the cluster that you want to delete.

  3. Enter DELETE to confirm, and then click the Delete Cluster button.