Version: 1.15.0

Install EGS Worker Prerequisites

This topic describes the prerequisites to install Elastic GPU Service (EGS) Worker on a Kubernetes cluster.

The EGS worker requires the following components:

NVIDIA GPU Operator for GPU management and monitoring
Kube-Prometheus-Stack for metrics collection and visualization
Proper monitoring configuration to scrape GPU metrics from GPU Operator components
GPU-enabled nodes with NVIDIA drivers

Installation Options

You can install the prerequisites using either of the following methods:

Using the installation script provided in the egs-installation repository.
Manually install the components using Helm charts.

info

The following steps for Nvidia GPU operator installation are provided for reference only. For a production system, always refer to Nvidia's latest GPU operator installation instructions: NVIDIA GPU Operator Getting Started Guide.

Install GPU Operator and Prometheus Stack Using the Script

You can use the egs-install-prerequisites.sh script to install GPU Operator and Prometheus stack on the worker cluster. Modify the egs-installer-config.yaml file to add the GPU Operator and Prometheus parameters.

The following is an example configuration YAML file to install GPU operator and Prometheus stack using the script:

# Enable additional applications installation
enable_install_additional_apps: true

# Enable custom applications
enable_custom_apps: true

# Command execution settings
run_commands: false

# Additional applications configuration
additional_apps:
  - name: "gpu-operator"
    skip_installation: false
    use_global_kubeconfig: true
    namespace: "egs-gpu-operator"
    release: "gpu-operator"
    chart: "gpu-operator"
    repo_url: "https://helm.ngc.nvidia.com/nvidia"
    version: "v24.9.1"
    specific_use_local_charts: true
    inline_values:
      hostPaths:
        driverInstallDir: "/home/kubernetes/bin/nvidia"
      toolkit:
        installDir: "/home/kubernetes/bin/nvidia"
      cdi:
        enabled: true
        default: true
      driver:
        enabled: false
    helm_flags: "--debug"
    verify_install: false
    verify_install_timeout: 600
    skip_on_verify_fail: true
    enable_troubleshoot: false

  - name: "prometheus"
    skip_installation: false
    use_global_kubeconfig: true
    namespace: "egs-monitoring"
    release: "prometheus"
    chart: "kube-prometheus-stack"
    repo_url: "https://prometheus-community.github.io/helm-charts"
    version: "v45.0.0"
    specific_use_local_charts: true
    inline_values:
      prometheus:
        service:
          type: ClusterIP
        prometheusSpec:
          storageSpec: {}
          additionalScrapeConfigs:
          - job_name: tgi
            kubernetes_sd_configs:
            - role: endpoints
            relabel_configs:
            - source_labels: [__meta_kubernetes_pod_name]
              target_label: pod_name
            - source_labels: [__meta_kubernetes_pod_container_name]
              target_label: container_name
          - job_name: gpu-metrics
            scrape_interval: 1s
            metrics_path: /metrics
            scheme: http
            kubernetes_sd_configs:
            - role: endpoints
              namespaces:
                names:
                - egs-gpu-operator
            relabel_configs:
            - source_labels: [__meta_kubernetes_endpoints_name]
              action: drop
              regex: .*-node-feature-discovery-master
            - source_labels: [__meta_kubernetes_pod_node_name]
              action: replace
              target_label: kubernetes_node
      grafana:
        enabled: true
        grafana.ini:
          auth:
            disable_login_form: true
            disable_signout_menu: true
          auth.anonymous:
            enabled: true
            org_role: Viewer
        service:
          type: ClusterIP
        persistence:
          enabled: false
          size: 1Gi
    helm_flags: "--debug"
    verify_install: false
    verify_install_timeout: 600
    skip_on_verify_fail: true
    enable_troubleshoot: false

To apply the configuration, use the following command:

./egs-install-prerequisites.sh --input-yaml egs-installer-config.yaml

info

The above script installs GPU operator (v24.9.1) in the egs-gpu-operator namespace and the Prometheus stack (v45.0.0) in the egs-monitoring namespace.
Access Grafana at http://localhost:3000. The default login credential is admin and the password is prom-operator.
The NVIDIA dashboard number is 12239.

Install GPU Operator Manually

The NVIDIA GPU Operator is essential for managing GPU resources and exposing GPU metrics that EGS Worker needs for GPU slicing operations. For existing infrastructure, you can manually install the GPU Operator.

Prerequisites

Before installing the GPU Operator, ensure the following prerequisites are met:

Container Runtime: Nodes must be configured with a container engine such as CRI-O or containerd.
Operating System: All worker nodes running the GPU workloads must run the same OS version.

Pod Security: If you are using Pod Security Admission (PSA), label the namespace for privileged access:

kubectl create ns egs-gpu-operator
kubectl label --overwrite ns egs-gpu-operator pod-security.kubernetes.io/enforce=privileged

Node Feature Discovery: Check if NFD is already running:
```
kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'
```
If output is true, NFD is already running and should be disabled during the GPU Operator installation.
GPU Node Labeling: Label GPU nodes to enable GPU Operator operands:
```
kubectl label node <gpu-node-name> nvidia.com/gpu.deploy.operands=true
```
Replace <gpu-node-name> with the actual name of your GPU-enabled node.
NVIDIA Driver Installation: It is strongly recommended to follow the official NVIDIA driver installation documentation for your specific platform and operating system. The GPU Operator can manage drivers, but pre-installing drivers following NVIDIA's official guidelines ensures optimal compatibility and performance.

Install the GPU Operator

To install the GPU Operator:

Add the Helm repository using the following command:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Create a gpu-operator namespace using the following command:
```
kubectl create namespace egs-gpu-operator
```

Install the GPU Operator using the following command:

helm install --wait --generate-name \
-n egs-gpu-operator \
nvidia/gpu-operator \
--version=v25.3.4 \
--set hostPaths.driverInstallDir="/home/kubernetes/bin/nvidia" \
--set toolkit.installDir="/home/kubernetes/bin/nvidia" \
--set cdi.enabled=true \
--set cdi.default=true \
--set driver.enabled=false

Verify the installation using the following commands:

Check if all the GPU Operator pods are running
```
kubectl get pods -n egs-gpu-operator
```

Check GPU Operator components:

kubectl get daemonset -n egs-gpu-operator
kubectl get deployment -n egs-gpu-operator

Verify GPU nodes are labeled:

kubectl get nodes --show-labels | grep nvidia.com/gpu

Check if NVIDIA drivers are installed

kubectl get pods -n egs-gpu-operator -l app=nvidia-driver-daemonset

info

For any issues during GPU Operator installation, refer to the NVIDIA GPU Operator Getting Started Guide.

The GPU Operator installs the following components that expose metrics:

NVIDIA Driver DaemonSet: Manages GPU drivers on nodes
NVIDIA Device Plugin: Exposes GPU resources to Kubernetes
Node Feature Discovery: Labels nodes with GPU capabilities
DCGM Exporter: Exposes GPU metrics (if enabled)
GPU Feature Discovery: Discovers GPU features and capabilities

Install Kube Prometheus Stack Manually

The kube-prometheus-stack provides comprehensive monitoring capabilities for the EGS Worker cluster.

To install Prometheus stack:

Add the Helm repository using the following command:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Install Kube-Prometheus-Stack with GPU metrics configuration. Use the following example configuration to create a gpu-monitoring-values.yaml file.

# gpu-monitoring-values.yaml
inline_values:
  prometheus:
    service:
      type: ClusterIP                     # Service type for Prometheus
    prometheusSpec:
      storageSpec: {}                     # Placeholder for storage configuration
      additionalScrapeConfigs:
      - job_name: tgi
        kubernetes_sd_configs:
        - role: endpoints
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_name]
          target_label: pod_name
        - source_labels: [__meta_kubernetes_pod_container_name]
          target_label: container_name
      - job_name: gpu-metrics
        scrape_interval: 1s
        metrics_path: /metrics
        scheme: http
        kubernetes_sd_configs:
        - role: endpoints
          namespaces:
            names:
            - egs-gpu-operator
        relabel_configs:
        - source_labels: [__meta_kubernetes_endpoints_name]
          action: drop
          regex: .*-node-feature-discovery-master
        - source_labels: [__meta_kubernetes_pod_node_name]
          action: replace
          target_label: kubernetes_node
  grafana:
    enabled: true                         # Enable Grafana
    grafana.ini:
      auth:
        disable_login_form: true
        disable_signout_menu: true
      auth.anonymous:
        enabled: true
        org_role: Viewer
    service:
      type: ClusterIP                  # Service type for Grafana
    persistence:
      enabled: false                      # Disable persistence
      size: 1Gi                           # Default persistence size

Create an egs-monitoring namespace using the following command:
```
kubectl create namespace egs-monitoring
```

Apply the values file using the following command:

# Install kube-prometheus-stack with GPU metrics configuration
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace egs-monitoring \
--values gpu-monitoring-values.yaml \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false  

Verify the Installation

Verify the installation using the following commands:

# Check if all monitoring pods are running
kubectl get pods -n egs-monitoring

# Check Prometheus service
kubectl get svc -n egs-monitoring | grep prometheus

# Verify additional scrape configs are loaded
kubectl port-forward svc/prometheus-operated 9090:9090 -n egs-monitoring
# Visit http://localhost:9090/config to verify gpu-metrics job is configured

GPU Metrics Monitoring Configuration

The GPU Operator exposes metrics on several endpoints that need to be monitored:

DCGM Exporter: GPU performance and health metrics
NVIDIA Device Plugin: GPU resource allocation metrics
Node Feature Discovery: GPU capability labels
GPU Feature Discovery: GPU feature metrics

Create a Service Monitor to scrape metrics from GPU. Use the following example configuration to create a gpu-servicemonitor.yaml file:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: gpu-metrics-monitor
  namespace: egs-monitoring
  labels:
    app.kubernetes.io/instance: kube-prometheus-stack
    release: prometheus
spec:
  endpoints:
    - interval: 30s
      port: metrics
      path: /metrics
      scrapeTimeout: 10s
      scheme: http
  namespaceSelector:
    matchNames:
      - egs-gpu-operator
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter

To apply the configuration, use the following command:

kubectl apply -f gpu-servicemonitor.yaml

To verify if the services are created, use the following command:

kubectl get servicemonitor -n egs-monitoring

Example Output

NAME                         AGE
kubeslice-controller-manager-monitor         38m
prometheus-grafana                  40m
prometheus-kube-prometheus-alertmanager       40m
prometheus-kube-prometheus-apiserver         40m
prometheus-kube-prometheus-coredns          40m
prometheus-kube-prometheus-kube-controller-manager  40m
prometheus-kube-prometheus-kube-etcd         40m
prometheus-kube-prometheus-kube-proxy        40m
prometheus-kube-prometheus-kube-scheduler      40m
prometheus-kube-prometheus-kubelet          40m
prometheus-kube-prometheus-operator         40m
prometheus-kube-prometheus-prometheus        40m
prometheus-kube-state-metrics            40m
prometheus-prometheus-node-exporter         40m

Pod Monitor Configuration

Create a Pod Monitor for direct pod metrics collection. Use the following example configuration to create a gpu-podmonitor.yaml file:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: gpu-pod-metrics-monitor
  namespace: egs-monitoring
  labels:
    app.kubernetes.io/instance: kube-prometheus-stack
    release: prometheus
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  namespaceSelector:
    matchNames:
      - egs-gpu-operator
  podMetricsEndpoints:
    - interval: 30s
      port: "9400"
      path: /metrics
      scrapeTimeout: 10s
      scheme: http

To apply the configuration, use the following command:

kubectl apply -f gpu-podmonitor.yaml

To verify if the services are created, use the following command:

kubectl get podmonitor -n egs-monitoring

GPU Metrics Dashboard

Import a GPU monitoring dashboard into Grafana using the following command:

# Port forward to Grafana
kubectl port-forward svc/prometheus-grafana 3000:80 -n egs-monitoring

Access Grafana at http://localhost:3000, the import dashboard ID is 14574 (NVIDIA GPU Exporter Dashboard).

Verify the Deployment

Verify the GPU Operator installation

# Check GPU Operator status
kubectl get pods -n egs-gpu-operator

# Verify GPU resources are available
kubectl get nodes -o json | jq '.items[] | select(.status.allocatable."nvidia.com/gpu" != null) | .metadata.name'

# Check GPU device plugin
kubectl get pods -n egs-gpu-operator -l app=nvidia-device-plugin-daemonset

# Test GPU allocation
kubectl run gpu-test --rm -it --restart=Never \
  --image=nvcr.io/nvidia/cuda:12.6.3-base-ubuntu22.04 \
  --overrides='{"spec":{"containers":[{"name":"gpu-test","image":"nvcr.io/nvidia/cuda:12.6.3-base-ubuntu22.04","resources":{"limits":{"nvidia.com/gpu":"1"}}}]}}' \
  -- nvidia-smi

Verify the Prometheus Configuration

# Check if GPU metrics job is configured
kubectl port-forward svc/prometheus-operated 9090:9090 -n egs-monitoring
# Visit http://localhost:9090/targets and look for gpu-metrics job

# Check if GPU metrics are being scraped
# Visit http://localhost:9090/graph and query: up{job="gpu-metrics"}

# For comprehensive verification, use the Universal Metrics Verification Steps (section 2.5)
# and GPU Metrics Verification (section 2.6) above

Verify the GPU Metrics collection

# Check if GPU metrics are available
kubectl port-forward svc/prometheus-operated 9090:9090 -n egs-monitoring

# Query GPU metrics in Prometheus:
# - DCGM_FI_DEV_GPU_UTIL: up{job="gpu-metrics",__name__=~"DCGM_FI_DEV_GPU_UTIL"}
# - GPU Memory Usage: up{job="gpu-metrics",__name__=~"DCGM_FI_DEV_FB_USED"}
# - GPU Temperature: up{job="gpu-metrics",__name__=~"DCGM_FI_DEV_GPU_TEMP"}

Verify the GPU Worker Readiness

# Check if EGS Worker can access GPU resources
kubectl get nodes --show-labels | grep nvidia.com/gpu

Installation Options​

Install GPU Operator and Prometheus Stack Using the Script​

Install GPU Operator Manually​

Prerequisites​

Install the GPU Operator​

Install Kube Prometheus Stack Manually​

Verify the Installation​

GPU Metrics Monitoring Configuration​

Pod Monitor Configuration​

GPU Metrics Dashboard​

Verify the Deployment​

Verify the GPU Operator installation​

Verify the Prometheus Configuration​

Verify the GPU Metrics collection​

Verify the GPU Worker Readiness​

Installation Options

Install GPU Operator and Prometheus Stack Using the Script

Install GPU Operator Manually

Prerequisites

Install the GPU Operator

Install Kube Prometheus Stack Manually

Verify the Installation

GPU Metrics Monitoring Configuration

Pod Monitor Configuration

GPU Metrics Dashboard

Verify the Deployment

Verify the GPU Operator installation

Verify the Prometheus Configuration

Verify the GPU Metrics collection

Verify the GPU Worker Readiness