Skip to main content
Version: 1.15.0

Prerequisites to Install EGS Worker

This topic describes the prerequisites to install Elastic GPU Service (EGS) Worker on a Kubernetes cluster.

The EGS worker requires the following components:

  • NVIDIA GPU Operator for GPU management and monitoring
  • Kube-Prometheus-Stack for metrics collection and visualization
  • Proper monitoring configuration to scrape GPU metrics from GPU Operator components
  • GPU-enabled nodes with NVIDIA drivers

Install GPU Operator and Prometheus Stack Using the Script

You can use the egs-install-prerequisites.sh script to install GPU Operator and Prometheus stack on the worker cluster. Modify the egs-installer-config.yaml file to add the GPU Operator and Prometheus parameters.

The following is an example configuration YAML file to install GPU operator and Prometheus stack using the script:

# Enable additional applications installation
enable_install_additional_apps: true

# Enable custom applications
enable_custom_apps: true

# Command execution settings
run_commands: false

# Additional applications configuration
additional_apps:
- name: "gpu-operator"
skip_installation: false
use_global_kubeconfig: true
namespace: "egs-gpu-operator"
release: "gpu-operator"
chart: "gpu-operator"
repo_url: "https://helm.ngc.nvidia.com/nvidia"
version: "v24.9.1"
specific_use_local_charts: true
inline_values:
hostPaths:
driverInstallDir: "/home/kubernetes/bin/nvidia"
toolkit:
installDir: "/home/kubernetes/bin/nvidia"
cdi:
enabled: true
default: true
driver:
enabled: false
helm_flags: "--debug"
verify_install: false
verify_install_timeout: 600
skip_on_verify_fail: true
enable_troubleshoot: false

- name: "prometheus"
skip_installation: false
use_global_kubeconfig: true
namespace: "egs-monitoring"
release: "prometheus"
chart: "kube-prometheus-stack"
repo_url: "https://prometheus-community.github.io/helm-charts"
version: "v45.0.0"
specific_use_local_charts: true
inline_values:
prometheus:
service:
type: ClusterIP
prometheusSpec:
storageSpec: {}
additionalScrapeConfigs:
- job_name: tgi
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod_name
- source_labels: [__meta_kubernetes_pod_container_name]
target_label: container_name
- job_name: gpu-metrics
scrape_interval: 1s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- egs-gpu-operator
relabel_configs:
- source_labels: [__meta_kubernetes_endpoints_name]
action: drop
regex: .*-node-feature-discovery-master
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: kubernetes_node
grafana:
enabled: true
grafana.ini:
auth:
disable_login_form: true
disable_signout_menu: true
auth.anonymous:
enabled: true
org_role: Viewer
service:
type: ClusterIP
persistence:
enabled: false
size: 1Gi
helm_flags: "--debug"
verify_install: false
verify_install_timeout: 600
skip_on_verify_fail: true
enable_troubleshoot: false

To apply the configuration, use the following command:

./egs-install-prerequisites.sh --input-yaml egs-installer-config.yaml

This script installs GPU operator (v24.9.1) in the egs-gpu-operator namespace and the Prometheus stack (v45.0.0) in the egs-monitoring namespace.

Install GPU Operator

The NVIDIA GPU Operator is essential for managing GPU resources and exposing GPU metrics that EGS Worker needs for GPU slicing operations. For existing infrastructure, you can manually install the GPU Operator.

  1. Add the Helm repository using the following command:

    helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
    helm repo update
  2. Create a gpu-operator namespace using the following command:

    kubectl create namespace egs-gpu-operator
  3. Install the GPU Operator using the following command:

    helm install gpu-operator nvidia/gpu-operator \
    --namespace egs-gpu-operator \
    --set nfd.enabled=true \
    --set nfd.nodefeaturerules=false \
    --set driver.enabled=true \
    --set driver.version="550.144.03" \
    --set driver.useOpenKernelModules=false \
    --set driver.upgradePolicy.autoUpgrade=true \
    --set driver.upgradePolicy.maxParallelUpgrades=1 \
    --set driver.upgradePolicy.maxUnavailable="25%" \
    --set mig.strategy=single \
    --set node-feature-discovery.enableNodeFeatureApi=true \
    --set node-feature-discovery.master.config.extraLabelNs=["nvidia.com"] \
    --set daemonsets.tolerations[0].key="nvidia.com/gpu" \
    --set daemonsets.tolerations[0].operator="Exists" \
    --set daemonsets.tolerations[0].effect="NoSchedule" \
    --set daemonsets.tolerations[1].key="kubeslice.io/egs" \
    --set daemonsets.tolerations[1].operator="Exists" \
    --set daemonsets.tolerations[1].effect="NoSchedule"
  4. Verify the installation using the following commands:

    • Check if all the GPU Operator pods are running

      kubectl get pods -n egs-gpu-operator
    • Check GPU Operator components:

      kubectl get daemonset -n egs-gpu-operator
      kubectl get deployment -n egs-gpu-operator
  • Verify GPU nodes are labeled:

    kubectl get nodes --show-labels | grep nvidia.com/gpu
  • Check if NVIDIA drivers are installed

    kubectl get pods -n egs-gpu-operator -l app=nvidia-driver-daemonset  

The GPU Operator installs the following components that expose metrics:

  • NVIDIA Driver DaemonSet: Manages GPU drivers on nodes
  • NVIDIA Device Plugin: Exposes GPU resources to Kubernetes
  • Node Feature Discovery: Labels nodes with GPU capabilities
  • DCGM Exporter: Exposes GPU metrics (if enabled)
  • GPU Feature Discovery: Discovers GPU features and capabilities

Install Kube Prometheus Stack

The kube-prometheus-stack provides comprehensive monitoring capabilities for the EGS Worker cluster.

To install Prometheus stack:

  1. Add the Helm repository using the following command:

    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update
  2. Install Kube-Prometheus-Stack with GPU metrics configuration. Use the following example configuration to create a gpu-monitoring-values.yaml file.

    # gpu-monitoring-values.yaml
    inline_values:
    prometheus:
    service:
    type: ClusterIP # Service type for Prometheus
    prometheusSpec:
    storageSpec: {} # Placeholder for storage configuration
    additionalScrapeConfigs:
    - job_name: tgi
    kubernetes_sd_configs:
    - role: endpoints
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_name]
    target_label: pod_name
    - source_labels: [__meta_kubernetes_pod_container_name]
    target_label: container_name
    - job_name: gpu-metrics
    scrape_interval: 1s
    metrics_path: /metrics
    scheme: http
    kubernetes_sd_configs:
    - role: endpoints
    namespaces:
    names:
    - egs-gpu-operator
    relabel_configs:
    - source_labels: [__meta_kubernetes_endpoints_name]
    action: drop
    regex: .*-node-feature-discovery-master
    - source_labels: [__meta_kubernetes_pod_node_name]
    action: replace
    target_label: kubernetes_node
    grafana:
    enabled: true # Enable Grafana
    grafana.ini:
    auth:
    disable_login_form: true
    disable_signout_menu: true
    auth.anonymous:
    enabled: true
    org_role: Viewer
    service:
    type: ClusterIP # Service type for Grafana
    persistence:
    enabled: false # Disable persistence
    size: 1Gi # Default persistence size
  3. Create an egs-monitoring namespace using the following command:

    kubectl create namespace egs-monitoring
  4. Apply the values file using the following command:

    # Install kube-prometheus-stack with GPU metrics configuration
    helm install prometheus prometheus-community/kube-prometheus-stack \
    --namespace egs-monitoring \
    --values gpu-monitoring-values.yaml \
    --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
    --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
  5. Verify the installation using the following commands:

    # Check if all monitoring pods are running
    kubectl get pods -n egs-monitoring

    # Check Prometheus service
    kubectl get svc -n egs-monitoring | grep prometheus

    # Verify additional scrape configs are loaded
    kubectl port-forward svc/prometheus-operated 9090:9090 -n egs-monitoring
    # Visit http://localhost:9090/config to verify gpu-metrics job is configured

GPU Metrics Monitoring Configuration

The GPU Operator exposes metrics on several endpoints that need to be monitored:

  • DCGM Exporter: GPU performance and health metrics
  • NVIDIA Device Plugin: GPU resource allocation metrics
  • Node Feature Discovery: GPU capability labels
  • GPU Feature Discovery: GPU feature metrics

Create a Service Monitor to scrape metrics from GPU. Use the following example configuration to create a servicemonitor.yaml file:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: gpu-metrics-monitor
namespace: egs-monitoring
labels:
app.kubernetes.io/instance: kube-prometheus-stack
release: prometheus
spec:
endpoints:
- interval: 30s
port: metrics
path: /metrics
scrapeTimeout: 10s
scheme: http
namespaceSelector:
matchNames:
- egs-gpu-operator
selector:
matchLabels:
app: nvidia-dcgm-exporter

To apply the configuration, use the following command:

kubectl apply -f gpu-servicemonitor.yaml

To verify is the services are created, use the following command:

kubectl get servicemonitor -n egs-monitoring

Pod Monitor Configuration

Create a Pod Monitor for direct pod metrics collection. Use the following example configuration to create a podmonitor.yaml file:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: gpu-pod-metrics-monitor
namespace: egs-monitoring
labels:
app.kubernetes.io/instance: kube-prometheus-stack
release: prometheus
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
namespaceSelector:
matchNames:
- egs-gpu-operator
podMetricsEndpoints:
- interval: 30s
port: "9400"
path: /metrics
scrapeTimeout: 10s
scheme: http

To apply the configuration, use the following command:

kubectl apply -f gpu-podmonitor.yaml

To verify is the services are created, use the following command:

kubectl get podmonitor -n egs-monitoring

GPU Metrics Dashboard

Import a GPU monitoring dashboard into Grafana using the following command:

# Port forward to Grafana
kubectl port-forward svc/prometheus-grafana 3000:80 -n egs-monitoring

Access Grafana at http://localhost:3000, the import dashboard ID is 14574 (NVIDIA GPU Exporter Dashboard).

Verify the Deployment

Verify the GPU Operator installation

# Check GPU Operator status
kubectl get pods -n egs-gpu-operator

# Verify GPU resources are available
kubectl get nodes -o json | jq '.items[] | select(.status.allocatable."nvidia.com/gpu" != null) | .metadata.name'

# Check GPU device plugin
kubectl get pods -n egs-gpu-operator -l app=nvidia-device-plugin-daemonset

# Test GPU allocation
kubectl run gpu-test --rm -it --restart=Never \
--image=nvcr.io/nvidia/cuda:12.6.3-base-ubuntu22.04 \
--overrides='{"spec":{"containers":[{"name":"gpu-test","image":"nvcr.io/nvidia/cuda:12.6.3-base-ubuntu22.04","resources":{"limits":{"nvidia.com/gpu":"1"}}}]}}' \
-- nvidia-smi

Verify the Prometheus Configuration


# Check if GPU metrics job is configured
kubectl port-forward svc/prometheus-operated 9090:9090 -n egs-monitoring
# Visit http://localhost:9090/targets and look for gpu-metrics job

# Check if GPU metrics are being scraped
# Visit http://localhost:9090/graph and query: up{job="gpu-metrics"}

# For comprehensive verification, use the Universal Metrics Verification Steps (section 2.5)
# and GPU Metrics Verification (section 2.6) above

Verify the GPU Metrics collection

# Check if GPU metrics are available
kubectl port-forward svc/prometheus-operated 9090:9090 -n egs-monitoring

# Query GPU metrics in Prometheus:
# - DCGM_FI_DEV_GPU_UTIL: up{job="gpu-metrics",__name__=~"DCGM_FI_DEV_GPU_UTIL"}
# - GPU Memory Usage: up{job="gpu-metrics",__name__=~"DCGM_FI_DEV_FB_USED"}
# - GPU Temperature: up{job="gpu-metrics",__name__=~"DCGM_FI_DEV_GPU_TEMP"}

Verify the GPU Worker Readiness

# Check if EGS Worker can access GPU resources
kubectl get nodes --show-labels | grep nvidia.com/gpu

# Verify GPU operator tolerations are working
kubectl get pods -n egs-gpu-operator -o wide