Configure Workload Placement
This topic describe the steps to enable the automatic workload placement across clusters in a workspace using Workload Placement feature in EGS.
Overview
Workload Placement enables elastic, cross-cluster scaling in EGS. It automatically deploys additional workload replicas to other clusters within the same workspace when the primary cluster runs out of GPU capacity.
A GPU Provision Request (GPR) secures baseline GPUs in the primary cluster. When demand increases through Horizontal Pod Autoscaler (HPA) actions or manual replica updates and additional pods cannot be scheduled, Workload Placement evaluates available clusters and bursts the incremental replicas to the optimal targets.
Scaled deployments when burst to a new cluster are simply created through helm install, the source cluster deployment replica count is not reflected all at once.
Existing replicas remain on the primary cluster. Networking between replicas is seamless, and burst replicas are automatically removed when they are no longer required. Cluster selection considers resource availability, wait time, and policy or priority settings defined in GPR templates, ensuring compliance with governance policies.
Custom Resource Definitions
Workload Placement introduces two Custom Resource Definitions (CRDs) that operate together.
- WorkloadTemplate: a reusable blueprint that defines what to deploy, such as Helm, manifests, or commands, and the deployment steps. It is prepared ahead of time.
- WorkloadPlacement: an execution request that deploys the blueprint to a specific cluster. It is created automatically during scale events by Auto-GPR or manually by a user.
Together, these CRDs support automated, policy-driven workload bursting while still allowing manual, on-demand deployments when required.
The following table summarizes the difference between WorkloadTemplate and WorkloadPlacement:
| Dimension | WorkloadTemplate | WorkloadPlacement |
|---|---|---|
| Purpose | Define workload specifications | Execute workload deployments |
| Creation | Created manually by users | Created automatically by EGS during scaling events |
| Reusability | Can be reused across multiple deployments | Specific to a single deployment instance |
| Target Cluster | Not tied to any specific cluster | Deployed to a designated target cluster. Required parameter spec.ClusterName |
| Content | Contains deployment details (Helm - helmConfig, manifests - manifestResources, commands - cmdExe). Optional ordered steps, burstDuration, deletionPolicy, gprTemplates. | References a WorkloadTemplate and includes deployment parameters |
| Typical Trigger | HPA/manual scale-up that needs extra GPUs | Auto-GPR creates it, or a user triggers it directly |
| Lifecycle | Template object persists and can be re-used | Has phases (Running/Succeeded/Failed/Completed) and cleanup of burst resources per policy |
| Customization | Highly customizable using parameters | Limited customization, focused on deployment |
Workflow
- The auto-GPR feature must be enabled in the GPR Template to allow automatic workload placement across clusters. While creating a
GPR Template, ensure the
Auto-GPRoption is selected. This setting allows EGS to automatically create Workload Placements when scaling events occur. - Create a Workload Template describing your workload configuration and optional GPR template references.
- When scaling events occur that exceed the GPU capacity of the primary cluster, Auto-GPR evaluates available clusters in the workspace.
- When scaling exceeds available GPU capacity, Auto-GPR selects an appropriate target cluster and creates a Workload Placement based on the template.
- The Workload Placement deploys only the additional replicas to the target cluster. These replicas are cleaned up when the burst duration expires or when the workload scales down.
Prerequisites
Before you begin, ensure you meet the following prerequisites:
1. EGS Controller and Worker Components
EGS Controller and EGS Worker components are installed and running. For installation steps, see Install EGS.
2. Registered Clusters
At least two clusters (for example, worker-1, worker-2) are registered with the controller cluster. For more information on registering clusters, see Register Clusters.
3. Primary and Target Clusters
Designate one cluster as the primary cluster where the initial workload is deployed. Other cluster will serve as target cluster for workload bursting. At least one cluster in the workspace must have available GPU inventory for auto-placement to trigger successfully.
4. Workspace Setup
Create a workspace across the clusters. For more information on how to create a workspace, see Manage Workspaces.
5. Namespace Onboarding
Onboard namespaces onto a workspace. For more information, see Onboard Namespaces.
6. GPR Templates
Create a GPR Template for each worker cluster that is part of the workspace, ensuring the Auto-GPR option is enabled in the GPR Template.
For more information on how to create a GPR Template, see Manage GPR Templates.
7. Service Account and RBAC Permissions
Ensure access to the destination worker cluster where workloads will be deployed, and verify permissions to create ServiceAccounts,
Roles, and RoleBindings in the kubeslice-system namespace.
For more information, see ServiceAccount and RBAC Setup for Workload Templates.
Configuration Parameters
The following sections describe the configuration parameters for WorkloadTemplate and WorkloadPlacement resources.
Workload Template Configuration Parameters
apiVersion: gpr.kubeslice.io/v1alpha1 #
kind: WorkloadTemplate
metadata:
name: <template-name>
namespace: <namespace>
spec:
# List of GPU Provisioning Request (GPR) template names that this workload template can use
gprTemplates: []string
# Kubernetes resources to be deployed
manifestResources: []ManifestResource # Array of ManifestResource objects. The list of Kubernetes resources to be deployed on the managed cluster
# Helm chart configurations
helmConfig: []HelmConfig
# Duration for which the workload will be bursted (for example, "3m", "1h")
burstDuration: string
# Service account for workload execution
# IMPORTANT: The ServiceAccount must be created in the kubeslice-system namespace in the destination cluster before deployment
serviceAccount: string # The name of the ServiceAccount to be used for the workload deployment
# kubectl commands to execute before deployment
cmdExec: []CmdExec
# Deletion policy: "Delete" (default) or "Retain"
deletionPolicy: DeletionPolicyType
# Ordered list of execution steps
steps: []WorkloadStep
# Workspace name for multi-tenant environments
workspaceName: string # The name of the workspace this template belongs to
# Namespaces to create/manage
namespaces: []string # The list of namespaces associated with this workload template
Workload Placement Configuration Parameters
apiVersion: worker.kubeslice.io/v1alpha1
kind: WorkloadPlacement
metadata:
name: <placement-name>
namespace: <namespace>
spec:
# REQUIRED: Target cluster name for deployment
clusterName: string
# Kubernetes resources to be deployed
manifestResources: []ManifestResource
# Helm chart configurations
helmConfig: []HelmConfig
# Duration for which the workload will be bursted
burstDuration: string
# Service account for workload execution
# IMPORTANT: The ServiceAccount must be created in the kubeslice-system namespace
# in the destination cluster before deployment. See ServiceAccount and RBAC Setup
# documentation for details on creating custom Roles and RoleBindings.
serviceAccount: string
# kubectl commands to execute
cmdExec: []CmdExec
# Deletion policy: "Delete" (default) or "Retain"
deletionPolicy: DeletionPolicyType
# Ordered list of execution steps
steps: []WorkloadStep
ServiceAccount and RBAC Setup for Workload Templates
When deploying workloads using Workload Template or Workload Placement, you may need to configure custom ServiceAccounts with appropriate RBAC (Role-Based Access Control) permissions.
The ServiceAccount must be created in the kubeslice-system namespace in the destination cluster, as the workload deployment
is managed through the EGS control plane. This also ensure namespace isolation and security boundaries.
The following are the example YAML configurations to create a custom ServiceAccount and setup RBAC for Workload Templates.
1. Create a ServiceAccount
In the following example, we create a ServiceAccount named vllm-sa for a Virtual Large Language Model (vLLM) workload.
Create a YAML file named serviceaccount.yaml:
apiVersion: v1
kind: ServiceAccount
metadata:
name: vllm-sa # ServiceAccount name
namespace: kubeslice-system
labels:
app: vllm-app # Application label
purpose: workload-execution # Purpose of the ServiceAccount
Use the following command to apply the YAML file in the destination cluster (in the kubeslice-system namespace):
kubectl apply -f serviceaccount.yaml
2. Create a Role
In the following example, we create a Role named vllm-role that grants permissions to manage ConfigMaps, Secrets, Pods, and Jobs.
Create a YAML file named role.yaml:
apiVersion: rbac.authorization.k8s.io/v1 # RBAC API version
kind: Role
metadata:
name: vllm-role # Role name
namespace: vllm-demo # Workload namespace
rules:
- apiGroups: [""]
resources: ["configmaps", "secrets"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["create", "get", "list", "watch", "update", "patch", "delete"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "get", "list", "watch", "update", "patch", "delete"]
Use the following command to apply the YAML file in the destination cluster (in the workload namespace):
kubectl apply -f role.yaml
3. Create a RoleBinding
In the following example, we create a RoleBinding named vllm-rolebinding that binds the vllm-role to the vllm-sa ServiceAccount.
Create a YAML file named rolebinding.yaml:
apiVersion: rbac.authorization.k8s.io/v1 # RBAC API version
kind: RoleBinding
metadata:
name: vllm-rolebinding # RoleBinding name
namespace: vllm-demo # Workload namespace
subjects:
- kind: ServiceAccount
name: vllm-sa
namespace: kubeslice-system
roleRef:
kind: Role
name: vllm-role
apiGroup: rbac.authorization.k8s.io # RBAC API group
Use the following command to apply the YAML file in the destination cluster (in the workload namespace):
kubectl apply -f rolebinding.yaml
4. Verify a ServiceAccount in the Destination Cluster
After creating the WorkloadTemplate, you can verify that the ServiceAccount and RBAC permissions are correctly set up in the destination cluster using the following commands:
# Connect to destination cluster
kubectl get serviceaccount vllm-sa -n kubeslice-system
# Verify RoleBinding
kubectl get rolebinding vllm-rolebinding -n vllm-demo
# Check permissions
kubectl auth can-i create pods --as=system:serviceaccount:kubeslice-system:vllm-sa -n vllm-demo
Auto Workload Placement Using the Workload Template
This section provides an example of creating a Workload Template to enable automatic Workload Placement across clusters in a workspace.
Create and Apply a Workload Template
Create a Workload Template that defines the workload configuration to be deployed across clusters.
-
Create a YAML file called
workload-template.yaml. The following is an example Workload Template that deploys a vLLM Helm chart:apiVersion: gpr.kubeslice.io/v1alpha1
kind: WorkloadTemplate
metadata:
name: vllm-workload-template
namespace: kubeslice-avesha
spec:
# Reference the ServiceAccount created in the destination cluster
serviceAccount: vllm-sa
# Define Helm configurations
helmConfig:
- name: "vllm-app"
chart: vllm/vllm-stack
releaseName: vllm
releaseNamespace: vllm-demo
repoName: vllm
repoURL: https://vllm-project.github.io/production-stack
helmFlags: "--debug"
values:
servingEngineSpec:
modelSpec:
- name: "llama3"
repository: "vllm/vllm-openai"
tag: "v0.10.1"
modelURL: "meta-llama/Llama-3.2-1B-Instruct" # "TheBloke/deepseek-llm-7B-chat-AWQ" #"meta-llama/Llama-3.1-8B-Instruct"
replicaCount: 1
requestCPU: 4
requestMemory: "8Gi"
requestGPU: 1
pvcStorage: "100Gi"
storageClass: "standard"
#pvcMatchLabels:
# model: "llama3-pv"
vllmConfig:
maxModelLen: 4096
#quantization: 'awq'
#extraArgs:
env:
- name: VLLM_FLASHINFER_DISABLED
value: "1"
hf_token: <your-huggingface-token>
routerSpec:
resources:
requests:
cpu: "2"
memory: "8G"
limits:
cpu: "8"
memory: "32G"
# Define command executions
cmdExec:
- name: "get-namespace"
cmd: "kubectl get pods -n vllm-demo"
# Define the order of execution (steps)
steps:
- name: "get-namespace"
type: "command"
- name: "vllm-app"
type: "helm"
# Duration after which helm releases should be uninstalled
burstDuration: "10m" -
Apply the
workload-template.yamlfile on the controller cluster:kubectl apply -f workload-template.yaml
Verify the Workload Template Creation
-
To verify that the Workload Template has been created successfully, run the following command on the controller cluster:
Example
kubectl get workloadTemplates -n kubeslice-aveshaExample Output
NAME AGE
vllm-workload-template 44h -
To verify the Workload Template details, run the following command on the controller cluster:
Example
kubectl get workloadtemplates -n kubeslice-avesha vllm-workload-template -o yamlExample Output
apiVersion: gpr.kubeslice.io/v1alpha1
kind: WorkloadTemplate
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"gpr.kubeslice.io/v1alpha1","kind":"WorkloadTemplate","metadata":{"annotations":{},"name":"vllm-workload-template","namespace":"kubeslice-avesha"},"spec":{"burstDuration":"10m","cmdExec":[{"cmd":"kubectl get pods -n vllm-demo","name":"get-namespace"}],"helmConfig":[{"chart":"vllm/vllm-stack","helmFlags":"--debug","name":"vllm-app","releaseName":"vllm","releaseNamespace":"vllm-demo","repoName":"vllm","repoURL":"https://vllm-project.github.io/production-stack","values":{"routerSpec":{"resources":{"limits":{"cpu":"8","memory":"32G"},"requests":{"cpu":"2","memory":"8G"}}},"servingEngineSpec":{"modelSpec":[{"env":[{"name":"VLLM_FLASHINFER_DISABLED","value":"1"}],"hf_token":"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","maxModelLen":4096,"modelURL":"meta-llama/Llama-3.1-1B-Instruct","name":"llama3","pvcStorage":"100Gi","replicaCount":1,"repository":"vllm/vllm-openai","requestCPU":4,"requestGPU":1,"requestMemory":"8Gi","storageClass":"standard","tag":"v0.10.1","vllmConfig":null}]}}}],"steps":[{"name":"get-namespace","type":"command"},{"name":"vllm-app","type":"helm"}]}}
creationTimestamp: "2025-11-04T09:25:26Z"
generation: 8
name: vllm-workload-template
namespace: kubeslice-avesha
resourceVersion: "1762286904037935011"
uid: 06c04428-a041-4424-86c5-af6d4c490892
spec:
burstDuration: 10m
cmdExec:
- cmd: kubectl get pods -n vllm-demo
name: get-namespace
deletionPolicy: Delete
helmConfig:
- chart: vllm/vllm-stack
helmFlags: --debug
name: vllm-app
releaseName: vllm
releaseNamespace: vllm-demo
repoName: vllm
repoURL: https://vllm-project.github.io/production-stack
values:
routerSpec:
resources:
limits:
cpu: "8"
memory: 32G
requests:
cpu: "2"
memory: 8G
servingEngineSpec:
modelSpec:
- env:
- name: VLLM_FLASHINFER_DISABLED
value: "1"
hf_token: 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
modelURL: meta-llama/Llama-3.2-1B-Instruct
name: llama3
pvcStorage: 100Gi
replicaCount: 1
repository: vllm/vllm-openai
requestCPU: 4
requestGPU: 1
requestMemory: 8Gi
storageClass: standard
tag: v0.10.1
vllmConfig:
maxModelLen: 4096
steps:
- name: get-namespace
type: command
- name: vllm-app
type: helm
status:
workloadSelector:
name: vllm
namespace: vllm-demo
Verify the GPR Creation
When the workload scales beyond the GPU capacity of the primary cluster, EGS automatically creates GPU Provision Requests (GPRs).
-
Verify the GPRs created using the following command on the worker clusters:
Example
kubectl get gprs -n kubeslice-aveshaExample Output
NAME AGE
gpr-03a815ee-c2dc-460f-a 38h -
To get the list of GPU Provisioning Requests (GPRs) created as part of workload placement, run the following command on the controller cluster:
Example
kubectl get gpuprovisioningrequests.gpr.kubeslice.io -n kubeslice-aveshaExample Output
NAME AGE
gpr-03a815ee-c2dc-460f-a 38h
gpr-05adeb5c-61b6-4dc0-9 37h
Examples
The following are examples of WorkloadPlacement configurations using different deployment methods.
Manifest Deployment Example
apiVersion: aiops.kubeslice.io/v1alpha1
kind: WorkloadPlacement
metadata:
name: workload-placement-example
spec:
manifestResources:
- name: "sample-configmap"
manifest:
apiVersion: v1
kind: ConfigMap
metadata:
name: sample-configmap
namespace: default
data:
ui.properties: |
color=purple
theme=dark
language=en
database.properties: |
host=localhost
port=5432
database=myapp
steps:
- name: "sample-configmap"
type: "manifest"
Helm Deployment Example
apiVersion: aiops.kubeslice.io/v1alpha1
kind: WorkloadPlacement
metadata:
name: workload-placement-helm-example
spec:
helmConfig:
- name: "gpu-operator"
chart: "nvidia/gpu-operator"
repoName: "nvidia"
repoURL: "https://helm.ngc.nvidia.com/nvidia"
releaseName: "gpu-operator"
releaseNamespace: "gpu-operator"
version: "1.0.0"
helmFlags: "--create-namespace --wait --timeout 5m"
- name: "hello-world"
chart: "examples/hello-world"
repoName: "examples"
repoURL: "https://helm.github.io/examples"
releaseName: "ahoy"
releaseNamespace: "xyz"
helmFlags: "--wait --create-namespace --timeout 5m"
steps:
- name: "gpu-operator"
type: "helm"
- name: "hello-world"
type: "helm"
deletionPolicy: "Delete" # or "Retain"
Command Execution Example
apiVersion: aiops.kubeslice.io/v1alpha1
kind: WorkloadPlacement
metadata:
name: workload-placement-cmd-example
spec:
cmdExec:
- name: "create-namespace"
cmd: "kubectl create ns test-ns --dry-run=client"
- name: "label-nodes"
cmd: "kubectl label node worker-node1 gpu=true"
steps:
- name: "create-namespace"
type: "command"
- name: "label-nodes"
type: "command"