Version: 1.14.0

Concepts

This topic describes the concepts related to Elastic GPU Service (EGS).

Concept	Description
EGS Workspace	Provides isolated tenant environments mapped to Kubernetes namespaces. Supports network and workload isolation, granular RBAC, and secure multi-tenancy.
GPU Cluster Time-Slicing	Dynamically allocates and reallocates GPUs among workspaces/workloads to avoid idle capacity. Enables multiple workloads to share GPUs efficiently, reduces idle GPU capacity, and improves overall cluster utilization.
Dynamic GPU Provisioning in a Workspace	Assigns GPUs on demand within a workspace based on workload needs. Allocates GPUs only when needed, supports heterogeneous GPU types, and works within workspace quotas and policies.
GPU Bursting	Temporarily expands GPU capacity by borrowing from other clusters during demand spikes. Handles sudden workload spikes, utilizes spare capacity from multiple clusters, and supports flexible scaling.
Preemption, Idle Timeout, Priority/Fairness Allocations	Applies scheduling rules to reclaim and redistribute GPUs for fairness and efficiency. Reclaims idle or low-priority resources, supports workload prioritization, and enforces fairness across tenants.
GPU Inventory Schedule Management	Tracks, schedules, and manages available GPU assets across clusters/clouds in real time. Maintains live GPU availability data, assists in capacity planning, and supports automated allocation.
Dynamic Node Pools and Nodes	Enables agile creation and management of GPU-equipped node pools and individual nodes. Scales GPU nodes automatically, supports mixed GPU types, and works with multi-cluster environments.
Multi-Cloud Multi-Cluster Workspace	Extends workspaces and GPU resources across multiple cloud and on-prem environments. Enables hybrid cloud GPU deployments, cross-cluster resource sharing, and single-pane management.
Slice/Workspace Overlay Network	Connects workloads securely across clusters with low-latency access. Offers secure tenant-aware networking, simplified service discovery, and maintains isolation.
Multi-Tenant Control Plane	Manages multiple tenants in shared clusters with unified governance. Supports tenant-level isolation, centralized policy enforcement, and shared infrastructure.
GPU Provision Requests (GPR) Management	Handles formal requests and approvals for GPU allocations for Workspaces. Automates resource requests, supports approval flows, and integrates with templates.
Workspace Provision Requests	Automates creation of tenant workspaces with governance settings. Includes GPU quota settings, integrates with onboarding, and automates workspace setup.
AI Workload/GPU Observability	Monitors workloads and GPU performance in real time. Provides real-time metrics, framework-level monitoring, and NVIDIA DCGM integration.
Scalable Inference Endpoints	Deploys and manages production-ready inference services. Supports autoscaling and versioning, multi-backend inference, and uses KServe/vLLM/Dynamo.
Smart Scaler for Inference	Predictively scales inference workloads to balance cost and latency. Learns workload patterns, reduces cost, and improves latency using RL-based scaling.
Workspace Policies	Applies policies and compliance rules across workspaces. Ensures policy enforcement, compliance monitoring, and RBAC governance.
Self-Service Portals	Allows admins and users to manage resources via intuitive interfaces. Features intuitive UI, low operational overhead, and self-provisioning capabilities.
EGS Core API/SDK	Provides APIs and SDKs for integrating EGS into workflows. Enables programmatic management, integration, and automation support.