Skip to main content
Version: 1.14.0

Concepts

This topic describes the concepts related to Elastic GPU Service (EGS).

ConceptDescription
EGS WorkspaceProvides isolated tenant environments mapped to Kubernetes namespaces. Supports network and workload isolation, granular RBAC, and secure multi-tenancy.
GPU Cluster Time-SlicingDynamically allocates and reallocates GPUs among workspaces/workloads to avoid idle capacity. Enables multiple workloads to share GPUs efficiently, reduces idle GPU capacity, and improves overall cluster utilization.
Dynamic GPU Provisioning in a WorkspaceAssigns GPUs on demand within a workspace based on workload needs. Allocates GPUs only when needed, supports heterogeneous GPU types, and works within workspace quotas and policies.
GPU BurstingTemporarily expands GPU capacity by borrowing from other clusters during demand spikes. Handles sudden workload spikes, utilizes spare capacity from multiple clusters, and supports flexible scaling.
Preemption, Idle Timeout, Priority/Fairness AllocationsApplies scheduling rules to reclaim and redistribute GPUs for fairness and efficiency. Reclaims idle or low-priority resources, supports workload prioritization, and enforces fairness across tenants.
GPU Inventory Schedule ManagementTracks, schedules, and manages available GPU assets across clusters/clouds in real time. Maintains live GPU availability data, assists in capacity planning, and supports automated allocation.
Dynamic Node Pools and NodesEnables agile creation and management of GPU-equipped node pools and individual nodes. Scales GPU nodes automatically, supports mixed GPU types, and works with multi-cluster environments.
Multi-Cloud Multi-Cluster WorkspaceExtends workspaces and GPU resources across multiple cloud and on-prem environments. Enables hybrid cloud GPU deployments, cross-cluster resource sharing, and single-pane management.
Slice/Workspace Overlay NetworkConnects workloads securely across clusters with low-latency access. Offers secure tenant-aware networking, simplified service discovery, and maintains isolation.
Multi-Tenant Control PlaneManages multiple tenants in shared clusters with unified governance. Supports tenant-level isolation, centralized policy enforcement, and shared infrastructure.
GPU Provision Requests (GPR) ManagementHandles formal requests and approvals for GPU allocations for Workspaces. Automates resource requests, supports approval flows, and integrates with templates.
Workspace Provision RequestsAutomates creation of tenant workspaces with governance settings. Includes GPU quota settings, integrates with onboarding, and automates workspace setup.
AI Workload/GPU ObservabilityMonitors workloads and GPU performance in real time. Provides real-time metrics, framework-level monitoring, and NVIDIA DCGM integration.
Scalable Inference EndpointsDeploys and manages production-ready inference services. Supports autoscaling and versioning, multi-backend inference, and uses KServe/vLLM/Dynamo.
Smart Scaler for InferencePredictively scales inference workloads to balance cost and latency. Learns workload patterns, reduces cost, and improves latency using RL-based scaling.
Workspace PoliciesApplies policies and compliance rules across workspaces. Ensures policy enforcement, compliance monitoring, and RBAC governance.
Self-Service PortalsAllows admins and users to manage resources via intuitive interfaces. Features intuitive UI, low operational overhead, and self-provisioning capabilities.
EGS Core API/SDKProvides APIs and SDKs for integrating EGS into workflows. Enables programmatic management, integration, and automation support.