Concepts
This topic describes the concepts related to Elastic GPU Service (EGS).
Concept | Description |
---|---|
EGS Workspace | Provides isolated tenant environments mapped to Kubernetes namespaces. Supports network and workload isolation, granular RBAC, and secure multi-tenancy. |
GPU Cluster Time-Slicing | Dynamically allocates and reallocates GPUs among workspaces/workloads to avoid idle capacity. Enables multiple workloads to share GPUs efficiently, reduces idle GPU capacity, and improves overall cluster utilization. |
Dynamic GPU Provisioning in a Workspace | Assigns GPUs on demand within a workspace based on workload needs. Allocates GPUs only when needed, supports heterogeneous GPU types, and works within workspace quotas and policies. |
GPU Bursting | Temporarily expands GPU capacity by borrowing from other clusters during demand spikes. Handles sudden workload spikes, utilizes spare capacity from multiple clusters, and supports flexible scaling. |
Preemption, Idle Timeout, Priority/Fairness Allocations | Applies scheduling rules to reclaim and redistribute GPUs for fairness and efficiency. Reclaims idle or low-priority resources, supports workload prioritization, and enforces fairness across tenants. |
GPU Inventory Schedule Management | Tracks, schedules, and manages available GPU assets across clusters/clouds in real time. Maintains live GPU availability data, assists in capacity planning, and supports automated allocation. |
Dynamic Node Pools and Nodes | Enables agile creation and management of GPU-equipped node pools and individual nodes. Scales GPU nodes automatically, supports mixed GPU types, and works with multi-cluster environments. |
Multi-Cloud Multi-Cluster Workspace | Extends workspaces and GPU resources across multiple cloud and on-prem environments. Enables hybrid cloud GPU deployments, cross-cluster resource sharing, and single-pane management. |
Slice/Workspace Overlay Network | Connects workloads securely across clusters with low-latency access. Offers secure tenant-aware networking, simplified service discovery, and maintains isolation. |
Multi-Tenant Control Plane | Manages multiple tenants in shared clusters with unified governance. Supports tenant-level isolation, centralized policy enforcement, and shared infrastructure. |
GPU Provision Requests (GPR) Management | Handles formal requests and approvals for GPU allocations for Workspaces. Automates resource requests, supports approval flows, and integrates with templates. |
Workspace Provision Requests | Automates creation of tenant workspaces with governance settings. Includes GPU quota settings, integrates with onboarding, and automates workspace setup. |
AI Workload/GPU Observability | Monitors workloads and GPU performance in real time. Provides real-time metrics, framework-level monitoring, and NVIDIA DCGM integration. |
Scalable Inference Endpoints | Deploys and manages production-ready inference services. Supports autoscaling and versioning, multi-backend inference, and uses KServe/vLLM/Dynamo. |
Smart Scaler for Inference | Predictively scales inference workloads to balance cost and latency. Learns workload patterns, reduces cost, and improves latency using RL-based scaling. |
Workspace Policies | Applies policies and compliance rules across workspaces. Ensures policy enforcement, compliance monitoring, and RBAC governance. |
Self-Service Portals | Allows admins and users to manage resources via intuitive interfaces. Features intuitive UI, low operational overhead, and self-provisioning capabilities. |
EGS Core API/SDK | Provides APIs and SDKs for integrating EGS into workflows. Enables programmatic management, integration, and automation support. |