Concepts
This topic describes the concepts related to Elastic GPU Service (EGS).
GPU Cluster Time Slicing
GPU cluster time slicing allows the dynamic allocation and reallocation of GPU resources across multiple users and workloads, ensuring that GPU clusters are utilized more efficiently.
EGS Slice VPC
EGS slice VPC is a logical boundary for a user (or a team) workspace. The slice can be viewed as a VPC that spans one or more Kubernetes clusters. The slice VPC workspace will be associated with one or more namespaces where users can deploy their AI workloads.
GPU Request
Users create GPU requests for the slice using the EGS User portal. The EGS control plane manages the queue of the GPU requests across the slices/users (teams) and clusters.
GPU Requests Management
The EGS control plane GPU requests (GPR) manager manages the life cycle of the GPU requests. GPU requests are inserted into the pending requests queue based on a number of considerations.
Priority Queue
A priority queue would be used to store GPRs based on the priorities assigned to them. The priority queue performs time and space efficient operations to get, insert, update, delete, and rearrange elements in the queue.
Dynamic GPU Provisioning in a Slice
The GPR manager periodically checks with Inventory and Queue managers to get the next GPR allocation for provisioning. After the GPR is allocated, the resources needed to provision the request is identified. The GPR manager works with the worker cluster EGS component, aiOps Operator to complete the provisioning. The GPR manager creates appropriate CRs for the aiOps operator to complete the provisioning of the GPU nodes into the worker cluster Slice VPC.
In single cluster EGS deployments, the EGS control plane components and EGS worker components (aiOps Operator) are deployed in the same cluster in different namespaces.
GPU Inventory Schedule Management
EGS inventory schedule management service maintains details about the GPU nodes and GPU devices and their attributes. It also maintains the network and other related configuration information as well. It also maintains the detailed schedule of the GPU nodes.
AI Workload/GPU Observability
EGS User portal offers a detailed view of AI workloads that are running in the User workspace (namespaces). In addition, a user-focused dashboard shows key metrics across slices, workloads, and GPUs.
GPU Monitoring and Remediation
EGS Worker aiOps operator constantly monitors the key metrics such as power, temperature, utilization for all GPU nodes across all slices. It generates events, alerts, and notifications for any metrics threshold violations.
Multi-Cloud, Multi-Cluster Slice VPC
The EGS Control Plane provides workflows to register one or more public cloud (or edge or data center) clusters with a project. EGS Admins then can create a slice that spans across multiple clusters. Users can take advantage of a workspace (namespaces) that spans across the cluster. Users can then decide based on the requirements and GPU provision requests in any of the associated clusters.
EGS provides GPU resource management and visibility across multiple clusters.