Understand the User Personas
EGS supports two personas: Admin and User.
Admin
The Admin is responsible for installing, configuring, and managing the Elastic GPU Service (EGS) platform. They primarily use the Admin Portal to perform operations but can also leverage YAML-based (manifest) workflows for automation and integration into CI/CD or MLOps pipelines.
An admin's responsibilities include:
-
Platform Deployment and Registration
- Deploy EGS on existing GPU clusters using Helm charts.
- Configure EGS to work with existing Kubernetes clusters.
- Register existing GPU clusters with EGS.
-
Workspace (Slice) Management
- Create, update, or delete workspaces (slices) to manage GPU resources.
- Define workspace-specific GPU quotas, allowed GPU types, and policies (for example, preemption and max idle time).
- Assign workspaces to teams, departments, or individual users.
-
GPU Resource Provisioning
- Manage GPU inventory across all registered clusters.
- View available GPUs, their specifications, and current allocations.
-
Monitoring & Maintenance
- Monitor GPU resource usage, performance metrics, and workload statistics.
- Set up alerts for resource utilization thresholds or performance issues.
- Perform routine maintenance tasks like updating EGS versions, applying security patches, and managing cluster health.
Admins can also use APIs for workspace creation, GPR management, and other administrative tasks. To know more, see API Reference.
User
The User is typically a Data Scientist, ML Engineer, or Researcher who needs GPU resources for training, fine-tuning, or running inference workloads. They work within the limits and permissions defined by the Admin.
A user's capabilities include:
-
Workspace Access
- Access assigned workspaces to utilize GPU resources.
-
GPU inventory
- Check available GPU inventory and specifications (model, memory, architecture).
-
GPU Provisioning Requests (GPRs)
- Submit GPU Provisioning Requests (GPRs) to request GPU resources for workloads.
- Specify workload details such as number of GPUs, type (A100, H100, etc.), and runtime requirements.
- Track approval status and resource allocation in real time.
-
Workload Monitoring
- View AI workload performance dashboards.
- Analyze GPU metrics: utilization, memory consumption, temperature, and running jobs.
- Troubleshoot workload performance issues using provided telemetry.
-
Lifecycle Management
- Start, stop, or release GPU resources as needed.
- Modify or cancel GPRs based on evolving workload requirements.
- Deploy and manage inference endpoints for trained models.
Users can also interact with EGS programmatically through APIs to submit GPRs or retrieve workload metrics. To know more, see API Reference.