Skip to main content
Version: 1.17.0

Release Notes for EGS Version 1.17.2

Release Date: 11th June 2026

The Elastic GPU Service (EGS) platform is an innovative solution designed to optimize GPU utilization and efficiency for your AI projects. EGS leverages the power of Kubernetes to deliver optimized GPU resource management, GPU provisioning, and GPU fault identification.

We continue to add new features and enhancements to EGS.

These release notes describe the new changes and enhancements in this version.

info
  • Across our documentation, we refer to the workspace as the slice workspace. The two terms are used interchangeably.
  • The EGS Controller is also referred to as the KubeSlice Controller in some diagrams and in the YAML files.
  • The EGS Admin Portal is also referred to as the KubeSlice Manager (UI) in some diagrams and in the YAML files.

What's New πŸ”ˆβ€‹

GPU Scheduling and Capacity Management​

GPU scheduling and capacity management now provide improved resource allocation, workload placement, and overall platform efficiency.

Key enhancements:

  • GPU wait-time estimates are now more accurate and reliable. Enhancements in capacity tracking, resource forecasting, and request prioritization provide better visibility into when GPU resources are expected to become available.

  • GPU reservation handling has been enhanced to improve reliability in environments with high levels of concurrent activity. This helps ensure GPU resources are allocated consistently and accurately, even during periods of heavy demand.

  • GPU availability reporting has been improved to provide a more accurate view of available resources across the platform. These enhancements support better scheduling decisions and more predictable workload placement.

Workload Placement and Redistribution​

Workload placement and redistribution now provide improved stability, reliability, and visibility when managing workloads across clusters and GPU resources.

Key enhancements:

  • Several improvements have been made to workload redistribution, increasing stability and reducing the likelihood of workloads becoming stuck during migration or recovery operations.

  • Workload placement now includes a history of recent redistribution events, providing greater visibility into workload movement and helping operators troubleshoot placement decisions more effectively.

  • Resource reclamation and workload eviction now more consistently honor workload priorities, helping ensure that critical workloads receive access to GPU resources when capacity is constrained.

  • Workloads that are moved between clusters now recover more reliably, reducing the need for manual intervention and improving overall workload availability.

GPU Node Management​

GPU node management now provides improved resource utilization, reduced idle time, and optimized GPU provisioning and cleanup processes.

Key enhancements:

  • GPU node idle detection has been enhanced to more accurately identify idle resources, helping organizations optimize GPU utilization and reclaim resources.

  • Resource cleanup processes have been enhanced to ensure that GPU nodes are promptly and consistently returned to the available resource pool.

MIG Enhancements​

MIG (Multi-Instance GPU) support has been enhanced to improve scheduling, resource matching, and policy enforcement for MIG-enabled workloads.

Key enhancements:

  • MIG-enabled workloads now benefit from more accurate resource matching and scheduling decisions, resulting in improved placement consistency and resource utilization.

  • GPU shape validation is now applied consistently across all scheduling paths, improving policy enforcement and workload placement accuracy.

Workload Scheduling Improvements​

Scheduling controls are now applied more precisely, ensuring workload scheduling actions affect only the intended resources and workloads.

Automatic Use of Latest Helm Chart Versions​

Platform deployments now automatically use the latest available Helm chart versions, simplifying upgrades and ensuring access to the most recent fixes and enhancements.

API and Platform Reliability​

Several improvements have been made to enhance the reliability and stability of the platform's APIs and core services, including:

  • The platform now provides clearer responses for temporary provisioning delays, enabling automated systems to recover more effectively and reducing operational overhead.

  • Cleanup workflows have been improved to better handle previously completed or released resources, reducing unnecessary errors and improving operational reliability.

  • Additional validation checks help ensure workloads are configured correctly before deployment, reducing configuration-related issues and improving deployment success rates.

Security Updates​

This release includes multiple security enhancements and dependency upgrades across the platform, including:

  • Upgrades to the latest supported Go runtime.
  • Updated gRPC libraries across platform services.
  • Hardened API Gateway container images.
  • Remediation of multiple Critical, High, and Medium severity vulnerabilities.

These updates strengthen the platform's overall security posture and reduce exposure to known vulnerabilities.

Known Issues​

The platform currently includes several moderate-severity vulnerabilities inherited through third-party Kubernetes client dependencies. Resolving these vulnerabilities requires upgrading to a newer major version of the Kubernetes client library that introduces breaking API changes. This upgrade is planned for a future major release.

We are actively working on a comprehensive upgrade plan to address these vulnerabilities while minimizing disruption to existing users. In the meantime, we recommend users follow best practices for securing their Kubernetes environments and monitoring for any potential security issues related to these dependencies.