Skip to main content
Version: 1.11.0

Frequently Asked Questions

This topic describes the frequently asked questions on the EGS installation and configuration.

How can you prevent a job from one EGS workspace from accessing the GPU memory of another workspace?

Existing Protection Available from Kernel

The major GPU vendors have each provided APIs that allow a job to securely interact with the GPUs assigned to it. API toolkits such as the NVIDIA Container Toolkit and AMD ROCm enable jobs to interact with different GPUs without interfering with each other, even if the GPUs are on the same node. The API allows a job in a container to interface only with the GPUs assigned to that container, as enforced by Kubernetes and the GPU driver.

In addition, EGS influences the Kubernetes assignment of GPUs to containers. EGS applies taints to nodes and tolerations to jobs to control the relationship of GPU nodes to jobs. Kubernetes is responsible for binding unique GPUs to a container based on its resource specification. After a GPU is assigned to a job, the driver API the application uses to interact with the GPU is exactly the same as above (the job is given a GPU context within the API toolkit of the given GPU and is then scoped to interact only with the GPUs assigned to the job).

Additional Protection Available from EGS

EGS provides an option to isolate nodes to specific workspaces. This prevents jobs from other workspaces from getting allocated on these nodes.

Can EGS help with defragmentation of GPUs allocated to a specific job?

If a job is using GPUs across multiple nodes, EGS can defrag it onto a single node when possible.

Can EGS help with defragmentation of GPU across nodes?

EGS supports consolidating GPUs onto fewer nodes when possible.

Where can I find the API documentation for EGS?

A detailed API reference is available at, EGS API reference.

Why does EGS installation fail on a cluster that contains KServe already installed?

Installing EGS using the script on a cluster where KServe is already installed fails as the script also attempts to install KServe in an EGS namespace. To avoid conflicts, you must disable KServe in the EGS installer YAML script before running it on that cluster. Afterward, if you want EGS to use the existing KServe, you must configure it accordingly.