Version: 1.13.0

Manage GPU Requests

This topic outlines the steps to create a GPU request, manage a GPR, and early-release the GPU nodes.

GPUs are not assigned to a workspace by default. Use the portal to create a GPU provision request and to run AI workloads (in the namespaces that are associated with the workspace) that require one or more GPUs.

The following are the GPU Provision Request (GPR) features:

Users can create one or more GPU provision requests as needed.
Only one GPR can be provisioned to a workspace at any given time.
Each GPR includes defined entry and exit times for GPU nodes from a workspace.
GPU nodes are isolated per workspace to ensure dedicated access.
GPUs assigned to a workspace cannot be accessed by other users or workspaces.
Users can manage GPRs through portal and have visibility into wait time for GPUs.
Users can delete or edit a GPR before it is provisioned.
If GPUs are no longer needed, users can early-release their GPR to free up resources.

info

Across our documentation, we refer to the workspace as the slice workspace. The two terms are used interchangeably.

Create a GPU Request

Go to GPU Requests on the left sidebar.
On the GPU Requests page, select the workspace for which you want to request GPU Allocation.
On the top-right corner, click the Create GPU Request button.

note
Users can create a GPU request using the available GPR templates or manually enter the GPR configuration. To manually configure a GPU, skip Step 4, Step 5, and proceed to [Manual GPU Configuration]((#manual-gpu-configuration).

Auto GPU Configuration

On the Create GPU Request pane, click Select Template on the top-right corner to select the available GPR template (GPR configuration) to create a GPU request.
Select the available template and click Apply Template.

info

To view the templates assigned to a workspace, see View GPR Templates. If the available templates do not have the required configuration, you can manually configure a GPU.

Manual GPU Configuration

On the Create GPU Request pane, enter the following information to configure GPU request:
1. For cluster selection, select the cluster from the Cluster drop-down list.
2. For GPU configuration, enter the following information:
  1. Enter the GPR request name in the Request Name text box.
  2. Select the node type from the Node Type drop-down list.
  3. The GPU shape, Memory (GB) per GPU , the GPU Per Node, and the GPU Nodes values are populated automatically after selecting the node type. The values are immutable.
3. For priority configuration of the requesting GPU, enter the following information:
  1. The default value for User is the user name. The user name is the name of the workspace user.
  2. The Priority and the Priority Number are automatically assigned and cannot be modified by the user.
  3. Enter the amount of time you wish to reserve the GPU resource (in days/hours/mins) in the Reserve for text boxes. The values must be greater then zero for days/hours/mins.
    
    note
    The admin chooses the priority for each workspace. As the user, you are not allowed to modify the priority for a request. If you are seeing a long wait time and would like your request to be processed earlier, you may request your admin to promote the current GPU request priority to a higher priority.

Advanced GPR Configuration

info

If the available GPR templates have the advanced configuration, skip Step 7.

For advanced configuration, enter the following information:
1. Enter the idle timeout duration (in days/hours/mins) in the Idle Timeout text boxes for the reserved resource. The values must be greater then zero for days/hours/mins. The idle timeout duration is always lesser than the duration of the reservation.
2. (Optional) Enforce Idle Timeout : After the idle timeout is set, the Enforce Idle Timeout checkbox will be enabled. Ensure this box is checked to enforce the idle timeout setting.
3. (Optional) Select the Requeue on Failure check box if you want to queue this GPR in case it fails.
4. Only the admin can select Evict low priority GPRs to configure auto eviction of a low-priority GPR.
Click the Get Wait Time button. EGS shows the estimated wait time for the GPU nodes provisioning.
Click the Request GPUs button.

info

The status of the GPR request changes to Queued if the GPU node allocation is in queue.
The status of the GPR request changes to Running if the GPU in queue is provisioned successfully.
The status of the GPR request changes to Released Early if the GPR is released early then the scheduled time.
The status of the GPR request changes to Completed if the GPR Request completes is scheduled time.

View GPU Requests

The user can manage the GPRs in their workspaces.

To view the GPU Request:

Go to GPU Requests on the left sidebar.
On the GPU Requests page, select the workspace to view GPU requests for a workspace.
For the selected workspace, select the GPR to view GPR details.

Manage GPR Queues

The user can manage the GPRs that are on their workspace GPR queue.

The following operations can be performed:

The user can delete a pending GPR. This will remove the GPR from the queue.
The user can early-release a provisioned GPR. This will end the GPR early (early exit of GPU nodes).
The user can edit a pending GPR.
The user can extend a GPR with a small grace period.

Edit GPU Requests

To edit the GPU request:

On the GPU Requests page, select the GPU request you want to edit.
On the top-right, click the Actions button and select Edit. You can edit only the GPU request name.

Edit the request name and click the Update button.

Early Release the GPU Nodes

For any reason, if you want to release the GPU nodes associated with the workspace, you can early-release the GPR. From the portal you can perform early-release to release a provisioned GPR.

To early release the provisioned GPU nodes:

On the GPU Requests page, select the request you want to edit.
On the top-right, click the Actions button and select Early Release.
Enter RELEASE to confirm to early release the nodes.

warning

After the GPR is early-released, the GPU nodes will no longer be available for any AI workloads running on the workspace . Any running workloads (pods/and so on) using GPUs and running on the node will go into a pending state.

Create a GPU Request​

Auto GPU Configuration​

Manual GPU Configuration​

Advanced GPR Configuration​

View GPU Requests​

Manage GPR Queues​

Edit GPU Requests​

Early Release the GPU Nodes​

Create a GPU Request

Auto GPU Configuration

Manual GPU Configuration

Advanced GPR Configuration

View GPU Requests

Manage GPR Queues

Edit GPU Requests

Early Release the GPU Nodes