Skip to main content
Version: 1.11.0

Manage GPU Requests

This topic describes the steps to create a GPU request, manage a GPR, and early-release the GPU nodes.

GPUs will not be assigned to a slice workspace by default. Use the portal to create a GPU provision requests and to run AI workloads (in the namespaces that are associated with the slice workspace) that require one or more GPUs.

The following are the GPU Provision Request (GPR) features:

  • The user can create one or more GPU provision requests
  • Only one GPR will be provisioned onto the slice workspace at a given time
  • GPR has strict entry and exit times for GPU nodes from a slice workspace
  • Isolation of GPU nodes per slice workspace
  • Other slice workspaces (or users) cannot use the GPUs allocated to the user's slice workspace inadvertently
  • It has a self-service mechanism for GPU provision requests
  • It has visibility into wait time for GPUs
  • The user can delete or edit GPRs before provisioned
  • The user can early-release GPR if they no longer need the GPUs in their slice workspace

Create a GPU Request

  1. Go to GPU Requests on the left sidebar.

  2. On the GPU Requests page, go to the GPU Requests Per Slice tab and select the slice workspace for which you want to create a GPU request.

  3. On the top-right corner, click the Create GPU Request button.

  4. On the Create GPU Request page, enter the following information to configure GPU request:

    1. For cluster selection, select the cluster from the Cluster drop-down list.

    2. For GPU configuration, enter the following information:

      1. Enter the GPR request name in the Request Name text box.

      2. Select the node type from the Node Type drop-down list.

      3. The GPU shape, Memory (GB) per GPU , the GPU Per Node, and the GPU Nodes values are populated automatically after selecting the node type. The values are immutable.

    3. For priority configuration of the requesting GPU, enter the following information:

      1. The default value for User is the user name. The user name is the name of the slice workspace user.

      2. The Priority and the Priority Number are automatically assigned and cannot be modified by the user.

      3. Enter the amount of time you wish to reserve the GPU resource (in days/hours/mins) in the Reserve for text boxes. The values must be greater then zero for days/hours/mins.

        note

        The admin chooses the priority for each slice workspace. As the user, you are not allowed to modify the priority for a request. If you are seeing a long wait time and would like your request to be processed earlier, you may request your admin to promote the current GPU request priority to a higher priority.

      ug

    4. For advanced configuration, enter the following information:

      1. Enter the idle timeout duration (in days/hours/mins) in the Idle Timeout text boxes for the reserved resource. The values must be greater then zero for days/hours/mins. The idle timeout duration is always lesser than the duration of the reservation.

      2. (Optional) Enforce Idle Timeout : After the idle timeout is set, the Enforce Idle Timeout checkbox will be enabled. Ensure this box is checked to enforce the idle timeout setting.

      3. (Optional) Select the Requeue on Failure check box if you want to queue this GPR in case it fails.

      4. Only the admin can select Evict low priority GPRs to configure auto eviction of a low-priority GPR.

        ug

  5. Click the Get Wait Time button. EGS shows the estimated wait time for the GPU nodes provisioning.

    ug

  6. Under Requested GPUs, select the GPU that is acceptable with estimated wait time.

  7. Click the Request GPUs button.

info
  • The status of the GPR request changes to Queued if the GPU node allocation is in queue.
  • The status of the GPR request changes to Provisioned if the GPU in queue is provisioned successfully.
  • The status of the GPR request changes to Released Early if the GPR is released early then the scheduled time.
  • The status of the GPR request changes to Completed if the GPR Request completes is scheduled time.

View GPU Requests

The user can manage the GPRs in their slice workspaces.

To view the GPU Request:

  1. Go to GPU Requests on the left sidebar.

  2. On the GPU Requests page, go to the All GPR Requests tab and select the GPU request to view the details.

    ug

Manage GPR Queues

The user can manage the GPRs that are on their slice workspace GPR queue.

The following operations can be performed:

  • The user can delete a pending GPR. This will remove the GPR from the queue.
  • The user can early-release a provisioned GPR. This will end the GPR early (early exit of GPU nodes).
  • The user can edit a pending GPR.
  • The user can extend a GPR with a small grace period.

Edit GPU Requests

To edit the GPU request:

  1. On the GPU Requests page, select the GPU request you want to edit.

  2. On the top-right, click the Actions button and select Edit. You can edit only the GPU request name.

ug

  1. Edit the request name and click the Update button.

Early Release the GPU Nodes

For any reason, if you want to release the GPU nodes associated with the slice workspace, you can early-release the GPR. From the portal you can perform early-release to release a provisioned GPR.

To early release the provisioned GPU nodes:

  1. On the GPU Requests page, select the request you want to edit.

  2. On the top-right, click the Actions button and select Early Release.

  3. Enter RELEASE to confirm to early release the nodes.

warning

After the GPR is early-released, the GPU nodes will no longer be available for any AI workloads running on the slice workspace . Any running workloads (pods/and so on) using GPUs and running on the node will go into a pending state.