Skip to main content
Version: 1.14.0

Manage an Inference Endpoint

This topic describes the steps to view, deploy, and delete Inference Endpoints for your slice workspaces.

info

Across our documentation, we refer to the workspace as the slice workspace. The two terms are used interchangeably.

View the Inference Endpoint

To view the Inference Endpoint:

  1. Go to Inference Endpoints on the left sidebar.

  2. On the Workspaces page, select the workspace whose Inference Endpoints you want to view.

    inference-endpoints

  3. On the Inference Endpoints page, you see a list of Inference Endpoints for that workspace.

    inference-endpoints

    The following figure illustrates the deployment for the workspace.

    inference-endpoints

Deploy an Inference Endpoint

note

Only the admin has the privilege to add namespaces to a workspace. The user (non-admin) with access to a workspace must create the inference endpoint with the namespace name added to a workspace by the admin.

For example, if the admin adds the inference-1-gpu and inference-1-cpu namespaces to a workspace. The non-admin user must create an inference endpoints with name inference-1-gpu or inference-1-cpu only.

To deploy an Inference Endpoint on your workspace:

  1. Go to Inference Endpoint on the left sidebar.

  2. On the Workspaces page, go to the workspace on which you want to deploy an Inference Endpoint.

  3. On the Inferences Endpoint page, click Deploy Inference Endpoint.

    inference-endpoints

  4. On the Create Inference Endpoint pane, under Basic Specifications:

    1. Enter a name for the Inference Endpoint in the Endpoint Name text box.
    2. Select standard model name from the Select Model dropdown menu. To populate the Select Model dropdown menu, with standard model names you must configure the configMap to store the list of standard models.
  5. Under Cluster Specifications:

    1. (Optional) The checkbox Burst to available clusters is enabled by default. You can uncheck the checkbox to disable bursting. To know more, see Bursting Scenarios.

    2. Under Available Clusters, select the clusters in the order on which you want to deploy inference endpoint.

      warning

      If you try creating an Inference Endpoint with a name without ensuring that a namespace with the same name exists, then you get an error that says Failed to create namespace.

  6. Under Advanced Options, enter the specifications for model deployment. Under Model Specifications, enter the following:

    note

    The following are standard parameters for most of model deployments. However, if these parameters do not meet your model requirements, then select the Specify your own model configuration checkbox and enter your own model configuration.

    inference-endpoints

    1. Enter a name in the Model Format Name text box.
    2. Add the storage URI in the Storage URI text box.
    3. Add the CPU value in the CPU text box.
    4. Add the Memory value in the Memory text box.
    5. Add the arguments in the Args text box.
    6. To add secret key-value pair, click the plus sign against Secret and add them.

    Own Model Configuration

    To add your own model configuration:

    1. Select the Specify your own model configuration checkbox.

    2. On the terminal screen, enter your model configuration YAML file from KServe. For more information, see KServe.

  7. Under GPU Specifications, enter the following:

    info

    Select the Create CPU-only Inference checkbox for CPU only specification for a model.

    inference-endpoints

    1. Select node type from the Node Type drop-down list. After you select node type:

      • The GPU Shape and Memory per GPU values are auto populated. These values are immutable.
      • The GPU Nodes, GPUs Per Node have the default values. Edit these values as per your requirements.
      • The Reserve For have 365 days by default. The duration is in days\hours\mins format. Edit the days as per your requirements. The number of days must be less that 365.
      • The Priority and Priority Number have default values. Edit the values as per your requirements.
    2. Click the Create Inference Endpoint button. The status goes to Pending before it changes to Ready.

      inference-endpoints

Delete an Inference Endpoint

To delete an Inference Endpoint:

  1. On the Workspaces page, select a workspace for which you want to delete an Inference Endpoint.

  2. On the Inference Endpoint page, select the deployment name or click the right arrow next to the Inference Endpoint.

  3. click the Delete button.

  4. Enter the name of the Inference Endpoint in the text box and click Delete.

    inference-endpoints

Configure the ConfigMap for Selecting Standard Models

To ensure the model selection dropdown is populated correctly, follow these steps:

  1. Create the ConfigMap in the project namespace. For example, kubeslice-avesha is the project namespace.

  2. To identify the ConfigMap as a standard inference model specification, the label egs.kubeslice.io/type: model-specs must be added to the ConfigMap metadata.

  3. The data section of the ConfigMap must contain the key model-specs.yaml. This key holds the specification of the model in YAML format.

    The following are the three example ConfigMaps:

    Example 1: llama3-8b with Custom InferenceService and GPU Specifications
    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: llama3-8b
    namespace: kubeslice-avesha
    labels:
    egs.kubeslice.io/type: model-specs
    data:
    model-specs.yaml: |
    specsType: "CUSTOM"
    rawModelSpecs: |
    kind: InferenceService
    metadata:
    name: huggingface-llama3
    spec:
    predictor:
    model:
    modelFormat:
    name: huggingface
    args:
    - --model_name=llama3
    - --model_id=meta-llama/meta-llama-3-8b-instruct
    resources:
    limits:
    cpu: "6"
    memory: 24Gi
    nvidia.com/gpu: "1"
    requests:
    cpu: "6"
    memory: 24Gi
    nvidia.com/gpu: "1"
    env:
    - name: HF_TOKEN
    valueFrom:
    secretKeyRef:
    name: hf-secret
    key: HF_TOKEN
    optional: false
    gpuSpecs:
    memory: 24
    totalNodes: 1
    gpusPerNode: 1
    Example 2: sklearn-iris with GPU specifications
    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: sklearn-iris
    namespace: kubeslice-avesha
    labels:
    egs.kubeslice.io/type: model-specs
    data:
    model-specs.yaml: |
    specsType: "STANDARD"
    modelSpecs:
    modelFormat: "sklearn"
    storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
    args:
    - --model_name=sklearn-iris
    - --model_id=sklearn/iris
    secrets:
    SK_LEARN_TOKEN: xxx-yyyy-zzz
    cpu: "1"
    memory: "2Gi"
    gpuSpecs:
    memory: 6
    totalNodes: 1
    gpusPerNode: 1
    Example 3: sklearn without GPU
    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: sklearn-iris-cpu
    namespace: kubeslice-avesha
    labels:
    egs.kubeslice.io/type: model-specs
    data:
    model-specs.yaml: |
    specsType: "STANDARD"
    modelSpecs:
    modelFormat: "sklearn"
    storageUri: "gs://kfserving-examples/models/sklearn/1.0/model
  4. Use the following command to apply the ConfigMap:

    kubectl apply -f example-model-configmap.yaml
  5. Use the following command to verify if the ConfigMap is created correctly:

    kubectl get configmap -n kubeslice-avesha -l egs.kubeslice.io/type=model-specs

    This command lists all ConfigMaps with the required label in the specified namespace. If your ConfigMap appears, it is correctly configured for use in the model selection dropdown.

Bursting Scenarios

  • When Burst to Available Clusters option is enabled and if no worker clusters are selected, the workload is assigned to the cluster with sufficient inventory and the shortest wait time.

  • When Burst to Available Clusters option is disabled:

    • At least one cluster must be selected to process the workload.
    • For a single selected cluster, the user must specify both the GPU Node type.
    • For multiple selected clusters, the workload is assigned to the cluster with sufficient inventory and the shortest wait time among the selected options.