Skip to main content
Version: 2.18.0

Bring Your Own Model

EGS Serverless is a cloud-based service where you can deploy your AI or ML model without managing any infrastructure. It lets you focus on your model, while the service handles all operational overhead. You simply deploy the model and start sending inference requests through a ready-to-use endpoint.

info

This feature is currently in beta Beta, and its performance, and behavior may change without prior notice.

Prerequisites

  • To deploy a model, a valid payment method is required. For more information, see Add a Payment Method.

    info

    You will be prompted to add a payment method when you are enter your model details if no payment method is added.

Deploy a Model from the Library

caution

To deploy a model, a valid payment method is required. For more information, see Add a Payment Method.

  1. Go to EGS Serverless on the left sidebar and click Bring Your Own Model.

  2. On the Bring Your Own Model page, choose a model from the library.

    alt

    The library provides the following models:

    • Llama3
    • MistralAI
    • YOLO-26
  3. On the chosen model tile, click Deploy. alt

  4. On the Deploy Model pane:

    1. Enter an endpoint in the Endpoint Name text box.

    2. Under Deployment Options, all the options are enabled by default. You can only disable/reactivate Spot Deployment.

      The deployment options are:

      • Spot Deployment: Uses discounted but interruptible compute for cost-effective model hosting. The spot deployment is shared among users.
      • Auto Capacity Chasing: Automatically selects the best available compute (GPU/CPU) across instance types or zones to maximize availability and minimize cost.
      • Auto Scaling: Dynamically adjusts the number of model replicas based on traffic demand.
    3. Click Deploy Model.

      caution

      If you see Add payment method instead of Deploy Model, you must add a payment method.

      To deploy a model, a valid payment method is required. For more information, see Add a Payment Method.

    4. On the left side of the Bring Your Own Model page, go to Inference Endpoints.

    5. On the Inference Points tab, under Inference Models, confirm the newly created inference endpoint for the model you just deployed.

Deploy Your Own Model

caution

To deploy a model, a valid payment method is required. For more information, see Add a Payment Method.

  1. Go to EGS Serverless on the left sidebar and click Bring Your Own Model.

  2. On the Bring Your Own Model page, click Deploy your own model on the top-right side of the page.

    alt

  3. On the Deploy your own model pane:

    1. Enter your endpoint name in the Endpoint Name text box.

    2. Choose your model framework from the Model Framework drop-down list. The available options are:

      • LLM (vLLM)
      • Video (Triton)
      • Custom Model
    3. Under Deployment Options, all the options are enabled by default. You can only disable/reactivate Spot Deployment.

      The deployment options are:

      • Spot Deployment: Uses discounted but interruptible compute for cost-effective model hosting. The spot deployment is shared among users.
      • Auto Capacity Chasing: Automatically selects the best available compute (GPU/CPU) across instance types or zones to maximize availability and minimize cost.
      • Auto Scaling: Dynamically adjusts the number of model replicas based on traffic demand.
    4. Select a model source from the Model Source drop-down list. The supported model sources are:

      • Hugging Face (default)
      • S3
      • GitHub
    5. Enter your AI model name in the Model Name text box.

    6. Based on your model source, the next few steps vary:

      For Hugging Face

      alt

      1. [Optional] Enter the access token used to download the model from Hugging Face in the HF Token text box. For private models, HF token is mandatory. It is only optional for public models.

      For S3

      alt

      1. Enter the S3 bucket location of your AI Model in the URL - for S3 bucket text box.
      2. Enter the AWS access key ID for S3 bucket authentication in the Access Key text box.
      3. Enter the AWS secret access key for S3 bucket authentication in the Secret Access text box.
      4. Enter the AWS region for your S3 bucket in the Region text box.

      For GitHub

      alt

      1. Enter the GitHub URL location of your model in the URL text box.
      2. Enter the GitHub account or organization that owns the repository in the Account Name text box.
      3. Enter the personal access token or secret for GitHub access in the Secret Key text box.
    7. Under Memory, 16 GB GPU memory per instance is selected by default. Based on your model size, choose a higher GPU memory. You can also click Custom to set a custom GPU memory per instance.

    8. In the Additional Configuration text box, enter one argument per line.

    9. Click Create Inference Endpoint to deploy your model.

      caution

      If you only see Add payment method instead of Create Inference Endpoint, you must add a payment method.

      To deploy a model, a valid payment method is required. For more information, see Add a Payment Method.

    10. On the left side of the Bring Your Own Model page, go to Inference Endpoints.

    11. On the Inference Points tab, under Inference Models, confirm the newly created inference endpoint for the model you just deployed.

Manage Inference Endpoints

After you deploy your an AI model, an inference endpoint is created.

info

To learn how you are billed for GPU-minutes consumed by inference endpoints, see Billing Cycle.

To manage inference endpoints:

  1. On the Bring Your Own Model page, go to Inference Endpoints.

    alt

  2. On the Inference Endpoints tab, under Inference Models, the created inference endpoints along with their details are provided in a table.

  3. The table provides each inference model's deployment name, model name, its status, endpoint, and API token. In the top-left of the table, use the Deployment Name search field to look for a specific deployment.

    In the top-right of the table, click Columns to select or clear the columns you want to view.

  4. To activate a timed-out inference endpoint, click the Activate icon in the corresponding column.

    alt

  5. You can use an inference endpoint for inference. For more information, see Perform Inference Through Managed Endpoints.

  6. To copy an endpoint, click the copy icon next to it in the EndPoint column.

  7. To delete an endpoint, click the delete icon next to it in the Delete column. After you click the delete icon:

    alt

    1. On the Delete Inference Endpoint dialog, enter DELETE in the text box.
    2. Click Yes, Delete to confirm.

Perform Inference Through Managed Endpoints

  1. On the Bring Your Own Model page, go to Inference Endpoints.

  2. Under Inference Endpoints, click an inference endpoint (anywhere on that row) that you want to use.

    info

    You can use an inference endpoint only when its status is Ready.

  3. On the Endpoint Details pane, in the Inference Try Command text box, click copy command to copy, and then run it in a terminal to view the output.

    info

    An inference endpoint contains its own API token.

    alt

    This is an example command:

    info

    An inference endpoint contains its own API token.

    curl -v \
    -X POST \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer 3d8b4933-83b4-498c-ada1-72a25721a3df" \
    "https://llama-model-llama-model.inference.smartscaler.io/openai/v1/chat/completions" \
    -d '{
    "model": "llama3",
    "messages": [
    { "role": "user", "content": "Explain Kubernetes in simple terms" }
    ],
    "temperature": 0.7,
    "max_tokens": 150
    }'
  4. To use a different query, change the content value under messages in the command and run it.