Inference Endpoints
An Inference Endpoint is a hosted service to perform inference tasks such as making predictions or generating outputs using a pre-trained AI model.
info
Across our documentation, we refer to the workspace as the slice workspace. The two terms are used interchangeably.
Create an Inference Endpoint
Use this API to create an Inference Endpoint.
Method and URL
POST /api/v1/inference-endpoint
Parameters
Parameter | Parameter Type | Description | Required |
---|---|---|---|
clusterName | String | This is the name of the cluster. | Mandatory |
endpointName | String | The name of the endpoint. | Mandatory |
workspace | String | The name of the slice workspace. | Mandatory |
modelSpec | Object | This contains the AI model details. | Mandatory |
gpuSpec | Object | This contains the GPU specifications. For CPU-based inference specify the value as None . | Mandatory |
Model Spec Parameters
Parameter | Parameter Type | Description | Required |
---|---|---|---|
modelName | String | The AI model name. | Mandatory |
storageURI | String | The URI (Uniform Resource Identifier) of the model object exposed via an HTTP or HTTPs endpoint. | Mandatory |
GPU Spec Parameters
Parameter | Parameter Type | Description | Required |
---|---|---|---|
gpuShape | String | The name of the GPU type that you can get from the Inventory details. | Mandatory |
instanceType | String | The type of the instance requested for. | Mandatory |
memoryPerGPU | Integer | The memory requirement in GB per GPU. | Mandatory |
numberOfGPUs | Integer | The number of GPUs requested. | Mandatory |
numberOfGPUNodes | Integer | The number of GPU nodes requested. | Mandatory |
Example Request
POST /api/v1/auth
curl --location --globoff '{{host}}/api/v1/auth' \
--data '{
"clusterName": "{{clusterName}}",
"endpointName": "{{endpointName}}",
"workspace": "{{workspaceName}}",
"modelSpec": {
"modelName": "sklearn",
"storageURI": "gs://kfserving-examples/models/sklearn/1.0/model"
}
}'
The specifications for GPU can be passed for GPU-based inference:
"gpuSpec": {
"gpuShape": "hello",
"instanceType": "world",
"memoryPerGPU": 1024,
"numberOfGPUNodes": 1,
"numberOfGPUs": 1
},
Example Responses
Response: Success
{
"statusCode": 200,
"status": "OK",
"message": "Success",
"data": {
"endpointName": "deepankar-hf-llm3"
}
}
Response: Bad Request
{
"statusCode": 422,
"status": "UNKNOWN",
"message": "UNKNOWN",
"data": null,
"error": {
"errorKey": "UNKNOWN",
"message": "UNKNOWN",
"data": "{\"error\":{\"Message\":\"Error while fetching GPR wait time\",\"Status\":422,\"DetailedError\":{\"Errormessage\":\"\\\"exitDuration\\\" is not allowed to be empty\",\"Reason\":\"\",\"Details\":\"\"},\"StatusCode\":422,\"ReferenceDocumentation\":\"NA\"}}"
}
}
Create an Inference Endpoint with Raw Specs
Use this API to create an Inference Endpoint with raw specs.
Method and URL
POST /api/v1/inference-endpoint
Parameter Description
Parameter | Parameter Type | Description | Required |
---|---|---|---|
clusterName | String | This is the name of the cluster. | Mandatory |
endpointName | String | The name of the endpoint. | Mandatory |
gpuSpec | Object | This contains the GPU specifications. For CPU-based inference, specify the value as None . | Mandatory |
workspace | String | The name of the slice workspace. | Mandatory |
rawModelSpec | String | The custom model specifications. | Mandatory |
Example Request
POST /api/v1/auth
curl --location --globoff '{{host}}/api/v1/auth' \
--data '{
"clusterName": "worker-1",
"endpointName": "sklearn-iris",
"gpuSpec": {
"gpuShape": "hello",
"instanceType": "world",
"memoryPerGPU": 1024,
"numberOfGPUNodes": 1,
"numberOfGPUs": 1
},
"workspace": "team-beta",
"rawModelSpec": "apiVersion: \"serving.kserve.io/v1beta1\"\nkind: \"InferenceService\"\nmetadata:\n name: \"sklearn-iris\"\nspec:\n predictor:\n model:\n modelFormat:\n name: sklearn\n storageUri: \"gs://kfserving-examples/models/sklearn/1.0/model\"\n"
}'
Example Responses
Response; Success
{
"statusCode": 200,
"status": "OK",
"message": "Success",
"data": {
"endpointName": "deepankar-hf-llm3"
}
}
Response: Bad Request
{
"statusCode": 422,
"status": "UNKNOWN",
"message": "UNKNOWN",
"data": null,
"error": {
"errorKey": "UNKNOWN",
"message": "UNKNOWN",
"data": "{\"error\":{\"Message\":\"Error while fetching GPR wait time\",\"Status\":422,\"DetailedError\":{\"Errormessage\":\"\\\"exitDuration\\\" is not allowed to be empty\",\"Reason\":\"\",\"Details\":\"\"},\"StatusCode\":422,\"ReferenceDocumentation\":\"NA\"}}"
}
}
List Inference Endpoints
Use this API to list the Inference Endpoints.
Method and URL
GET /api/v1/inference-endpoint/list?workspace
Example Request
GET /api/v1/inference-endpoint?workspace=team-omega
Example Responses
Response: Success
{
"statusCode": 200,
"status": "OK",
"message": "Success",
"data": {
"endpoints": [
{
"endpointName": "deepankar-hf-llm3",
"modelName": "huggingface",
"status": "Running",
"endpoint": "https://deepankar-hf-llm3.team-omega.aveshalabs.io/v1/model/huggingface-llama3:predict",
"clusterName": "worker-1",
"namespace":"team-omega-deepankar-hf-ah1cex"
},
{
"endpointName": "eric-gpt4",
"modelName": "gpt4-o-preview",
"status": "Running",
"endpoint": "https://eric-gpt4.team-omega.aveshalabs.io/v1/model/gpt4-o-preview:predict",
"clusterName": "worker-1",
"namespace":"team-omega-eric-gpt4-p8j7fa"
},
{
"endpointName": "eric-gpt4-2",
"modelName": "gpt4-o-preview",
"status": "Image Pull Backoffice",
"endpoint": "",
"clusterName": "worker-1",
"namespace":"team-omega-eric-gpt4-2-y5n6ka"
},
{
"endpointName": "mini-gpt4",
"modelName": "gpt4-mini",
"status": "Pending",
"endpoint": "",
"clusterName": "worker-2",
"namespace":"team-omega-mini-gpt4-l1uh5g"
}
]
}
}
Describe an Inference Endpoint
Use this API to describe the Inference Endpoint.
Method and URL
GET /api/v1/inference-endpoint?workspace&endpoint
Example Request
GET /api/v1/inference-endpoint?workspace=team-omega&endpoint=deepankar-hf-llm3
Example Responses
Response: Success
{
"statusCode": 200,
"status": "OK",
"message": "Success",
"data": {
"endpoint": {
"endpointName": "deepankar-hf-llm3",
"modelName": "huggingface",
"status": "Running",
"endpoint": "https://deepankar-hf-llm3.team-omega.aveshalabs.io/v1/model/huggingface-llama3:predict",
"clusterName": "worker-1",
"namespace": "team-omega-deepankar-hf-ah1cex",
"predictStatus": "Ready",
"ingressStatus": "Ready",
"tryCommand": [
"curl -v \\",
"-H \"Host: deepankar-hf-llm3.team-omega.example.com\" \\",
"-H \"Content-Type: application/json\" \\",
"\"http://192.168.1.130/v1/models/sklearn-iris:predict\" \\",
"-d '{...<request body>...}'"
],
"dnsRecords": [
{
"dns": "deepankar-hf-llm3.team-omega.example.com",
"type": "A",
"value": "192.168.1.130"
},
{
"dns": "deepankar-hf-llm3.team-omega.dev-example.com",
"type": "A",
"value": "192.168.1.130"
}
],
"gpuRequests": [
{
"gprName": "deepankar-hf-llm3-gpr-gh6aj7",
"gprId": "yyyyyyy",
"instanceType": "VM.GPU.A10.2",
"gpuShape": "nVidia A-10",
"numberOfGPUs": 1,
"numberOfGPUNodes": 1,
"memoryPerGPU": "24576",
"status": "Provisioned"
},
{
"gprName": "deepankar-hf-llm3-gpr-lo9ju7",
"gprId": "xxxxxxxxx",
"instanceType": "VM.GPU.A10.2",
"gpuShape": "nVidia A-10",
"numberOfGPUs": 1,
"numberOfGPUNodes": 1,
"memoryPerGPU": "24576",
"status": "Released"
}
]
}
}
}
Delete an Inference Endpoint
Use this API to delete an Inference Endpoint.
Method and URL
DELETE /api/v1/inference-endpoint
Example Request
POST /api/v1/auth
curl --location --globoff '{{host}}/api/v1/auth' \
--data '{
"endpoint": "{{inferenceEndpointName}}",
"workspace": "team-omega"
}'
Example Responses
Response: Success
{
"statusCode": 200,
"status": "OK",
"message": "Success",
"data": {}
}
Response: Cannot Delete
{
"statusCode": 500,
"status": "UNKNOWN",
"message": "UNKNOWN",
"data": null,
"error": {
"errorKey": "UNKNOWN",
"message": "UNKNOWN",
"data": "{\"error\":{\"Message\":\"Error while deleting GPR\",\"Status\":500,\"DetailedError\":{\"Errormessage\":\"Cannot delete GPR in Successful state\",\"Reason\":\"\",\"Details\":\"\"},\"StatusCode\":500,\"ReferenceDocumentation\":\"NA\"}}"
}
}