Version: 2.14.0

Frequently Asked Questions

This topic provides help with the most common questions about Smart Scaler.

Installation and Configuration

The questions are based on installation and configuration of Smart Scaler.

How do I check the log location?

Check the logs using the following command:

kubectl logs <pod name> -n <namespace>

Where the pod name is the name of the pod.

For example, inference-agent-r9wbf is the pod name and customer-1 is the namespace. For an Inference Agent pod, look for the inference-agent prefix.

Example

kubectl logs -f inference-agent-r9wbf -n customer-1

Expected Output

{"level":"info","ts":1688372847.6437705,"msg":"Starting Smart Scaler Inference Agent"} 
{"level":"info","ts":1688372847.643794,"msg":"reading configuration"} 
{"level":"info","ts":1688372847.6438043,"msg":"Unable to find Scrapper configuration file path in env.SMART_SCALER_CONFIGURATION_PATH. Using /app/configuration.yaml"}
{"level":"info","ts":1688372847.6441996,"msg":"Using startTime 2023-05-08 19:00:00 +0000 UTC and end time 2023-05-09 19:00:00 +0000 UTC to set duration 24.000000"} 
{"level":"info","ts":1688372847.6442132,"msg":"bootstrapping saas apis service"} 
{"level":"info","ts":1688372847.6442165,"msg":"exchanging credentials for access token"}

How do I validate the communication between the Smart Scaler Agent and the Smart Scaler cloud?

Communication is established when the job starts running. Check the logs using the following command:

kubectl logs <pod name> -n <namespace>

Where the pod name is the name of the pod.

In the log output:

If the communication is successful, the agent logs the statement below.

{"level":"info","ts":1687944475.416642,"msg":"Successfully connected to the Smart Scaler APIs"}

How do we validate if the Smart Scaler Agent sources data from the correct data source?

The Smart Scaler Agent logs a message for the data source it is using. Check the logs for the message using datasource in <datasource-name> using the following command:

kubectl logs <pod name> -n <namespace>

Where the pod name is the name of the pod.

Example

kubectl logs inference-agent-tl4d9 | egrep 'data source|datasource'

Expected Output

{"level":"info","ts":1689953376.0342093,"msg":"initializing datasource"} 
{"level":"info","ts":1689953376.0342476,"msg":"init prometheus data source with name my-prometheus"} 
{"level":"info","ts":1689953376.0342827,"msg":"init datadog data source with name us5-datadog"} 
{"level":"info","ts":1689953376.0343823,"msg":"using datasource my-prometheus"}

How should I handle secrets/keys if I am using a vault?

Enter the secrets/keys into the vault/keystore. Save the path to the secret/key and use it in the ss-agent-values.yaml file.

note

You will need to use a third-party tool such as helm resolver to take the path and retrieve the secret/key from the vault and apply it to the cluster.

See the example below.

dataSources:
  datadog:
    existingSecret: "" 
    apiKey: path:avesha/data/infrastructure/smart-scaler/#datadog_api_key
    appkey: path:avesha/data/infrastructure/smart-scaler/#datadog_app_key

How do I securely pass the Smart Scaler clientSecret?

To securely pass the Smart Scaler clientSecret, use one of the following methods:

Configure to reference securely stored clientSecret an existing secret in the ss-agents-values.yaml file. For more information, see Configure an Existing Secret.
Configure clientSecret as an External Secret Operator in the ss-agents-values.yaml file. For more information, see Configure an External Secrets Operator.

Can I change the values under agentConfiguration section of the downloaded `ss-agent-values.yaml file?

It is important that you change the clusterDisplayName field to match the cluster name you defined in Datadog/Prometheus/New Relic or another metrics data source.

Other than that, do not change the values under agentConfiguration as they are set specifically for your setup. Changing these values could lead to errors in the feature behavior of the Smart Scaler agent.

How can I change the clusterDisplayName in the ss-agent-values.yaml?

If you change the clusterDisplayName under the agentConfiguration section of the ss-agent-values.yaml file, then:

Run the following command:

helm upgrade --install smartscaler smart-scaler/smartscaler-agent -f <valuesfilename>.yaml -n smart-scaler --create-namespace --set configHelper.enabled=true

Confirm that the cluster display name change is reflected in the management console's Deployed Agents page.

How do I get the values for smartscalerApi and smartscalerMetrics in the ss-agent-values.yaml file?

The values for smartscalerApi and smartscalerMetrics will be provided to you through email from the Smart Scaler support. Please wait to start the agent installation until you receive this email before proceeding with the installation. If you haven't received it, please contact Avesha support at support@avesha.io.

After installing the Smart Scaler Agent, why is it not showing up in the Smart Scaler management console?

Please make sure that the agentConfiguration section in the values file is not changed (apart from clusterDisplayName), this will result in handshake failure between the Smart Scaler agent and Avesha SaaS.

Carefully read and follow the instructions in the downloaded ss-agent-values.yaml file.

After the Smart Scaler Agent is configured, why don't I see any data in the management console's Dashboard?

Check if the smartscalerApi and smartscalerMetrics values provided in the email is correctly added in the ss-agent-values.yaml file.

How do I obtain the URL for the Prometheus metrics data source?

For Prometheus, it is the external IP of the prometheus-server service. If it is ClusterIP service, use the DNS name of the service in this format: http://my-svc.my-namespace.svc.cluster-domain.example:port. For example, http://prometheus.istio-system.svc.cluster.local:9090.

What are the recommendations to configure replica from the metrics server on a Kubernetes cluster with large number of nodes and services?

For 100 deployments in an application, it is recommended to configure one replica. For example, configure four replicas for monitoring an application containing 400 deployments.

For more information, see Kubernetes Metrics Server.

Does Smart Scaler use customer data to train Avesha models? What specific data (if any) leaves customer site, and what is it used for?

No. Avesha's models are pre trained. We do not use customer data to train the models. We only use application and Kubernetes metrics data to train the models. For example, we use data such as application RPS, response latency, number of pods, and CPU and memory usage metrics.

We don't need/collect/use or transmit customer/user/account details.

The model uses metrics data to understand application behavior in the cluster under user traffic, detect anomalies, correct configurations, and proactively autoscale applications.

Cluster Autoscaler

How does Smart Scaler work with a Cluster Autoscaler?

Smart Scaler operates at the application/microservice layer of Kubernetes, scaling pods as required.

A Cluster Autoscaler manages node scaling based on the pod status. It adds nodes when there are pending pods and removes them when pods scale down. No changes are required for the Cluster Autoscaler.

Beyla Configuration

How can I ensure Beyla DaemonSets run on all nodes?

You can ensure Beyla DaemonSets run on all nodes by extracting the node taints and applying the necessary tolerations in the Helm chart values file. For more information, see Ensure Beyla DaemonSets run on all nodes.

Karpenter

How does Karpenter select OCI images?

Karpenter launches self-managed nodes.

OL7 and OL8 images are required for self-managed nodes. Karpenter deploys the images as described in the following steps:

Karpenter uses the first image found on the existing OKI node pools.
Override this behavior by setting the IMAGE_OCID environment variable to the required image. The image that you configure must be OL7 or OL8 with Kubernetes installed.

Override the ImageOCID property in the OciNodeClass definition as shown in the following example:

apiVersion:
karpenter.multicloud.sh/v1alpha1

kind:
 OciNodeClass

metadata:

name:
 ocinodeclass1

spec:

#ImageOCID:
 <image.ocid>

#BootVolumeSizeGB:
 <size in GB>

#SubnetOCID:
 <subnet OCID>

#NetworkSgOCID:
 <nsgOCid1,nsgOCid2>

#PodsSubnetOCID:
 <PODs subnet OCID> #For OCI VCN-Native Pod Networking 

#PodsNetworkSgOCIDs:
 <PODs nsgOCid1,PODs nsgOCid2> #For OCI VCN-Native Pod Networking 

#SSHKeys:
 "ssh-rsa ********"

Updating ImageOCID in the OciNodeClass definition marks all nodes as drifted and replaces them with new nodes.

How can I disable disruption on a NodePool?

On a NodePool definition, replace the following disruption object parameters:

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized 

With the following disruption object parameters:

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 1m

How to make sure a Pod will not be disrupted by Karpenter?

If you are using Karpenter and want to ensure a pod is not disrupted during scale-down or node consolidation, set the karpenter.sh/do-not-disrupt: "true" annotation on the pod. Add it under the metadata.annotations field in your Pod spec.

The following is an example YAML file:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    metadata:
      annotations:
        karpenter.sh/do-not-disrupt: "true"

This annotation tells Karpenter not to consider the pod for disruption during node termination, consolidation, or deprovisioning operations.

We receive additional discounts on E5 instance types, which could help with cost savings. Is there a way to prioritize E5 (or any specific instance type) when Karpenter scales up new nodes?

The following example values files represents the prices per CPU and per GB of memory per hour, taken from OCI Price List.

oci:
  prices:
    vm:
      - name: "VM.Standard.E3.Flex"
        ocpu: 0.025
        mem: 0.0015
      - name: "VM.Standard.E4.Flex"
        ocpu: 0.025
        mem: 0.0015
      - name: "VM.Standard.E5.Flex"
        ocpu: 0.03
        mem: 0.002
      - name: "VM.GPU.A10"
        gpu: 2
      - name: "VM.GPU2"
        gpu: 1.275
      - name: "VM.GPU3"
        gpu: 2.95

If you assign the same pricing values to E3, E4, and E5 instance types, Karpenter will treat them as equally cost-effective and may provision any of them interchangeably. However, if E5 is configured with a lower price, Karpenter will prioritize it to optimize for cost savings.

Another way to control which VM shapes Karpenter uses is by setting the karpenter.oci.sh/instance-family requirement in the node pool definition:

- key: "karpenter.oci.sh/instance-family"
  operator: In
  values: ["E3","E4"]

This acts as a restrictive filter. For example:

If you specify ["E3", "E4", "E5"] and the pricing for all is equal, Karpenter will likely favor E3 and E4 due to internal preferences or availability.
If you set it to only ["E5"], then only E5 instances will be provisioned regardless of price, because you have restricted the selection explicitly.

Installation and Configuration​

Cluster Autoscaler​

Beyla Configuration​

Karpenter​

Installation and Configuration

Cluster Autoscaler

Beyla Configuration

Karpenter