Skip to main content
Version: 1.1.0

Supported Anomaly Detection Types

This topic describes the various types of anomalies detected by the SRE anomaly detection system across different platforms and services.

Overview

The anomaly detection system monitors multiple platforms and services to identify issues that could impact system reliability and performance.

Kubernetes Anomalies

Node Anomalies

AnomalySeverityDescriptionTrigger ConditionsSymptomsMetadata
node_status_anomalyHighDetects unhealthy Kubernetes nodes- Node health status is unhealthy
- Node ready status is not True
Node Ready status is not TrueNode name, ready status, resource ID

Pod Anomalies

AnomalySeverityDescriptionTrigger ConditionsSymptomsMetadata
pod_crashloopbackoff_anomalyCriticalDetects containers stuck in CrashLoopBackOff stateContainer waiting state reason is CrashLoopBackOff- Container in CrashLoopBackOff state
- Container restart count information
- Error messages from container
Container name, restart count, error details
pod_imagepull_anomalyHighDetects image pull failures preventing container startup- Container waiting state reason is ImagePullBackOff
- Container waiting state reason is ErrImagePull
- Container in ImagePullBackOff or ErrImagePull state
- Specific error messages
Container name, error details
pod_high_restart_anomalyMediumDetects containers with excessive restart counts indicating instabilityContainer restart count >= 10High restart count- Container instability indicatorsContainer name, restart count
pod_container_not_ready_anomalyMediumDetects containers that are not ready without clear waiting reasons- Container ready status is false
- No clear waiting state reason
Container not ready statusContainer name, restart count
pod_pending_anomalyMediumDetects pods stuck in Pending state for extended periods- Pod phase is Pending
- Pending duration > 3 minutes from creation time
- Pod stuck in Pending state with duration
- Container readiness information
- Recent restart information if applicable
Namespace, phase, node name, pending duration, creation timestamp
pod_status_anomalyMediumGeneral pod health issues not covered by specific anomaly types- Pod is not in healthy state (Running/Succeeded)
- Pod is not ready
- Recent restarts detected
- Pod status information
- Container readiness ratios
- Restart information
Namespace, phase, node name, container statuses

AWS Anamolies

EC2 Instance Anomalies

AnomalySeverityDescriptionTrigger ConditionsSymptomsMetadata
instance_status_anomalyHighDetects EC2 instances not in running stateInstance state is not runningInstance status informationInstance details, resource ID
cpu_utilization_anomalyMedium-HighDetects CPU utilization issues (both high and low)- Current CPU utilization > 90% (high usage)
- 48-hour average CPU utilization < 15% (under utilization)
CPU utilization percentage with threshold informationCPU metrics, thresholds, resource ID

ECS Anomalies

AnomalySeverityDescriptionTrigger ConditionsSymptomsMetadata
ecs_orphaned_taskMediumDetects ECS tasks not associated with any serviceTask has no service name or empty service nameTask is orphaned (no service name)Task details, resource ID
ecs_task_status_anomalyMediumDetects ECS tasks in problematic statesTask status is STOPPED or DEACTIVATINGTask status informationTask details, resource ID
ecs_service_scaling_anomalyMediumDetects ECS service scaling issuesDesired task count != running task countDesired vs running count mismatchDesired count, running count, resource ID
ecs_service_pending_tasksMediumDetects ECS tasks that cannot be scheduledService has pending task count > 0Number of pending tasks that cannot be scheduledPending count, service details, resource ID
ecs_service_status_anomalyMediumDetects ECS services in abnormal statesService status is not ACTIVE or DRAININGService status informationService details, resource ID

Load Balancer Anomalies

AnomalySeverityDescriptionTrigger ConditionsSymptomsMetadata
target_group_high_error_rateHighDetects high 5XX error rates in target groupsHTTP 5XX error rate > 15%- High 5XX error rate with percentage
- Load balancer experiencing backend errors
Load balancer name, target group ARN, error metrics, threshold
target_group_low_healthHighDetects insufficient healthy targets in target groupsHealthy host count < 50% of total targets- Low healthy host count ratio
- Health percentage information
Load balancer name, target group ARN, health metrics, threshold
target_group_high_latencyMediumDetects high response times in target groupsTarget response time > 1.0 second- High response time with threshold information
- Slow backend response indicators
Load balancer name, target group ARN, response time metrics, threshold

Prometheus/Istio Anomalies

Service Mesh Anomalies

AnomalySeverityDescriptionTrigger ConditionsSymptomsMetadata
istio_error_rateMediumDetects high error rates in Istio service meshService error rate > 10%Error rate exceeds threshold with percentageService details, error rate metrics, resource ID

Anomaly Data Structure

Each detected anomaly contains the information described in the following table.

FieldDescription
source_systemThe system that generated the data (for example, kubernetes, aws).
affected_systemThe system affected by the anomaly.
anomaly_typeSpecific type of anomaly detected.
resource_typeType of resource affected (for example, pod, node, instance).
nameName/identifier of the affected resource.
timestampWhen the anomaly was detected.
metricThe metric that triggered the anomaly.
valueThe value that triggered the anomaly.
symptomsA list of human-readable symptoms describing the issue.
metadataAdditional context and details about the anomaly.

Severity Levels

The system uses the following severity levels:

Severity LevelDescription
CriticalThis severity level implies that immediate attention required and system functionality is severely impacted.
HighThis severity level implies that there are significant issues that need prompt attention.
MediumThis severity level implies that there are issues that should be addressed, but are not immediately critical.
LowThis severity level implies that there are minor issues or optimization opportunities.

Detection Thresholds

Key thresholds used across the system:

Kubernetes

  • Pod pending timeout: 3 minutes
  • Container restart count threshold: 10 restarts

AWS

  • CPU utilization high threshold: 90%
  • CPU utilization low threshold: 15% (48-hour average)
  • Target group error rate threshold: 15%
  • Target group health threshold: 50% healthy targets
  • Target group response time threshold: 1.0 second

Prometheus/Istio

  • Service error rate threshold: 10%