Version: 1.1.0

Supported Anomaly Detection Types

This topic describes the various types of anomalies detected by the SRE anomaly detection system across different platforms and services.

Overview

The anomaly detection system monitors multiple platforms and services to identify issues that could impact system reliability and performance.

Kubernetes Anomalies

Node Anomalies

Anomaly	Severity	Description	Trigger Conditions	Symptoms	Metadata
node_status_anomaly	High	Detects unhealthy Kubernetes nodes	- Node health status is `unhealthy` - Node ready status is not `True`	Node Ready status is not True	Node name, ready status, resource ID

Pod Anomalies

Anomaly	Severity	Description	Trigger Conditions	Symptoms	Metadata
pod_crashloopbackoff_anomaly	Critical	Detects containers stuck in CrashLoopBackOff state	Container waiting state reason is `CrashLoopBackOff`	- Container in CrashLoopBackOff state - Container restart count information - Error messages from container	Container name, restart count, error details
pod_imagepull_anomaly	High	Detects image pull failures preventing container startup	- Container waiting state reason is `ImagePullBackOff` - Container waiting state reason is `ErrImagePull`	- Container in ImagePullBackOff or ErrImagePull state - Specific error messages	Container name, error details
pod_high_restart_anomaly	Medium	Detects containers with excessive restart counts indicating instability	Container restart count >= 10	High restart count- Container instability indicators	Container name, restart count
pod_container_not_ready_anomaly	Medium	Detects containers that are not ready without clear waiting reasons	- Container ready status is false - No clear waiting state reason	Container not ready status	Container name, restart count
pod_pending_anomaly	Medium	Detects pods stuck in Pending state for extended periods	- Pod phase is `Pending` - Pending duration > 3 minutes from creation time	- Pod stuck in Pending state with duration - Container readiness information - Recent restart information if applicable	Namespace, phase, node name, pending duration, creation timestamp
pod_status_anomaly	Medium	General pod health issues not covered by specific anomaly types	- Pod is not in healthy state (Running/Succeeded) - Pod is not ready - Recent restarts detected	- Pod status information - Container readiness ratios - Restart information	Namespace, phase, node name, container statuses

AWS Anamolies

EC2 Instance Anomalies

Anomaly	Severity	Description	Trigger Conditions	Symptoms	Metadata
`instance_status_anomaly`	High	Detects EC2 instances not in running state	Instance state is not `running`	Instance status information	Instance details, resource ID
`cpu_utilization_anomaly`	Medium-High	Detects CPU utilization issues (both high and low)	- Current CPU utilization > 90% (high usage) - 48-hour average CPU utilization < 15% (under utilization)	CPU utilization percentage with threshold information	CPU metrics, thresholds, resource ID

ECS Anomalies

Anomaly	Severity	Description	Trigger Conditions	Symptoms	Metadata
`ecs_orphaned_task`	Medium	Detects ECS tasks not associated with any service	Task has no service name or empty service name	Task is orphaned (no service name)	Task details, resource ID
`ecs_task_status_anomaly`	Medium	Detects ECS tasks in problematic states	Task status is `STOPPED` or `DEACTIVATING`	Task status information	Task details, resource ID
`ecs_service_scaling_anomaly`	Medium	Detects ECS service scaling issues	Desired task count != running task count	Desired vs running count mismatch	Desired count, running count, resource ID
`ecs_service_pending_tasks`	Medium	Detects ECS tasks that cannot be scheduled	Service has pending task count > 0	Number of pending tasks that cannot be scheduled	Pending count, service details, resource ID
`ecs_service_status_anomaly`	Medium	Detects ECS services in abnormal states	Service status is not `ACTIVE` or `DRAINING`	Service status information	Service details, resource ID

Load Balancer Anomalies

Anomaly	Severity	Description	Trigger Conditions	Symptoms	Metadata
`target_group_high_error_rate`	High	Detects high 5XX error rates in target groups	HTTP 5XX error rate > 15%	- High 5XX error rate with percentage - Load balancer experiencing backend errors	Load balancer name, target group ARN, error metrics, threshold
`target_group_low_health`	High	Detects insufficient healthy targets in target groups	Healthy host count < 50% of total targets	- Low healthy host count ratio - Health percentage information	Load balancer name, target group ARN, health metrics, threshold
`target_group_high_latency`	Medium	Detects high response times in target groups	Target response time > 1.0 second	- High response time with threshold information - Slow backend response indicators	Load balancer name, target group ARN, response time metrics, threshold

Prometheus/Istio Anomalies

Service Mesh Anomalies

Anomaly	Severity	Description	Trigger Conditions	Symptoms	Metadata
`istio_error_rate`	Medium	Detects high error rates in Istio service mesh	Service error rate > 10%	Error rate exceeds threshold with percentage	Service details, error rate metrics, resource ID

Anomaly Data Structure

Each detected anomaly contains the information described in the following table.

Field	Description
source_system	The system that generated the data (for example, `kubernetes`, `aws`).
affected_system	The system affected by the anomaly.
anomaly_type	Specific type of anomaly detected.
resource_type	Type of resource affected (for example, `pod`, `node`, `instance`).
name	Name/identifier of the affected resource.
timestamp	When the anomaly was detected.
metric	The metric that triggered the anomaly.
value	The value that triggered the anomaly.
symptoms	A list of human-readable symptoms describing the issue.
metadata	Additional context and details about the anomaly.

Severity Levels

The system uses the following severity levels:

Severity Level	Description
Critical	This severity level implies that immediate attention required and system functionality is severely impacted.
High	This severity level implies that there are significant issues that need prompt attention.
Medium	This severity level implies that there are issues that should be addressed, but are not immediately critical.
Low	This severity level implies that there are minor issues or optimization opportunities.

Detection Thresholds

Key thresholds used across the system:

Kubernetes

Pod pending timeout: 3 minutes
Container restart count threshold: 10 restarts

AWS

CPU utilization high threshold: 90%
CPU utilization low threshold: 15% (48-hour average)
Target group error rate threshold: 15%
Target group health threshold: 50% healthy targets
Target group response time threshold: 1.0 second

Prometheus/Istio

Service error rate threshold: 10%

Overview​

Kubernetes Anomalies​

Node Anomalies​

Pod Anomalies​

AWS Anamolies​

EC2 Instance Anomalies​

ECS Anomalies​

Load Balancer Anomalies​

Prometheus/Istio Anomalies​

Service Mesh Anomalies​

Anomaly Data Structure​

Severity Levels​

Detection Thresholds​

Kubernetes​

AWS​

Prometheus/Istio​