Supported Anomaly Detection Types
This topic describes the various types of anomalies detected by the SRE anomaly detection system across different platforms and services.
Overview
The anomaly detection system monitors multiple platforms and services to identify issues that could impact system reliability and performance.
Kubernetes Anomalies
Node Anomalies
Anomaly | Severity | Description | Trigger Conditions | Symptoms | Metadata |
---|---|---|---|---|---|
node_status_anomaly | High | Detects unhealthy Kubernetes nodes | - Node health status is unhealthy - Node ready status is not True | Node Ready status is not True | Node name, ready status, resource ID |
Pod Anomalies
Anomaly | Severity | Description | Trigger Conditions | Symptoms | Metadata |
---|---|---|---|---|---|
pod_crashloopbackoff_anomaly | Critical | Detects containers stuck in CrashLoopBackOff state | Container waiting state reason is CrashLoopBackOff | - Container in CrashLoopBackOff state - Container restart count information - Error messages from container | Container name, restart count, error details |
pod_imagepull_anomaly | High | Detects image pull failures preventing container startup | - Container waiting state reason is ImagePullBackOff - Container waiting state reason is ErrImagePull | - Container in ImagePullBackOff or ErrImagePull state - Specific error messages | Container name, error details |
pod_high_restart_anomaly | Medium | Detects containers with excessive restart counts indicating instability | Container restart count >= 10 | High restart count- Container instability indicators | Container name, restart count |
pod_container_not_ready_anomaly | Medium | Detects containers that are not ready without clear waiting reasons | - Container ready status is false - No clear waiting state reason | Container not ready status | Container name, restart count |
pod_pending_anomaly | Medium | Detects pods stuck in Pending state for extended periods | - Pod phase is Pending - Pending duration > 3 minutes from creation time | - Pod stuck in Pending state with duration - Container readiness information - Recent restart information if applicable | Namespace, phase, node name, pending duration, creation timestamp |
pod_status_anomaly | Medium | General pod health issues not covered by specific anomaly types | - Pod is not in healthy state (Running/Succeeded) - Pod is not ready - Recent restarts detected | - Pod status information - Container readiness ratios - Restart information | Namespace, phase, node name, container statuses |
AWS Anamolies
EC2 Instance Anomalies
Anomaly | Severity | Description | Trigger Conditions | Symptoms | Metadata |
---|---|---|---|---|---|
instance_status_anomaly | High | Detects EC2 instances not in running state | Instance state is not running | Instance status information | Instance details, resource ID |
cpu_utilization_anomaly | Medium-High | Detects CPU utilization issues (both high and low) | - Current CPU utilization > 90% (high usage) - 48-hour average CPU utilization < 15% (under utilization) | CPU utilization percentage with threshold information | CPU metrics, thresholds, resource ID |
ECS Anomalies
Anomaly | Severity | Description | Trigger Conditions | Symptoms | Metadata |
---|---|---|---|---|---|
ecs_orphaned_task | Medium | Detects ECS tasks not associated with any service | Task has no service name or empty service name | Task is orphaned (no service name) | Task details, resource ID |
ecs_task_status_anomaly | Medium | Detects ECS tasks in problematic states | Task status is STOPPED or DEACTIVATING | Task status information | Task details, resource ID |
ecs_service_scaling_anomaly | Medium | Detects ECS service scaling issues | Desired task count != running task count | Desired vs running count mismatch | Desired count, running count, resource ID |
ecs_service_pending_tasks | Medium | Detects ECS tasks that cannot be scheduled | Service has pending task count > 0 | Number of pending tasks that cannot be scheduled | Pending count, service details, resource ID |
ecs_service_status_anomaly | Medium | Detects ECS services in abnormal states | Service status is not ACTIVE or DRAINING | Service status information | Service details, resource ID |
Load Balancer Anomalies
Anomaly | Severity | Description | Trigger Conditions | Symptoms | Metadata |
---|---|---|---|---|---|
target_group_high_error_rate | High | Detects high 5XX error rates in target groups | HTTP 5XX error rate > 15% | - High 5XX error rate with percentage - Load balancer experiencing backend errors | Load balancer name, target group ARN, error metrics, threshold |
target_group_low_health | High | Detects insufficient healthy targets in target groups | Healthy host count < 50% of total targets | - Low healthy host count ratio - Health percentage information | Load balancer name, target group ARN, health metrics, threshold |
target_group_high_latency | Medium | Detects high response times in target groups | Target response time > 1.0 second | - High response time with threshold information - Slow backend response indicators | Load balancer name, target group ARN, response time metrics, threshold |
Prometheus/Istio Anomalies
Service Mesh Anomalies
Anomaly | Severity | Description | Trigger Conditions | Symptoms | Metadata |
---|---|---|---|---|---|
istio_error_rate | Medium | Detects high error rates in Istio service mesh | Service error rate > 10% | Error rate exceeds threshold with percentage | Service details, error rate metrics, resource ID |
Anomaly Data Structure
Each detected anomaly contains the information described in the following table.
Field | Description |
---|---|
source_system | The system that generated the data (for example, kubernetes , aws ). |
affected_system | The system affected by the anomaly. |
anomaly_type | Specific type of anomaly detected. |
resource_type | Type of resource affected (for example, pod , node , instance ). |
name | Name/identifier of the affected resource. |
timestamp | When the anomaly was detected. |
metric | The metric that triggered the anomaly. |
value | The value that triggered the anomaly. |
symptoms | A list of human-readable symptoms describing the issue. |
metadata | Additional context and details about the anomaly. |
Severity Levels
The system uses the following severity levels:
Severity Level | Description |
---|---|
Critical | This severity level implies that immediate attention required and system functionality is severely impacted. |
High | This severity level implies that there are significant issues that need prompt attention. |
Medium | This severity level implies that there are issues that should be addressed, but are not immediately critical. |
Low | This severity level implies that there are minor issues or optimization opportunities. |
Detection Thresholds
Key thresholds used across the system:
Kubernetes
- Pod pending timeout: 3 minutes
- Container restart count threshold: 10 restarts
AWS
- CPU utilization high threshold: 90%
- CPU utilization low threshold: 15% (48-hour average)
- Target group error rate threshold: 15%
- Target group health threshold: 50% healthy targets
- Target group response time threshold: 1.0 second
Prometheus/Istio
- Service error rate threshold: 10%