Monitoring and alerting
Atlas uses AWS-native observability in both environment roots. The current implementation emphasizes fast signal capture over deep customization.
Signals by layer
| Layer | Signal | Current implementation |
|---|---|---|
| ECS | task count, CPU, memory | Container Insights on the shared ECS cluster |
| ALB | unhealthy targets and traffic health | CloudWatch metrics and the alb_unhealthy_hosts alarm |
| MSK | broker health and replication | enhanced monitoring plus the msk_under_replicated alarm |
| Logs | workload and connector logs | CloudWatch log groups with environment-specific retention |
| Cost | monthly spend thresholds | AWS Budget notifications to owner_email |
Provisioned alarms
| Alarm | What it means |
|---|---|
ecs_no_running_tasks | the events ingestion ECS service has no running tasks |
alb_unhealthy_hosts | the shared ALB sees unhealthy registered targets |
msk_under_replicated | the Kafka cluster has under-replicated partitions |
All alarms publish to a shared SNS topic named from the root prefix and subscribe the operator email configured in owner_email.
Log groups to expect
/ecs/<events-service-name>/ecs/<dashboard-service-name>/ecs/<kafka-ui-service-name>/msk-connect/<connector-name>when the sink is enabled/vpc/flow-logs
Environment differences
stagingexample values keep most log retention at 1 day for cost control.prodcommitted values raise application, connector, and flow-log retention to 7 days.- Both roots keep the same alarm types and budget threshold model.
note
The current CloudWatch alarms focus on service availability, edge health, and Kafka replication. They do not replace service-level SLOs or application-specific dashboards.