Monitoring and alerting

Atlas uses AWS-native observability in both environment roots. The current implementation emphasizes fast signal capture over deep customization.

Signals by layer

Layer	Signal	Current implementation
ECS	task count, CPU, memory	Container Insights on the shared ECS cluster
ALB	unhealthy targets and traffic health	CloudWatch metrics and the `alb_unhealthy_hosts` alarm
MSK	broker health and replication	enhanced monitoring plus the `msk_under_replicated` alarm
Logs	workload and connector logs	CloudWatch log groups with environment-specific retention
Cost	monthly spend thresholds	AWS Budget notifications to `owner_email`

Provisioned alarms

Alarm	What it means
`ecs_no_running_tasks`	the events ingestion ECS service has no running tasks
`alb_unhealthy_hosts`	the shared ALB sees unhealthy registered targets
`msk_under_replicated`	the Kafka cluster has under-replicated partitions

All alarms publish to a shared SNS topic named from the root prefix and subscribe the operator email configured in owner_email.

Log groups to expect

/ecs/<events-service-name>
/ecs/<dashboard-service-name>
/ecs/<kafka-ui-service-name>
/msk-connect/<connector-name> when the sink is enabled
/vpc/flow-logs

Environment differences

staging example values keep most log retention at 1 day for cost control.
prod committed values raise application, connector, and flow-log retention to 7 days.
Both roots keep the same alarm types and budget threshold model.

note

The current CloudWatch alarms focus on service availability, edge health, and Kafka replication. They do not replace service-level SLOs or application-specific dashboards.

Signals by layer​

Provisioned alarms​

Log groups to expect​

Environment differences​

Signals by layer

Provisioned alarms

Log groups to expect

Environment differences