Skip to main content

Monitoring and alerting

Atlas uses AWS-native observability in both environment roots. The current implementation emphasizes fast signal capture over deep customization.

Signals by layer

LayerSignalCurrent implementation
ECStask count, CPU, memoryContainer Insights on the shared ECS cluster
ALBunhealthy targets and traffic healthCloudWatch metrics and the alb_unhealthy_hosts alarm
MSKbroker health and replicationenhanced monitoring plus the msk_under_replicated alarm
Logsworkload and connector logsCloudWatch log groups with environment-specific retention
Costmonthly spend thresholdsAWS Budget notifications to owner_email

Provisioned alarms

AlarmWhat it means
ecs_no_running_tasksthe events ingestion ECS service has no running tasks
alb_unhealthy_hoststhe shared ALB sees unhealthy registered targets
msk_under_replicatedthe Kafka cluster has under-replicated partitions

All alarms publish to a shared SNS topic named from the root prefix and subscribe the operator email configured in owner_email.

Log groups to expect

  • /ecs/<events-service-name>
  • /ecs/<dashboard-service-name>
  • /ecs/<kafka-ui-service-name>
  • /msk-connect/<connector-name> when the sink is enabled
  • /vpc/flow-logs

Environment differences

  • staging example values keep most log retention at 1 day for cost control.
  • prod committed values raise application, connector, and flow-log retention to 7 days.
  • Both roots keep the same alarm types and budget threshold model.
note

The current CloudWatch alarms focus on service availability, edge health, and Kafka replication. They do not replace service-level SLOs or application-specific dashboards.