The Complete Amazon CloudWatch Tutorial (2025): Metrics, Logs, Alarms, Dashboards & Real-World Monitoring Use Cases

By Srikanth Ch, Senior DevOps Engineer | Last updated: January 2025

Table of Contents: Amazon CloudWatch Tutorial

Introduction to Amazon CloudWatch

Picture this scenario: You deploy a critical application to production on Friday evening. Everything looks fine, so you head home for the weekend. Monday morning arrives, and you discover the application crashed Saturday night due to memory exhaustion. Customers couldn’t access the service for over 30 hours. No alerts fired. No one knew.

This nightmare happens more often than you’d think, and it’s entirely preventable.

Amazon CloudWatch exists precisely to stop these situations from happening. It’s your always-on monitoring system that watches every heartbeat of your AWS infrastructure, captures logs from your applications, triggers alerts when things go wrong, and gives you visual CloudWatch dashboards to understand system health at a glance.

Think of CloudWatch as having a dedicated 24×7 operations team watching your AWS environment. Except this team never sleeps, never misses a metric spike, and can automatically take corrective action before problems escalate.

Whether you’re preparing for the AWS Solutions Architect (SAA-C03), DevOps Professional, or AWS Certified CloudOps Engineer – Associate certification exams, or simply want to build production-grade AWS monitoring for your workloads, understanding CloudWatch deeply is non-negotiable.

Real Incident: The Friday Deployment Nightmare

Let me share the full story behind that Friday deployment scenario because it illustrates exactly why CloudWatch matters.

Timeline of the incident:

Friday, 6:15 PM — Team deploys v2.3.0 of the checkout service. Smoke tests pass. Everyone heads home.
Saturday, 2:47 AM — A memory leak in the new code causes heap exhaustion. The application crashes silently. EC2 status checks still show “healthy” because the instance itself is fine.
Saturday, 2:48 AM → Sunday, 11:30 PM — Customers see 502 errors. Support tickets pile up, but the weekend on-call engineer doesn’t check email until Sunday night.
Monday, 8:15 AM — Engineering discovers 30+ hours of downtime. Revenue loss estimated at $47,000.

What went wrong? The team monitored only default EC2 metrics (CPU, network). They had no memory metrics, no application-level error monitoring, and no alarms configured for HTTP 5xx responses.

How CloudWatch would have prevented this:

CloudWatch Agent collecting memory utilization would have shown the leak
A CloudWatch alarm on memory exceeding 85% would have fired at 2:30 AM
An alarm on ALB 5xx error rate would have triggered within 3 minutes of the crash
SNS integration would have paged the on-call engineer immediately

The fix took 15 minutes once someone knew about it. The detection gap cost 30 hours. That’s the difference CloudWatch makes.

CloudWatch Architecture Overview

Before diving into individual components, let’s understand how CloudWatch’s architecture works. This mental model will help everything else click into place.

CloudWatch operates as a centralized monitoring hub that receives data from multiple sources and enables multiple outputs.

Data Sources

AWS services automatically publish CloudWatch metrics (EC2, RDS, Lambda, ELB, S3, and over 80 others). The CloudWatch Agent installed on EC2 or on-premises servers collects OS-level metrics and logs. Your applications can push custom metrics via the PutMetricData API. Additionally, VPC Flow Logs, CloudTrail, and other AWS logging sources feed into CloudWatch.

Core Components

Metrics are time-series data points organized by namespaces. CloudWatch Logs contains text-based log data organized into Log Groups and Log Streams. CloudWatch Alarms provide threshold-based notifications tied to metrics. Dashboards offer visual displays combining metrics, logs, and alarm states. Events (now Amazon EventBridge) enable event-driven triggers for automation.

Output Actions

When conditions are met, CloudWatch can send SNS notifications (email, SMS, webhooks), trigger Auto Scaling policies, invoke Lambda functions, execute Systems Manager automation documents, or integrate with third-party tools via SNS or EventBridge.

The authentication model is straightforward. AWS services have built-in permissions to publish metrics to CloudWatch. For custom metrics and logs from EC2 instances, you attach an IAM role with CloudWatch permissions. The CloudWatch Agent uses these credentials to authenticate before pushing data.

CloudWatch Metrics Deep Dive

Metrics are the foundation of CloudWatch monitoring. Every metric is a time-ordered set of data points published to CloudWatch.

Default AWS Metrics

Most AWS services automatically publish metrics without any configuration. When you launch an EC2 instance, CloudWatch immediately starts collecting CPUUtilization, NetworkIn, NetworkOut, DiskReadOps, and similar metrics. For RDS databases, you get DatabaseConnections, ReadLatency, FreeStorageSpace, and dozens more.

These default metrics use standard resolution with 1-minute granularity for most services (5-minute for basic EC2 monitoring).

Custom Metrics

When default metrics aren’t enough, you publish custom metrics using the PutMetricData API. Common custom metrics include application-specific measurements like orders processed per minute, cache hit ratios, queue depths, and business KPIs like revenue per hour.

Here’s a simple example using the AWS CLI:

aws cloudwatch put-metric-data \
  --namespace "MyApplication" \
  --metric-name "OrdersProcessed" \
  --value 42 \
  --unit Count

For complete API reference, see AWS PutMetricData Documentation.

Metric Namespaces

Namespaces are containers that isolate metrics from different sources. AWS services use namespaces like AWS/EC2, AWS/RDS, and AWS/Lambda. Your custom metrics should use your own namespace like MyCompany/OrderService to avoid confusion.

High-Resolution Metrics

For workloads requiring sub-minute visibility, CloudWatch supports high-resolution metrics with 1-second granularity. This is particularly valuable for real-time trading systems, gaming backends with strict latency requirements, and IoT applications processing rapid sensor data.

High-resolution metrics cost more than standard resolution (see CloudWatch Pricing for current rates), so use them selectively for workloads where the granularity genuinely matters.

Reflection Prompt: Which workloads in your environment would benefit from 1-second metric granularity versus standard 1-minute resolution? Consider the cost-benefit tradeoff before enabling high-resolution metrics everywhere.

CloudWatch Logs Explained

If metrics tell you what happened, logs tell you why it happened. CloudWatch Logs provides centralized log management for all your AWS workloads.

Log Groups and Log Streams

The organizational hierarchy is simple. A Log Group is a collection of log streams that share the same retention and access control settings — think of it as a folder. A Log Stream represents a sequence of log events from a single source, like one EC2 instance or one Lambda function invocation.

For example, you might have a Log Group called /ecs/checkout-service containing Log Streams for each container task running that service.

Log Retention

By default, CloudWatch Logs retains data indefinitely, which gets expensive fast. Always configure retention policies appropriate for your compliance and troubleshooting needs. Options range from 1 day to 10 years, or you can keep logs forever if required.

A common pattern: retain application logs for 30–90 days in CloudWatch for active troubleshooting, then export older logs to S3 for long-term archival at much lower cost.

CloudWatch Logs Insights

This is where CloudWatch Logs becomes genuinely powerful. CloudWatch Logs Insights provides a purpose-built query language for searching and analyzing log data at scale. See the CloudWatch Logs Insights Query Syntax documentation for complete reference.

Say your Lambda function is throwing errors. Instead of scrolling through thousands of log lines, you run a query:

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50

This returns the 50 most recent error messages with timestamps. You can get more sophisticated with aggregations:

stats count(*) as errorCount by bin(5m)
| filter @message like /ERROR/

This shows error counts bucketed into 5-minute intervals, instantly revealing whether errors are increasing or decreasing.

Log Subscriptions

For real-time log processing, you can subscribe Log Groups to destinations like Lambda functions, Kinesis Data Streams, or OpenSearch Service. This enables use cases like real-time security alerting on suspicious patterns, streaming logs to third-party SIEM platforms, and building custom log analytics pipelines.

Real-World Troubleshooting Example

A Lambda function processing SQS messages started failing intermittently. The team used CloudWatch Logs Insights to query for errors:

fields @timestamp, @message
| filter @message like /Task timed out/
| stats count(*) by bin(1h)

They discovered timeout errors spiking every day at 2 PM. Correlating with business data revealed a daily batch job was flooding the queue. The fix: increase Lambda concurrency limits and timeout settings during peak processing windows.

Without centralized logs and CloudWatch Logs Insights, this investigation would have taken hours instead of minutes.

CloudWatch Alarms Configuration

CloudWatch Alarms transform passive monitoring into active incident response. When metrics cross thresholds you define, alarms change state and trigger actions.

Alarm States

Every alarm exists in one of three states. OK means the metric is within the defined threshold. ALARM indicates the metric has breached the threshold. INSUFFICIENT_DATA means not enough data points exist to determine state, which is common when first creating alarms or if metric reporting stops.

Standard Alarms

A standard alarm watches a single metric against a threshold. For example, you might alarm when CPUUtilization exceeds 80% for 3 consecutive 5-minute periods, when healthy host count drops below 2 for 1 minute, or when 4XX errors exceed 100 per minute.

You configure the evaluation period (how long before triggering), the number of data points required, and the comparison operator.

Composite Alarms

When you need more sophisticated alerting logic, composite alarms combine multiple alarms using AND/OR logic. For instance, you might only page the on-call engineer if BOTH CPU is high AND error rate is elevated (indicating genuine application stress, not just high traffic). Or alert if ANY of three database replicas shows high latency.

Composite alarms reduce alert noise by requiring multiple conditions before triggering.

Metric Math

Alarms can use metric math to compute derived metrics on the fly. Common patterns include calculating error rate as a percentage (errors / requests * 100), anomaly detection bands, and sum of metrics across multiple instances.

Alarm Actions

When an alarm triggers, it can send notifications via SNS (email, SMS, PagerDuty, Slack via webhooks), trigger Auto Scaling policies (scale out when load increases), execute EC2 actions (stop, terminate, reboot, or recover instances), create OpsItems in Systems Manager OpsCenter, or trigger EventBridge rules for custom automation.

Quiz Prompt: What happens when a metric has no data points and remains in INSUFFICIENT_DATA state? Does it trigger the alarm action?

Answer: By default, no. INSUFFICIENT_DATA is a separate state from ALARM. However, you can configure alarms to treat missing data as “breaching” if you want alerts when metric reporting stops.

How to Create a CloudWatch Alarm (Step-by-Step)

Let’s walk through creating a practical alarm that monitors EC2 CPU utilization and sends an SNS notification when it exceeds 80%.

Step 1: Select the Metric

Via AWS Console:
Navigate to CloudWatch → Alarms → Create alarm → Select metric → EC2 → Per-Instance Metrics → Choose your instance → Select CPUUtilization.

Via AWS CLI:

# First, verify the metric exists
aws cloudwatch list-metrics \
  --namespace "AWS/EC2" \
  --metric-name "CPUUtilization" \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0

Step 2: Define the Threshold

Configure when the alarm should trigger. For this example, we want the alarm to fire when CPU exceeds 80% for 3 consecutive 5-minute periods.

Console: Set “Threshold type” to Static, “Whenever CPUUtilization is” to Greater than 80, “Datapoints to alarm” to 3 out of 3.

CLI command to create the alarm:

aws cloudwatch put-metric-alarm \
  --alarm-name "High-CPU-Production-WebServer" \
  --alarm-description "Alarm when CPU exceeds 80% for 15 minutes" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
  --unit Percent

Step 3: Configure Actions

Specify what happens when the alarm triggers. In the console, select an existing SNS topic or create a new one. The CLI example above uses --alarm-actions to specify the SNS topic ARN.

Common action patterns include sending notifications to your ops team SNS topic, triggering an Auto Scaling policy to add capacity, and invoking a Lambda function for custom remediation.

Step 4: Test the Alarm

Never deploy alarms without testing. You can force an alarm state change using:

aws cloudwatch set-alarm-state \
  --alarm-name "High-CPU-Production-WebServer" \
  --state-value ALARM \
  --state-reason "Testing alarm notification"

Verify your SNS notification arrives, then reset the alarm:

aws cloudwatch set-alarm-state \
  --alarm-name "High-CPU-Production-WebServer" \
  --state-value OK \
  --state-reason "Test complete"

Console screenshot suggestion: Show the “Create alarm” wizard with CPUUtilization selected, threshold set to 80%, and SNS action configured. Filename: cloudwatch-create-alarm-console.png

Building CloudWatch Dashboards

CloudWatch Dashboards transform raw metrics into actionable visual intelligence. For DevOps teams, well-designed dashboards accelerate incident response and enable proactive capacity planning.

Dashboard Components

CloudWatch Dashboards support multiple widget types including line graphs for time-series metrics (CPU over time), stacked area charts for aggregate views (requests by status code), numbers for single statistics (current healthy host count), gauges for threshold visualization, text widgets for documentation and links, alarm status widgets showing OK/ALARM states, and log widgets displaying recent log entries.

Building Effective Dashboards

The key to useful dashboards is organizing them around operational questions.

A Service Health Dashboard answers “Is the service healthy right now?” by showing request count and error rate, latency percentiles (p50, p95, p99), active alarms for this service, and healthy/unhealthy host count.

An Infrastructure Dashboard answers “How are the underlying resources performing?” with EC2 CPU, memory, and disk across the fleet, RDS connections, storage, and replication lag, plus network throughput and packet drops.

A Business Metrics Dashboard answers “How is the business performing?” through orders per minute, revenue tracking, user signups, and conversion rates.

Cross-Account Dashboards

For organizations with multiple AWS accounts, CloudWatch supports cross-account dashboards that aggregate metrics from different accounts into a single view. This requires setting up cross-account access using IAM roles and CloudWatch cross-account observability.

Real-World DevOps Use Case

A SaaS company built a “Leadership Dashboard” displayed on monitors in the engineering area. It showed real-time error rates, API latency, and customer-impacting incidents. During weekly leadership meetings, the same dashboard drove discussions about system reliability. This visibility created shared accountability between engineering and business teams.

CloudWatch Agent and Advanced Observability

Default AWS metrics don’t include everything you need. The CloudWatch Agent extends observability to OS-level metrics and custom log files.

What the CloudWatch Agent Provides

By default, EC2 metrics don’t include memory utilization, disk space, or other OS-level measurements because AWS can’t see inside your operating system. The CloudWatch Agent runs inside your instance and reports memory utilization and available memory, disk space utilization by mount point, network statistics beyond basic throughput, process-level metrics, and custom application logs.

Unified vs. Legacy Agent

AWS previously offered separate agents for logs and metrics. The Unified CloudWatch Agent combines both capabilities and is the recommended choice for all new deployments. It supports both Linux and Windows, EC2 and on-premises servers.

Configuration uses a JSON file that specifies which metrics to collect and which log files to ship:

{
  "metrics": {
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent"]
      },
      "disk": {
        "measurement": ["disk_used_percent"],
        "resources": ["/", "/data"]
      }
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/myapp/*.log",
            "log_group_name": "myapp-logs"
          }
        ]
      }
    }
  }
}

ServiceLens and Distributed Tracing

For microservices architectures, CloudWatch ServiceLens combines metrics, logs, and traces into a unified view. It integrates with AWS X-Ray for distributed tracing, allowing you to follow requests across service boundaries.

When a user request fails, you can trace it from the API Gateway through Lambda functions to DynamoDB calls, identifying exactly where latency or errors occurred.

Related reading: AWS X-Ray Distributed Tracing Guide

OpenTelemetry Support

AWS now supports the AWS Distro for OpenTelemetry (ADOT), allowing you to instrument applications using vendor-neutral OpenTelemetry SDKs while sending data to CloudWatch. This provides flexibility if you later want to switch observability platforms.

Reflection Prompt: Are you currently collecting OS-level metrics like memory and disk utilization, or relying only on default AWS metrics? Many production outages stem from memory exhaustion or disk space issues that default metrics would never catch.

CloudWatch Events and Amazon EventBridge

CloudWatch Events (now largely superseded by Amazon EventBridge) enables event-driven automation triggered by AWS resource changes or scheduled expressions.

Event-Driven Automation

Events are emitted when AWS resources change state. Examples include EC2 instance state changes (running, stopped, terminated), ECS task state changes, CodePipeline execution status updates, and Security Hub findings.

You create rules that match specific event patterns and route them to targets like Lambda functions, SNS topics, or Step Functions.

Scheduled Events (Cron Jobs)

EventBridge supports cron and rate expressions for time-based triggers:

# Run every day at 6 AM UTC
cron(0 6 * * ? *)

# Run every 5 minutes
rate(5 minutes)

Common Use Cases

Auto-Stop Development Instances — A rule triggers at 7 PM daily, invoking a Lambda function that stops all EC2 instances tagged Environment=Development. This saves significant costs on non-production workloads.

Security Event Notifications — When GuardDuty detects a threat, an event triggers SNS notification to the security team.

Automated Remediation — When an EC2 instance fails status checks, automatically trigger Systems Manager to run recovery automation.

Integrating CloudWatch Across AWS Services

CloudWatch’s power multiplies when integrated with other AWS services.

Auto Scaling Integration

CloudWatch Alarms can trigger Auto Scaling policies. When CPUUtilization exceeds 70% for 5 minutes, scale out by adding instances. When it drops below 30% for 15 minutes, scale in. This creates self-healing, cost-optimized infrastructure.

RDS Enhanced Monitoring

Standard RDS metrics update every minute. Enhanced Monitoring provides OS-level metrics at 1-second granularity directly in the RDS console, with data stored in CloudWatch Logs. This helps diagnose performance issues caused by OS-level resource constraints.

EKS Container Insights

For Kubernetes workloads on EKS, Container Insights collects cluster-level, node-level, and pod-level metrics. It automatically discovers containers and collects metrics like CPU, memory, network, and disk utilization for each pod.

Cross-Account Observability

Large organizations often use multiple AWS accounts for different environments or teams. CloudWatch cross-account observability lets you configure a central monitoring account that aggregates metrics, logs, and traces from linked source accounts. This provides unified visibility without logging into each account separately.

CloudWatch Best Practices for Production

After years of working with CloudWatch in production environments, these practices consistently prove valuable.

Enable Detailed Monitoring for Critical Workloads. The difference between 5-minute and 1-minute metrics can mean the difference between catching a problem early or discovering it after customers complain. Enable detailed monitoring for production EC2 instances.

Set Appropriate Log Retention. Storing logs indefinitely is wasteful. Define retention policies based on actual needs — 30 days for application logs, 90 days for security logs, 1 year for compliance-required audit logs. Export older logs to S3 if long-term retention is required.

Use Metric Filters for Security. Create metric filters that detect security-relevant patterns in logs such as failed SSH login attempts, IAM policy changes, root account usage, and unauthorized API calls. When these metrics exceed thresholds, alarm immediately.

Adopt CloudWatch Synthetics. CloudWatch Synthetics canaries are configurable scripts that run on a schedule to monitor endpoints. Use them to test API endpoints from customer perspectives, monitor website availability and page load times, and validate multi-step user workflows.

Infrastructure as Code. Define CloudWatch dashboards, alarms, and log groups in Terraform, CDK, or CloudFormation. Version control your monitoring configuration alongside your infrastructure code.

🚀 Pro Tip: Centralize logs from all accounts into a single logging account. This simplifies security analysis, reduces costs through consolidated retention policies, and provides a single pane of glass for troubleshooting cross-account issues.

Related reading: S3 Lifecycle Policies for Log Archival

CloudWatch Security Considerations

Monitoring data often contains sensitive information. Secure it appropriately.

Encryption

CloudWatch Logs supports encryption at rest using AWS KMS. For sensitive workloads, create a customer-managed KMS key and associate it with your Log Groups:

aws logs associate-kms-key \
  --log-group-name "/app/sensitive-service" \
  --kms-key-id "arn:aws:kms:us-east-1:123456789012:key/your-key-id"

IAM Permissions

Follow least-privilege principles. CloudWatch administrators need different permissions than read-only dashboard viewers. Common managed policies include CloudWatchReadOnlyAccess for viewing metrics, logs, and dashboards, CloudWatchAgentServerPolicy for EC2 instances running the agent, and CloudWatchFullAccess for complete administrative access.

Security Monitoring Workflow

A powerful pattern combines CloudTrail, CloudWatch Logs, and metric filters. CloudTrail logs all API activity to CloudWatch Logs, metric filters detect suspicious patterns (root login, IAM changes, security group modifications), alarms trigger on these metrics, and SNS notifies the security team.

This provides near-real-time security alerting for your AWS environment.

Reflection Prompt: Do you currently monitor failed login attempts, unauthorized API calls, or IAM policy changes using CloudWatch metric filters? If not, you’re missing critical security visibility.

Common CloudWatch Mistakes to Avoid

Learn from others’ mistakes rather than making your own.

Storing Logs Forever. Without retention policies, log storage costs grow indefinitely. A single chatty application can generate hundreds of gigabytes monthly.

Ignoring Application Metrics. Monitoring only infrastructure metrics (CPU, memory) without application metrics (error rates, latency, business KPIs) leaves you blind to problems until they become severe.

Alarm Threshold Mistakes. Setting thresholds too sensitive creates alert fatigue. Setting them too lenient means problems escalate before anyone notices. Base thresholds on historical data and adjust over time.

Missing Metric Filters. CloudWatch Logs contain valuable signals buried in text. Without metric filters, you’re manually searching logs instead of being alerted automatically.

No Dashboards. If you don’t have dashboards, you’re reacting to problems instead of proactively monitoring. Build dashboards for every production service.

CloudWatch Pricing and Cost Optimization

CloudWatch pricing has multiple dimensions. Understanding them helps control costs. For current pricing, always refer to the official CloudWatch Pricing page.

What You Pay For (Illustrative Examples)

Metrics: The free tier includes 10 custom metrics. Beyond that, pricing varies by volume (approximately $0.30/metric/month at lower volumes, with tiered discounts at scale).

Logs Ingested: Approximately $0.50/GB ingested (varies by region).

Logs Stored: Approximately $0.03/GB/month.

Alarms: Approximately $0.10/alarm/month for standard resolution alarms.

Dashboards: First 3 dashboards free, then approximately $3/dashboard/month.

Note: Prices shown are illustrative examples for US East region as of early 2025. Always verify current pricing on the AWS pricing page.

Cost Optimization Strategies

Reduce Log Retention. Storing 6 months of logs costs 6× storing 1 month. Be realistic about retention needs.

Export to S3. For long-term archival, export logs to S3 where storage costs are significantly lower than CloudWatch Logs storage.

Use Metric Math. Instead of publishing 10 custom metrics and creating 10 alarms, use metric math to derive calculated metrics, reducing both metric and alarm costs.

Avoid Unnecessary High-Resolution Metrics. High-resolution (1-second) metrics cost more than standard resolution. Use them only where genuinely needed.

Real-World Cost Example

A development team enabled 1-second resolution metrics for a chatty microservice publishing 50 custom metrics. Monthly CloudWatch bill: over $400 for metrics alone. After switching to standard resolution, costs dropped to under $50, with no meaningful loss in observability. This illustrates how metric resolution choices significantly impact costs.

What to Monitor First (Quick Reference Table)

AWS Service	Recommended Metrics	Suggested Retention	Example Alarm Threshold
EC2	CPUUtilization, Memory* (agent), DiskUsed* (agent), StatusCheckFailed	30 days	CPU > 80% for 15 min
RDS	DatabaseConnections, FreeStorageSpace, ReadLatency, CPUUtilization	60 days	Storage < 20% free
Lambda	Errors, Duration, Throttles, ConcurrentExecutions	14 days	Error rate > 5%
ALB	TargetResponseTime, HTTPCode_Target_5XX_Count, HealthyHostCount	30 days	5XX > 10/min
ECS/EKS	CPUUtilization, MemoryUtilization, RunningTaskCount	30 days	Memory > 85%
S3	BucketSizeBytes, NumberOfObjects, 4xxErrors, 5xxErrors	7 days	5XX errors > 0
SQS	ApproximateAgeOfOldestMessage, NumberOfMessagesVisible	14 days	Message age > 300 sec
API Gateway	Latency, 4XXError, 5XXError, Count	30 days	p99 latency > 3 sec

*Requires CloudWatch Agent

How CloudWatch Maps to AWS Certification Exams

Understanding CloudWatch is essential for multiple AWS certifications. Here’s how this tutorial maps to exam domains.

AWS Solutions Architect Associate (SAA-C03)

Domain 2: Design Resilient Architectures — CloudWatch Alarms trigger Auto Scaling for self-healing infrastructure. Understand alarm states, evaluation periods, and scaling policy integration.

Domain 4: Design Cost-Optimized Architectures — Know CloudWatch pricing dimensions, log retention cost impacts, and when to export logs to S3.

AWS SysOps Administrator (SOA-C02)

Domain 1: Monitoring, Logging, and Remediation — This is the core CloudWatch domain. Expect questions on metric filters, CloudWatch Logs Insights queries, alarm configuration, and dashboard creation.

Domain 2: Reliability and Business Continuity — CloudWatch enables automated recovery through alarms triggering EC2 recovery actions and Systems Manager automation.

AWS DevOps Engineer Professional

Domain 1: SDLC Automation — Integrate CloudWatch metrics into CI/CD pipelines for deployment health checks.

Domain 3: Monitoring and Logging — Deep knowledge expected on CloudWatch Agent configuration, cross-account observability, and EventBridge integration for automation.

Key Exam Topics to Master

IAM roles required for CloudWatch Agent (exam frequently tests IAM integration)
Difference between CloudWatch Logs and CloudTrail (common confusion point)
Composite alarms vs. standard alarms
High-resolution metrics use cases and cost implications
CloudWatch Logs Insights query syntax basics

30-Minute Action Checklist

Put your CloudWatch knowledge into practice immediately. Complete these six actions in 30–60 minutes.

☐ Action 1: Enable Detailed Monitoring (5 min)

Select one production EC2 instance. Enable detailed monitoring to get 1-minute metrics instead of 5-minute.

aws ec2 monitor-instances --instance-ids i-0123456789abcdef0

☐ Action 2: Create Three Essential Alarms (10 min)

Create alarms for CPU > 80%, StatusCheckFailed > 0, and one application-specific metric. Use the CLI commands from the “How to Create an Alarm” section.

☐ Action 3: Set Log Retention Policy (3 min)

Find one Log Group with indefinite retention and set a 30-day policy.

aws logs put-retention-policy \
  --log-group-name "/aws/lambda/my-function" \
  --retention-in-days 30

☐ Action 4: Run a CloudWatch Logs Insights Query (5 min)

Open CloudWatch Logs Insights, select a Log Group, and run this query:

fields @timestamp, @message
| sort @timestamp desc
| limit 20

☐ Action 5: Create Your First Dashboard (5 min)

Create a dashboard with three widgets: EC2 CPU line graph, alarm status widget, and a text widget with your team’s runbook link.

☐ Action 6: Install CloudWatch Agent on One EC2 (10 min)

Follow the CloudWatch Agent installation guide to collect memory metrics from one instance.

Frequently Asked Questions

What is Amazon CloudWatch?

Amazon CloudWatch is AWS’s native monitoring and observability service. It collects metrics and logs from AWS resources and applications, enables alerting through alarms, and provides dashboards for visualization. CloudWatch serves as the central monitoring hub for most AWS workloads.

What are CloudWatch metrics?

CloudWatch metrics are time-series data points representing measurements of your resources and applications. AWS services automatically publish metrics like EC2 CPUUtilization or Lambda Duration. You can also publish custom metrics from your applications using the PutMetricData API.

How do CloudWatch Logs work?

CloudWatch Logs collects and stores log data from AWS services, EC2 instances, containers, and applications. Logs are organized into Log Groups (collections sharing retention settings) and Log Streams (sequences from individual sources). You can search logs using CloudWatch Logs Insights, create metric filters, and subscribe logs to downstream services.

Is CloudWatch free?

CloudWatch has a free tier that includes 10 custom metrics, 5 GB of log data ingestion, 3 dashboards, and 10 alarms (see current free tier details). Beyond the free tier, you pay for metrics, logs ingested/stored, alarms, dashboards, and API calls. Most production workloads exceed the free tier.

What is the difference between CloudWatch and CloudTrail?

CloudWatch monitors operational metrics and logs such as CPU utilization, application errors, and performance data. CloudTrail audits API activity, recording who did what, when, and from where. They’re complementary: CloudTrail logs often flow into CloudWatch Logs for analysis and alerting on security events.

Wrapping Up

Monitoring isn’t optional in modern cloud operations — it’s the backbone of reliability, security, and cost management. Amazon CloudWatch provides the foundation for observability across your AWS environment, from basic metrics collection through advanced distributed tracing.

The concepts covered in this Amazon CloudWatch tutorial prepare you for real-world DevOps work and AWS certification exams alike. CloudWatch questions appear frequently on the SAA-C03, DVA-C02, SOA-C02, and DevOps Professional exams because understanding monitoring is fundamental to operating AWS effectively.

Start with the basics: enable detailed monitoring for production EC2 instances, set up log retention policies, create alarms for critical metrics, and build dashboards for your most important services. As you gain experience, expand into advanced patterns like metric filters, composite alarms, cross-account observability, and event-driven automation.

The investment in solid monitoring pays dividends every time you catch a problem before customers notice, troubleshoot an incident in minutes instead of hours, or demonstrate system reliability to stakeholders with clear dashboards.

Published on thedevopstooling.com by Srikanth Ch, Senior DevOps Engineer

Found this guide helpful? Share it with your team and bookmark it for quick reference during your next monitoring setup or AWS exam prep session.