Skip to main content

Canary Deployment Implementation

Overview

This project now supports gradual traffic shifting via API Gateway canary deployments. When enabled, new releases start with a small percentage of traffic (e.g., 10%) and automatically increment to 100% if the canary is healthy.

Architecture

Three Lambda Functions

1. CanaryConfigurator Lambda

Purpose: CloudFormation custom resource to configure API Gateway canary settings
Location: src/lambda/canaryConfigurator/
Why: CDK Stage construct doesn't support canarySettings natively

Function:

  • Called during CloudFormation Create/Update/Delete operations
  • Uses API Gateway UpdateStage API with patchOperations
  • Configures:
    • /canarySettings/percentTraffic (e.g., 10)
    • /canarySettings/useStageCache (false)
    • /canarySettings/stageVariableOverrides/activeEnvironment (blue)

Deployed: Automatically invoked by CDK custom resource in ReleaseDeploymentStack

2. CanaryMonitor Lambda

Purpose: Automated health checking and traffic increments
Location: src/lambda/canaryMonitor/
Construct: lib/lambda/CanaryMonitorLambda.ts

Functions:

  • increment: Checks canary health, increments traffic by 10%
  • rollback: Emergency rollback to 0% if unhealthy
  • promote: Promote canary to 100%, update base environment
  • status: Get current canary deployment state

Monitoring:

  • Queries CloudWatch metrics for base vs canary error rates
  • Compares error rate against threshold (default: 5%)
  • Requires minimum 5 invocations before triggering actions

EventBridge Rule:

  • Runs every 15 minutes (configurable)
  • Initially disabled, enabled when canary starts
  • Automatically disables after promotion/rollback

Deployed: Conditionally created in MonitoringStack when canaryEnabled=true

3. Existing Monitoring Lambdas

ErrorAggregator: Real-time error tracking via Lambda Destinations
ReleaseMonitor: Release-level rollback for instant deployments

These continue to work alongside canary deployments for comprehensive monitoring.

How Canary Works

Traffic Routing

API Gateway Stage has two routing targets:

  • Base (e.g., 90% traffic): Routes to green environment via stage variable activeEnvironment=green
  • Canary (e.g., 10% traffic): Routes to blue environment via canary settings override activeEnvironment=blue

Gradual Rollout

  1. Deploy new release with CANARY_ENABLED=true INITIAL_CANARY_PERCENT=10
  2. Custom resource configures canary at 10%
  3. CanaryMonitor runs every 15 minutes:
    • Checks error rate for canary traffic
    • If healthy: Increment to 20%, 30%, 40%...
    • If unhealthy: Rollback to 0% (instant switch to base)
  4. Promotion at 100%: Update base to blue, disable canary

CloudWatch Metrics

API Gateway separates metrics by dimension:

  • prod: Base traffic metrics
  • prod/canary: Canary traffic metrics

This allows precise error rate comparison.

Deployment

Standard Blue-Green Deployment (Instant Switch)

# Deploy without canary - instant 100% switch
RELEASE_VERSION=v2.0.0 npm run deploy

Canary Deployment (Gradual Rollout)

# Deploy with canary - start at 10%, auto-increment to 100%
CANARY_ENABLED=true \
INITIAL_CANARY_PERCENT=10 \
CANARY_TARGET_PERCENT=100 \
CANARY_INCREMENT_PERCENT=10 \
CANARY_INCREMENT_INTERVAL=15 \
CANARY_ERROR_THRESHOLD=5.0 \
CANARY_MONITORING_WINDOW=5 \
RELEASE_VERSION=v2.0.0 npm run deploy

Environment Variables:

  • CANARY_ENABLED: Enable canary deployment (true/false)
  • INITIAL_CANARY_PERCENT: Starting traffic percentage (default: 10)
  • CANARY_TARGET_PERCENT: Target traffic before promotion (default: 100)
  • CANARY_INCREMENT_PERCENT: Traffic increment per check (default: 10)
  • CANARY_INCREMENT_INTERVAL: Minutes between increments (default: 15)
  • CANARY_ERROR_THRESHOLD: % error rate for rollback (default: 5.0)
  • CANARY_MONITORING_WINDOW: Minutes for error rate calculation (default: 5)

Manual Controls

Check Canary Status

npm run canary-status

Shows:

  • Current canary percent
  • Base environment (green)
  • Canary environment (blue)
  • Release versions

Manual Gradual Rollout

# Increment from 10% to 100% in 10% steps every 15 minutes
npm run gradual-rollout 10 100 10 15

Emergency Rollback

# Instantly set canary to 0% (all traffic to base)
npm run canary-rollback

Cost Analysis

Canary Deployment: ~$0.10/month

  • CanaryMonitor Lambda: 96 invocations/day × 30 days = 2,880/month (within 1M free tier)
  • CloudWatch GetMetricStatistics: 2,880 API calls = ~$0.01/month
  • API Gateway: No additional cost (same request volume)
  • CanaryConfigurator Lambda: Only runs during deployment (~1 invocation = free tier)

Total Infrastructure Cost

  • Lambda: ~$0/month (within free tier)
  • DynamoDB: ~$0/month (within free tier)
  • CloudWatch: ~$0.01/month
  • EventBridge: ~$0/month (within free tier)
  • API Gateway: Pay per request (same as before)

Total: ~$0.01/month for full blue-green + canary monitoring

Configuration

Adjust Canary Parameters

Edit bin/app.ts:

const monitoringStack = new MonitoringStack(app, 'ReleaseMonitoringStack', {
// ...
canaryEnabled: process.env.CANARY_ENABLED === 'true',
canaryTargetPercent: 100, // Promote after reaching this %
canaryIncrementPercent: 10, // Increment step size
canaryErrorThreshold: 5.0, // % error rate for rollback
canaryMonitoringWindow: 5, // Minutes for error rate calculation
canaryIncrementInterval: 15, // Minutes between health checks
});

Adjust Lock-In Period

const monitoringStack = new MonitoringStack(app, 'ReleaseMonitoringStack', {
// ...
lockInPeriodMinutes: 60, // After 60 min, instant rollback disabled
});

After the lock-in period, automatic rollback is disabled (errors likely environmental). Manual rollback still available: npm run rollback-release

Testing Canary Rollback

  1. Introduce an error:

    // In src/lambda/helloWorld/src/helloWorld.ts
    export async function handler() {
    throw new Error('Simulated canary error');
    }
  2. Deploy with canary:

    CANARY_ENABLED=true INITIAL_CANARY_PERCENT=10 RELEASE_VERSION=v2.0.0 npm run deploy
  3. Generate traffic:

    for i in {1..100}; do
    curl https://YOUR-API-ID.execute-api.YOUR-REGION.amazonaws.com/prod/hello
    done
  4. Wait for health check (every 15 minutes):

    • CanaryMonitor detects high error rate in canary traffic
    • Automatically sets canary to 0%
    • All traffic routes to base (green, previous release)
  5. Check logs:

    aws logs tail /aws/lambda/CanaryMonitorFunction --follow

Benefits

Canary vs Instant Switch

Instant Blue-Green:

  • ✅ Fastest deployment (< 1 second switch)
  • ✅ Simple rollback mechanism
  • ❌ All users affected if errors occur
  • ❌ No gradual validation

Canary Deployment:

  • ✅ Only 10% of users affected initially
  • ✅ Gradual validation with automatic increments
  • ✅ Early error detection before full rollout
  • ✅ Same instant rollback capability
  • ❌ Longer deployment time (2-3 hours for full rollout)

When to Use Each

Use Instant Blue-Green:

  • High confidence in release (e.g., bug fixes, minor updates)
  • Low-risk changes
  • Need immediate rollout
  • Post-hours deployments with monitoring

Use Canary:

  • Major releases or architecture changes
  • New features with uncertain performance
  • High-traffic production systems
  • Business-hours deployments
  • When blast radius matters

Monitoring

CloudWatch Metrics

# Base traffic metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name 5XXError \
--dimensions Name=ApiName,Value=prod \
--start-time 2025-01-25T10:00:00Z \
--end-time 2025-01-25T11:00:00Z \
--period 300 \
--statistics Sum

# Canary traffic metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name 5XXError \
--dimensions Name=ApiName,Value=prod/canary \
--start-time 2025-01-25T10:00:00Z \
--end-time 2025-01-25T11:00:00Z \
--period 300 \
--statistics Sum

Lambda Logs

# Canary Monitor logs
aws logs tail /aws/lambda/CanaryMonitorFunction --follow

# Canary Configurator logs
aws logs tail /aws/lambda/ReleaseDeploymentStack-CanaryConfigurator --follow

SSM Parameters

# Check canary state
aws ssm get-parameter --name /release/canary-enabled
aws ssm get-parameter --name /release/canary-percent
aws ssm get-parameter --name /release/canary-deployment-timestamp

# Check release versions
aws ssm get-parameter --name /release/active-environment
aws ssm get-parameter --name /release/blue-version
aws ssm get-parameter --name /release/green-version

Troubleshooting

Issue: Canary not incrementing

Check:

  1. EventBridge rule enabled: aws events describe-rule --name CanaryMonitorRule
  2. CloudWatch metrics exist: Verify base and canary traffic metrics
  3. Error rate threshold: Check if error rate is below 5%
  4. Minimum invocations: Need at least 5 invocations in monitoring window
  5. CanaryMonitor logs: aws logs tail /aws/lambda/CanaryMonitorFunction --follow

Issue: Canary settings not applied

Check:

  1. Custom resource invoked: Check CloudFormation events
  2. CanaryConfigurator logs: Look for API Gateway UpdateStage calls
  3. API Gateway stage: aws apigateway get-stage --rest-api-id YOUR-ID --stage-name prod
  4. IAM permissions: Verify CanaryConfigurator has apigateway:UpdateStage permission

Issue: False rollback with low traffic

Solution: Increase monitoring window for low-traffic functions

canaryMonitoringWindow: 15, // Use 15-minute window instead of 5

This accumulates more samples before calculating error rate.

Future Enhancements

  • CloudWatch Alarms: Integrate with CloudWatch Alarms for advanced monitoring
  • Custom Metrics: Track business metrics (e.g., conversion rate) during canary
  • Multi-Region: Extend canary to multi-region deployments
  • A/B Testing: Use canary settings for feature flags and A/B tests
  • Automated Load Testing: Trigger load tests during canary deployment

References