Canary Deployment Implementation
Overview
This project now supports gradual traffic shifting via API Gateway canary deployments. When enabled, new releases start with a small percentage of traffic (e.g., 10%) and automatically increment to 100% if the canary is healthy.
Architecture
Three Lambda Functions
1. CanaryConfigurator Lambda
Purpose: CloudFormation custom resource to configure API Gateway canary settings
Location: src/lambda/canaryConfigurator/
Why: CDK Stage construct doesn't support canarySettings natively
Function:
- Called during CloudFormation Create/Update/Delete operations
- Uses API Gateway
UpdateStageAPI with patchOperations - Configures:
/canarySettings/percentTraffic(e.g., 10)/canarySettings/useStageCache(false)/canarySettings/stageVariableOverrides/activeEnvironment(blue)
Deployed: Automatically invoked by CDK custom resource in ReleaseDeploymentStack
2. CanaryMonitor Lambda
Purpose: Automated health checking and traffic increments
Location: src/lambda/canaryMonitor/
Construct: lib/lambda/CanaryMonitorLambda.ts
Functions:
- increment: Checks canary health, increments traffic by 10%
- rollback: Emergency rollback to 0% if unhealthy
- promote: Promote canary to 100%, update base environment
- status: Get current canary deployment state
Monitoring:
- Queries CloudWatch metrics for base vs canary error rates
- Compares error rate against threshold (default: 5%)
- Requires minimum 5 invocations before triggering actions
EventBridge Rule:
- Runs every 15 minutes (configurable)
- Initially disabled, enabled when canary starts
- Automatically disables after promotion/rollback
Deployed: Conditionally created in MonitoringStack when canaryEnabled=true
3. Existing Monitoring Lambdas
ErrorAggregator: Real-time error tracking via Lambda Destinations
ReleaseMonitor: Release-level rollback for instant deployments
These continue to work alongside canary deployments for comprehensive monitoring.
How Canary Works
Traffic Routing
API Gateway Stage has two routing targets:
- Base (e.g., 90% traffic): Routes to green environment via stage variable
activeEnvironment=green - Canary (e.g., 10% traffic): Routes to blue environment via canary settings override
activeEnvironment=blue
Gradual Rollout
- Deploy new release with
CANARY_ENABLED=true INITIAL_CANARY_PERCENT=10 - Custom resource configures canary at 10%
- CanaryMonitor runs every 15 minutes:
- Checks error rate for canary traffic
- If healthy: Increment to 20%, 30%, 40%...
- If unhealthy: Rollback to 0% (instant switch to base)
- Promotion at 100%: Update base to blue, disable canary
CloudWatch Metrics
API Gateway separates metrics by dimension:
- prod: Base traffic metrics
- prod/canary: Canary traffic metrics
This allows precise error rate comparison.
Deployment
Standard Blue-Green Deployment (Instant Switch)
# Deploy without canary - instant 100% switch
RELEASE_VERSION=v2.0.0 npm run deploy
Canary Deployment (Gradual Rollout)
# Deploy with canary - start at 10%, auto-increment to 100%
CANARY_ENABLED=true \
INITIAL_CANARY_PERCENT=10 \
CANARY_TARGET_PERCENT=100 \
CANARY_INCREMENT_PERCENT=10 \
CANARY_INCREMENT_INTERVAL=15 \
CANARY_ERROR_THRESHOLD=5.0 \
CANARY_MONITORING_WINDOW=5 \
RELEASE_VERSION=v2.0.0 npm run deploy
Environment Variables:
CANARY_ENABLED: Enable canary deployment (true/false)INITIAL_CANARY_PERCENT: Starting traffic percentage (default: 10)CANARY_TARGET_PERCENT: Target traffic before promotion (default: 100)CANARY_INCREMENT_PERCENT: Traffic increment per check (default: 10)CANARY_INCREMENT_INTERVAL: Minutes between increments (default: 15)CANARY_ERROR_THRESHOLD: % error rate for rollback (default: 5.0)CANARY_MONITORING_WINDOW: Minutes for error rate calculation (default: 5)
Manual Controls
Check Canary Status
npm run canary-status
Shows:
- Current canary percent
- Base environment (green)
- Canary environment (blue)
- Release versions
Manual Gradual Rollout
# Increment from 10% to 100% in 10% steps every 15 minutes
npm run gradual-rollout 10 100 10 15
Emergency Rollback
# Instantly set canary to 0% (all traffic to base)
npm run canary-rollback
Cost Analysis
Canary Deployment: ~$0.10/month
- CanaryMonitor Lambda: 96 invocations/day × 30 days = 2,880/month (within 1M free tier)
- CloudWatch GetMetricStatistics: 2,880 API calls = ~$0.01/month
- API Gateway: No additional cost (same request volume)
- CanaryConfigurator Lambda: Only runs during deployment (~1 invocation = free tier)
Total Infrastructure Cost
- Lambda: ~$0/month (within free tier)
- DynamoDB: ~$0/month (within free tier)
- CloudWatch: ~$0.01/month
- EventBridge: ~$0/month (within free tier)
- API Gateway: Pay per request (same as before)
Total: ~$0.01/month for full blue-green + canary monitoring
Configuration
Adjust Canary Parameters
Edit bin/app.ts:
const monitoringStack = new MonitoringStack(app, 'ReleaseMonitoringStack', {
// ...
canaryEnabled: process.env.CANARY_ENABLED === 'true',
canaryTargetPercent: 100, // Promote after reaching this %
canaryIncrementPercent: 10, // Increment step size
canaryErrorThreshold: 5.0, // % error rate for rollback
canaryMonitoringWindow: 5, // Minutes for error rate calculation
canaryIncrementInterval: 15, // Minutes between health checks
});
Adjust Lock-In Period
const monitoringStack = new MonitoringStack(app, 'ReleaseMonitoringStack', {
// ...
lockInPeriodMinutes: 60, // After 60 min, instant rollback disabled
});
After the lock-in period, automatic rollback is disabled (errors likely environmental). Manual rollback still available: npm run rollback-release
Testing Canary Rollback
-
Introduce an error:
// In src/lambda/helloWorld/src/helloWorld.ts
export async function handler() {
throw new Error('Simulated canary error');
} -
Deploy with canary:
CANARY_ENABLED=true INITIAL_CANARY_PERCENT=10 RELEASE_VERSION=v2.0.0 npm run deploy -
Generate traffic:
for i in {1..100}; do
curl https://YOUR-API-ID.execute-api.YOUR-REGION.amazonaws.com/prod/hello
done -
Wait for health check (every 15 minutes):
- CanaryMonitor detects high error rate in canary traffic
- Automatically sets canary to 0%
- All traffic routes to base (green, previous release)
-
Check logs:
aws logs tail /aws/lambda/CanaryMonitorFunction --follow
Benefits
Canary vs Instant Switch
Instant Blue-Green:
- ✅ Fastest deployment (< 1 second switch)
- ✅ Simple rollback mechanism
- ❌ All users affected if errors occur
- ❌ No gradual validation
Canary Deployment:
- ✅ Only 10% of users affected initially
- ✅ Gradual validation with automatic increments
- ✅ Early error detection before full rollout
- ✅ Same instant rollback capability
- ❌ Longer deployment time (2-3 hours for full rollout)
When to Use Each
Use Instant Blue-Green:
- High confidence in release (e.g., bug fixes, minor updates)
- Low-risk changes
- Need immediate rollout
- Post-hours deployments with monitoring
Use Canary:
- Major releases or architecture changes
- New features with uncertain performance
- High-traffic production systems
- Business-hours deployments
- When blast radius matters
Monitoring
CloudWatch Metrics
# Base traffic metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name 5XXError \
--dimensions Name=ApiName,Value=prod \
--start-time 2025-01-25T10:00:00Z \
--end-time 2025-01-25T11:00:00Z \
--period 300 \
--statistics Sum
# Canary traffic metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name 5XXError \
--dimensions Name=ApiName,Value=prod/canary \
--start-time 2025-01-25T10:00:00Z \
--end-time 2025-01-25T11:00:00Z \
--period 300 \
--statistics Sum
Lambda Logs
# Canary Monitor logs
aws logs tail /aws/lambda/CanaryMonitorFunction --follow
# Canary Configurator logs
aws logs tail /aws/lambda/ReleaseDeploymentStack-CanaryConfigurator --follow
SSM Parameters
# Check canary state
aws ssm get-parameter --name /release/canary-enabled
aws ssm get-parameter --name /release/canary-percent
aws ssm get-parameter --name /release/canary-deployment-timestamp
# Check release versions
aws ssm get-parameter --name /release/active-environment
aws ssm get-parameter --name /release/blue-version
aws ssm get-parameter --name /release/green-version
Troubleshooting
Issue: Canary not incrementing
Check:
- EventBridge rule enabled:
aws events describe-rule --name CanaryMonitorRule - CloudWatch metrics exist: Verify base and canary traffic metrics
- Error rate threshold: Check if error rate is below 5%
- Minimum invocations: Need at least 5 invocations in monitoring window
- CanaryMonitor logs:
aws logs tail /aws/lambda/CanaryMonitorFunction --follow
Issue: Canary settings not applied
Check:
- Custom resource invoked: Check CloudFormation events
- CanaryConfigurator logs: Look for API Gateway UpdateStage calls
- API Gateway stage:
aws apigateway get-stage --rest-api-id YOUR-ID --stage-name prod - IAM permissions: Verify CanaryConfigurator has
apigateway:UpdateStagepermission
Issue: False rollback with low traffic
Solution: Increase monitoring window for low-traffic functions
canaryMonitoringWindow: 15, // Use 15-minute window instead of 5
This accumulates more samples before calculating error rate.
Future Enhancements
- CloudWatch Alarms: Integrate with CloudWatch Alarms for advanced monitoring
- Custom Metrics: Track business metrics (e.g., conversion rate) during canary
- Multi-Region: Extend canary to multi-region deployments
- A/B Testing: Use canary settings for feature flags and A/B tests
- Automated Load Testing: Trigger load tests during canary deployment