Canary Deployment Enhancement for Release-Level Blue-Green
This document proposes adding gradual traffic shifting capabilities to the existing release-level blue-green deployment system using AWS API Gateway Canary Deployments.
Problem Statement
Current implementation switches 100% of traffic instantly from blue → green environment. While fast (<1 second), this approach:
- ❌ Exposes all users to potential issues immediately
- ❌ Provides limited time to detect problems
- ❌ Requires aggressive monitoring within lock-in period
- ❌ All-or-nothing rollback affects entire user base
Proposed Solution: Canary Deployments
Add percentage-based traffic routing at API Gateway stage level, enabling gradual rollout of new releases.
Example Workflow
Release v2.0 deployment:
00:00 - Deploy v2.0 to blue environment
00:15 - 10% of traffic → v2.0 (canary)
00:30 - 25% of traffic → v2.0
00:45 - 50% of traffic → v2.0
01:00 - 75% of traffic → v2.0
01:15 - 100% of traffic → v2.0 (promotion)
If errors detected at any stage:
- Set canary to 0% immediately
- Only affected users were those in canary percentage
- Majority of users unaffected
Architecture
Implementation
1. Update ReleaseDeploymentStack
// lib/ReleaseDeploymentStack.ts
import {
RestApi,
Deployment,
Stage,
CanarySettings,
} from 'aws-cdk-lib/aws-apigateway';
export interface ReleaseDeploymentStackProps extends StackProps {
readonly releaseVersion: string;
readonly lambdaConfigs: LambdaConfig[];
readonly canaryEnabled?: boolean;
readonly initialCanaryPercent?: number;
}
export class ReleaseDeploymentStack extends Stack {
public readonly api: RestApi;
public readonly productionStage: Stage;
public readonly baseDeployment: Deployment;
public readonly canaryDeployment: Deployment;
constructor(scope: Construct, id: string, props: ReleaseDeploymentStackProps) {
super(scope, id, props);
const {
releaseVersion,
lambdaConfigs,
canaryEnabled = false,
initialCanaryPercent = 0,
} = props;
// ... existing Lambda deployment code ...
// Create base deployment (current stable)
this.baseDeployment = new Deployment(this, 'BaseDeployment', {
api: this.api,
description: `Base deployment for release ${releaseVersion}`,
});
// Create canary deployment (new release)
this.canaryDeployment = new Deployment(this, 'CanaryDeployment', {
api: this.api,
description: `Canary deployment for release ${releaseVersion}`,
});
// Canary settings (if enabled)
const canarySettings: CanarySettings | undefined = canaryEnabled
? {
percentTraffic: initialCanaryPercent,
useStageCache: false,
stageVariableOverrides: {
activeEnvironment: 'blue', // Canary uses blue (new release)
},
}
: undefined;
// Create production stage with canary support
this.productionStage = new Stage(this, 'ProductionStage', {
deployment: this.baseDeployment,
stageName: 'prod',
variables: {
activeEnvironment: 'green', // Base uses green (stable)
releaseVersion: releaseVersion,
},
canarySettings,
loggingLevel: MethodLoggingLevel.INFO,
dataTraceEnabled: true,
metricsEnabled: true,
});
// Store canary configuration in SSM
new StringParameter(this, 'CanaryEnabledParameter', {
parameterName: '/release/canary-enabled',
stringValue: canaryEnabled.toString(),
description: 'Whether canary deployments are enabled',
});
new StringParameter(this, 'CanaryPercentParameter', {
parameterName: '/release/canary-percent',
stringValue: initialCanaryPercent.toString(),
description: 'Current percentage of traffic routed to canary',
});
new StringParameter(this, 'CanaryDeploymentTimestampParameter', {
parameterName: '/release/canary-deployment-timestamp',
stringValue: Date.now().toString(),
description: 'Timestamp when canary deployment started',
});
}
}
2. Create Canary Monitor Lambda
// src/lambda/canaryMonitor/src/canaryMonitor.ts
import { Context } from 'aws-lambda';
import { APIGatewayClient, GetStageCommand, UpdateStageCommand } from '@aws-sdk/client-api-gateway';
import { SSMClient, GetParameterCommand, PutParameterCommand } from '@aws-sdk/client-ssm';
import { CloudWatchClient, GetMetricStatisticsCommand } from '@aws-sdk/client-cloudwatch';
const apigateway = new APIGatewayClient({});
const ssm = new SSMClient({});
const cloudwatch = new CloudWatchClient({});
const API_GATEWAY_ID = process.env.API_GATEWAY_ID!;
const STAGE_NAME = process.env.STAGE_NAME || 'prod';
const TARGET_PERCENT = parseInt(process.env.TARGET_PERCENT || '100', 10);
const INCREMENT_PERCENT = parseInt(process.env.INCREMENT_PERCENT || '10', 10);
const ERROR_THRESHOLD = parseFloat(process.env.ERROR_THRESHOLD || '5.0');
const MONITORING_WINDOW_MINUTES = parseInt(process.env.MONITORING_WINDOW_MINUTES || '15', 10);
export interface CanaryMonitorEvent {
action?: 'increment' | 'rollback' | 'promote' | 'status';
}
/**
* Canary Monitor Lambda
*
* Monitors canary deployment health and gradually shifts traffic
* Automatically rolls back if error rate exceeds threshold
*/
export async function handler(
event: CanaryMonitorEvent,
context: Context
): Promise<any> {
console.log('🔍 Canary Monitor triggered');
console.log('Event:', JSON.stringify(event, null, 2));
const action = event.action || 'increment';
try {
switch (action) {
case 'status':
return await getCanaryStatus();
case 'increment':
return await incrementCanary();
case 'rollback':
return await rollbackCanary();
case 'promote':
return await promoteCanary();
default:
throw new Error(`Unknown action: ${action}`);
}
} catch (error) {
console.error('❌ Canary Monitor failed:', error);
throw error;
}
}
async function getCanaryStatus(): Promise<any> {
const stage = await apigateway.send(new GetStageCommand({
restApiId: API_GATEWAY_ID,
stageName: STAGE_NAME,
}));
const currentPercent = stage.canarySettings?.percentTraffic || 0;
const canaryParam = await ssm.send(new GetParameterCommand({
Name: '/release/canary-percent',
}));
return {
currentPercent,
targetPercent: TARGET_PERCENT,
ssmPercent: canaryParam.Parameter?.Value,
canarySettings: stage.canarySettings,
};
}
async function incrementCanary(): Promise<any> {
// Get current canary percentage
const currentPercent = await getCurrentCanaryPercent();
console.log(`Current canary traffic: ${currentPercent}%`);
if (currentPercent >= TARGET_PERCENT) {
console.log(`🎉 Target ${TARGET_PERCENT}% reached. Promoting canary to production.`);
return await promoteCanary();
}
// Check canary health
const canaryErrorRate = await getCanaryErrorRate();
console.log(`Canary error rate: ${canaryErrorRate.toFixed(2)}%`);
if (canaryErrorRate > ERROR_THRESHOLD) {
console.error(`⚠️ Canary error rate ${canaryErrorRate.toFixed(2)}% exceeds threshold ${ERROR_THRESHOLD}%`);
console.log('🔄 Rolling back canary deployment');
return await rollbackCanary();
}
// Increment canary percentage
const newPercent = Math.min(currentPercent + INCREMENT_PERCENT, TARGET_PERCENT);
console.log(`📈 Increasing canary traffic: ${currentPercent}% → ${newPercent}%`);
await updateCanaryPercent(newPercent);
return {
action: 'increment',
previousPercent: currentPercent,
newPercent,
targetPercent: TARGET_PERCENT,
canaryHealthy: true,
};
}
async function getCurrentCanaryPercent(): Promise<number> {
const stage = await apigateway.send(new GetStageCommand({
restApiId: API_GATEWAY_ID,
stageName: STAGE_NAME,
}));
return stage.canarySettings?.percentTraffic || 0;
}
async function updateCanaryPercent(percent: number): Promise<void> {
await apigateway.send(new UpdateStageCommand({
restApiId: API_GATEWAY_ID,
stageName: STAGE_NAME,
patchOperations: [
{
op: 'replace',
path: '/canarySettings/percentTraffic',
value: percent.toString(),
},
],
}));
await ssm.send(new PutParameterCommand({
Name: '/release/canary-percent',
Value: percent.toString(),
Overwrite: true,
}));
console.log(`✅ Updated canary traffic to ${percent}%`);
}
async function getCanaryErrorRate(): Promise<number> {
const endTime = new Date();
const startTime = new Date(endTime.getTime() - MONITORING_WINDOW_MINUTES * 60 * 1000);
const [errorsResult, invocationsResult] = await Promise.all([
cloudwatch.send(new GetMetricStatisticsCommand({
Namespace: 'AWS/ApiGateway',
MetricName: '5XXError',
Dimensions: [
{ Name: 'ApiName', Value: API_GATEWAY_ID },
{ Name: 'Stage', Value: `${STAGE_NAME}/canary` },
],
StartTime: startTime,
EndTime: endTime,
Period: MONITORING_WINDOW_MINUTES * 60,
Statistics: ['Sum'],
})),
cloudwatch.send(new GetMetricStatisticsCommand({
Namespace: 'AWS/ApiGateway',
MetricName: 'Count',
Dimensions: [
{ Name: 'ApiName', Value: API_GATEWAY_ID },
{ Name: 'Stage', Value: `${STAGE_NAME}/canary` },
],
StartTime: startTime,
EndTime: endTime,
Period: MONITORING_WINDOW_MINUTES * 60,
Statistics: ['Sum'],
})),
]);
const errors = errorsResult.Datapoints?.[0]?.Sum ?? 0;
const invocations = invocationsResult.Datapoints?.[0]?.Sum ?? 0;
if (invocations === 0) {
console.warn('⚠️ No canary invocations found in monitoring window');
return 0;
}
return (errors / invocations) * 100;
}
async function rollbackCanary(): Promise<any> {
console.log('🔄 Rolling back canary deployment');
const currentPercent = await getCurrentCanaryPercent();
await updateCanaryPercent(0);
console.log('✅ Canary rolled back to 0%');
return {
action: 'rollback',
previousPercent: currentPercent,
newPercent: 0,
reason: 'High error rate detected',
};
}
async function promoteCanary(): Promise<any> {
console.log('🎉 Promoting canary to production');
// Promotion steps:
// 1. Update stage variables to make canary the new base
// 2. Set canary percentage to 0%
// 3. Update SSM parameters
await apigateway.send(new UpdateStageCommand({
restApiId: API_GATEWAY_ID,
stageName: STAGE_NAME,
patchOperations: [
{
op: 'replace',
path: '/variables/activeEnvironment',
value: 'blue', // Make blue the new active
},
{
op: 'replace',
path: '/canarySettings/percentTraffic',
value: '0',
},
],
}));
await ssm.send(new PutParameterCommand({
Name: '/release/canary-percent',
Value: '0',
Overwrite: true,
}));
await ssm.send(new PutParameterCommand({
Name: '/release/active-environment',
Value: 'blue',
Overwrite: true,
}));
console.log('✅ Canary promoted to production');
return {
action: 'promote',
newActiveEnvironment: 'blue',
canaryPercent: 0,
};
}
3. Create Canary Monitor Construct
// lib/lambda/CanaryMonitorLambda.ts
import { Duration } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';
import { Runtime } from 'aws-cdk-lib/aws-lambda';
import { LogGroup, RetentionDays } from 'aws-cdk-lib/aws-logs';
import { Rule, Schedule } from 'aws-cdk-lib/aws-events';
import { LambdaFunction } from 'aws-cdk-lib/aws-events-targets';
import { PolicyStatement } from 'aws-cdk-lib/aws-iam';
export interface CanaryMonitorLambdaProps {
readonly functionName: string;
readonly apiGatewayId: string;
readonly stageName: string;
readonly targetPercent: number;
readonly incrementPercent: number;
readonly errorThreshold: number;
readonly monitoringWindowMinutes: number;
readonly incrementIntervalMinutes: number;
}
export class CanaryMonitorLambda extends Construct {
public readonly lambda: NodejsFunction;
constructor(scope: Construct, id: string, props: CanaryMonitorLambdaProps) {
super(scope, id);
const {
functionName,
apiGatewayId,
stageName,
targetPercent,
incrementPercent,
errorThreshold,
monitoringWindowMinutes,
incrementIntervalMinutes,
} = props;
const logGroup = new LogGroup(this, 'LogGroup', {
logGroupName: `/aws/lambda/${functionName}`,
retention: RetentionDays.ONE_WEEK,
});
this.lambda = new NodejsFunction(this, 'Function', {
functionName,
runtime: Runtime.NODEJS_22_X,
entry: './src/lambda/canaryMonitor/src/canaryMonitor.ts',
handler: 'handler',
timeout: Duration.seconds(60),
memorySize: 512,
logGroup,
environment: {
API_GATEWAY_ID: apiGatewayId,
STAGE_NAME: stageName,
TARGET_PERCENT: targetPercent.toString(),
INCREMENT_PERCENT: incrementPercent.toString(),
ERROR_THRESHOLD: errorThreshold.toString(),
MONITORING_WINDOW_MINUTES: monitoringWindowMinutes.toString(),
},
bundling: {
minify: true,
sourceMap: true,
},
});
// Grant permissions
this.lambda.addToRolePolicy(
new PolicyStatement({
actions: [
'apigateway:GET',
'apigateway:PATCH',
'apigateway:UpdateStage',
],
resources: [
`arn:aws:apigateway:${props.apiGatewayId.split('/')[0]}::/restapis/${apiGatewayId}/stages/${stageName}`,
],
})
);
this.lambda.addToRolePolicy(
new PolicyStatement({
actions: [
'ssm:GetParameter',
'ssm:PutParameter',
],
resources: [
`arn:aws:ssm:*:*:parameter/release/*`,
],
})
);
this.lambda.addToRolePolicy(
new PolicyStatement({
actions: [
'cloudwatch:GetMetricStatistics',
],
resources: ['*'],
})
);
// Create EventBridge rule for automatic increments
const incrementRule = new Rule(this, 'IncrementRule', {
schedule: Schedule.rate(Duration.minutes(incrementIntervalMinutes)),
description: 'Automatically increment canary percentage',
});
incrementRule.addTarget(
new LambdaFunction(this.lambda, {
event: { action: 'increment' },
})
);
}
}
4. Create Gradual Rollout CLI Script
// scripts/gradual-rollout.ts
#!/usr/bin/env node
import { APIGatewayClient, UpdateStageCommand, GetStageCommand } from '@aws-sdk/client-api-gateway';
import { SSMClient, PutParameterCommand } from '@aws-sdk/client-ssm';
import { CloudFormationClient, DescribeStacksCommand } from '@aws-sdk/client-cloudformation';
const apigateway = new APIGatewayClient({});
const ssm = new SSMClient({});
const cfn = new CloudFormationClient({});
interface RolloutConfig {
startPercent: number;
targetPercent: number;
incrementPercent: number;
intervalMinutes: number;
}
async function getApiGatewayId(): Promise<string> {
const result = await cfn.send(new DescribeStacksCommand({
StackName: 'ReleaseDeploymentStack',
}));
const outputs = result.Stacks?.[0]?.Outputs || [];
const apiIdOutput = outputs.find(o => o.OutputKey === 'ApiId');
if (!apiIdOutput?.OutputValue) {
throw new Error('API Gateway ID not found in stack outputs');
}
return apiIdOutput.OutputValue;
}
async function getCurrentCanaryPercent(apiGatewayId: string): Promise<number> {
const stage = await apigateway.send(new GetStageCommand({
restApiId: apiGatewayId,
stageName: 'prod',
}));
return stage.canarySettings?.percentTraffic || 0;
}
async function updateCanaryPercent(apiGatewayId: string, percent: number): Promise<void> {
await apigateway.send(new UpdateStageCommand({
restApiId: apiGatewayId,
stageName: 'prod',
patchOperations: [
{
op: 'replace',
path: '/canarySettings/percentTraffic',
value: percent.toString(),
},
],
}));
await ssm.send(new PutParameterCommand({
Name: '/release/canary-percent',
Value: percent.toString(),
Overwrite: true,
}));
}
async function gradualRollout(config: RolloutConfig): Promise<void> {
const { startPercent, targetPercent, incrementPercent, intervalMinutes } = config;
console.log('\n🚀 Starting gradual canary rollout...');
console.log(` Start: ${startPercent}%`);
console.log(` Target: ${targetPercent}%`);
console.log(` Increment: ${incrementPercent}%`);
console.log(` Interval: ${intervalMinutes} minutes\n`);
const apiGatewayId = await getApiGatewayId();
const currentPercent = await getCurrentCanaryPercent(apiGatewayId);
console.log(`📊 Current canary traffic: ${currentPercent}%\n`);
let nextPercent = Math.max(startPercent, currentPercent);
while (nextPercent < targetPercent) {
nextPercent = Math.min(nextPercent + incrementPercent, targetPercent);
console.log(`⏳ Updating canary traffic to ${nextPercent}%...`);
await updateCanaryPercent(apiGatewayId, nextPercent);
console.log(`✅ Canary traffic updated to ${nextPercent}%`);
if (nextPercent < targetPercent) {
console.log(`⏸️ Waiting ${intervalMinutes} minutes before next increment...\n`);
await new Promise(resolve => setTimeout(resolve, intervalMinutes * 60 * 1000));
}
}
console.log('\n🎉 Gradual rollout complete!');
console.log(` Final canary traffic: ${nextPercent}%\n`);
}
// Parse command-line arguments
const args = process.argv.slice(2);
if (args.length === 0 || args[0] === '--help') {
console.log(`
Usage: npm run gradual-rollout <start%> <target%> <increment%> <interval-minutes>
Examples:
npm run gradual-rollout 0 100 10 15
→ Rollout from 0% to 100% in 10% increments every 15 minutes
npm run gradual-rollout 5 50 5 30
→ Conservative rollout: 5% to 50% in 5% increments every 30 minutes
npm run gradual-rollout 10 100 25 5
→ Fast rollout: 10% to 100% in 25% increments every 5 minutes
`);
process.exit(0);
}
const config: RolloutConfig = {
startPercent: parseInt(args[0]) || 0,
targetPercent: parseInt(args[1]) || 100,
incrementPercent: parseInt(args[2]) || 10,
intervalMinutes: parseInt(args[3]) || 15,
};
gradualRollout(config).catch(error => {
console.error('❌ Gradual rollout failed:', error);
process.exit(1);
});
5. Update bin/app.ts
// bin/app.ts
const releaseStack = new ReleaseDeploymentStack(app, 'ReleaseDeploymentStack', {
description: `Release ${releaseVersion} with canary deployment`,
releaseVersion,
lambdaConfigs,
canaryEnabled: true, // Enable canary deployments
initialCanaryPercent: 10, // Start with 10% canary traffic
env: {
account: process.env.CDK_DEFAULT_ACCOUNT,
region: process.env.CDK_DEFAULT_REGION,
},
});
// Deploy Canary Monitor for gradual rollout
const canaryMonitor = new CanaryMonitorLambda(app, 'CanaryMonitor', {
functionName: 'CanaryMonitorFunction',
apiGatewayId: releaseStack.api.restApiId,
stageName: 'prod',
targetPercent: 100, // Target: 100% traffic to canary
incrementPercent: 10, // Increment by 10% each time
errorThreshold: 5, // Rollback if error rate > 5%
monitoringWindowMinutes: 15, // Monitor over 15-minute windows
incrementIntervalMinutes: 15, // Increment every 15 minutes
});
6. Update package.json
{
"scripts": {
"gradual-rollout": "tsx ./scripts/gradual-rollout.ts",
"canary-status": "tsx ./scripts/canary-status.ts",
"canary-rollback": "tsx ./scripts/canary-rollback.ts"
},
"dependencies": {
"@aws-sdk/client-api-gateway": "^3.713.0",
"@aws-sdk/client-cloudformation": "^3.713.0",
"@aws-sdk/client-cloudwatch": "^3.713.0",
"@aws-sdk/client-ssm": "^3.713.0"
}
}
Usage
Automatic Gradual Rollout
When you deploy a new release with canary enabled:
RELEASE_VERSION=v2.0.0 npm run deploy
The system automatically:
- Deploys v2.0.0 to blue environment
- Sets canary to 10% traffic
- Monitors canary health every 15 minutes
- Increments by 10% if healthy (25% → 50% → 75% → 100%)
- Promotes to production at 100%
- Rolls back to 0% if error rate > 5%
Manual Gradual Rollout
# Conservative rollout: 0% → 100% in 10% increments every 15 min (total: 2.5 hours)
npm run gradual-rollout 0 100 10 15
# Fast rollout: 10% → 100% in 25% increments every 5 min (total: 20 minutes)
npm run gradual-rollout 10 100 25 5
# Test rollout: 5% → 20% in 5% increments every 10 min
npm run gradual-rollout 5 20 5 10
Manual Control
# Check canary status
npm run canary-status
# Emergency rollback
npm run canary-rollback
# Set specific percentage
aws apigateway update-stage \
--rest-api-id YOUR-API-ID \
--stage-name prod \
--patch-operations op=replace,path=/canarySettings/percentTraffic,value=50
Monitoring
CloudWatch Metrics
API Gateway automatically emits separate metrics for base and canary:
# Canary metrics (tagged with /canary suffix)
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name 5XXError \
--dimensions Name=ApiName,Value=YOUR-API-ID Name=Stage,Value=prod/canary \
--start-time 2026-01-25T00:00:00Z \
--end-time 2026-01-25T23:59:59Z \
--period 300 \
--statistics Sum
# Base metrics (standard)
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name 5XXError \
--dimensions Name=ApiName,Value=YOUR-API-ID Name=Stage,Value=prod \
--start-time 2026-01-25T00:00:00Z \
--end-time 2026-01-25T23:59:59Z \
--period 300 \
--statistics Sum
Canary Monitor Logs
# Watch canary monitor logs
aws logs tail /aws/lambda/CanaryMonitorFunction --follow
# Expected output during healthy rollout:
# 📊 Current canary traffic: 10%
# 🔍 Canary error rate: 1.23%
# 📈 Increasing canary traffic: 10% → 25%
# ✅ Updated canary traffic to 25%
CloudWatch Dashboard
Create a dashboard to monitor canary vs base:
import { Dashboard, GraphWidget } from 'aws-cdk-lib/aws-cloudwatch';
const dashboard = new Dashboard(this, 'CanaryDashboard', {
dashboardName: 'ReleaseCanaryMonitoring',
});
dashboard.addWidgets(
new GraphWidget({
title: 'Error Rate: Base vs Canary',
left: [
new Metric({
namespace: 'AWS/ApiGateway',
metricName: '5XXError',
dimensionsMap: {
ApiName: apiGatewayId,
Stage: 'prod',
},
statistic: 'Sum',
label: 'Base Errors',
}),
new Metric({
namespace: 'AWS/ApiGateway',
metricName: '5XXError',
dimensionsMap: {
ApiName: apiGatewayId,
Stage: 'prod/canary',
},
statistic: 'Sum',
label: 'Canary Errors',
}),
],
})
);
Comparison: Instant vs Canary
| Aspect | Instant Switch | Canary Deployment |
|---|---|---|
| Risk | ⚠️ High - All users affected | ✅ Low - Gradual exposure |
| Rollback Impact | ⚠️ All users affected | ✅ Only canary % affected |
| Detection Time | ❌ Limited (lock-in period) | ✅ Extended (multiple stages) |
| Rollout Speed | ✅ <1 second | ⚠️ 1-2 hours |
| User Experience | ⚠️ All-or-nothing | ✅ Smooth transition |
| Cost | ✅ $0 additional | ✅ $0 additional |
| Complexity | ✅ Simple | ⚠️ Moderate |
| Best For | Low-risk releases | High-risk releases |
Recommended Rollout Strategy
For production systems with 20+ Lambda functions:
Phase 1: Initial Canary (0-30 minutes)
- Deploy to blue environment
- Set canary to 10%
- Monitor for 15 minutes
- Metrics: Error rate, latency, throughput
Phase 2: Progressive Rollout (30-90 minutes)
- Increment to 25% (if healthy)
- Monitor 15 minutes
- Increment to 50% (if healthy)
- Monitor 15 minutes
- Increment to 75% (if healthy)
- Monitor 15 minutes
Phase 3: Full Rollout (90-105 minutes)
- Increment to 100%
- Monitor 15 minutes
- Promote canary to production
- Lock in new release
Rollback Triggers
- Automatic: Error rate > 5% in canary
- Manual: CLI command or AWS Console
- Result: Canary → 0%, all traffic to stable base
Benefits
Safety
- ✅ Issues only affect small percentage of users
- ✅ Multiple checkpoints to detect problems
- ✅ Gradual validation before full rollout
Observability
- ✅ Side-by-side comparison of base vs canary metrics
- ✅ Extended monitoring period (1-2 hours vs instant)
- ✅ Real-world traffic validation
Control
- ✅ Manual override at any percentage
- ✅ Pause rollout for investigation
- ✅ Automated or manual rollback
Cost
- ✅ No additional AWS costs
- ✅ Uses built-in API Gateway features
- ✅ CloudWatch metrics included
Migration Path
Step 1: Test in Non-Production
# Enable canary in staging
CANARY_ENABLED=true RELEASE_VERSION=v2.0.0 npm run deploy -- --profile staging
Step 2: Conservative Production Rollout
# Small canary, long intervals
npm run gradual-rollout 5 25 5 30
Step 3: Confidence Building
# Standard rollout after confidence
npm run gradual-rollout 10 100 10 15
Step 4: Automated Rollouts
Enable automatic canary increments via EventBridge scheduled rules
Troubleshooting
Issue: Canary not receiving traffic
# Check canary settings
aws apigateway get-stage \
--rest-api-id YOUR-API-ID \
--stage-name prod \
--query 'canarySettings'
# Verify percentTraffic > 0
Issue: Cannot update canary percentage
# Check IAM permissions
# CanaryMonitor Lambda needs:
# - apigateway:UpdateStage
# - apigateway:GET
# - apigateway:PATCH
Issue: No canary metrics in CloudWatch
# Canary metrics appear with /canary suffix
aws cloudwatch list-metrics \
--namespace AWS/ApiGateway \
--dimensions Name=Stage,Value=prod/canary
Issue: Rollback not working
# Check Canary Monitor logs
aws logs tail /aws/lambda/CanaryMonitorFunction --follow
# Verify error threshold configuration
aws lambda get-function-configuration \
--function-name CanaryMonitorFunction \
--query 'Environment.Variables'
Conclusion
Adding canary deployment capabilities provides:
- Safer releases: Gradual validation with minimal user impact
- Better observability: Extended monitoring and comparison
- Flexible control: Manual and automated rollout options
- Zero additional cost: Uses native API Gateway features
Recommendation: Implement canary deployments for production systems with critical user bases or complex releases involving 10+ Lambda functions.
Next Steps
- Review proposed architecture
- Decide on rollout strategy (automatic vs manual)
- Configure monitoring thresholds
- Test in staging environment
- Deploy to production with conservative settings
- Iterate based on operational experience