Skip to main content

Canary Deployment Enhancement for Release-Level Blue-Green

This document proposes adding gradual traffic shifting capabilities to the existing release-level blue-green deployment system using AWS API Gateway Canary Deployments.

Problem Statement

Current implementation switches 100% of traffic instantly from blue → green environment. While fast (<1 second), this approach:

  • ❌ Exposes all users to potential issues immediately
  • ❌ Provides limited time to detect problems
  • ❌ Requires aggressive monitoring within lock-in period
  • ❌ All-or-nothing rollback affects entire user base

Proposed Solution: Canary Deployments

Add percentage-based traffic routing at API Gateway stage level, enabling gradual rollout of new releases.

Example Workflow

Release v2.0 deployment:
00:00 - Deploy v2.0 to blue environment
00:15 - 10% of traffic → v2.0 (canary)
00:30 - 25% of traffic → v2.0
00:45 - 50% of traffic → v2.0
01:00 - 75% of traffic → v2.0
01:15 - 100% of traffic → v2.0 (promotion)

If errors detected at any stage:

  • Set canary to 0% immediately
  • Only affected users were those in canary percentage
  • Majority of users unaffected

Architecture


Implementation

1. Update ReleaseDeploymentStack

// lib/ReleaseDeploymentStack.ts
import {
RestApi,
Deployment,
Stage,
CanarySettings,
} from 'aws-cdk-lib/aws-apigateway';

export interface ReleaseDeploymentStackProps extends StackProps {
readonly releaseVersion: string;
readonly lambdaConfigs: LambdaConfig[];
readonly canaryEnabled?: boolean;
readonly initialCanaryPercent?: number;
}

export class ReleaseDeploymentStack extends Stack {
public readonly api: RestApi;
public readonly productionStage: Stage;
public readonly baseDeployment: Deployment;
public readonly canaryDeployment: Deployment;

constructor(scope: Construct, id: string, props: ReleaseDeploymentStackProps) {
super(scope, id, props);

const {
releaseVersion,
lambdaConfigs,
canaryEnabled = false,
initialCanaryPercent = 0,
} = props;

// ... existing Lambda deployment code ...

// Create base deployment (current stable)
this.baseDeployment = new Deployment(this, 'BaseDeployment', {
api: this.api,
description: `Base deployment for release ${releaseVersion}`,
});

// Create canary deployment (new release)
this.canaryDeployment = new Deployment(this, 'CanaryDeployment', {
api: this.api,
description: `Canary deployment for release ${releaseVersion}`,
});

// Canary settings (if enabled)
const canarySettings: CanarySettings | undefined = canaryEnabled
? {
percentTraffic: initialCanaryPercent,
useStageCache: false,
stageVariableOverrides: {
activeEnvironment: 'blue', // Canary uses blue (new release)
},
}
: undefined;

// Create production stage with canary support
this.productionStage = new Stage(this, 'ProductionStage', {
deployment: this.baseDeployment,
stageName: 'prod',
variables: {
activeEnvironment: 'green', // Base uses green (stable)
releaseVersion: releaseVersion,
},
canarySettings,
loggingLevel: MethodLoggingLevel.INFO,
dataTraceEnabled: true,
metricsEnabled: true,
});

// Store canary configuration in SSM
new StringParameter(this, 'CanaryEnabledParameter', {
parameterName: '/release/canary-enabled',
stringValue: canaryEnabled.toString(),
description: 'Whether canary deployments are enabled',
});

new StringParameter(this, 'CanaryPercentParameter', {
parameterName: '/release/canary-percent',
stringValue: initialCanaryPercent.toString(),
description: 'Current percentage of traffic routed to canary',
});

new StringParameter(this, 'CanaryDeploymentTimestampParameter', {
parameterName: '/release/canary-deployment-timestamp',
stringValue: Date.now().toString(),
description: 'Timestamp when canary deployment started',
});
}
}

2. Create Canary Monitor Lambda

// src/lambda/canaryMonitor/src/canaryMonitor.ts
import { Context } from 'aws-lambda';
import { APIGatewayClient, GetStageCommand, UpdateStageCommand } from '@aws-sdk/client-api-gateway';
import { SSMClient, GetParameterCommand, PutParameterCommand } from '@aws-sdk/client-ssm';
import { CloudWatchClient, GetMetricStatisticsCommand } from '@aws-sdk/client-cloudwatch';

const apigateway = new APIGatewayClient({});
const ssm = new SSMClient({});
const cloudwatch = new CloudWatchClient({});

const API_GATEWAY_ID = process.env.API_GATEWAY_ID!;
const STAGE_NAME = process.env.STAGE_NAME || 'prod';
const TARGET_PERCENT = parseInt(process.env.TARGET_PERCENT || '100', 10);
const INCREMENT_PERCENT = parseInt(process.env.INCREMENT_PERCENT || '10', 10);
const ERROR_THRESHOLD = parseFloat(process.env.ERROR_THRESHOLD || '5.0');
const MONITORING_WINDOW_MINUTES = parseInt(process.env.MONITORING_WINDOW_MINUTES || '15', 10);

export interface CanaryMonitorEvent {
action?: 'increment' | 'rollback' | 'promote' | 'status';
}

/**
* Canary Monitor Lambda
*
* Monitors canary deployment health and gradually shifts traffic
* Automatically rolls back if error rate exceeds threshold
*/
export async function handler(
event: CanaryMonitorEvent,
context: Context
): Promise&lt;any&gt; {
console.log('🔍 Canary Monitor triggered');
console.log('Event:', JSON.stringify(event, null, 2));

const action = event.action || 'increment';

try {
switch (action) {
case 'status':
return await getCanaryStatus();
case 'increment':
return await incrementCanary();
case 'rollback':
return await rollbackCanary();
case 'promote':
return await promoteCanary();
default:
throw new Error(`Unknown action: ${action}`);
}
} catch (error) {
console.error('❌ Canary Monitor failed:', error);
throw error;
}
}

async function getCanaryStatus(): Promise&lt;any&gt; {
const stage = await apigateway.send(new GetStageCommand({
restApiId: API_GATEWAY_ID,
stageName: STAGE_NAME,
}));

const currentPercent = stage.canarySettings?.percentTraffic || 0;
const canaryParam = await ssm.send(new GetParameterCommand({
Name: '/release/canary-percent',
}));

return {
currentPercent,
targetPercent: TARGET_PERCENT,
ssmPercent: canaryParam.Parameter?.Value,
canarySettings: stage.canarySettings,
};
}

async function incrementCanary(): Promise&lt;any&gt; {
// Get current canary percentage
const currentPercent = await getCurrentCanaryPercent();
console.log(`Current canary traffic: ${currentPercent}%`);

if (currentPercent &gt;= TARGET_PERCENT) {
console.log(`🎉 Target ${TARGET_PERCENT}% reached. Promoting canary to production.`);
return await promoteCanary();
}

// Check canary health
const canaryErrorRate = await getCanaryErrorRate();
console.log(`Canary error rate: ${canaryErrorRate.toFixed(2)}%`);

if (canaryErrorRate &gt; ERROR_THRESHOLD) {
console.error(`⚠️ Canary error rate ${canaryErrorRate.toFixed(2)}% exceeds threshold ${ERROR_THRESHOLD}%`);
console.log('🔄 Rolling back canary deployment');
return await rollbackCanary();
}

// Increment canary percentage
const newPercent = Math.min(currentPercent + INCREMENT_PERCENT, TARGET_PERCENT);
console.log(`📈 Increasing canary traffic: ${currentPercent}% → ${newPercent}%`);

await updateCanaryPercent(newPercent);

return {
action: 'increment',
previousPercent: currentPercent,
newPercent,
targetPercent: TARGET_PERCENT,
canaryHealthy: true,
};
}

async function getCurrentCanaryPercent(): Promise&lt;number&gt; {
const stage = await apigateway.send(new GetStageCommand({
restApiId: API_GATEWAY_ID,
stageName: STAGE_NAME,
}));

return stage.canarySettings?.percentTraffic || 0;
}

async function updateCanaryPercent(percent: number): Promise&lt;void&gt; {
await apigateway.send(new UpdateStageCommand({
restApiId: API_GATEWAY_ID,
stageName: STAGE_NAME,
patchOperations: [
{
op: 'replace',
path: '/canarySettings/percentTraffic',
value: percent.toString(),
},
],
}));

await ssm.send(new PutParameterCommand({
Name: '/release/canary-percent',
Value: percent.toString(),
Overwrite: true,
}));

console.log(`✅ Updated canary traffic to ${percent}%`);
}

async function getCanaryErrorRate(): Promise&lt;number&gt; {
const endTime = new Date();
const startTime = new Date(endTime.getTime() - MONITORING_WINDOW_MINUTES * 60 * 1000);

const [errorsResult, invocationsResult] = await Promise.all([
cloudwatch.send(new GetMetricStatisticsCommand({
Namespace: 'AWS/ApiGateway',
MetricName: '5XXError',
Dimensions: [
{ Name: 'ApiName', Value: API_GATEWAY_ID },
{ Name: 'Stage', Value: `${STAGE_NAME}/canary` },
],
StartTime: startTime,
EndTime: endTime,
Period: MONITORING_WINDOW_MINUTES * 60,
Statistics: ['Sum'],
})),
cloudwatch.send(new GetMetricStatisticsCommand({
Namespace: 'AWS/ApiGateway',
MetricName: 'Count',
Dimensions: [
{ Name: 'ApiName', Value: API_GATEWAY_ID },
{ Name: 'Stage', Value: `${STAGE_NAME}/canary` },
],
StartTime: startTime,
EndTime: endTime,
Period: MONITORING_WINDOW_MINUTES * 60,
Statistics: ['Sum'],
})),
]);

const errors = errorsResult.Datapoints?.[0]?.Sum ?? 0;
const invocations = invocationsResult.Datapoints?.[0]?.Sum ?? 0;

if (invocations === 0) {
console.warn('⚠️ No canary invocations found in monitoring window');
return 0;
}

return (errors / invocations) * 100;
}

async function rollbackCanary(): Promise&lt;any&gt; {
console.log('🔄 Rolling back canary deployment');

const currentPercent = await getCurrentCanaryPercent();
await updateCanaryPercent(0);

console.log('✅ Canary rolled back to 0%');

return {
action: 'rollback',
previousPercent: currentPercent,
newPercent: 0,
reason: 'High error rate detected',
};
}

async function promoteCanary(): Promise&lt;any&gt; {
console.log('🎉 Promoting canary to production');

// Promotion steps:
// 1. Update stage variables to make canary the new base
// 2. Set canary percentage to 0%
// 3. Update SSM parameters

await apigateway.send(new UpdateStageCommand({
restApiId: API_GATEWAY_ID,
stageName: STAGE_NAME,
patchOperations: [
{
op: 'replace',
path: '/variables/activeEnvironment',
value: 'blue', // Make blue the new active
},
{
op: 'replace',
path: '/canarySettings/percentTraffic',
value: '0',
},
],
}));

await ssm.send(new PutParameterCommand({
Name: '/release/canary-percent',
Value: '0',
Overwrite: true,
}));

await ssm.send(new PutParameterCommand({
Name: '/release/active-environment',
Value: 'blue',
Overwrite: true,
}));

console.log('✅ Canary promoted to production');

return {
action: 'promote',
newActiveEnvironment: 'blue',
canaryPercent: 0,
};
}

3. Create Canary Monitor Construct

// lib/lambda/CanaryMonitorLambda.ts
import { Duration } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';
import { Runtime } from 'aws-cdk-lib/aws-lambda';
import { LogGroup, RetentionDays } from 'aws-cdk-lib/aws-logs';
import { Rule, Schedule } from 'aws-cdk-lib/aws-events';
import { LambdaFunction } from 'aws-cdk-lib/aws-events-targets';
import { PolicyStatement } from 'aws-cdk-lib/aws-iam';

export interface CanaryMonitorLambdaProps {
readonly functionName: string;
readonly apiGatewayId: string;
readonly stageName: string;
readonly targetPercent: number;
readonly incrementPercent: number;
readonly errorThreshold: number;
readonly monitoringWindowMinutes: number;
readonly incrementIntervalMinutes: number;
}

export class CanaryMonitorLambda extends Construct {
public readonly lambda: NodejsFunction;

constructor(scope: Construct, id: string, props: CanaryMonitorLambdaProps) {
super(scope, id);

const {
functionName,
apiGatewayId,
stageName,
targetPercent,
incrementPercent,
errorThreshold,
monitoringWindowMinutes,
incrementIntervalMinutes,
} = props;

const logGroup = new LogGroup(this, 'LogGroup', {
logGroupName: `/aws/lambda/${functionName}`,
retention: RetentionDays.ONE_WEEK,
});

this.lambda = new NodejsFunction(this, 'Function', {
functionName,
runtime: Runtime.NODEJS_22_X,
entry: './src/lambda/canaryMonitor/src/canaryMonitor.ts',
handler: 'handler',
timeout: Duration.seconds(60),
memorySize: 512,
logGroup,
environment: {
API_GATEWAY_ID: apiGatewayId,
STAGE_NAME: stageName,
TARGET_PERCENT: targetPercent.toString(),
INCREMENT_PERCENT: incrementPercent.toString(),
ERROR_THRESHOLD: errorThreshold.toString(),
MONITORING_WINDOW_MINUTES: monitoringWindowMinutes.toString(),
},
bundling: {
minify: true,
sourceMap: true,
},
});

// Grant permissions
this.lambda.addToRolePolicy(
new PolicyStatement({
actions: [
'apigateway:GET',
'apigateway:PATCH',
'apigateway:UpdateStage',
],
resources: [
`arn:aws:apigateway:${props.apiGatewayId.split('/')[0]}::/restapis/${apiGatewayId}/stages/${stageName}`,
],
})
);

this.lambda.addToRolePolicy(
new PolicyStatement({
actions: [
'ssm:GetParameter',
'ssm:PutParameter',
],
resources: [
`arn:aws:ssm:*:*:parameter/release/*`,
],
})
);

this.lambda.addToRolePolicy(
new PolicyStatement({
actions: [
'cloudwatch:GetMetricStatistics',
],
resources: ['*'],
})
);

// Create EventBridge rule for automatic increments
const incrementRule = new Rule(this, 'IncrementRule', {
schedule: Schedule.rate(Duration.minutes(incrementIntervalMinutes)),
description: 'Automatically increment canary percentage',
});

incrementRule.addTarget(
new LambdaFunction(this.lambda, {
event: { action: 'increment' },
})
);
}
}

4. Create Gradual Rollout CLI Script

// scripts/gradual-rollout.ts
#!/usr/bin/env node
import { APIGatewayClient, UpdateStageCommand, GetStageCommand } from '@aws-sdk/client-api-gateway';
import { SSMClient, PutParameterCommand } from '@aws-sdk/client-ssm';
import { CloudFormationClient, DescribeStacksCommand } from '@aws-sdk/client-cloudformation';

const apigateway = new APIGatewayClient({});
const ssm = new SSMClient({});
const cfn = new CloudFormationClient({});

interface RolloutConfig {
startPercent: number;
targetPercent: number;
incrementPercent: number;
intervalMinutes: number;
}

async function getApiGatewayId(): Promise&lt;string&gt; {
const result = await cfn.send(new DescribeStacksCommand({
StackName: 'ReleaseDeploymentStack',
}));

const outputs = result.Stacks?.[0]?.Outputs || [];
const apiIdOutput = outputs.find(o =&gt; o.OutputKey === 'ApiId');

if (!apiIdOutput?.OutputValue) {
throw new Error('API Gateway ID not found in stack outputs');
}

return apiIdOutput.OutputValue;
}

async function getCurrentCanaryPercent(apiGatewayId: string): Promise&lt;number&gt; {
const stage = await apigateway.send(new GetStageCommand({
restApiId: apiGatewayId,
stageName: 'prod',
}));

return stage.canarySettings?.percentTraffic || 0;
}

async function updateCanaryPercent(apiGatewayId: string, percent: number): Promise&lt;void&gt; {
await apigateway.send(new UpdateStageCommand({
restApiId: apiGatewayId,
stageName: 'prod',
patchOperations: [
{
op: 'replace',
path: '/canarySettings/percentTraffic',
value: percent.toString(),
},
],
}));

await ssm.send(new PutParameterCommand({
Name: '/release/canary-percent',
Value: percent.toString(),
Overwrite: true,
}));
}

async function gradualRollout(config: RolloutConfig): Promise&lt;void&gt; {
const { startPercent, targetPercent, incrementPercent, intervalMinutes } = config;

console.log('\n🚀 Starting gradual canary rollout...');
console.log(` Start: ${startPercent}%`);
console.log(` Target: ${targetPercent}%`);
console.log(` Increment: ${incrementPercent}%`);
console.log(` Interval: ${intervalMinutes} minutes\n`);

const apiGatewayId = await getApiGatewayId();
const currentPercent = await getCurrentCanaryPercent(apiGatewayId);

console.log(`📊 Current canary traffic: ${currentPercent}%\n`);

let nextPercent = Math.max(startPercent, currentPercent);

while (nextPercent &lt; targetPercent) {
nextPercent = Math.min(nextPercent + incrementPercent, targetPercent);

console.log(`⏳ Updating canary traffic to ${nextPercent}%...`);
await updateCanaryPercent(apiGatewayId, nextPercent);
console.log(`✅ Canary traffic updated to ${nextPercent}%`);

if (nextPercent &lt; targetPercent) {
console.log(`⏸️ Waiting ${intervalMinutes} minutes before next increment...\n`);
await new Promise(resolve =&gt; setTimeout(resolve, intervalMinutes * 60 * 1000));
}
}

console.log('\n🎉 Gradual rollout complete!');
console.log(` Final canary traffic: ${nextPercent}%\n`);
}

// Parse command-line arguments
const args = process.argv.slice(2);

if (args.length === 0 || args[0] === '--help') {
console.log(`
Usage: npm run gradual-rollout &lt;start%&gt; &lt;target%&gt; &lt;increment%&gt; &lt;interval-minutes&gt;

Examples:
npm run gradual-rollout 0 100 10 15
→ Rollout from 0% to 100% in 10% increments every 15 minutes

npm run gradual-rollout 5 50 5 30
→ Conservative rollout: 5% to 50% in 5% increments every 30 minutes

npm run gradual-rollout 10 100 25 5
→ Fast rollout: 10% to 100% in 25% increments every 5 minutes
`);
process.exit(0);
}

const config: RolloutConfig = {
startPercent: parseInt(args[0]) || 0,
targetPercent: parseInt(args[1]) || 100,
incrementPercent: parseInt(args[2]) || 10,
intervalMinutes: parseInt(args[3]) || 15,
};

gradualRollout(config).catch(error =&gt; {
console.error('❌ Gradual rollout failed:', error);
process.exit(1);
});

5. Update bin/app.ts

// bin/app.ts
const releaseStack = new ReleaseDeploymentStack(app, 'ReleaseDeploymentStack', {
description: `Release ${releaseVersion} with canary deployment`,
releaseVersion,
lambdaConfigs,
canaryEnabled: true, // Enable canary deployments
initialCanaryPercent: 10, // Start with 10% canary traffic
env: {
account: process.env.CDK_DEFAULT_ACCOUNT,
region: process.env.CDK_DEFAULT_REGION,
},
});

// Deploy Canary Monitor for gradual rollout
const canaryMonitor = new CanaryMonitorLambda(app, 'CanaryMonitor', {
functionName: 'CanaryMonitorFunction',
apiGatewayId: releaseStack.api.restApiId,
stageName: 'prod',
targetPercent: 100, // Target: 100% traffic to canary
incrementPercent: 10, // Increment by 10% each time
errorThreshold: 5, // Rollback if error rate &gt; 5%
monitoringWindowMinutes: 15, // Monitor over 15-minute windows
incrementIntervalMinutes: 15, // Increment every 15 minutes
});

6. Update package.json

{
"scripts": {
"gradual-rollout": "tsx ./scripts/gradual-rollout.ts",
"canary-status": "tsx ./scripts/canary-status.ts",
"canary-rollback": "tsx ./scripts/canary-rollback.ts"
},
"dependencies": {
"@aws-sdk/client-api-gateway": "^3.713.0",
"@aws-sdk/client-cloudformation": "^3.713.0",
"@aws-sdk/client-cloudwatch": "^3.713.0",
"@aws-sdk/client-ssm": "^3.713.0"
}
}

Usage

Automatic Gradual Rollout

When you deploy a new release with canary enabled:

RELEASE_VERSION=v2.0.0 npm run deploy

The system automatically:

  1. Deploys v2.0.0 to blue environment
  2. Sets canary to 10% traffic
  3. Monitors canary health every 15 minutes
  4. Increments by 10% if healthy (25% → 50% → 75% → 100%)
  5. Promotes to production at 100%
  6. Rolls back to 0% if error rate > 5%

Manual Gradual Rollout

# Conservative rollout: 0% → 100% in 10% increments every 15 min (total: 2.5 hours)
npm run gradual-rollout 0 100 10 15

# Fast rollout: 10% → 100% in 25% increments every 5 min (total: 20 minutes)
npm run gradual-rollout 10 100 25 5

# Test rollout: 5% → 20% in 5% increments every 10 min
npm run gradual-rollout 5 20 5 10

Manual Control

# Check canary status
npm run canary-status

# Emergency rollback
npm run canary-rollback

# Set specific percentage
aws apigateway update-stage \
--rest-api-id YOUR-API-ID \
--stage-name prod \
--patch-operations op=replace,path=/canarySettings/percentTraffic,value=50

Monitoring

CloudWatch Metrics

API Gateway automatically emits separate metrics for base and canary:

# Canary metrics (tagged with /canary suffix)
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name 5XXError \
--dimensions Name=ApiName,Value=YOUR-API-ID Name=Stage,Value=prod/canary \
--start-time 2026-01-25T00:00:00Z \
--end-time 2026-01-25T23:59:59Z \
--period 300 \
--statistics Sum

# Base metrics (standard)
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name 5XXError \
--dimensions Name=ApiName,Value=YOUR-API-ID Name=Stage,Value=prod \
--start-time 2026-01-25T00:00:00Z \
--end-time 2026-01-25T23:59:59Z \
--period 300 \
--statistics Sum

Canary Monitor Logs

# Watch canary monitor logs
aws logs tail /aws/lambda/CanaryMonitorFunction --follow

# Expected output during healthy rollout:
# 📊 Current canary traffic: 10%
# 🔍 Canary error rate: 1.23%
# 📈 Increasing canary traffic: 10% → 25%
# ✅ Updated canary traffic to 25%

CloudWatch Dashboard

Create a dashboard to monitor canary vs base:

import { Dashboard, GraphWidget } from 'aws-cdk-lib/aws-cloudwatch';

const dashboard = new Dashboard(this, 'CanaryDashboard', {
dashboardName: 'ReleaseCanaryMonitoring',
});

dashboard.addWidgets(
new GraphWidget({
title: 'Error Rate: Base vs Canary',
left: [
new Metric({
namespace: 'AWS/ApiGateway',
metricName: '5XXError',
dimensionsMap: {
ApiName: apiGatewayId,
Stage: 'prod',
},
statistic: 'Sum',
label: 'Base Errors',
}),
new Metric({
namespace: 'AWS/ApiGateway',
metricName: '5XXError',
dimensionsMap: {
ApiName: apiGatewayId,
Stage: 'prod/canary',
},
statistic: 'Sum',
label: 'Canary Errors',
}),
],
})
);

Comparison: Instant vs Canary

AspectInstant SwitchCanary Deployment
Risk⚠️ High - All users affected✅ Low - Gradual exposure
Rollback Impact⚠️ All users affected✅ Only canary % affected
Detection Time❌ Limited (lock-in period)✅ Extended (multiple stages)
Rollout Speed✅ <1 second⚠️ 1-2 hours
User Experience⚠️ All-or-nothing✅ Smooth transition
Cost✅ $0 additional✅ $0 additional
Complexity✅ Simple⚠️ Moderate
Best ForLow-risk releasesHigh-risk releases

For production systems with 20+ Lambda functions:

Phase 1: Initial Canary (0-30 minutes)

  • Deploy to blue environment
  • Set canary to 10%
  • Monitor for 15 minutes
  • Metrics: Error rate, latency, throughput

Phase 2: Progressive Rollout (30-90 minutes)

  • Increment to 25% (if healthy)
  • Monitor 15 minutes
  • Increment to 50% (if healthy)
  • Monitor 15 minutes
  • Increment to 75% (if healthy)
  • Monitor 15 minutes

Phase 3: Full Rollout (90-105 minutes)

  • Increment to 100%
  • Monitor 15 minutes
  • Promote canary to production
  • Lock in new release

Rollback Triggers

  • Automatic: Error rate > 5% in canary
  • Manual: CLI command or AWS Console
  • Result: Canary → 0%, all traffic to stable base

Benefits

Safety

  • ✅ Issues only affect small percentage of users
  • ✅ Multiple checkpoints to detect problems
  • ✅ Gradual validation before full rollout

Observability

  • ✅ Side-by-side comparison of base vs canary metrics
  • ✅ Extended monitoring period (1-2 hours vs instant)
  • ✅ Real-world traffic validation

Control

  • ✅ Manual override at any percentage
  • ✅ Pause rollout for investigation
  • ✅ Automated or manual rollback

Cost

  • ✅ No additional AWS costs
  • ✅ Uses built-in API Gateway features
  • ✅ CloudWatch metrics included

Migration Path

Step 1: Test in Non-Production

# Enable canary in staging
CANARY_ENABLED=true RELEASE_VERSION=v2.0.0 npm run deploy -- --profile staging

Step 2: Conservative Production Rollout

# Small canary, long intervals
npm run gradual-rollout 5 25 5 30

Step 3: Confidence Building

# Standard rollout after confidence
npm run gradual-rollout 10 100 10 15

Step 4: Automated Rollouts

Enable automatic canary increments via EventBridge scheduled rules


Troubleshooting

Issue: Canary not receiving traffic

# Check canary settings
aws apigateway get-stage \
--rest-api-id YOUR-API-ID \
--stage-name prod \
--query 'canarySettings'

# Verify percentTraffic &gt; 0

Issue: Cannot update canary percentage

# Check IAM permissions
# CanaryMonitor Lambda needs:
# - apigateway:UpdateStage
# - apigateway:GET
# - apigateway:PATCH

Issue: No canary metrics in CloudWatch

# Canary metrics appear with /canary suffix
aws cloudwatch list-metrics \
--namespace AWS/ApiGateway \
--dimensions Name=Stage,Value=prod/canary

Issue: Rollback not working

# Check Canary Monitor logs
aws logs tail /aws/lambda/CanaryMonitorFunction --follow

# Verify error threshold configuration
aws lambda get-function-configuration \
--function-name CanaryMonitorFunction \
--query 'Environment.Variables'

Conclusion

Adding canary deployment capabilities provides:

  • Safer releases: Gradual validation with minimal user impact
  • Better observability: Extended monitoring and comparison
  • Flexible control: Manual and automated rollout options
  • Zero additional cost: Uses native API Gateway features

Recommendation: Implement canary deployments for production systems with critical user bases or complex releases involving 10+ Lambda functions.


Next Steps

  1. Review proposed architecture
  2. Decide on rollout strategy (automatic vs manual)
  3. Configure monitoring thresholds
  4. Test in staging environment
  5. Deploy to production with conservative settings
  6. Iterate based on operational experience

References