Quick Definition
Flagger is an open-source progressive delivery tool that automates traffic shifting, analysis, and promotion or rollback of new versions in Kubernetes and similar environments.
Analogy: Flagger is like an automated traffic cop for feature rollouts that watches metrics and either waves a new release through or stops it at the curb.
Formal technical line: Flagger is a Kubernetes-native controller that performs automated canary and progressive delivery using metric-based analysis, integrations with service meshes or ingress providers, and orchestration of traffic routing changes.
If Flagger has multiple meanings, the most common meaning above refers to the open-source progressive delivery controller. Other meanings (less common):
- A monitoring rule engine in custom platforms.
- A feature-flag coordination component in bespoke systems.
- A human role (flagger) in manual release procedures.
What is Flagger?
What it is / what it is NOT
- It is a controller that automates canary and progressive traffic shifts driven by metrics and policies.
- It is NOT a full continuous delivery platform; it focuses on traffic orchestration and analysis, not artifact builds or full release orchestration.
- It is NOT a generic feature flag management system, though it complements feature flags.
Key properties and constraints
- Kubernetes-native: operates via CRDs and watches Kubernetes resources.
- Pluggable: integrates with service meshes, ingress controllers, monitoring backends, and notification systems.
- Metric-driven: promotion decisions are based on telemetry and defined thresholds.
- Policy-bound: rollouts follow declarative Canary or ProgressiveDelivery resources.
- Real-time control: can abort and rollback automatically based on analysis.
- Security: respects RBAC in cluster; network/mesh configuration needed for some integrations.
- Constraints: Requires supported traffic routing primitives (e.g., Istio, Linkerd, NGINX, Contour, ALB) and reliable telemetry.
Where it fits in modern cloud/SRE workflows
- Sits between CI and production traffic; receives a newly deployed revision and controls live traffic to it.
- Integrates with CI pipelines to trigger canaries or ProgressiveDelivery CRDs post-deployment.
- Works with observability stacks to evaluate health during rollouts; used by SREs to reduce blast radius.
- Automates what teams used to do manually: gradual traffic increases, metric checks, rollbacks.
Text-only “diagram description” readers can visualize
- A Kubernetes cluster with an ingress/service mesh controlling ingress and internal routing.
- Flagger controller watches Canary CRDs and a new deployment revision.
- Flagger configures routing rules to direct a small percentage of traffic to canary.
- Observability systems collect metrics; Flagger queries them via configured providers.
- If metrics are healthy, Flagger increases traffic increments until full promotion; otherwise it aborts and rolls back.
Flagger in one sentence
Flagger is an automated Kubernetes controller for progressive delivery that shifts traffic, evaluates metrics, and promotes or rolls back releases without manual intervention.
Flagger vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from Flagger | Common confusion |
|---|---|---|---|
| T1 | Feature flag | Runtime toggling of code paths | People think it controls traffic levels |
| T2 | Argo Rollouts | Comprehensive rollout CRDs and UI | People use interchangeably |
| T3 | Spinnaker | Full CD system including pipelines | Flagger focuses on traffic, not pipelines |
| T4 | Service mesh | Network plane for traffic control | Flagger uses mesh APIs for routing |
| T5 | CI system | Builds and tests artifacts | Flagger acts after CI deploys |
| T6 | Load balancer | Network traffic distribution device | Flagger manipulates routing rules |
| T7 | Observability backend | Metrics, traces, logs stores | Flagger queries backends for signals |
| T8 | Chaos engineering tool | Introduces failures to test resilience | Flagger analyzes health, not inject faults |
Row Details
- T2: Argo Rollouts provides similar progressive delivery CRDs and has a richer UI and deeper ecosystem integrations in some environments.
- T7: Observability backends include Prometheus, DataDog, Stackdriver; Flagger requires adapter configuration.
Why does Flagger matter?
Business impact
- Reduces release risk and outage blast radius, which often preserves revenue and customer trust.
- Enables predictable gradual exposure of features to users, protecting brand reputation.
- Helps teams iterate features faster while keeping risk bounded.
Engineering impact
- Typically reduces incidents during releases by catching regressions early in a partial-traffic window.
- Improves deployment velocity by automating promotion and rollback decisions.
- Lowers manual toil for runbooks that previously required human traffic shifts.
SRE framing
- SLIs/SLOs: Flagger helps maintain SLOs by preventing unhealthy revisions from receiving full traffic.
- Error budgets: Progressive promotion respects error budgets; teams can choose to pause canaries when budgets are low.
- Toil/on-call: Proper automation reduces manual steps during releases, but broken automation increases on-call toil if not observed.
3–5 realistic “what breaks in production” examples
- A new deployment leaks requests that create downstream high latency, detected during a 5% canary reduce.
- A memory regression causes OOM kills under atypical load; Flagger aborts before full promotion.
- A configuration change bypasses auth for certain routes, discovered via increased error rates or security telemetry.
- A third-party API quota is exceeded when canary traffic triggers a different usage pattern, noticed by integration metrics.
Where is Flagger used? (TABLE REQUIRED)
| ID | Layer/Area | How Flagger appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Edge | Controls ingress routing for new release | Request rate latency status codes | Ingress controllers service mesh |
| L2 | Network | Manages canary routing inside mesh | Connection errors retries latencies | Istio Linkerd Envoy |
| L3 | Service | Routes partial traffic to new pods | Per-version latency error rate | Prometheus OpenTelemetry |
| L4 | Application | Validates business metrics during canary | Custom app metrics transactions | App metrics exporters |
| L5 | CI/CD | Triggered after deployment step | Deployment success events logs | Jenkins GitHub Actions |
| L6 | Security | Limits exposure of risky changes | Auth failure rates anomalous calls | WAF logs SIEM |
| L7 | Observability | Queries backends to decide outcomes | Time series anomalies dashboards | Prometheus Datadog |
Row Details
- L1: Edge integrations often use ingress controllers like NGINX or cloud load balancers; configuration varies by provider.
- L3: Service-level telemetry needs per-release tagging or label-aware metrics to compare canary vs primary.
- L7: Flagger connects to metric providers through configured providers; ensure query permissions.
When should you use Flagger?
When it’s necessary
- You run user-facing services with non-trivial traffic where a faulty release causes customer impact.
- You want automated rollback capabilities tied to real-time metrics.
- Your routing layer supports weighted traffic routing (service mesh or ingress).
When it’s optional
- For internal admin tools with low impact and few users.
- For teams with mature feature-flagging that fully isolates risky features from runtime changes.
When NOT to use / overuse it
- For tiny services with one developer and nightly redeploys where simple rollbacks suffice.
- For database schema changes that require careful migration strategies beyond traffic shift.
- Overuse: Applying canaries for every trivial config tweak adds complexity and monitoring cost.
Decision checklist
- If you have a supported mesh/ingress AND non-zero user traffic -> adopt Flagger.
- If you have robust feature flags for the exact behavior change -> consider feature flags instead.
- If you lack telemetry or reliable metrics -> fix observability first, then use Flagger.
Maturity ladder
- Beginner: Single service canary with Prometheus metrics and simple thresholds.
- Intermediate: Multi-service canaries, custom metrics, alerting tied to SLOs, automated notifications.
- Advanced: Cross-service progressive delivery, traffic shaping based on canary-specific telemetry, canary analysis ML augmentations.
Example decisions
- Small team: Use Flagger for high-risk public endpoints only; manual rollbacks for low-risk services.
- Large enterprise: Standardize Flagger CRDs across clusters, integrate with enterprise monitoring and CD pipelines, enforce RBAC and audit.
How does Flagger work?
Components and workflow
- Canary/ProgressiveDelivery CRD: Declarative policy defining metrics and traffic steps.
- Flagger controller: Watches CRDs and deployment revisions; controls routing.
- Router/mesh adapter: Flagger programs traffic rules into service mesh or ingress.
- Metrics provider: Flagger queries observability backends for configured metrics.
- Notifications: Flagger emits events to Slack, PagerDuty, or other channels.
- Promotion/rollback: Based on analysis, Flagger updates the primary service or reverses deployment.
Data flow and lifecycle
- Deployment updated -> New revision observed -> Flagger creates canary analysis -> Routes small traffic to canary -> Polls metrics -> If metrics pass -> increases traffic in steps -> Final promotion updates primary -> Canary cleaned up.
- On metric violations -> Flagger reverts any routing and scales down canary -> Optionally notifies humans and creates a rollback report.
Edge cases and failure modes
- Metrics provider unresponsive: Flagger may stall or abort; ensure retries and timeouts.
- Flagger loses connectivity to router API: Promotion may not be applied; detect via controller events.
- Noisy metrics or low traffic: Statistical significance may be low; use longer observation windows or business metrics.
Short practical examples (pseudocode)
- Define Canary CRD with target service, traffic steps, and metric checks.
- Flagger observes new revision tag and increments traffic: 5% -> 25% -> 50% -> 100% with waits and checks.
Typical architecture patterns for Flagger
-
Single-service canary – When: Isolated service with independent deploys. – Use for: Simple rollouts.
-
Side-by-side canary (parallel versions) – When: Services with different versions concurrently running. – Use for: Isolated testing of a new runtime.
-
Multi-service progressive rollout – When: Changes touch multiple services; orchestrated canaries across dependencies. – Use for: Coordinated feature launches.
-
API gateway canary – When: Traffic control at the edge; implement for client-facing endpoints. – Use for: Versioned APIs with backward compatibility.
-
Dark launches with metrics gating – When: Route traffic to canary but hide responses; use metrics-only evaluation. – Use for: Telemetry validation without user exposure.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | Metrics not available | Analysis times out | Provider auth or query error | Validate provider creds retry | Flagger events metric errors |
| F2 | Insufficient traffic | No significant difference | Low request volume | Use longer test windows synthetic traffic | High variance in metric series |
| F3 | Router API failure | Traffic not shifted | Mesh or ingress API error | Retry or fail open with alert | Router error logs |
| F4 | False positives | Unnecessary rollback | Noisy metric thresholding | Use multiple metrics or smoothing | Spiky alerts on metric panels |
| F5 | Canary scaling issues | Canary pods crash | Resource limits or image bug | Resource tuning probe checks | Pod crashloop events |
| F6 | Permission errors | Flagger cannot update resources | RBAC misconfiguration | Grant proper RBAC roles | Kubernetes audit logs |
| F7 | Telemetry drift | Metrics differ due to label mismatch | Missing labels or instrumentation | Standardize labels add enrichers | Metric cardinality change |
| F8 | Long detection windows | Slow promotion | Too-long analysis intervals | Tune step durations | Extended pending analysis events |
Row Details
- F2: Synthetic traffic generation can be used for low-traffic services; ensure it does not violate business semantics.
- F4: Combine business and system metrics to reduce false positives; consider moving averages.
- F7: Label mismatches often occur after template changes; standardize metrics emitted per version.
Key Concepts, Keywords & Terminology for Flagger
- Canary release — Small percent of users routed to a new version — Validates behavior in prod — Pitfall: low sample size.
- Progressive delivery — Gradual rollout controlled by policy and metrics — Reduces blast radius — Pitfall: overcomplex policies.
- Canary CRD — Declarative Kubernetes resource for canaries — Primary policy for Flagger — Pitfall: misconfigured steps.
- ProgressiveDelivery CRD — More advanced rollout configuration — Enables phased rollouts — Pitfall: mis-specified metrics.
- Mesh adapter — Component translating Flagger actions to service mesh APIs — Enables traffic control — Pitfall: adapter version mismatch.
- Ingress controller — Edge traffic component Flagger can program — Controls external traffic split — Pitfall: limited weight granularity.
- Metric provider — Backend Flagger queries for telemetry — Source of truth for analysis — Pitfall: query permission issues.
- Prometheus provider — Flagger adapter for Prometheus queries — Common for Kubernetes — Pitfall: missing selectors.
- Cloud metrics provider — Managed metrics like Cloud Monitoring — Integrates via provider — Pitfall: higher query latency.
- Alert policy — Reactive rule for incidents — Used with Flagger for notifications — Pitfall: alert noise during canaries.
- Rollback — Reversion when metrics fail — Stops bad release exposure — Pitfall: rollback not fully reverting external state.
- Promotion — Completing a rollout to primary — Finalizes new version — Pitfall: promotion without post-mortem.
- Analysis window — Time period Flagger uses to evaluate metrics — Controls sensitivity — Pitfall: too short yields noise.
- Statistical significance — Confidence in observed differences — Important for decisioning — Pitfall: ignored in low traffic.
- Error budget — Allowable error for SLOs — Used to gate rollouts — Pitfall: not integrated with promotions.
- Business metric — Application-level indicator like purchase success — Strong signal for rollouts — Pitfall: not instrumented.
- System metric — Infrastructure indicator like CPU or latency — Early warning signal — Pitfall: overreliance without context.
- Health check — Liveness/readiness probes — Ensures pod viability — Pitfall: unhealthy probes masking deeper issues.
- Mesh traffic shifting — Adjusting weights via mesh APIs — Core Flagger operation — Pitfall: inconsistent config across clusters.
- Istio adapter — Flagger integration for Istio mesh — Common enterprise integration — Pitfall: incompatible Istio versions.
- Linkerd adapter — Flagger integration for Linkerd — Lightweight mesh option — Pitfall: lacking layer7 features in some versions.
- Envoy — Proxy used by many meshes — Underpins traffic routing — Pitfall: config complexity for complex routing.
- NGINX ingress adapter — Flagger integration with NGINX ingress — Common for edge canaries — Pitfall: weight granularity.
- ALB controller integration — Cloud load balancer adapter — Used in managed clusters — Pitfall: cloud-specific limitations.
- Sidecar — Proxy deployed per pod for traffic control — Enables fine-grained routing — Pitfall: proxy resource overhead.
- Admission controller — K8s mechanism that can block or modify requests — Not directly Flagger but complementary — Pitfall: conflicts with Flagger changes.
- Feature flag — Runtime toggle separate from traffic routing — Works with Flagger for complex launches — Pitfall: drift between flags and routes.
- Traffic mirroring — Copying requests to canary without affecting response — Good for telemetry-only checks — Pitfall: double-write or side effects.
- Dark launch — Expose feature to internal metrics only — Low-risk validation — Pitfall: missing privacy constraints.
- Test traffic generator — Produces controlled traffic for canaries — Helps low-traffic services — Pitfall: synthetic patterns differ from real users.
- Canary analysis — Metric evaluation pass/fail algorithm — Determines promotion — Pitfall: single metric reliance.
- Confidence score — Composite score across metrics — Helps automation — Pitfall: opaque thresholds.
- Notification sink — Channel for alerts (Slack, PagerDuty) — Notifies stakeholders — Pitfall: misconfigured channels.
- RBAC — Kubernetes access control — Flagger needs permissions — Pitfall: least privilege not granted.
- Audit trail — Record of rollout decisions — Important for compliance — Pitfall: incomplete events.
- Observability pipeline — Metric collection and storage — Foundation for Flagger decisions — Pitfall: data loss or high latency.
- Synthetic transaction — Engineered workflow to test business paths — Useful during canaries — Pitfall: not representative.
- Canary weight — Percentage of traffic directed to canary — Primary knob for exposure — Pitfall: too large steps.
- Canary step — Individual increment in rollout plan — Controls cadence — Pitfall: steps too frequent or too long.
- Timeout policy — Wait and retry behavior for Flagger operations — Affects resilience — Pitfall: too aggressive timeouts.
- Telemetry cardinality — Number of label combinations in metrics — High cardinality can hinder queries — Pitfall: performance issues.
How to Measure Flagger (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | Request success rate | Canaries don’t increase errors | (success / total) per version | 99.9% for critical APIs | Low traffic skews % |
| M2 | Request latency P95 | Performance regressions | P95 per version over window | Keep within 1.2x baseline | Outliers affect percentiles |
| M3 | Resource utilization | Canary resource correctness | CPU memory per pod | Under 70% average | Bursts may mislead |
| M4 | Error rate by type | New regressions per error class | Errors grouped per op | Maintain prior baseline | Granularity matters |
| M5 | Availability | SLO adherence during rollout | Uptime per service per version | 99.95% typical start | Depends on SLA |
| M6 | Business conversion | User impact direct metric | Conversion per cohort | No regression vs baseline | Needs tagging |
| M7 | Dependency error rate | Downstream regressions | Downstream errors per call | Similar to baseline | Cross-service taxonomy needed |
| M8 | Synthetic success | End-to-end path health | Synthetic transaction pass rate | 100% for critical paths | Synthetic may not reflect real traffic |
| M9 | Rollback frequency | Stability of releases | Rollbacks per week per service | Low and trending down | May be suppressed |
| M10 | Time-to-detect | Detection latency for regressions | Time from issue to abort | < MTTD target | Depends on polling interval |
Row Details
- M1: Ensure per-version metrics by labeling or using split-aware metrics.
- M6: Business conversion requires consistent user sampling and version identifiers.
- M9: Track why rollbacks occurred for remediation.
Best tools to measure Flagger
Tool — Prometheus
- What it measures for Flagger: System and application metrics, P95/P99 latencies, error rates.
- Best-fit environment: Kubernetes-native clusters with Prometheus operator.
- Setup outline:
- Install Prometheus operator configure scrape jobs.
- Instrument app with client libraries expose per-version labels.
- Configure Flagger Prometheus provider with query templates.
- Strengths:
- Flexible query language and alerting.
- Wide adoption in cloud-native stacks.
- Limitations:
- Requires maintenance for scaling.
- High-cardinality metrics can degrade performance.
Tool — Grafana
- What it measures for Flagger: Visualization and dashboarding across metrics used by Flagger.
- Best-fit environment: Teams needing visual dashboards and alerting.
- Setup outline:
- Connect Prometheus or other backends.
- Build executive and on-call dashboards.
- Wire alerts to notification channels.
- Strengths:
- Rich visualization and templating.
- Plug-ins and alerting support.
- Limitations:
- Alerting noise if dashboards not tuned.
- Requires permissions for data sources.
Tool — Datadog
- What it measures for Flagger: Metrics, traces, and RUM useful for canary comparison.
- Best-fit environment: Managed SaaS monitoring shops.
- Setup outline:
- Install Datadog agents configure tags per service.
- Use monitors for SLI thresholds.
- Configure Flagger provider via Datadog adapter if available.
- Strengths:
- Unified metrics traces logs and RUM.
- Built-in anomaly detection.
- Limitations:
- Cost scales with cardinality and hosts.
- Integration quality varies.
Tool — OpenTelemetry
- What it measures for Flagger: Traces and metrics instrumentation standardization.
- Best-fit environment: Multi-language polyglot stacks.
- Setup outline:
- Instrument apps with OpenTelemetry SDKs.
- Export to chosen backend.
- Ensure per-version resource attributes.
- Strengths:
- Vendor-agnostic and standard.
- Rich context propagation.
- Limitations:
- Requires backend for queries.
- Sampling decisions complicate comparisons.
Tool — Cloud monitoring (managed) — Varied
- What it measures for Flagger: Cloud-specific metrics and logs.
- Best-fit environment: Managed Kubernetes on cloud providers.
- Setup outline:
- Enable cluster metrics export.
- Configure Flagger provider with cloud credentials.
- Strengths:
- Managed scaling and retention.
- Limitations:
- Query latency and cost may vary.
- If unknown: Varies / Not publicly stated
Recommended dashboards & alerts for Flagger
Executive dashboard
- Panels:
- Overall service availability and SLO status.
- Recent promotions and rollbacks timeline.
- Business metric trend per release.
- Why: Stakeholders need quick health and release status.
On-call dashboard
- Panels:
- Active canaries with current step and elapsed time.
- Per-canary key SLIs: error rate, latency P95, resource usage.
- Flagger events and recent rollbacks.
- Why: Rapid triage and quick rollback visibility.
Debug dashboard
- Panels:
- Raw per-version traces and logs.
- Metric series used by canary analysis with overlays.
- Router config and weight history.
- Why: Deep dive to diagnose root cause during canary failures.
Alerting guidance
- Page vs ticket:
- Page on SLO breaches or automatic aborts for critical services.
- Create tickets for non-urgent degraded canaries or minor regressions.
- Burn-rate guidance:
- Pause rollouts if error budget burn-rate exceeds 2x planned.
- Escalate if burn continues beyond threshold window.
- Noise reduction tactics:
- Deduplicate alerts by grouping by service and canary ID.
- Suppress transient spikes with aggregated rules and rolling windows.
- Use composite alerts combining business and system metrics.
Implementation Guide (Step-by-step)
1) Prerequisites – Kubernetes cluster with network routing capable of weighted traffic (service mesh or supported ingress). – CI pipeline that can create or update Deployment replicasets or image tags. – Observability backend with required metrics and query access. – RBAC roles for Flagger to manage services and routing primitives.
2) Instrumentation plan – Add version labels or metrics that include pod revision. – Expose business metrics (purchase success, sign-ups) as SLIs. – Ensure readiness and liveness probes are meaningful.
3) Data collection – Configure Prometheus or alternative to scrape app metrics. – Tag metrics with version labels. – Create synthetic transactions for critical paths.
4) SLO design – Identify 1–3 SLIs per service (success rate, P95 latency, availability). – Define SLO targets and error budget policies for rollouts.
5) Dashboards – Build executive, on-call, and debug dashboards. – Include per-version drilldowns for canaries.
6) Alerts & routing – Create alerts for automatic aborts, promotion failures, and SLO breaches. – Route alerts to on-call, with escalation policies tied to severity.
7) Runbooks & automation – Document runbook for failed canary: commands to inspect Flagger CRD, view events, roll back manually. – Automate notification of escalation and ticket creation.
8) Validation (load/chaos/game days) – Run game days including synthetic traffic and controlled failures. – Validate promotion timing, rollback paths, and alerting.
9) Continuous improvement – Review rollouts and postmortems, iterate metric thresholds. – Add automation for common fixes discovered during incidents.
Checklists
Pre-production checklist
- Ingress/mesh supports weighted routing.
- Flagger controller installed and has RBAC.
- Metrics provider configured and returning queries.
- App exposes per-version metrics.
- Synthetic tests available for critical paths.
Production readiness checklist
- Dashboards and alerts in place and tested.
- Runbooks documented and accessible.
- On-call trained on Flagger behavior.
- Audit logging enabled for promotions and rollbacks.
- Rollout policies reviewed by SRE and security.
Incident checklist specific to Flagger
- Inspect Flagger Canary CRD status and events.
- Query metrics for canary vs primary over analysis windows.
- Check router configuration and weight history.
- If necessary, manually scale down canary and apply rollback.
- Create postmortem capturing metric triggers and mitigation steps.
Example for Kubernetes
- Prerequisite: Istio installed, Prometheus available.
- Instrumentation: Add version label via pod template metadata.
- Data collection: Configure Prometheus scrape and record rules.
- SLO: Request success rate 99.9% for API.
- Validation: Deploy canary, run synthetic load, verify metrics, promote.
Example for managed cloud service
- Prerequisite: Managed Kubernetes with ingress controller and cloud metrics enabled.
- Instrumentation: Use cloud agent to export metrics with version labels.
- Data collection: Configure Flagger provider for cloud metrics.
- SLO: Availability 99.95% for customer endpoints.
- Validation: Trigger canary via CD pipeline and monitor cloud metric queries.
Use Cases of Flagger
1) Public API version rollout – Context: Breaking API change to a customer endpoint. – Problem: High risk of breaking clients. – Why Flagger helps: Gradual traffic shift allows progressive exposure and quick rollback. – What to measure: Error rate 4xx/5xx latency P95. – Typical tools: Istio Prometheus Grafana.
2) Payment service upgrade – Context: New payment gateway library. – Problem: Small increase in failure has direct revenue impact. – Why Flagger helps: Business metric gating (successful payments). – What to measure: Payment success rate transaction errors. – Typical tools: Prometheus Datadog synthetic transactions.
3) A/B test with backend changes – Context: Algorithmic change affecting recommendation results. – Problem: Need safe exposure and comparison. – Why Flagger helps: Split traffic with telemetry comparison. – What to measure: Engagement conversion metrics. – Typical tools: Flagger plus analytics backend.
4) Dark launch of new UI path – Context: UI sends new telemetry calls to backend. – Problem: Want to validate backend behavior without user exposure. – Why Flagger helps: Mirror or partial routing for telemetry-only evaluation. – What to measure: Telemetry event counts error rates. – Typical tools: Traffic mirroring support, metrics backend.
5) Dependency version bump – Context: Library update might affect downstream latency. – Problem: Degradation could propagate. – Why Flagger helps: Canary scopes exposure and observes downstream calls. – What to measure: Downstream API latency failure rates. – Typical tools: Tracing with OpenTelemetry, Prometheus.
6) Serverless function change – Context: Managed function updated with new logic. – Problem: Cold-starts and performance regressions may occur. – Why Flagger helps: Progressive traffic control at API gateway layer. – What to measure: Invocation errors latency cost per invocation. – Typical tools: API gateway metrics cloud monitoring.
7) Database connection pooling change – Context: App code changes pooling behavior. – Problem: Increased connection churn may overload DB. – Why Flagger helps: Gradual exposure to detect connection errors. – What to measure: DB connection count errors latency. – Typical tools: App metrics DB performance monitors.
8) Multi-region rollout – Context: New version must be tested regionally. – Problem: Latency and config differences across regions. – Why Flagger helps: Region-based canaries with local metrics gating. – What to measure: Region-specific latency and error rates. – Typical tools: Multi-cluster Flagger setup, region metrics.
9) Cost-optimization flight – Context: Runtime change reduces instance usage but may affect latency. – Problem: Need to trade performance and cost safely. – Why Flagger helps: Compare resource utilization and latency per version. – What to measure: CPU memory cost per request latency. – Typical tools: Cloud cost metrics Prometheus.
10) Feature toggle deprecation – Context: Removing legacy code behind a flag. – Problem: Risk of regression across many services. – Why Flagger helps: Progressive removal with telemetry checks. – What to measure: Error rates user-impacting metrics. – Typical tools: Flagger plus centralized feature flag telemetry.
Scenario Examples (Realistic, End-to-End)
Scenario #1 — Kubernetes: Public API Canary with Istio
Context: A customer-facing REST API in Kubernetes needs a new handler version.
Goal: Safely roll out new version to 100% traffic only after verification.
Why Flagger matters here: Automates traffic shifts through Istio, reduces manual steps and speeds rollback if issues appear.
Architecture / workflow: Deployment with two revisions, Istio virtual service and destination rules, Flagger Canary CRD, Prometheus metrics.
Step-by-step implementation:
- Ensure Istio and Prometheus installed.
- Add version label in Deployment pod template.
- Create Canary CRD referencing service and metrics (P95 latency, error rate).
- Deploy new image tag to cluster; Flagger detects new revision.
- Flagger shifts 5% traffic and monitors metrics for analysis window.
- If checks pass, Flagger moves to next step; else abort and rollback.
What to measure: P95 latency, request success rate, pod OOMs.
Tools to use and why: Istio for traffic weights, Prometheus for metrics, Grafana for dashboards.
Common pitfalls: Missing per-version labels; Prometheus queries return aggregated metrics only.
Validation: Run synthetic transactions and compare canary vs primary; trigger rollback scenario for practice.
Outcome: Safe promotion and audit trail for release.
Scenario #2 — Serverless/Managed-PaaS: Lambda-like function via API Gateway
Context: Managed function upgrade behind API gateway with staged deployments.
Goal: Validate function performance under real traffic and roll back on regressions.
Why Flagger matters here: Controls stage traffic via gateway percentages or split routes and enforces metric checks.
Architecture / workflow: API gateway supports weighted backend targets; Flagger orchestrates traffic and queries cloud monitoring.
Step-by-step implementation:
- Ensure gateway supports weights and cloud metrics available.
- Deploy new function revision in parallel.
- Create Flagger ProgressiveDelivery specifying gateway adapter and metrics (errors, latency).
- Trigger Flagger analysis after deployment.
- Promote or rollback based on cloud metric responses.
What to measure: Invocation errors, cold-start latency, cost per invocation.
Tools to use and why: Cloud monitoring for function metrics, gateway adapter for traffic shifts.
Common pitfalls: Provider inability to shift small weights or slow metric propagation.
Validation: Run controlled traffic spikes and verify Flagger decision latency.
Outcome: Safer function updates with minimal customer impact.
Scenario #3 — Incident-response/Postmortem: Abort and Investigate
Context: During a canary, error rate spikes and Flagger aborts promotion.
Goal: Quickly diagnose root cause and remediate.
Why Flagger matters here: Early abort prevented a full-blown outage; provides events and rollback.
Architecture / workflow: Flagger sends abort event, scales down canary, and notifies on-call.
Step-by-step implementation:
- On-call inspects Flagger Canary status and event logs.
- Query metrics for canary vs primary across windows.
- Check logs and traces correlated to the canary revision.
- If bug found, roll back and create PR to fix code.
- Run postmortem documenting metrics that triggered abort and remediation steps.
What to measure: Time-to-detect, rollback time, effected requests count.
Tools to use and why: Tracing for root cause, logs for stack traces, Flagger events for audit.
Common pitfalls: Missing observability artifacts for canary revision.
Validation: Postmortem includes timeline and action items.
Outcome: Incident contained with learnings added to runbooks.
Scenario #4 — Cost/performance trade-off
Context: New version uses vectorized algorithms reducing CPU but raising memory slightly.
Goal: Ensure cost benefit without violating latency SLOs.
Why Flagger matters here: Can evaluate cost signals and latency simultaneously during rollout.
Architecture / workflow: Flagger configured to monitor resource usage and latency P95.
Step-by-step implementation:
- Capture baseline CPU memory cost per request.
- Define Canary CRD with metrics for CPU seconds per request and P95 latency.
- Run canary with incremented traffic and observe cost vs latency.
- Promote if cost savings and no latency regression; else rollback.
What to measure: CPU cost per request, memory usage, P95 latency.
Tools to use and why: Prometheus for resource metrics, Grafana for cost modeling.
Common pitfalls: Overlooking memory spikes that cause OOMs under load.
Validation: Load test at expected peak before final promotion.
Outcome: Data-driven decision balancing cost and customer experience.
Common Mistakes, Anti-patterns, and Troubleshooting
- Symptom: Canary analysis times out -> Root cause: Metrics provider auth or network error -> Fix: Validate provider credentials, increase timeouts.
- Symptom: No per-version metrics -> Root cause: Missing version label in instrumentation -> Fix: Add stable version label to metrics, redeploy.
- Symptom: Frequent false aborts -> Root cause: Single metric threshold too sensitive -> Fix: Use composite metrics and smoothing windows.
- Symptom: Canary never promotes -> Root cause: Flagger cannot update routing rules -> Fix: Check RBAC and router adapter health.
- Symptom: High alert noise during canaries -> Root cause: Alerts not gated by canary context -> Fix: Add canary-specific suppression or dedupe rules.
- Symptom: Promotion applied but external state not reverted on rollback -> Root cause: Side effects not idempotent -> Fix: Add compensating transactions and pre-flight checks.
- Symptom: Low sample sizes misleading metrics -> Root cause: Low traffic service -> Fix: Use synthetic traffic or longer observation windows.
- Symptom: Inconsistent metrics across regions -> Root cause: Telemetry collection inconsistent -> Fix: Standardize instrumentation and collection config.
- Symptom: Flagger events missing -> Root cause: Flagger controller crash loops -> Fix: Inspect pod logs, ensure resource limits and restarts.
- Symptom: Router weight granularity too coarse -> Root cause: Ingress lacks fractional weights -> Fix: Use mesh adapter or staged promotion strategy.
- Symptom: RBAC denies Flagger updates -> Root cause: Missing ClusterRoleBindings -> Fix: Apply documented roles for Flagger.
- Symptom: Flagger mis-evaluates business metric -> Root cause: Metric aligned to wrong tag -> Fix: Tag metrics with stable release ID.
- Symptom: Canary pods starved -> Root cause: Pod anti-affinity or resource quotas -> Fix: Adjust quotas and affinity rules.
- Symptom: Excessive metric cardinality -> Root cause: High label cardinality in app metrics -> Fix: Reduce labels and aggregate dimensions.
- Symptom: Flagger waits indefinitely -> Root cause: Undefined timeout or stuck step -> Fix: Define explicit timeouts and failure behavior.
- Symptom: Observability dashboards missing canary context -> Root cause: Dashboards not templated by version -> Fix: Add version variables and query per revision.
- Symptom: Alerts not routed correctly -> Root cause: Misconfigured notification sink -> Fix: Test channels and set correct keys.
- Symptom: Feature flags conflict with routing -> Root cause: Separate toggles not synchronized -> Fix: Coordinate flag and Flagger rollout via pipeline.
- Symptom: Data race in deployment ordering -> Root cause: CI deploys while Flagger is mid-step -> Fix: Gate CD to wait for Flagger completion.
- Symptom: Manual overrides ignored -> Root cause: Flagger controller reconciling desired state -> Fix: Use CRD edits or pause Flagger before manual change.
- Observability pitfall: No trace correlation between versions -> Root cause: Missing version fields in spans -> Fix: Add resource attributes to traces.
- Observability pitfall: Metrics delayed beyond analysis window -> Root cause: Scrape or ingestion lag -> Fix: Increase scrape frequency or extend window.
- Observability pitfall: Logs not tagged with revision -> Root cause: Logging pipeline lacks pod labels -> Fix: Enrich logs with pod metadata.
- Observability pitfall: Synthetic tests not representative -> Root cause: Synthetic flows too rigid -> Fix: Model realistic user patterns in synthetic scripts.
- Symptom: Canary affects downstream quotas -> Root cause: Canary hits third-party usage limits -> Fix: Throttle canary or mock external services.
Best Practices & Operating Model
Ownership and on-call
- Assign service owners responsible for canary definitions and SLOs.
- On-call engineers should be trained to interpret Flagger events and dashboards.
- Ownership: Flagger release decisions remain with SRE/CD team jointly.
Runbooks vs playbooks
- Runbooks: Step-by-step recovery for known issues, commands to roll back, checks to run.
- Playbooks: High-level strategy for escalation and cross-service coordination.
Safe deployments (canary/rollback)
- Use small initial weight increments and require multiple metric passes.
- Automate rollbacks and notify post-mortem owners.
- Avoid manual promotions unless verified by SRE.
Toil reduction and automation
- Automate common analysis report generation after canary (diff metrics).
- Automate synthetic traffic for low-traffic services.
- Automate remediation for known transient failures (e.g., restart pods on OOM).
Security basics
- Grant minimum RBAC for Flagger.
- Audit promotions and CRD updates.
- Ensure metric providers use least-privilege credentials.
Weekly/monthly routines
- Weekly: Review active canaries and any aborted rollouts.
- Monthly: Audit Flagger RBAC and router adapter versions; update playbooks.
- Quarterly: Run game days and validate SLO alignment.
What to review in postmortems related to Flagger
- Metric triggers that caused abort; whether thresholds were appropriate.
- Time-to-detect and time-to-rollback.
- Was Flagger behavior expected by operators? Any surprise automation?
- Action items: telemetry gaps, configuration changes, runbook updates.
What to automate first
- Synthetic traffic generation for low-traffic services.
- Automatic alerts for Flagger aborts and promotions.
- Basic canary CRD templates enforced via policy for standard services.
Tooling & Integration Map for Flagger (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | Service mesh | Traffic routing and weight control | Istio Linkerd Envoy | Critical for L7 traffic shifting |
| I2 | Ingress controller | Edge routing and weighted backends | NGINX Contour ALB | Useful for edge-level canaries |
| I3 | Metrics backend | Stores and queries metrics | Prometheus Datadog Cloud | Flagger queries for analysis |
| I4 | Tracing | Request-level visibility | OpenTelemetry Jaeger Zipkin | Helps root cause for regressions |
| I5 | CI/CD | Triggers deployments and canaries | Jenkins GitHub Actions | Orchestrates deploy then Flagger |
| I6 | Notification | Sends alerts and events | Slack PagerDuty Email | Notify on aborts/promotions |
| I7 | Feature flag system | Runtime feature toggles | LaunchDarkly Unleash | Complementary to traffic canaries |
| I8 | Observability platform | Dashboards and alerts | Grafana Datadog | Visualize canary comparisons |
| I9 | Chaos tools | Fault injection for resilience | Chaos Mesh Litmus | Validate rollback and recovery |
| I10 | Policy engine | Enforce CRD templates | Kyverno OPA Gatekeeper | Enforce safe rollout policies |
Row Details
- I1: Service mesh choice dictates available routing features and performance.
- I3: Prometheus commonly used; managed backends have higher latency.
- I5: CI pipelines should wait for Flagger completion when necessary.
Frequently Asked Questions (FAQs)
How do I integrate Flagger with Prometheus?
Use the Flagger Prometheus provider configuration with a query URL and appropriate RBAC; ensure metrics expose version labels.
How do I pause or abort a canary manually?
Edit the Canary CRD to set the analysis to pause or remove canary weights; alternatively scale canary replica to zero and Flagger will detect.
How do I add business metrics to Flagger analysis?
Instrument application to expose business SLIs and reference them in the Canary CRD metric checks.
What’s the difference between Flagger and Argo Rollouts?
Both support progressive delivery; Argo Rollouts provides a richer UI and ecosystem for canaries and experiments while Flagger focuses on metric-driven rollouts with wide adapter support.
What’s the difference between feature flags and Flagger?
Feature flags toggle code paths at runtime; Flagger shifts user traffic between versions. Use flags for behavior toggles and Flagger for deployment-level control.
What’s the difference between Flagger and a CI system?
CI handles builds and tests; Flagger sits post-deploy to manage traffic and promotions based on metrics.
How do I measure canary success?
Define SLIs such as success rate and latency, compute them per-version, and compare against baselines and SLO targets.
How much traffic should a canary receive initially?
Often 1–5% for public endpoints; use business risk and traffic volume to decide.
How do I handle low-traffic services?
Use synthetic test traffic and longer analysis windows or group canaries with similar traffic patterns.
How does Flagger handle multi-service rollouts?
Configure Flagger for each service and coordinate via pipeline orchestration or use higher-level ProgressiveDelivery patterns.
How do I debug a failed canary?
Check Flagger Canary CRD status and events, inspect routing configuration, query metrics for canary vs primary, and review logs/traces for the canary revision.
How do I ensure security with Flagger?
Grant least privilege RBAC, audit actions, and ensure adapters use secure credentials and TLS for metric queries.
How long should analysis windows be?
Varies / depends
How many metrics should I use for analysis?
Use at least two complementary metrics (system and business). More metrics can reduce false positives but increase complexity.
How does Flagger decide to rollback?
It compares configured metric thresholds across configured analysis windows; failing checks trigger abort and rollback.
How is Flagger deployed in multi-cluster setups?
Flagger must be installed per-cluster; cross-cluster coordination is external to Flagger and often handled by CD pipelines.
How do I test Flagger changes safely?
Use staging clusters and synthetic traffic; run game days to validate behavior before production rollout.
Conclusion
Flagger provides a pragmatic, metric-driven approach to progressive delivery in Kubernetes and related environments. It reduces release risk by automating traffic shifting, analysis, and rollback while integrating with existing observability and routing stacks. Successful adoption depends on solid telemetry, appropriate RBAC, and clear runbooks.
Next 7 days plan
- Day 1: Inventory services and identify high-risk candidates for canaries.
- Day 2: Ensure Prometheus or a metrics backend is scraping per-version metrics.
- Day 3: Install Flagger in a staging cluster and validate RBAC and adapters.
- Day 4: Create a Canary CRD for a non-critical service and run a controlled rollout.
- Day 5: Build on-call dashboard panels and alert rules for Flagger events.
Appendix — Flagger Keyword Cluster (SEO)
- Primary keywords
- Flagger
- Flagger canary
- Flagger Kubernetes
- Flagger progressive delivery
- Flagger tutorial
- Flagger guide
- Flagger canary CRD
- Flagger prometheus
- Flagger istio
-
Flagger ingress
-
Related terminology
- Canary release
- Progressive delivery
- Canary analysis
- Traffic shifting
- Weighted routing
- Service mesh canary
- Istio canary
- Linkerd canary
- NGINX canary
- ALB canary
- Canary CRD example
- Prometheus provider Flagger
- Flagger adapters
- Flagger events
- Flagger rollback
- Flagger promotion
- Flagger metrics
- Flagger best practices
- Flagger troubleshooting
- Flagger failure modes
- Flagger security
- Flagger RBAC
- Flagger observability
- Flagger dashboards
- Flagger alerts
- Flagger synthetic traffic
- Canary step configuration
- Canary weight strategy
- Canary analysis window
- Business metrics canary
- System metrics canary
- Flagger vs Argo Rollouts
- Flagger vs feature flags
- Flagger vs Spinnaker
- Flagger installation
- Flagger configuration
- Flagger examples
- Flagger case studies
- Flagger for serverless
- Flagger multi-cluster
- Canary testing
- Canary automation
- Flagger runbook
- Flagger playbook
- Flagger incident response
- Flagger audit trail
- ProgressiveDelivery CRD
- Flagger adapters list
- Flagger plugin
- Flagger architecture
- Flagger monitoring
- Flagger metrics queries
- Flagger synthetic tests
- Flagger canary templates
- Flagger CI/CD integration
- Flagger deployment strategy
- Flagger rollback policy
- Flagger promotion policy
- Flagger troubleshooting guide
- Flagger observability pipeline
- Flagger load testing
- Flagger game days
- Flagger low traffic strategies
- Flagger cost optimization
- Flagger compliance
- Flagger audit logs
- Flagger telemetry best practices
- Flagger metric cardinality
- Flagger label conventions
- Flagger canary examples Kubernetes
- Flagger serverless examples
- Flagger production checklist
- Flagger pre-production checklist
- Flagger runbook examples
- Flagger automation tips
- Flagger maturity model
- Flagger adoption roadmap
- Flagger SLO integration
- Flagger error budget
- Flagger burn rate
- Flagger alerting guidance
- Flagger dedupe alerts
- Flagger incident checklist
- Flagger rollback scenarios
- Flagger canary analysis thresholds
- Flagger traffic mirroring
- Flagger dark launch
- Flagger side-by-side deployment
- Flagger multi-service rollout
- Flagger API gateway canary
- Flagger edge canary
- Flagger mesh adapter
- Flagger ingress adapter
- Flagger tracing integration
- Flagger OpenTelemetry
- Flagger Datadog integration
- Flagger Grafana dashboards
- Flagger synthetic transactions
- Flagger load generator
- Flagger observability pitfalls
- Flagger troubleshooting steps
- Flagger common mistakes
- Flagger anti-patterns
- Flagger success metrics
- Flagger SLI examples
- Flagger SLO guidance
- Flagger metric examples
- Flagger measurement
- Flagger KPIs
- Flagger business metrics
- Flagger deploy automation
- Flagger CI integration
- Flagger CD pipeline
- Flagger enterprise adoption