Quick Definition
Plain-English definition: Topology spread constraints are scheduling rules used to influence where workloads are placed across a cluster so replicas or instances are spread across failure domains to reduce correlated failures.
Analogy: Like seating guests at a wedding so family members are distributed across multiple tables instead of all at one table; if one table collapses, not everyone is affected.
Formal technical line: Topology spread constraints are declarative scheduling constraints that direct the cluster scheduler to maximize distribution of pods or instances across specified topology keys and topology domains subject to label selectors and min-available rules.
Other meanings (if any):
- The most common meaning above refers to Kubernetes PodTopologySpread and topologySpreadConstraints.
- In some orchestration systems it can mean policy layers for multi-zone VM placement.
- In networking or storage contexts it can refer to data-plane replication distribution rules.
- In proprietary systems, similar concepts may be called anti-affinity, spread policies, or placement constraints.
What is topology spread constraints?
What it is / what it is NOT
- It is a scheduler-level policy that guides placement for high availability by spreading replicas across topology domains like nodes, zones, racks, or regions.
- It is NOT a hard guarantee of perfect distribution—scheduling is subject to resource availability, taints/tolerations, and affinity/anti-affinity.
- It is NOT a replacement for proper backup, replication, or cross-region disaster recovery plans.
Key properties and constraints
- Declarative: defined in workload specs (for example Kubernetes Pod spec).
- Topology key driven: works against labels on nodes or infrastructure (zone, hostname, rack).
- Balancing strategy: typically uses “maxSkew” or “whenUnsatisfiable” semantics to define toleration for imbalance.
- Selector-limited: applies to pods matched by a label selector.
- Dependent on scheduler: behavior can vary by orchestration platform and scheduler implementation.
- Resource-aware: cannot place pods where resources do not exist; decisions consider CPU/memory and taints.
- Compatibility: interacts with node affinity, pod affinity/anti-affinity, and other placement constraints.
Where it fits in modern cloud/SRE workflows
- Prevents correlated failures by spreading workload replicas across failure domains.
- Used in deployment and release plans to increase reliability for stateful and stateless services.
- Part of SRE practices to meet availability SLIs and reduce blast radius for incidents.
- Integrated into CI/CD for canary and staged rollouts where distribution matters.
- Often combined with autoscaling, topology-aware load balancers, and multi-cluster strategies.
Diagram description (text-only)
- Visualize a grid of nodes grouped by zones and racks; pods of a single deployment are colored the same.
- Topology spread constraints act like a set of lines dividing the grid; the scheduler tries to place one colored pod in each segment before placing a second in any one segment.
- If a zone has no capacity, the scheduler fills remaining choices with best-effort placements while trying to respect maxSkew limits.
topology spread constraints in one sentence
Topology spread constraints are scheduler directives that aim to distribute workload replicas across labeled topology domains to reduce correlated risk and improve availability.
topology spread constraints vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from topology spread constraints | Common confusion |
|---|---|---|---|
| T1 | PodAntiAffinity | Expresses which pods should not colocate on same topology | Often conflated with spread |
| T2 | NodeAffinity | Targets nodes by labels rather than balancing across them | Mixes filtering with balancing |
| T3 | ReplicationController | Ensures replica count but not distribution | People assume it spreads replicas |
| T4 | StatefulSet | Manages stable identities and ordering not distribution | Some expect automatic cross-zone spread |
| T5 | TopologyAwareRouting | Load routing based on topology rather than placement | Not a placement guarantee |
| T6 | ZoneFailover | Operational plan for zone outage not scheduling logic | Sometimes thought to be auto-handled by constraints |
| T7 | PodDisruptionBudget | Controls voluntary evictions not initial placement | Confusion about preventing imbalance |
| T8 | VolumeTopology | Storage placement targeting nodes with volumes | Related but distinct from pod placement |
| T9 | SchedulerExtender | External plugins for scheduling decisions | May alter constraint behavior |
| T10 | MultiClusterController | Orchestrates across clusters not intra-cluster spread | Different scope |
Row Details
- T1: PodAntiAffinity only prevents colocating matching pods on same topology domain; topology spread tries to balance counts across domains.
- T4: StatefulSet ensures stable network IDs and persistent volumes; it does not by itself guarantee even spread across labels like zone.
- T8: VolumeTopology dictates where persistent volumes are accessible; pod spread must respect available volume topology, which can limit spread.
Why does topology spread constraints matter?
Business impact (revenue, trust, risk)
- Reduces correlated outages that can impact customer-facing features, preserving revenue during infrastructure incidents.
- Improves customer trust by reducing wide-scale failures that appear as systemic downtime.
- Lowers business risk by shrinking blast radius; fewer customers impacted per outage often means lower compliance and contractual penalties.
Engineering impact (incident reduction, velocity)
- Often reduces incident frequency and severity by preventing many replicas from being knocked out by single-node or single-zone failures.
- Enables faster recovery and simpler runbooks because fewer dependencies fail simultaneously.
- Can increase deployment complexity; teams may need to account for insufficient topology domains which can slow rollout velocity if not automated.
SRE framing (SLIs/SLOs/error budgets/toil/on-call)
- SLIs improved: higher availability and lower error rates during infrastructure-level failures.
- SLOs: topology spread constraints help maintain error budgets by reducing large-scale incidents.
- Toil reduction: reduces repetitive manual redistribution after failures, but initial setup requires engineering effort.
- On-call: fewer pages for zone/node failures; pages that occur are narrower in scope but possibly more complex due to placement interactions.
3–5 realistic “what breaks in production” examples
- A node pool upgrade inadvertently drains a heterogeneous preferred zone; many replicas are scheduled on remaining nodes causing resource exhaustion and evictions.
- A network partition isolates a rack; without spread, many replicas were on that rack causing service outage.
- Storage controller failure removes access to a specific topology domain; pods scheduled there lose persistent volumes.
- Autoscaler quickly removes underutilized nodes that happened to host many replicas for a low-traffic service, causing a cascading restart storm.
Where is topology spread constraints used? (TABLE REQUIRED)
| ID | Layer/Area | How topology spread constraints appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Infrastructure | Spread VMs or instances across zones and racks | Node counts per zone, instance churn | Cluster autoscaler, cloud API |
| L2 | Kubernetes | Pod topologySpreadConstraints in PodSpec | Pod distribution, scheduling failures | kube-scheduler, kubectl, policies |
| L3 | Storage | Volume placement and attachments across nodes | Volume attachments, IO errors per topology | CSI drivers, storage controllers |
| L4 | Network | Load balancer and routing affinity aware of topology | Traffic per AZ, failover events | LB controllers, service mesh |
| L5 | CI/CD | Deployment strategies that respect topology | Rollout success per zone | ArgoCD, Flux, pipelines |
| L6 | Serverless | Managed placement or zone awareness for functions | Invocation latency per region | Cloud provider platform metrics |
| L7 | Observability | Dashboards for placement and imbalance | Skew metrics, pod evictions | Prometheus, Grafana, telemetry pipelines |
| L8 | Security | Isolate sensitive workloads across domains | Audit of label/topology mapping | Policy engines, admission controllers |
Row Details
- L2: Kubernetes usage is declarative in PodSpec and commonly used for Deployments and StatefulSets; scheduler plugin behavior can vary by version.
- L3: Storage topology can constrain pod placement when volumes are bound to particular nodes or zones, limiting spread.
- L6: Serverless providers often hide placement, so topology awareness varies by provider; managed services may expose zone routing metrics.
When should you use topology spread constraints?
When it’s necessary
- When you run replicated stateful or stateless services and want to minimize correlated failures across labeled topology domains.
- When SLOs require partial-failure tolerance (for example tolerate one zone outage).
- When regulatory or compliance requirements demand distribution across fault domains.
When it’s optional
- For small, non-critical workloads where failure of all replicas is acceptable.
- When cluster resource scarcity means constraints cause frequent pending pods and manual intervention.
- For ephemeral dev/test clusters where simplicity outweighs availability.
When NOT to use / overuse it
- Don’t apply strict spread constraints to every pod; unnecessary constraints can cause scheduling pressure and resource fragmentation.
- Avoid overly granular topology keys (e.g., per-container metadata) that prevent effective balancing.
- Don’t use spread constraints as the only HA mechanism for stateful systems that require replication or quorum across persistence layers.
Decision checklist
- If you need N+1 availability across zones and you have >= N zones -> use spread constraints.
- If resource availability is low and pods are frequently Pending -> relax constraints or add capacity.
- If storage volume binding prevents cross-zone relocation -> verify volume topology before adding constraints.
Maturity ladder
- Beginner: Apply basic zone-level spread for critical Deployments with simple maxSkew and whenUnsatisfiable set to “ScheduleAnyway” for flexibility.
- Intermediate: Use multiple constraints with node and zone keys, integrate with PDBs and deployment strategies.
- Advanced: Implement scheduler extenders, custom scoring plugins, and multi-cluster placement controllers with automatic remediation and autoscaler integration.
Example decisions
- Small team example: For a 3-node cluster in one zone, do NOT use zone-level spread; use node-level spread with ScheduleAnyway to reduce pending pods.
- Large enterprise example: For a global service, enforce cross-zone spread with hard constraints and integrate with multi-region failover and CI/CD gating.
How does topology spread constraints work?
Components and workflow
- Workload definition: The Pod or deployment includes topologySpreadConstraints (selector, topologyKey, maxSkew, whenUnsatisfiable).
- Node labels: Nodes are labeled with topologyKey values (zone, hostname, rack).
- Scheduler evaluation: When scheduling, the scheduler evaluates existing pod counts per topology domain that match the selector.
- Scoring and filtering: Scheduler chooses nodes that minimize skew respecting other constraints (resources, taints).
- Placement: Pod is placed; counters update and influence subsequent placements.
- Rebalance: On node changes (drain, add, fail), scheduler attempts to re-schedule to preserve spread subject to disruption policies.
Data flow and lifecycle
- Input: Pod specs, node labels, existing pod placements.
- Processing: Scheduler computes skew per domain and applies balancing heuristics.
- Output: Pod assigned to node; metrics emitted (scheduling latencies, pending reasons, skew per topology).
Edge cases and failure modes
- Not enough topology domains to satisfy constraints -> pods remain Pending or are scheduled anyway depending on whenUnsatisfiable.
- Volume binding limits: If PVCs are bound to a specific topology, pods cannot be placed elsewhere.
- Affinity conflict: Node/pod affinity rules can conflict and make spread impossible.
- Autoscaler/Node pool draining: sudden node removals can temporarily violate spread during scale-down.
Short practical examples (pseudocode)
- Pod spec snippet: include topologySpreadConstraints with selector for app=myservice, topologyKey=topology.kubernetes.io/zone, maxSkew=1, whenUnsatisfiable=DoNotSchedule.
- Evaluate: scheduler counts pods with label app=myservice per zone and only schedules a new pod where count is minimal subject to maxSkew.
Typical architecture patterns for topology spread constraints
- Zone-balanced stateless web tier: Spread pods across zones with maxSkew 1; use ScheduleAnyway for flexible scaling.
- Node-level spread for densified clusters: Use hostname key to distribute across nodes for noise isolation on single-node failures.
- Hybrid: Zone + node dual constraints where zone prevents zone-level failure and node reduces per-node correlated risk.
- Stateful replicas with volume constraints: Combine StatefulSet with volume topology hints and spread constraints across racks to protect storage controller faults.
- Multi-cluster fanout: Use topology constraints within each cluster and a multi-cluster controller to distribute traffic across clusters.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | Pending pods | Pods stuck Pending | Not enough domains or resources | Relax whenUnsatisfiable or add capacity | Pending pod count |
| F2 | Skew violation after scale down | Many pods concentrated | Scale down removed spread nodes | Pause scale down or graceful eviction | Skew per topology increased |
| F3 | Volume binding block | Pod can’t move to desired domain | PVC bound to specific zone | Use multi-zone storage or replicate volumes | Volume attach failure events |
| F4 | Affinity conflict | Scheduler rejects due to conflict | Conflicting affinity or anti-affinity | Review and simplify affinity rules | Scheduling failure reasons logs |
| F5 | Scheduler plugin bug | Unexpected placements | Custom scheduler extenders miscompute | Validate extender, upgrade or rollback | Unexpected node assignments metric |
| F6 | Node label drift | Spread not using intended keys | Node labels missing or inconsistent | Enforce label policies, use admission controller | Mismatch between node labels and desired topology |
| F7 | Eviction storms | Cascading restarts on drain | Poor interaction with PDB and spread | Adjust PDBs, scale up before drain | Eviction rate spike |
| F8 | Over-constraining | Reduced placement options | Too many constraints combined | Simplify constraints or use ScheduleAnyway | High pending and node utilization |
Row Details
- F1: Pending pods often occur when whenUnsatisfiable is DoNotSchedule and not enough distinct topology domains exist; resolution includes choosing ScheduleAnyway or increasing domains.
- F3: Volume binding block can be mitigated by using storage classes that support volume replication or dynamic provisioning across zones.
- F6: Node label drift happens when automated provisioning scripts fail to label nodes correctly; use admission controllers or node bootstrap to enforce labels.
Key Concepts, Keywords & Terminology for topology spread constraints
- topology spread constraints — Declarative scheduler rules to distribute pods across topology domains — Central to placement policy — Misconfiguring selectors can make them ineffective
- topology key — Node label used as a domain (e.g., zone, hostname) — Target for spread — Wrong keys lead to no effect
- maxSkew — Max allowed difference in pod counts across domains — Controls strictness — Too low causes pending pods
- whenUnsatisfiable — Behavior when constraint cannot be met (ScheduleAnyway or DoNotSchedule) — Governs scheduler flexibility — Setting DoNotSchedule may block pods
- PodSelector — Selector that defines which pods participate in counting — Scope of spread — Wrong selector equals missed pods
- hostname — Node-level topology key representing node name — Fine-grained domain — Can cause fragmentation in dense clusters
- zone — Regional domain label representing availability zone — Typical fault domain — Not all clusters have multiple zones
- region — Larger domain grouping multiple zones — Used for cross-region spread — Higher latency trade-offs
- rack — On-prem topology domain representing rack locality — Important for hardware faults — Requires node labeling
- skew — Numeric difference in pod counts between domains — Indicator of imbalance — Needs monitoring
- Kubernetes scheduler — Core component deciding pod placement — Implements topology rules — Plugins can change behavior
- scheduler extender — External service to modify scheduling decisions — For advanced policies — Operational complexity
- node affinity — Preference or requirement to schedule onto nodes with labels — Filters nodes — Can conflict with spread
- pod affinity — Co-locate pods based on labels — Opposite of spread — Combine carefully
- pod anti-affinity — Prevent co-location of pods — Similar purpose but different semantics — Can be heavier to compute
- PodDisruptionBudget — Controls voluntary disruption of pods — Works with spread to avoid mass evictions — Misconfigured PDBs can block maintenance
- StatefulSet — Workload controller for stateful apps — Maintains identity and storage — Needs extra care for spread due to volumes
- Deployment — Controller for stateless apps — Common place to apply spread — Rolling update strategies interact
- DaemonSet — Ensures pod runs on each node — Not compatible with spread semantics — Different placement intent
- ReplicaSet — Underlying controller for Deployments — Controls replica count — Does not handle distribution
- PVC binding — PersistentVolumeClaim to PersistentVolume mapping — Can limit spread — Storage class choices matter
- CSI driver — Storage interface implementing topology-aware provisioning — Enables multi-zone volumes — Important for stateful spread
- topological locality — Concept of physical or logical proximity — Affects latency and failover trade-offs — Granularity impacts scheduling
- scheduling policy — Rules deciding placement priority — Spread is one such policy — Policy mismatch causes conflicts
- admission controller — Validates and mutates resources at creation — Can enforce node labels and topology policies — Useful to avoid label drift
- chaos testing — Practice of introducing failures to validate resilience — Validates spread constraints — Helps find hidden assumptions
- autoscaler — Scales nodes/pods automatically — Interacts with spread during scale events — Scale-down can break spread
- cluster autoscaler — Scales node pools based on pending pods — Behavior affects spread when scale down removes nodes unevenly — Policy tuning required
- load balancer topology — How traffic is routed across zones — Needs to align with spread to avoid hotspots — Mismatch causes uneven traffic
- failover plan — Operational steps to respond to zone/node outage — Complementary to spread — Not replaced by spread
- observability signal — Telemetry indicating imbalance or failures — Crucial for detection — Missing signals hide problems
- scheduling latency — Time to place a pod — Can increase with complex constraints — Monitor to detect regressions
- taints and tolerations — Prevent pods from scheduling on certain nodes — Combine with spread for exclusion — Incorrect use can reduce feasible nodes
- placement score — Numeric evaluation used by scheduler to rank nodes — Spread influences score — Custom plugins can change weighting
- cluster topology — Map of node labels and domains — Needed to design constraints — Keep updated
- admission policies — Policies to enforce constraints consistently — Avoids misconfigurations — Requires process for exceptions
- replication factor — Number of replicas in a workload — Determines spread goals — Mismatch with domain count causes imbalances
- error budget — SRE metric limiting acceptable failure — Spread helps preserve error budget — Not a substitute for correct incident response
- SLIs — Service Level Indicators measuring availability — Used to justify spread investments — Choose realistic SLIs
- SLOs — Service Level Objectives derived from SLIs — Guide placement requirements — Should reflect business needs
- rollout strategy — Canary, blue/green, rolling updates — Must coordinate with spread to avoid imbalance — Canary can concentrate replicas
- orchestration platform — Kubernetes or alternative scheduler — Determines exact semantics — Behavior varies by platform
- multi-cluster — Multiple clusters across regions — Spread is intra-cluster unless controlled by multi-cluster controller — Cross-cluster distribution is separate
- reconciliation loop — Controller loop that attempts to bring actual state to desired — Ensures spread after changes — Watch for loops causing churn
- label management — Process to maintain node labels — Critical for topology keys — Automation prevents drift
- topology-aware routing — Routing traffic towards nearby instances — Complements spread for performance — Align routing with placement
How to Measure topology spread constraints (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | Topology skew per key | Balance of replicas per domain | Count pods by selector per topology label | maxSkew <= 1 | Labels missing cause false zeros |
| M2 | Pending due to constraints | Pods pending because of constraints | Filter pending reasons for Unschedulable | <1% of replicas | Diagnostic granularity needed |
| M3 | Evictions per domain | Eviction spikes tied to topology | Eviction events grouped by label | Low steady state | Scale down may spike temporary |
| M4 | Scheduling latency | Time to bind pod to node | Scheduler binding time histogram | p95 < 5s for stateless | Complex constraints raise latency |
| M5 | Availability per domain | Error rates per topology domain | Request error rate by node/zone | Error budget aligned SLO | Sparse traffic causes noisy signals |
| M6 | Cross-domain traffic ratio | Percentage of traffic across domains | Trace/span aggregation by domain | Balanced across healthy domains | Service mesh sampling limits |
| M7 | Volume attach failures by domain | Storage placement issues | Volume attach error counts | Zero or rare | Driver retries mask failures |
| M8 | Node label drift rate | Frequency of node label changes | Label change events per day | Near zero | Automation may relabel unexpectedly |
| M9 | Rebalance frequency | How often pods move to rebalance | Count of reschedules due to placement | Low during steady state | Autoscaler actions can increase rate |
| M10 | Pod density per node | Number of matching pods per node | Pod count grouped by hostname | Avoid concentration beyond limits | DaemonSets skew counts |
Row Details
- M1: Compute per domain counts using a label selector for the app and node label for the topology key; monitor max and p95 of skew distribution.
- M2: Pending due to constraints requires parsing scheduler events; some schedulers expose specific Unschedulable reasons.
- M4: Scheduling latency can be instrumented via kube-scheduler metrics or custom scheduler instrumentation.
Best tools to measure topology spread constraints
Tool — Prometheus
- What it measures for topology spread constraints: Scheduler metrics, pod counts, pending reasons, custom instrumented counters.
- Best-fit environment: Kubernetes clusters with Prometheus scraping.
- Setup outline:
- Enable kube-state-metrics and scheduler metrics.
- Scrape metrics for pod counts grouped by node labels.
- Create recording rules to compute skew and pending rates.
- Build Grafana dashboards for visualization.
- Strengths:
- Flexible queries and alerting.
- Wide ecosystem and exportability.
- Limitations:
- Requires careful query design for cardinality.
- Not opinionated; needs dashboards and alerts built.
Tool — Grafana
- What it measures for topology spread constraints: Visualization of Prometheus or other telemetry, dashboards for skew and availability.
- Best-fit environment: Teams using Prometheus, OpenTelemetry, or cloud metrics.
- Setup outline:
- Create dashboards for executive, on-call, and debug views.
- Use templating for topology keys.
- Configure alert rules integration.
- Strengths:
- Powerful visualization options.
- Multi-source panels.
- Limitations:
- Alerting depends on data source rules.
- May need plugins for advanced visualizations.
Tool — Cloud provider metrics (native)
- What it measures for topology spread constraints: Zone-level VM counts, instance health, and load balancer distribution.
- Best-fit environment: Managed Kubernetes or serverless on cloud.
- Setup outline:
- Enable provider metrics for node pools and zones.
- Map provider topology keys to cluster labels.
- Use provider dashboards for capacity planning.
- Strengths:
- Native visibility into underlying infrastructure.
- Limitations:
- Granularity and retention vary by provider.
Tool — Service mesh telemetry
- What it measures for topology spread constraints: Traffic distribution, latency across domains, locality-aware routing behavior.
- Best-fit environment: Mesh-enabled microservices in Kubernetes.
- Setup outline:
- Enable locality-aware load balancing and emit telemetry.
- Aggregate traces and metrics by node/zone labels.
- Use mesh dashboards for cross-domain traffic ratios.
- Strengths:
- High-resolution view of traffic and latency.
- Limitations:
- Instrumentation overhead; sampling may hide rare events.
Tool — Cluster autoscaler logs/metrics
- What it measures for topology spread constraints: Scaling events that affect placement; scale-down candidates and node removals.
- Best-fit environment: Clusters with autoscaling enabled.
- Setup outline:
- Collect events for scale operations.
- Correlate with scheduling and eviction telemetry.
- Alert on aggressive scale-downs that cause skew.
- Strengths:
- Directly ties placement to node lifecycle.
- Limitations:
- Interpretation requires context of workload patterns.
Recommended dashboards & alerts for topology spread constraints
Executive dashboard
- Panels:
- Global skew heatmap across zones: highlights imbalance.
- Availability SLI vs SLO trend: business view.
- Incident summary affecting topology domains: counts.
- Why: Gives leadership a concise view of health and risk.
On-call dashboard
- Panels:
- Pod pending due to constraints with top 10 failing selectors.
- Skew per topology key and worst domains.
- Recent evictions and drain events.
- Node label drift alerts and recent label changes.
- Why: Provides quick troubleshooting context during pages.
Debug dashboard
- Panels:
- Per-pod scheduling events and histories.
- Volume attach errors and bindings by domain.
- Affinity/conflict diagnostics from scheduler logs.
- Rebalance timeline showing node additions and removals.
- Why: Enables deep-dive root cause analysis.
Alerting guidance
- Page vs ticket:
- Page: When SLI breach is likely and error budget burn is rapid, or when large-scale Pending/eviction storms are detected.
- Ticket: Low severity imbalance or slow drift that can be resolved with planned ops.
- Burn-rate guidance:
- Use burn-rate thresholds to trigger escalating actions; e.g., 2x burn for 10 minutes escalate to page.
- Noise reduction tactics:
- Deduplicate by grouping alerts per topology key.
- Suppress transient pending spikes during autoscaling.
- Use alert grouping and silences during known maintenance windows.
Implementation Guide (Step-by-step)
1) Prerequisites – Cluster with labeled nodes (zone, hostname, rack as applicable). – Workload controllers that support topology constraints (Kubernetes Deployment, StatefulSet). – Observability stack capable of collecting pod, node, and scheduler metrics. – Access to storage class and CSI capabilities if stateful.
2) Instrumentation plan – Enable kube-state-metrics and scheduler metrics. – Add node label change telemetry. – Instrument apps to emit domain-tagged traces/metrics.
3) Data collection – Scrape metrics for pod counts grouped by topology keys. – Gather scheduling events and reasons. – Collect volume attach and eviction logs.
4) SLO design – Define SLIs affected by placement, such as zone availability and request error rate. – Set SLOs that reflect tolerance to domain loss (e.g., 99.95% availability tolerating single-zone failure). – Allocate error budget accordingly.
5) Dashboards – Build executive, on-call, and debug dashboards described earlier. – Template dashboards by namespace and app label.
6) Alerts & routing – Implement alerts for skew thresholds, pending pods due to constraints, and eviction storms. – Route topography-impacting incidents to platform on-call and app owners.
7) Runbooks & automation – Runbook for Pending due to constraints: check node labels, capacity, PVC bindings, and affinity conflicts. – Automation: automated node labeling on bootstrap, autoscaler safeguards, and admission policies.
8) Validation (load/chaos/game days) – Conduct game days simulating node and zone failures. – Run chaos experiments to ensure spread prevents correlated outage. – Validate autoscaler behavior during load shifts.
9) Continuous improvement – Review incidents for placement-related root causes. – Tune maxSkew and whenUnsatisfiable values as cluster evolves. – Automate remediations for common failures.
Checklists
Pre-production checklist
- Nodes labeled for topology keys required by policies.
- Test workloads with varying replica counts and constraints.
- Observability configured for skew and pending reasons.
- Storage classes reviewed for multi-zone support.
Production readiness checklist
- Alerts tested with simulated failures.
- Autoscaler and drain policies validated against PDBs and spread.
- Runbooks published and on-call briefed.
- Capacity planning includes extra nodes for spread headroom.
Incident checklist specific to topology spread constraints
- Verify whether constraints are causing Pending pods.
- Check node label health and recent changes.
- Inspect PVC bindings and storage errors.
- Confirm autoscaler activity and recent scale-down events.
- Decide whether to relax constraints temporarily or add capacity.
Examples
- Kubernetes example: Add topologySpreadConstraints to Deployment PodSpec with selector app=myapp, topologyKey=topology.kubernetes.io/zone, maxSkew=1, whenUnsatisfiable=ScheduleAnyway. Verify with kubectl get pods -o wide and custom Prometheus query for skew.
- Managed cloud service example: For a managed database with zone-aware replicas, verify provider-managed placement settings and ensure provider labels are exposed to the cluster; create workload with matching topology keys and run failover game days.
Use Cases of topology spread constraints
1) Web front-end spread across AZs – Context: Stateless web servers under heavy traffic. – Problem: Zone outage should not take down front-end. – Why helps: Ensures at least one pod remains in each region zone. – What to measure: Skew per zone, error rate per zone. – Typical tools: Kubernetes topologySpreadConstraints, Prometheus, Grafana.
2) Database read replicas distribution – Context: Read replicas in same cluster across racks. – Problem: Rack-level controller failure could remove many replicas. – Why helps: Ensures replicas are on separate racks. – What to measure: Replica count per rack, replication lag. – Typical tools: StatefulSet, CSI, cluster inventory.
3) Cache cluster high availability – Context: In-memory cache with ephemeral data. – Problem: Node failure causing many cache shards to be lost. – Why helps: Spread shards to reduce cache miss storms. – What to measure: Cache hit ratio by domain, re-shard events. – Typical tools: Affinity config, cache orchestration.
4) Storage controller resilience – Context: Storage controllers with local endpoints. – Problem: Controller node failure impacts attached pods. – Why helps: Spreading scheduled pods avoids concentration on controller-affiliated nodes. – What to measure: IO error rate by domain. – Typical tools: CSI and storage class config.
5) CI runner distribution – Context: Self-hosted CI runners across nodes. – Problem: Runner pool concentrated on few nodes during heavy builds. – Why helps: Spread runners to prevent job pile-ups on node failure. – What to measure: Job queue length by node, runner availability. – Typical tools: Pod topology constraints, autoscaler.
6) Service mesh ingress locality – Context: Ingress gateways in multiple zones. – Problem: Traffic imbalance causing some gateways to be overloaded. – Why helps: Spread gateways to match topology for routing locality. – What to measure: Gateway latency and throughput by zone. – Typical tools: Service mesh, load balancer tuning.
7) Multi-tenant isolation – Context: Tenants with soft isolation requirements. – Problem: Single-node failure impacts many tenants colocated accidentally. – Why helps: Spread tenants’ replicas across different nodes. – What to measure: Tenant outage scope and affected pod counts. – Typical tools: Namespaces with topology constraints.
8) Stateful app with local cache and state – Context: Stateful app with local cache and persistent volumes. – Problem: Storage or node failure can cause data loss and cache misses. – Why helps: Spread instances to different topologies that have independent storage access. – What to measure: Storage errors and cache rebuild time. – Typical tools: StatefulSet, multi-zone PVs.
Scenario Examples (Realistic, End-to-End)
Scenario #1 — Kubernetes: Multi-AZ web service
Context: A global e-commerce site runs a stateless web service in a Kubernetes cluster spanning 3 AZs. Goal: Ensure rolling zone outage still leaves sufficient capacity to serve traffic. Why topology spread constraints matters here: Distributing pods by zone reduces chance all replicas in a traffic-serving tier fail together. Architecture / workflow: Deployment with topologySpreadConstraints using topology.kubernetes.io/zone, maxSkew=1, whenUnsatisfiable=ScheduleAnyway; Service with cross-zone LB. Step-by-step implementation:
- Label nodes with zone labels and verify.
- Update Deployment PodSpec with selector and topologySpreadConstraints.
- Instrument Prometheus to count pods per zone.
- Run canary and validate distribution. What to measure: Skew per zone, request error rate per zone, scheduling latency. Tools to use and why: Kubernetes, Prometheus, Grafana, cluster-autoscaler; they provide placement and telemetry. Common pitfalls: Autoscaler may remove nodes in a zone causing temporary imbalance; use scale-down safeguards. Validation: Simulate AZ outage and verify service SLOs are maintained. Outcome: Successful zone outage test with SLO preserved.
Scenario #2 — Serverless/Managed-PaaS: Managed function placement
Context: Team uses managed functions with provider-exposed zone metrics. Goal: Reduce latency variance and impact of provider zone incidents. Why topology spread constraints matters here: While placement is managed, topology-aware routing or multi-region deployment helps reduce correlated failures. Architecture / workflow: Deploy function versions across multiple regions or zones using provider settings; use traffic splitting. Step-by-step implementation:
- Confirm provider offers zone affinity or multi-zone deployment controls.
- Configure traffic split and monitor per-zone invocation latency.
- Create health checks and automatic failover routing. What to measure: Invocation latency by zone, error rate per region. Tools to use and why: Provider metrics, CDN, global load balancer. Common pitfalls: Provider may not expose placement control; assume less control and use multi-region redundancy. Validation: Inject simulated region latency and verify failover. Outcome: Reduced single-zone impact despite managed placement.
Scenario #3 — Incident-response/Postmortem: Eviction storm from scale-down
Context: Production cluster experienced an eviction storm when scale-down removed nodes hosting many replicas. Goal: Prevent recurrence and document remediations. Why topology spread constraints matters here: Lack of proper spread and PDBs caused many pods to be evicted simultaneously. Architecture / workflow: Review Deployment spread settings, cluster autoscaler config, and PDBs. Step-by-step implementation:
- Recreate incident pattern in staging and capture metrics.
- Update Deployment with topologySpreadConstraints and set PDBs correctly.
- Configure autoscaler to respect protected pods. What to measure: Eviction rate, PDB violation attempts, skew before and after. Tools to use and why: Logs, Prometheus, autoscaler metrics. Common pitfalls: Overly strict PDBs blocking necessary maintenance. Validation: Run simulated scale-down with safeguards and observe no eviction storms. Outcome: Autoscaler respects constraints and evictions are reduced.
Scenario #4 — Cost/Performance trade-off: Dense cluster vs spread
Context: A cost-sensitive team runs many replicas on minimal nodes to save on cloud spend. Goal: Find balance between cost savings and availability. Why topology spread constraints matters here: Enforcing strict spread increases required nodes; team must weigh availability vs cost. Architecture / workflow: Evaluate cost of adding nodes to achieve desired spread against risk of outages. Step-by-step implementation:
- Measure current skew and failure impact.
- Estimate nodes required for desired maxSkew.
- Pilot spread with ScheduleAnyway to reduce pending while signaling imbalance. What to measure: Cost per additional node, reduced incident frequency, SLO impact. Tools to use and why: Cost metrics, Prometheus, budgeting tools. Common pitfalls: Underestimating capacity headroom needs during scaling. Validation: Run load test while simulating node failure to verify SLOs at new capacity. Outcome: Informed trade-off decision balancing cost and availability.
Common Mistakes, Anti-patterns, and Troubleshooting
1) Symptom: Many pods Pending with Unschedulable reason -> Root cause: whenUnsatisfiable=DoNotSchedule with too few topology domains -> Fix: Set ScheduleAnyway or increase domains. 2) Symptom: Eviction storm during node pool scale-down -> Root cause: No PDBs and aggressive autoscaler -> Fix: Add PDBs and configure scale-down delays. 3) Symptom: PVC cannot attach when pod scheduled -> Root cause: Volume topology mismatch -> Fix: Use multi-zone storage classes or reconfigure PVC binding. 4) Symptom: High scheduling latency p95 -> Root cause: Complex affinity and spread rules -> Fix: Simplify affinity, use ScheduleAnyway, or tune scheduler performance. 5) Symptom: Imbalance appears over time -> Root cause: Node label drift or automated tooling relabeling -> Fix: Enforce label policies via admission controllers. 6) Symptom: Unexpected placement despite constraints -> Root cause: PodSelector mismatch -> Fix: Verify pod labels match selector used by constraint. 7) Symptom: Over-constrained combined with anti-affinity -> Root cause: Contradictory rules -> Fix: Reconcile rules; prefer simpler spread policy. 8) Symptom: Observability blind spots for skew -> Root cause: No aggregation by topology key -> Fix: Add metrics and recording rules to compute skew. 9) Symptom: Alert noise when autoscaling -> Root cause: Alerts not silenced during scaling -> Fix: Suppress alerts for known scale window or use alert suppression rules. 10) Symptom: StatefulSet stuck on single node -> Root cause: PVC bound in same zone -> Fix: Migrate storage or use replicated storage classes. 11) Symptom: Large blast radius in outage -> Root cause: No spread constraint for critical service -> Fix: Add topologySpreadConstraints with reasonable maxSkew. 12) Symptom: Too many small topology keys -> Root cause: Using host-specific keys unnecessarily -> Fix: Use coarser keys like zone or rack. 13) Symptom: Metrics cardinality explosion -> Root cause: High label cardinality in metrics for each node -> Fix: Aggregate by topology label and limit per-pod labels. 14) Symptom: Scheduler plugin inconsistency across clusters -> Root cause: Different scheduler versions or extenders -> Fix: Standardize scheduler versions or test differences. 15) Symptom: Increased toil for ops -> Root cause: Manual label management and ad-hoc fixes -> Fix: Automate node labeling and remediation. 16) Symptom: Spread constraints block emergency maintenance -> Root cause: Too strict DoNotSchedule settings -> Fix: Have admin override process and temporary relaxation policy. 17) Symptom: Misleading dashboards showing balanced pods but outages still large -> Root cause: Traffic routed unevenly by LB despite spread -> Fix: Align routing localities with spread. 18) Symptom: Prometheus queries expensive -> Root cause: Inefficient grouping by node label per pod -> Fix: Use recording rules to precompute counts. 19) Symptom: Tests pass but production fails -> Root cause: Test clusters not mirroring topology domain counts -> Fix: Mirror production topology in staging or simulate domain counts. 20) Symptom: Cluster autoscaler doesn’t scale up adequately -> Root cause: Constraints block scheduling on new nodes due to taints -> Fix: Ensure tolerations are present where appropriate.
Observability pitfalls (at least 5 included above)
- Missing aggregation by topology key.
- Overlooking volume attach errors in scheduler diagnostics.
- Not correlating autoscaler logs with evictions.
- High cardinality metric designs hiding skew trends.
- Lack of event-level data that explains Unschedulable reasons.
Best Practices & Operating Model
Ownership and on-call
- Platform team owns node labeling, scheduler config, and cluster-level policies.
- App teams own workload-level topology selectors and PDBs.
- Runbook ownership shared; platform handles remediation and app teams validate app-level behavior.
Runbooks vs playbooks
- Runbook: Detailed step-by-step operational procedures for recurring actions (e.g., relabel nodes).
- Playbook: Higher-level decision framework for non-routine incidents (e.g., choosing to relax constraints due to outbreak).
Safe deployments (canary/rollback)
- Coordinate canary placements to respect spread to avoid concentrating canary pods in a single domain.
- Implement automated rollback triggers on SLI regressions measured per topology domain.
Toil reduction and automation
- Automate node labeling at bootstrap and upgrades.
- Automate recording rules for skew metrics.
- Automate temporary relaxation of constraints via CI/CD when capacity increases.
Security basics
- Limit who can modify node labels and topology-related admission controllers.
- Audit topology policy changes and maintain change logs.
Weekly/monthly routines
- Weekly: Review skew trends and pending pods over last week.
- Monthly: Run a game day simulating domain loss and validate runbooks.
- Quarterly: Review topology keys and label hygiene.
What to review in postmortems
- Root cause relating to scheduling and topology decisions.
- Whether spread constraints prevented or exacerbated the issue.
- Any config drift or automation gaps.
- Action items: add tests, alerts, or automation.
What to automate first
- Node labeling and label drift detection.
- Recording rules for skew metrics.
- Alert grouping and suppression during planned scale operations.
Tooling & Integration Map for topology spread constraints (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | Monitoring | Collects scheduler and pod metrics | kube-state-metrics, Prometheus | Central for skew metrics |
| I2 | Visualization | Dashboards for skew and alerts | Grafana, dashboards | Templated dashboards aid ops |
| I3 | Autoscaling | Scales nodes considering pending pods | Cluster autoscaler | Needs tuning to respect PDBs |
| I4 | Scheduler | Enforces topology constraints | kube-scheduler, scheduler extenders | Plugin behavior matters |
| I5 | Storage | Provides multi-zone PVs | CSI drivers, storage class | Limits pod movement when bound |
| I6 | Admission | Enforces node labels and policies | OPA Gatekeeper, MutatingWebhook | Prevents label drift |
| I7 | Service mesh | Locality-aware routing and telemetry | Istio, Linkerd | Helps align traffic with placement |
| I8 | CI/CD | Applies constraints via pipelines | ArgoCD, Flux | Ensures consistent rollout of constraints |
| I9 | Chaos tool | Simulates failures for validation | Chaos engine | Validates spread under failure |
| I10 | Cloud provider | Exposes zone metrics and topology | Provider APIs | Shows underlying infra constraints |
Row Details
- I1: Monitoring must include recording rules to compute skew to avoid expensive ad-hoc queries.
- I5: Storage integration often restricts where pods can run; verify CSI supports required topology features.
Frequently Asked Questions (FAQs)
H3: What exactly is maxSkew and how do I pick a value?
maxSkew is the maximum allowed difference in pod counts across domains; start with 1 for high availability and relax if pending pods occur.
H3: How do I debug pods Pending due to topology constraints?
Check scheduler events, pending reasons, node labels, PVC bindings, and affinity rules; correlate with autoscaler and capacity.
H3: What’s the difference between pod anti-affinity and topology spread constraints?
Pod anti-affinity prevents colocating certain pods, while topology spread constraints aim to balance counts across domains; anti-affinity can be stricter.
H3: How do topology spread constraints interact with StatefulSets?
StatefulSets guarantee stable identities and volumes; topology constraints apply but volume binding can limit possible placements.
H3: How do I measure skew in production?
Use a recording rule to count pods per topology label grouped by selector and compute max-min counts; expose skew metric and alert on thresholds.
H3: How do I choose topology keys?
Choose coarse-grained, meaningful domain labels like zone or rack; avoid very high-cardinality labels like pod metadata.
H3: Can topology spread constraints prevent all outages?
No; they reduce correlated failures but do not replace backups, replication, or multi-cluster DR.
H3: What’s the difference between zone and region spread?
Zone is a smaller fault domain within a region; region spread provides protection against zone-level disasters but may increase latency.
H3: How do I use spread constraints with autoscaler?
Ensure autoscaler policies provide headroom and have scale-down delays; correlate scale events with skew telemetry.
H3: How do I test topology spread constraints safely?
Run game days and chaos tests in staging that mirror production topology; simulate node and zone failures.
H3: How do I avoid alert noise from spread-related alerts?
Use suppression during planned scaling, dedupe alerts by topology key, and set sensible thresholds tuned to traffic and scale patterns.
H3: How do I update constraints without disrupting running workloads?
Apply constraints incrementally, use ScheduleAnyway when starting, and run a staged rollout with monitoring.
H3: How do I ensure storage does not block spread?
Use multi-zone replicated storage classes or ensure PVCs can be provisioned in desired topologies.
H3: How do I handle label drift?
Automate node labeling during bootstrap and use admission controllers to prevent drift.
H3: What tooling should small teams prioritize?
Start with Prometheus and Grafana for skew metrics and simple constraints on critical services; automate node labeling.
H3: How do I reconcile conflicting affinity and spread rules?
Simplify rules, prefer spread for availability, and convert complex affinity to tolerations or labels where possible.
H3: What are common scheduling failure reasons I should look for?
Insufficient resources, PVC attach errors, conflicting affinities, and lack of topology domains are common.
Conclusion
Topology spread constraints are a practical and powerful tool to reduce correlated failures by influencing scheduler placement across labeled topology domains. They are most effective when combined with storage strategies, autoscaler tuning, and robust observability. Careful design prevents over-constraining clusters and avoids unintended Pending or eviction storms.
Next 7 days plan
- Day 1: Inventory node labels and topology domains; document label hygiene.
- Day 2: Add recording rules for skew metrics and create a basic Grafana dashboard.
- Day 3: Apply topologySpreadConstraints to one non-critical deployment with ScheduleAnyway.
- Day 4: Run a small-scale failover test to validate distribution and SLI impact.
- Day 5: Adjust autoscaler and PDB settings based on findings.
- Day 6: Implement alerting for pending pods and skew threshold breaches.
- Day 7: Conduct a brief postmortem and schedule a game day for a critical service.
Appendix — topology spread constraints Keyword Cluster (SEO)
- Primary keywords
- topology spread constraints
- topologySpreadConstraints Kubernetes
- pod topology spread
- Kubernetes spread scheduling
- topology key scheduling
- maxSkew whenUnsatisfiable
- pod distribution across zones
- zone aware pod scheduling
- spread constraints guide
-
topology spread tutorial
-
Related terminology
- pod anti affinity
- node affinity
- pod disruption budget
- StatefulSet spread
- scheduler extender
- node label topology
- zone topology key
- rack awareness
- volume topology
- CSI multi zone
- scheduling skew metric
- scheduling latency monitoring
- pending due to constraints
- pod pending debug
- eviction storm mitigation
- cluster autoscaler interactions
- scale down safeguards
- admission controller label enforcement
- topology aware routing
- multi cluster placement
- chaos testing for placement
- storage class topology
- PV binding and topology
- affinity vs spread
- anti affinity vs spread
- replica distribution
- cross zone resilience
- zone failover planning
- topology-aware load balancing
- scheduling policy conflict
- recording rule topology skew
- Grafana skew dashboard
- Prometheus scheduler metrics
- kube state metrics topology
- topology drift detection
- label management automation
- orchestration placement policy
- high availability placement
- blast radius reduction
- placement decision checklist
- deployment spread best practices
- SLI for topology distribution
- SLO for domain failure tolerance
- error budget placement impact
- release strategy with spread
- canary placement and spread
- node labeling bootstrapping
- topology keys list
- topology spread anti pattern
- production readiness topology
- runbook for spread incidents
- game day topology testing
- observability for placement
- alerts for skew breaches
- dedupe topology alerts
- suppress alerts during scale
- topology label drift metric
- zone eviction monitoring
- PV attach failure monitoring
- topology-aware autoscaling
- cross region replication plan
- locality-aware routing policy
- rack level protection
- hostname topology key use
- multi zone storage replication
- topology spread constraints examples
- topology spread constraints scenarios
- topology spread constraints checklist
- topology spread constraints FAQs
- topology spread constraints glossary
- topology spread constraints implementation
- platform ownership topology
- topology spread constraints tools
- topology spread constraints metrics
- topology spread constraints failures
- topology spread constraints mitigations
- topology spread constraints security
- topology spread constraints automation
- topology spread constraints monitoring setup
- topology spread constraints troubleshooting
- topology spread constraints anti patterns
- topology spread constraints best practices
- topology spread constraints decision ladder
- topology spread constraints maturity model
- topology spread constraints SRE framing
- topology spread constraints observability signals
- topology spread constraints node labels
- topology spread constraints node label enforcement
- topology spread constraints cloud provider
- topology spread constraints managed services
- topology spread constraints serverless considerations
- topology spread constraints storage constraints
- topology spread constraints data layer use cases
- topology spread constraints application layer
- topology spread constraints infrastructure layer
- topology spread constraints CI CD integration
- topology spread constraints runbooks
- topology spread constraints incident checklist
- topology spread constraints production checklist
- topology spread constraints pre production checklist
- topology spread constraints dashboards
- topology spread constraints alerting guidance
- topology spread constraints burn rate
- topology spread constraints dedupe alerts
- topology spread constraints grouping
- topology spread constraints suppression
- topology spread constraints admission policies
- topology spread constraints label hygiene
- topology spread constraints label automation
- topology spread constraints node bootstrap labels
- topology spread constraints label drift prevention