What is NLB? Meaning, Examples, Use Cases & Complete Guide?


Quick Definition

Network Load Balancer (NLB) in plain English: a high-performance, low-latency load balancer that forwards TCP/UDP (and sometimes TLS/SSL) traffic at Layer 4, designed for large volumes of connections with minimal protocol inspection.

Analogy: NLB is like a motorway toll gantry that directs cars to different lanes using vehicle weight and speed sensors without opening any trunks — fast, simple routing, no deep inspection.

Formal technical line: A layer-4 forwarding device that distributes network traffic by source/destination IP and port, often preserving client IPs and supporting millions of concurrent connections with minimal packet processing.

Other common meanings (brief):

  • Network Load Balancer (cloud vendor branded)
  • NetBIOS Load Balancing — legacy / context-specific
  • Non-Linear Backend — Not common

What is NLB?

What it is / what it is NOT

  • It is a transport-layer (L4) balancer that forwards TCP/UDP and sometimes TLS without terminating the protocol in many deployments.
  • It is NOT a full application-layer (L7) proxy that can inspect HTTP headers, rewrite requests, or perform complex routing decisions based on content.
  • It is NOT a firewall, though it may provide basic filtering or integration points.
  • It is often provided as a managed cloud service or as an appliance in data centers.

Key properties and constraints

  • Low latency and high throughput for TCP/UDP traffic.
  • Preserves client IP and source port in many implementations.
  • Typically stateless or uses simple flow tables; connection affinity is often based on five-tuple hashing.
  • Limited or no application-layer health checks in some flavors; health checks may be at TCP level.
  • Can require additional components for TLS termination, WAF, or L7 routing.

Where it fits in modern cloud/SRE workflows

  • Edge load balancing for non-HTTP workloads (databases, proxy protocols, gRPC over raw TCP).
  • Fronting Kubernetes NodePorts, StatefulSets, or bare-metal clusters for high throughput.
  • Gateway for high-throughput ingress to serverless or managed PaaS that expects TCP or TLS pass-through.
  • Component in a layered load balancing strategy: Global DNS -> Edge NLB -> L7 ingress -> Service pods.

Text-only “diagram description”

  • Public client connects to a public IP owned by the NLB.
  • NLB receives packets and applies a hashing algorithm mapping client five-tuple to backend target.
  • NLB forwards traffic to a healthy target instance preserving source IP.
  • Backend processes the request and responds; NLB forwards the response back to client.
  • Health checks run periodically; unhealthy targets are removed from rotation until healthy.

NLB in one sentence

A high-performance, low-latency transport-layer balancer that routes TCP/UDP connections to backend targets while minimizing processing and preserving client context.

NLB vs related terms (TABLE REQUIRED)

ID Term How it differs from NLB Common confusion
T1 ALB — Application LB L7, inspects HTTP headers and routes by path or host Both called load balancers
T2 GSLB — Global SLB Cross-region DNS routing and health-based traffic steering People assume single-box balancing
T3 HAProxy Software proxy that can operate L4 or L7 HAProxy can be configured as NLB-like
T4 IPVS Kernel-level virtual server for L4 in Linux Often used inside K8s for service proxying
T5 Cloud Provider NLB Managed, integrated into cloud networking stack Feature set varies by vendor

Row Details (only if any cell says “See details below”)

  • None

Why does NLB matter?

Business impact

  • Revenue continuity: For transactional services that require high throughput and low latency, NLB availability directly affects customer experience and revenue flow.
  • Trust and SLA delivery: Downtime or high latency in L4 services often breaks downstream SLAs and reduces user trust.
  • Risk containment: Correctly configured NLBs reduce risk of overload to application backends by distributing connections and enabling graceful degradation.

Engineering impact

  • Incident reduction: Proper use of NLBs with health checks and routing reduces cascading failures.
  • Velocity: A standardized, well-instrumented NLB allows teams to iterate on backend services without redesigning ingress every time.
  • Complexity trade-off: NLBs keep the edge simple but shift application-aware routing to L7 components when needed.

SRE framing

  • SLIs: Connection success rate, connection latency, backend health ratio.
  • SLOs: Typical starting SLOs might aim for 99.9% connection success for critical services, but Var ies — depends on business needs.
  • Error budgets: Use NLB SLOs as part of system-level error budgets to determine safe release windows.
  • Toil: Automate health-driven instance replacement, autoscaling, and certificate rotation to reduce manual toil.
  • On-call: Network and platform teams must be on-call for NLB-level failures because app teams may not have sufficient visibility.

What commonly breaks in production (realistic examples)

  • Backend overload due to sticky flows mapping many clients to one instance.
  • Health check misconfiguration causing healthy instances to be removed.
  • Client IP preservation not enabled, breaking security rules or rate limits.
  • TLS offload mismatch: certificates not updated causing connection failures.
  • DNS TTL misalignment causing slow failover during global incidents.

Where is NLB used? (TABLE REQUIRED)

ID Layer/Area How NLB appears Typical telemetry Common tools
L1 Edge—Network Public IP balancing for TCP and UDP Connection rate, bytes/sec, errors Cloud NLB, F5, MetalLB
L2 Service—Backend Balancer before service instances Backend health, conn per target IPVS, kube-proxy, Nginx stream
L3 Kubernetes Service type LoadBalancer or MetalLB Node traffic, target health MetalLB, cloud LB integrations
L4 Serverless/PaaS Fronting managed endpoints for TCP Request counts, connection latency Managed NLB, platform gateway
L5 Security Minimal filtering and source-IP preservation Dropped connections, rejected packets Network ACLs, security groups

Row Details (only if needed)

  • None

When should you use NLB?

When it’s necessary

  • You need ultra-low latency TCP/UDP forwarding with minimal processing overhead.
  • You must preserve client source IP at the backend.
  • You require millions of concurrent connections or high packet throughput.
  • You are fronting non-HTTP protocols (databases, MQTT, SIP, raw TCP gRPC).

When it’s optional

  • When you have simple HTTP workloads that can be handled by an L7 ingress controller with advanced routing features.
  • For small services where a reverse proxy fits and TLS termination at L7 is desired.

When NOT to use / overuse it

  • Don’t use NLB for traffic that requires content-aware routing, header inspection, or WAF protections.
  • Avoid using NLB as a one-size-fits-all; layering L7 proxies is often necessary for modern web workloads.

Decision checklist

  • If you need L4 forwarding, client IP preservation, high throughput -> Use NLB.
  • If you need path-based routing, A/B testing, or header-based security -> Use ALB/Ingress.
  • If you need global load balancing across regions -> Combine GSLB with NLBs per region.

Maturity ladder

  • Beginner: Use cloud-managed NLB with basic health checks to front a small cluster or database.
  • Intermediate: Integrate NLB with Kubernetes Service LoadBalancer and basic autoscaling; add observability.
  • Advanced: Combine NLBs with regional GSLB, automated certificate and target lifecycle, canary rollouts across NLB target groups.

Example decision for small teams

  • Small SaaS with TCP proxying: Use provider-managed NLB, enable client IP preservation, configure TCP health checks, monitor connection errors.

Example decision for large enterprises

  • Multi-region microservices platform: Deploy NLBs per region for low-latency TCP ingress, combine with ALBs for L7 needs, manage via IaC, and automate detection-driven failover via GSLB.

How does NLB work?

Components and workflow

  • Public IP / VIP: NLB owns one or more virtual IPs that clients target.
  • Listener: Accepts connections on specific port(s) and protocol(s).
  • Target group / pool: Collection of backend endpoints (IPs, instances, pod IPs).
  • Health checks: Periodic probes (TCP, TLS, ICMP) to mark targets healthy/unhealthy.
  • Forwarding algorithm: Hash-based, round-robin, or flow table mapping.
  • Session affinity (optional): Source-ip or cookie-based affinity at L4 is usually limited.

Data flow and lifecycle

  1. Client resolves VIP and opens TCP connection to NLB.
  2. NLB maps connection to a target using hashing or flow tables.
  3. NLB forwards packets to target; may preserve source IP.
  4. Backend responds to client via NLB or directly if routing allows.
  5. Health checks continuously verify target liveness.
  6. On target failure, NLB removes target, rebalances connections for new flows.

Edge cases and failure modes

  • Mid-connection backend failure: NLB may silently drop or reset the connection; graceful drain mechanisms are essential.
  • Long-lived connections: A small number of connections can saturate target capacity; connection draining and capacity planning needed.
  • Packet fragmentation or MTU mismatches: Can cause corruption or reset; ensure path MTU consistent.

Practical examples (pseudocode)

  • Configure a target group with TCP health check port 3306 for a managed MySQL cluster.
  • Update autoscaling policy to consider active connection count as a scaling metric.

Typical architecture patterns for NLB

  • Single-tier edge NLB: Public NLB -> Backend instances (use for simple TCP services).
  • Two-tier edge NLB + ALB: NLB for TLS passthrough -> ALB for HTTP routing after decryption (when TLS needs to be terminated downstream).
  • K8s-integration pattern: NLB -> NodePort -> kube-proxy/IPVS -> Pod (good for performance and IP preservation).
  • Global failover: GSLB -> region NLBs -> regional backends (for DR and geo-traffic management).
  • Internal service mesh offload: NLB -> ingress gateway pods -> service mesh (when mesh needs L4 termination at edge).

Failure modes & mitigation (TABLE REQUIRED)

ID Failure mode Symptom Likely cause Mitigation Observability signal
F1 Health check flapping Targets marked healthy/unhealthy Misconfigured probe or intermittent app Fix probe, increase timeout and retries Health check failures/sec
F2 Single target overload High latency and erros Poor hashing affinity or small pool Autoscale, rebalance targets Per-target conn count
F3 TLS handshake failure Clients see TLS errors Cert mismatch or ALPN mismatch Rotate certs, verify cipher support TLS handshake errors
F4 Connection resets Clients see connection resets MTU or network path issue Verify MTU, check network devices TCP resets/sec
F5 Slow failover Extended downtime during failover Long DNS TTL or slow health detection Lower TTL, speed health checks Time to remove unhealthy target

Row Details (only if needed)

  • None

Key Concepts, Keywords & Terminology for NLB

(Note: condensed entries; each line contains term — short definition — why it matters — common pitfall)

  • 5-tuple — Source/dest IP and ports plus protocol — determines flow hashing — incorrect assumptions about affinity.
  • VIP — Virtual IP assigned to LB — client target address — misrouting when VIP overlaps.
  • Listener — Port/protocol definition on LB — entry point for traffic — mismatched port causes drop.
  • Target group — Pool of backends — health management and routing units — missing targets during deploys.
  • Health check — Probe to mark targets healthy — prevents routing to bad nodes — misconfigured thresholds cause flapping.
  • Connection tracking — State about active flows — needed for consistent routing — resets if state lost.
  • Preservation of client IP — Whether backend sees original IP — required for security/audit — lost when proxying without X-Forwarded headers.
  • Source NAT (SNAT) — Rewriting source IP — simplifies routing but breaks IP-based auth — unexpected logs on the backend.
  • Destination NAT (DNAT) — Rewriting destination IP — typical for VIP mapping — conflicts in port mappings possible.
  • Flow hashing — Algorithm to map flows to targets — balances statelessly — causes imbalance with skewed source distribution.
  • Sticky sessions — Affinity to particular backend — used for session state — prevents even load distribution.
  • TLS passthrough — Forward TLS without termination — reduces CPU overhead — limits visibility and WAF use.
  • TLS termination — LB decrypts TLS — allows L7 inspection — increases CPU and attack surface.
  • Round-robin — Simple distribution algorithm — easy to reason about — poor for long-lived connections.
  • Least-connections — Favors less loaded targets — better for variable load — harder to maintain in stateless LB.
  • IPVS — Linux kernel load balancer — high performance — kernel-level complexity.
  • kube-proxy — Kubernetes service proxy — can use IPVS or iptables — behaves as L4 distributor inside cluster.
  • MetalLB — L2/L3 load balancer for bare-metal K8s — provides LoadBalancer functionality — requires BGP or ARP config.
  • Flow stickiness timeout — Duration of session affinity — affects failover and capacity planning — too long causes hotspots.
  • Health check interval — Frequency of probes — faster detection vs more probe traffic — too aggressive can ++ to false positives.
  • Connection draining — Allow existing flows to finish during removal — prevents mid-flight resets — requires timeout tuning.
  • Cross-zone load balancing — Distribute traffic across zones — improves capacity use — can increase inter-zone costs.
  • Global server load balancing (GSLB) — DNS-based cross-region steering — handles geo and failover — DNS TTL challenges.
  • Anycast — Advertise same IP from multiple locations — low-latency routing — complex routing control.
  • DDoS mitigation — Ability to absorb or filter attacks — critical for availability — misconfigured rules may block legit traffic.
  • Autoscaling metric — Metric used to scale targets — connection count, CPU, or custom — wrong metric causes thrashing.
  • Pod readiness gating — Only send traffic to ready pods — avoids sending to initializing pods — misconfigured readiness breaks deploys.
  • MTU — Maximum Transmission Unit — affects large payloads — mismatches cause fragmentation.
  • TCP keepalive — Detects idle dead peers — important for long-lived connections — too aggressive drops legit sessions.
  • TLS ALPN — Protocol negotiation extension — needed for HTTP/2 or gRPC — mismatch causes protocol fallback.
  • Client IP header — X-Forwarded-For or PROXY protocol — preserves original client IP — missing header breaks analytics.
  • PROXY protocol — Encapsulates client info at L4 — used by many backends — incorrect version breaks parsers.
  • Packet per second (PPS) — Throughput measure — important for DDoS sizing — neglecting it causes saturation.
  • Bytes per second (BPS) — Bandwidth measure — impacts network cost — wrong expectation can inflate bills.
  • Flow table size — Number of tracked flows LB can handle — affects scaling — exceeded table leads to drops.
  • Connection timeout — Idle timeout for connections — affects long-lived clients — too short breaks apps like websockets.
  • Backend draining grace — Graceful removal time — ensures smooth deploys — too short causes resets.
  • Observability tag — Metadata to correlate traffic — essential for debugging — missing tags complicate triage.
  • Circuit breaker — Mechanism to stop sending to failing backends — prevents cascading failure — incorrectly tripping reduces availability.
  • Rate limiting — Throttle per-client or global — protects backend — over-aggressive limits impact legit users.
  • Service mesh ingress — L7 ingress through mesh — NLB often used in front — mismatch between mesh and NLB behavior causes lost metrics.
  • L4 vs L7 — Layer 4 transport vs Layer 7 application — determines capabilities — wrong choice leads to missing features.

How to Measure NLB (Metrics, SLIs, SLOs) (TABLE REQUIRED)

ID Metric/SLI What it tells you How to measure Starting target Gotchas
M1 Connection success rate Percent of attempted connections that succeed Successful TCP accepts / attempted connects 99.9% typical Counting method may double-count retries
M2 Connection latency Time to establish TCP handshake SYN->ACK time percentile P95 < 50ms for regional Network RTT dominates metric
M3 Active connections Concurrent flows through LB Current conn count per target Capacity-based threshold Long-lived conns skew autoscaling
M4 Bytes/sec Throughput handled by LB Sum of bytes in and out Depends on workload Burst traffic can exceed limits
M5 Health check failures Targets failing probes Failures per minute Minimal; investigate any sustained rise Misconfigured checks cause false positives
M6 TLS handshake errors Failed TLS negotiations TLS error counters on LB Near zero for healthy systems Cipher mismatch or cert expiry
M7 TCP resets/sec Reset events at LB Reset count per time Very low Resets may be caused by backends
M8 Flow table utilization LB state table usage Tracked flows / max flows < 70% utilization Sudden spikes can exhaust table
M9 Time to failover Time to stop sending traffic to bad target Delta from fail to traffic drained < health check period * retries High DNS TTLs skew global failover
M10 Error rate (downstream) 5xx from backend as seen at LB Backend error count / total requests Depends on SLOs L4 LBs may not see full HTTP codes

Row Details (only if needed)

  • None

Best tools to measure NLB

Choose tools that integrate with cloud APIs, instrument network telemetry, and capture flow-level stats.

Tool — Prometheus + Node exporters

  • What it measures for NLB: Exposes LB metrics if LB or exporter available; collects node-level network counters.
  • Best-fit environment: Kubernetes, on-prem Linux, self-managed stacks.
  • Setup outline:
  • Export LB metrics or deploy exporters on nodes.
  • Create scrape configs for Prometheus.
  • Instrument health checks and counters.
  • Build recording rules for SLIs.
  • Strengths:
  • Flexible queries and alerting.
  • Wide ecosystem for dashboards.
  • Limitations:
  • Requires exporters on LB or translation layer.
  • High cardinality can increase storage costs.

Tool — Cloud provider monitoring (native)

  • What it measures for NLB: Provider-specific LB metrics (conn count, bytes, health checks).
  • Best-fit environment: Managed cloud services.
  • Setup outline:
  • Enable LB metrics in cloud console.
  • Route metrics to monitoring workspace.
  • Create dashboards and alerts.
  • Strengths:
  • Accurate vendor-specific telemetry.
  • Integrated with IAM and autoscaling.
  • Limitations:
  • Metrics granularity and retention vary.
  • Cross-cloud visibility harder.

Tool — eBPF-based observability (e.g., Cilium, custom)

  • What it measures for NLB: Per-flow visibility, latency, packet drops at kernel level.
  • Best-fit environment: Linux kernels with eBPF support, Kubernetes.
  • Setup outline:
  • Deploy eBPF agents on nodes.
  • Configure probes for sockets and packet events.
  • Aggregate and visualize flows.
  • Strengths:
  • High-fidelity, low-overhead telemetry.
  • Visibility across L4 traffic.
  • Limitations:
  • Requires kernel compatibility and expertise.

Tool — Packet capture & analyzers (tcpdump, Wireshark)

  • What it measures for NLB: Full packet capture for debugging handshake and MTU issues.
  • Best-fit environment: Debugging sessions in staging or during incidents.
  • Setup outline:
  • Capture on LB or target interfaces.
  • Filter by IP/port and analyze captures.
  • Strengths:
  • Definitive evidence for low-level network issues.
  • Limitations:
  • Heavyweight, privacy and performance concerns.

Tool — Flow collectors (NetFlow, sFlow)

  • What it measures for NLB: Aggregate flow metrics like top talkers and byte counts.
  • Best-fit environment: Networking teams and large-scale deployments.
  • Setup outline:
  • Enable flow export on LB/network devices.
  • Collect in a flow analysis tool.
  • Strengths:
  • High-level traffic patterns and attribution.
  • Limitations:
  • Not packet-level; lacks payload insights.

Recommended dashboards & alerts for NLB

Executive dashboard

  • Panels:
  • Regional connection success rate (1 widget) — shows overall availability.
  • Total throughput (BPS) across regions — capacity insight.
  • Health target ratio — percentage healthy targets.
  • Why: High-level availability and capacity for stakeholders.

On-call dashboard

  • Panels:
  • Real-time active connections per target group.
  • Health check failures with trending.
  • TLS handshake errors and TCP resets.
  • Flow table utilization and per-target conn counts.
  • Why: Immediate actionable signals for responders.

Debug dashboard

  • Panels:
  • Per-target latency heatmap.
  • SYN->ACK times and packet drop counters.
  • Recent config changes / deploy timeline overlay.
  • Packet capture links or trace IDs.
  • Why: Deep-dive triage for resolving incidents.

Alerting guidance

  • Page vs ticket:
  • Page for high-severity: connection success rate below threshold for critical services or mass target failure.
  • Ticket for non-urgent: gradual increase in byte rates or non-critical errors.
  • Burn-rate guidance:
  • Use error budget burn rate to determine paging thresholds if SLO-driven.
  • Noise reduction tactics:
  • Deduplicate alerts by target group and region.
  • Group alerts by cause (health check flaps) and suppress during known maintenance windows.
  • Use rate-limited alerting and temporary suppression during autoscaling events.

Implementation Guide (Step-by-step)

1) Prerequisites – Inventory of protocols and ports to balance. – Capacity planning: expected concurrent connections and BPS. – Certificate lifecycle plan if TLS termination needed. – IAM and network permissions for LB provisioning.

2) Instrumentation plan – Identify metrics to export (see Metrics table). – Ensure health checks and readiness probes for backends. – Add provenance: service tags and correlation IDs.

3) Data collection – Configure provider metrics export or deploy exporters. – Collect flow, error, and health metrics centrally. – Ensure logs capture config changes and control-plane events.

4) SLO design – Define SLIs from metrics (connection success, latency). – Choose SLO targets and error budgets per service class.

5) Dashboards – Build exec, on-call, and debug dashboards. – Add runbook links and recent deploy metadata.

6) Alerts & routing – Map SLO violations to alert rules. – Define notification routing (network team, service owner). – Configure suppression for planned maintenance.

7) Runbooks & automation – Create runbooks for common NLB incidents (health check flaps, certificate expiry). – Automate target replacement, rolling updates, and certificate rotation.

8) Validation (load/chaos/game days) – Load testing for expected concurrency and BPS. – Chaos tests: kill targets, simulate network partition, observe failover. – Game days involving on-call and downstream teams.

9) Continuous improvement – Quarterly review of capacity and SLOs. – Post-incident action items automated into CI pipelines.

Pre-production checklist

  • Health checks configured and validated.
  • Readiness probes on all backends.
  • Observability pipelines ingest LB metrics.
  • Security groups and ACLs restrict access.
  • Certificate present and expiry monitored.

Production readiness checklist

  • Autoscaling triggers verified against connection metrics.
  • Backup/DR plan for LB (multi-region or failover).
  • IAM roles and audit logging enabled.
  • Runbooks published and tested.

Incident checklist specific to NLB

  • Confirm LB control plane status.
  • Check health check history and recent config changes.
  • Verify backend instance health and network connectivity.
  • If TLS issue: check cert expiry and ALPN/cipher settings.
  • Escalate to network/platform on persistent flow table exhaustion.

Example Kubernetes pre-production (actionable)

  • Create Service type LoadBalancer with annotation for target port.
  • Deploy MetalLB controller and configure address pool.
  • Verify pods readiness and kube-proxy IPVS mode.
  • Test connections from external client and verify client IP preservation.

Example managed cloud service production steps

  • Create cloud NLB and attach target group to instance IDs or IPs.
  • Configure TCP health check to application port.
  • Enable cross-zone balancing if desired.
  • Hook CloudWatch/metrics to central monitoring and set alerts for health check failures.

Use Cases of NLB

1) Database proxying – Context: Cloud-hosted MySQL cluster for SaaS app. – Problem: High-concurrency TCP connections need load distribution. – Why NLB helps: Pass-through L4 balancing preserves client IP and low latency. – What to measure: Connection success, active conns per target, replication lag indirectly. – Typical tools: Cloud NLB, Autoscaling groups, client-side connection pools.

2) gRPC services over TCP – Context: High-throughput microservices communicating via gRPC. – Problem: Need low latency and TLS passthrough for mTLS. – Why NLB helps: Preserves end-to-end TLS, supports long-lived connections. – What to measure: SYN-ACK latency, TLS handshakes, flow table utilization. – Typical tools: Cloud NLB, eBPF tracing, Prometheus.

3) Websocket gateway – Context: Real-time chat service using websockets. – Problem: Long-lived connections and high concurrency. – Why NLB helps: Efficient forwarding and minimal overhead. – What to measure: Active connections, idle timeouts, connection resets. – Typical tools: NLB, autoscale targets, connection-aware metrics.

4) IoT telemetry ingestion (UDP/TCP) – Context: Millions of IoT devices sending telemetry via UDP. – Problem: High PPS and simple protocol requiring L4 efficiency. – Why NLB helps: Handles packet throughput and simple routing at scale. – What to measure: Packets per second, dropped packets, bytes/sec. – Typical tools: Anycast NLB, flow collectors.

5) VPN concentrator – Context: Central VPN endpoints for remote employees. – Problem: High concurrent secure tunnels and NAT traversal. – Why NLB helps: Balances tunnels and preserves source IP for ACLs. – What to measure: Tunnel count, handshake failures, throughput per user. – Typical tools: NLB, managed VPN gateways.

6) SIP/VoIP load balancing – Context: Real-time voice systems needing UDP and TCP support. – Problem: Call setup latency and high concurrent streams. – Why NLB helps: Low-latency L4 routing and source IP preservation. – What to measure: Call setup time, RTP packet loss, jitter. – Typical tools: Hardware SBCs, NLB with UDP support.

7) Internal microservice TCP routing – Context: Non-HTTP internal services in k8s requiring stable endpoints. – Problem: Pod churn causing client reconnections. – Why NLB helps: Stable VIPs and health-check-based routing. – What to measure: Pod readiness, connection restarts, short-lived conn counts. – Typical tools: K8s Service LoadBalancer, IPVS.

8) Cross-region failover for critical TCP services – Context: Financial services needing regional redundancy. – Problem: Regional outage requires fast failover with minimal data loss. – Why NLB helps: Per-region NLBs combined with GSLB for DNS failover. – What to measure: Failover time, DNS TTL effectiveness, client reconnections. – Typical tools: NLB, GSLB, traffic steering automation.

9) Backend for ALB for TLS offload – Context: ALB terminates TLS for web while some endpoints need passthrough. – Problem: Mixed L7 and L4 requirements. – Why NLB helps: Forward specific ports while ALB handles L7 tasks. – What to measure: TLS handshake errors and ALB-to-LB routing latency. – Typical tools: ALB + NLB combo.

10) Bare-metal Kubernetes ingress – Context: On-prem K8s cluster needing public LoadBalancer service. – Problem: Cloud-managed LB not available. – Why NLB helps: MetalLB or BGP-provisioned NLB gives public VIPs. – What to measure: ARP/BGP stability, address pool exhaustion. – Typical tools: MetalLB, BGP routers.


Scenario Examples (Realistic, End-to-End)

Scenario #1 — Kubernetes high-concurrency websocket service

Context: K8s cluster hosts a chat service requiring many concurrent websockets.
Goal: Support 200k concurrent connections with low reconnection rate.
Why NLB matters here: Websockets are long-lived TCP connections; L4 performance and IP preservation are essential.
Architecture / workflow: Public NLB -> NodePort -> kube-proxy IPVS -> Pod -> App handles websocket.
Step-by-step implementation:

  1. Enable IPVS mode in kube-proxy.
  2. Deploy MetalLB or cloud NLB with static VIP.
  3. Tune NodePort range and ensure security groups allow ports.
  4. Configure health checks to use TCP and readiness probes in pods.
  5. Set connection draining to allow close of existing websockets. What to measure: Active connections per node, connection resets, SYN-ACK latency.
    Tools to use and why: MetalLB/NLB for VIP; Prometheus for metrics; eBPF for per-socket traces.
    Common pitfalls: Not tuning ephemeral port limits; using iptables mode causing performance drop.
    Validation: Load test gradually to 200k connections; run game day with pod/s node kills.
    Outcome: Stable long-lived connections, autoscaling based on connection metrics.

Scenario #2 — Serverless managed PaaS with internal TCP backend

Context: Managed PaaS exposes platform via HTTP but requires internal L4 balancing to stateful TCP services.
Goal: Ensure L4 routing to database proxies while keeping platform serverless.
Why NLB matters here: L4 pass-through reduces latency and keeps TLS endpoints secure.
Architecture / workflow: Serverless front -> ALB for HTTP -> NLB for DB proxy passthrough -> DB cluster.
Step-by-step implementation:

  1. Provision cloud NLB for DB ports.
  2. Configure target groups with private IPs of DB proxies.
  3. Ensure VPC peering and routing rules allow traffic.
  4. Add metrics collection for conn counts and bytes. What to measure: Connection success, TLS handshake errors, DB proxy CPU.
    Tools to use and why: Cloud provider NLB for managed scale; Cloud monitoring for telemetry.
    Common pitfalls: Route table misconfigurations blocking internal traffic.
    Validation: Simulate serverless bursts creating many DB connections; observe autoscaler and LB behavior.
    Outcome: Fast L4 routing with preserved client info and minimal latency.

Scenario #3 — Incident-response: sudden target group failures

Context: Production e-commerce app suffers elevated checkout failures and timeouts.
Goal: Rapidly identify LB vs backend issue and mitigate impact.
Why NLB matters here: NLB health-checks and connection metrics can show if backends are unhealthy.
Architecture / workflow: Public NLB -> target group of app instances -> downstream payment service.
Step-by-step implementation:

  1. Check LB health metrics and recent config changes.
  2. Verify health-check logs and last success times.
  3. Drill into per-target conn count and CPU/memory.
  4. If unhealthy targets found, roll back recent deploy or replace instances.
  5. If LB misconfiguration, apply corrected health-check settings. What to measure: Health check failures/minute, connection resets, per-target errors.
    Tools to use and why: Provider LB metrics, Prometheus, logging for deploy events.
    Common pitfalls: Paging the wrong team due to lack of ACL context.
    Validation: Postmortem identifies root cause; implement automation to detect and replace failing targets.
    Outcome: Reduced MTTR and clearer ownership for LB-level incidents.

Scenario #4 — Cost/performance trade-off for TLS offload

Context: A company needs to reduce backend CPU usage from TLS costs while keeping observability.
Goal: Move TLS termination to LB while retaining request-level observability.
Why NLB matters here: Using NLB with TLS passthrough vs LB TLS termination has cost and visibility trade-offs.
Architecture / workflow: Option A: NLB TLS passthrough -> backend termination. Option B: ALB terminates TLS -> NLB not used for TLS.
Step-by-step implementation:

  1. Measure backend CPU before changes.
  2. Implement LB TLS termination in staging and feed logs to tracing.
  3. If using NLB passthrough, add PROXY protocol or X-Forwarded-For to preserve client IP.
  4. Compare CPU, latency, and observability metrics. What to measure: Backend CPU, TLS handshake CPU time, request latency, trace completeness.
    Tools to use and why: Cloud LB metrics, app-level tracing, and cost reporting.
    Common pitfalls: Loss of client IP or missing headers breaking rate limits.
    Validation: A/B test traffic and compare SLOs and cost delta.
    Outcome: Informed decision balancing cost and observability with rollback plan.

Common Mistakes, Anti-patterns, and Troubleshooting

(Format: Symptom -> Root cause -> Fix)

1) Symptom: Targets flapping between healthy and unhealthy -> Root cause: Health check too strict -> Fix: Increase timeout and retries, validate probe endpoint. 2) Symptom: High P95 connection latency -> Root cause: Network path or overloaded target -> Fix: Check per-target CPU and inter-zone routing; redistribute traffic. 3) Symptom: Sudden spike in connection resets -> Root cause: MTU mismatch or abrupt backend termination -> Fix: Verify MTU across path and ensure graceful shutdown. 4) Symptom: Client IP missing at backend -> Root cause: SNAT enabled or missing PROXY header -> Fix: Enable client IP preservation or enable PROXY protocol. 5) Symptom: One backend receiving most traffic -> Root cause: Hashing skew due to few client IPs -> Fix: Use source-port hashing or implement stickiness timeout adjustments. 6) Symptom: Flow table exhaustion -> Root cause: Unexpected high short-lived connections -> Fix: Increase flow table limit, throttle traffic ingress, or implement rate limiting. 7) Symptom: TLS handshake failures -> Root cause: Expired cert or unsupported cipher -> Fix: Rotate certs and ensure cipher compatibility. 8) Symptom: Slow failover in DR -> Root cause: High DNS TTL and slow health checks -> Fix: Lower DNS TTL and tune health check aggressiveness. 9) Symptom: Increased operational toil for LB config -> Root cause: Manual processes and no IaC -> Fix: Move LB config to IaC pipelines. 10) Symptom: Observability blind spots -> Root cause: Not collecting LB metrics or missing correlation IDs -> Fix: Enable LB telemetry and add correlation propagation. 11) Symptom: Alerts during autoscale -> Root cause: Alert thresholds not aligned with scaling behavior -> Fix: Add suppression windows or scale-aware alert rules. 12) Symptom: Excess cost for cross-zone traffic -> Root cause: Cross-zone balancing enabled without need -> Fix: Evaluate architecture and disable if not beneficial. 13) Symptom: Backends not receiving health checks -> Root cause: Security group or ACL blocking probe sources -> Fix: Allow LB probe IP ranges. 14) Symptom: High error rate but LB shows green -> Root cause: L4 LB cannot see HTTP 5xx -> Fix: Add L7 insights via ALB/instrumentation. 15) Symptom: DNS pointing to decommissioned LB -> Root cause: Stale DNS records / TTL mismanagement -> Fix: Update DNS and reduce TTL for critical services. 16) Symptom: Missing observability during incidents -> Root cause: Log rotation removing recent logs -> Fix: Extend retention or stream to central logging. 17) Symptom: Backends overloaded during deploy -> Root cause: No connection draining -> Fix: Enable draining and coordinate deploys. 18) Symptom: Unexpected client blocking -> Root cause: Over-aggressive rate limiting at LB -> Fix: Review rate limits and add whitelisting if needed. 19) Symptom: Weird packet corruption -> Root cause: Middlebox altering packets or checksum offload mismatch -> Fix: Disable offloads for capture and validate network devices. 20) Symptom: Alerts noisy and unactionable -> Root cause: Too low thresholds and no dedupe -> Fix: Adjust thresholds, add grouping and suppression rules. 21) Symptom: Inconsistent metrics across regions -> Root cause: Different metric retention/granularity -> Fix: Standardize metrics collection and retention policy. 22) Symptom: Ingress service unavailable after scaling -> Root cause: Missing target registration automation -> Fix: Automate registration/deregistration on scale events. 23) Symptom: Application-level auth fails -> Root cause: IP change due to SNAT -> Fix: Enable client IP pass-through or adjust auth to use headers. 24) Symptom: High platform toil on TLS rotation -> Root cause: Manual cert renewal -> Fix: Automate cert rotation with ACME or provider integrations. 25) Symptom: Observability pitfall — metric cardinality explosion -> Root cause: High-tag cardinality on per-connection metrics -> Fix: Reduce labels, aggregate, or sample.


Best Practices & Operating Model

Ownership and on-call

  • Assign LB ownership to platform/network team with clear escalation to service owners.
  • Shared runbooks with contact lists and escalation paths.
  • On-call rotations should include platform engineers familiar with LB internals.

Runbooks vs playbooks

  • Runbooks: Step-by-step operational procedures for known incidents.
  • Playbooks: Higher-level strategic response for complex incidents; used in postmortem decisions.

Safe deployments

  • Use canary deployments or connection draining to avoid sudden traffic storms.
  • Validate health checks and scale behavior before shifting prod traffic.
  • Maintain rollback procedures and test them.

Toil reduction and automation

  • Automate target registration/deregistration, certificate rotation, and health-check tuning based on telemetry.
  • Use CI pipelines for LB config changes with validation tests.
  • Automate alert suppression for planned events.

Security basics

  • Limit LB exposure with security groups and allowlists.
  • Use TLS termination where needed and rotate certs proactively.
  • Monitor for anomalous traffic patterns for DDoS.

Weekly/monthly routines

  • Weekly: Check health check trends and active connection distributions.
  • Monthly: Capacity review and certificate expiry inventory.
  • Quarterly: Game day exercises and SLO review.

What to review in postmortems related to NLB

  • Timeline of health-check and target state changes.
  • Changes to LB configuration or control plane around incident.
  • Metrics correlation: connection counts, CPU, packet drops.
  • Automation failures: why automation didn’t remediate.

What to automate first

  • Certificate rotation and expiry alerts.
  • Target registration during autoscaling events.
  • Automated replacement of unhealthy targets.
  • Health-check tuning based on observed failure patterns.

Tooling & Integration Map for NLB (TABLE REQUIRED)

ID Category What it does Key integrations Notes
I1 Cloud LB Managed L4 balancing Cloud metrics, IAM, autoscaling Vendor feature sets vary
I2 MetalLB Bare-metal LoadBalancer for K8s BGP routers, k8s API Good for on-prem clusters
I3 IPVS/kube-proxy Kernel L4 balancing in K8s kubelet, kube-apiserver High performance
I4 eBPF observability Per-flow telemetry Prometheus, tracing Kernel-level visibility
I5 Prometheus Metrics collection and alerting Exporters, LB APIs Flexible queries
I6 Flow collectors NetFlow/sFlow analysis Network devices, SIEM Top-talkers and trends
I7 Packet capture Deep packet debugging Storage, analysis tools Use in staging/incidents
I8 GSLB/DNS Global traffic steering DNS providers, health APIs DNS TTL management matters
I9 WAF / ALB L7 inspection and protection LB chain, cloud logging Use with NLB for mixed workloads
I10 CI/CD/ IaC Manage LB config as code Terraform, Cloud SDKs Critical for repeatability

Row Details (only if needed)

  • None

Frequently Asked Questions (FAQs)

H3: What is the difference between NLB and ALB?

NLB is L4 for TCP/UDP with minimal inspection and low latency; ALB is L7 and understands HTTP(S), enabling host/path routing and header-based policies.

H3: How do I preserve client IP through an NLB?

Enable client IP preservation or use PROXY protocol or X-Forwarded-For at the backend; configuration depends on implementation.

H3: How do I measure connection success rate?

Compute successful TCP accepts divided by attempted connects as recorded by LB metrics over an interval.

H3: What’s the difference between NLB and a reverse proxy like HAProxy?

HAProxy operates at L7 (and L4) with richer routing and inspection; NLB focuses on fast L4 forwarding without deep request handling.

H3: How do I choose health check parameters?

Start with conservative timeouts and retries; match health checks to application readiness probe semantics and tune after observing flaps.

H3: How do I scale NLB capacity?

For managed NLBs autoscaling is often automatic; for self-managed increase pool size, tune flow tables, and add more VIPs/instances.

H3: What’s the difference between client IP preservation and SNAT?

Client IP preservation keeps original source IP visible to backend; SNAT rewrites source to LB IP, which may simplify return path but breaks IP-based logic.

H3: How do I debug TLS handshake failures?

Check cert validity, supported ciphers/ALPN, and capture handshake errors on LB logs or packet captures for failing clients.

H3: How do I handle long-lived connections with autoscaling?

Use connection-aware autoscaling metrics, enable graceful draining, and provision capacity headroom to absorb spikes.

H3: What’s the best way to test NLB changes?

Use a staging environment mirroring production, run load tests and game days, and validate rollback paths in CI.

H3: How do I avoid alert fatigue for LB alerts?

Group alerts, use rate-limits, and route only high-severity SLO breaches to paging while less severe to tickets.

H3: How do I handle cross-zone traffic cost concerns?

Evaluate need for cross-zone balancing; consider disabling if not required and architect to keep traffic local where possible.

H3: How do I ensure observability for encrypted traffic?

Propagate correlation IDs and use TLS termination where feasible; for passthrough, rely on LB-level metrics and flow telemetry.

H3: What’s the difference between L4 load balancer and IPVS?

L4 load balancer is a general concept; IPVS is a Linux kernel implementation that provides virtual server functionality at L4.

H3: How do I protect an NLB from DDoS?

Use provider DDoS protections, enable rate limits, and cascade to WAFs for L7 attacks; use blackholing sparingly and with care.

H3: How do I migrate from ALB to NLB for a service?

Validate protocol needs (no L7 features required), deploy NLB in parallel, switch DNS with low TTL, and monitor SLOs for regressions.

H3: How do I handle certificate rotation with NLB?

Automate rotation with provider APIs or ACME, maintain overlap windows and monitor expiry metrics to avoid outages.

H3: How do I collect per-client telemetry behind NLB passthrough?

Use PROXY protocol or add client-origin headers at the first termination point and centralize logs and traces.


Conclusion

Summary: Network Load Balancers are essential L4 primitives that provide fast, scalable TCP/UDP traffic distribution while preserving client context. They play a critical role for high-throughput services, long-lived connections, and non-HTTP protocols. Correct configuration, observability, and automation are key to reliable operation.

Next 7 days plan (5 bullets)

  • Day 1: Inventory all services using L4 traffic and capture current KPIs.
  • Day 2: Ensure health checks exist and instrument basic LB metrics ingestion.
  • Day 3: Implement IaC for NLB configuration and enable alerting for health-check failures.
  • Day 4: Run a controlled load test to validate connection limits and draining behavior.
  • Day 5–7: Run a game day simulating target failures, practice runbooks, and finalize postmortem actions.

Appendix — NLB Keyword Cluster (SEO)

Primary keywords

  • network load balancer
  • NLB
  • L4 load balancer
  • TCP load balancing
  • UDP load balancing
  • TLS passthrough
  • client IP preservation
  • cloud NLB
  • MetalLB
  • kube-proxy IPVS
  • load balancer health checks
  • connection draining
  • flow hashing
  • flow table utilization
  • long-lived connections

Related terminology

  • VIP address
  • listener port
  • target group
  • PROXY protocol
  • X-Forwarded-For header
  • SYN-ACK latency
  • connection success rate
  • active connections
  • bytes per second
  • packets per second
  • TLS handshake errors
  • connection resets
  • source NAT
  • destination NAT
  • cross-zone balancing
  • global server load balancing
  • GSLB
  • anycast routing
  • BGP load balancing
  • packet capture debugging
  • eBPF observability
  • NetFlow sFlow
  • packet per second sizing
  • MTU mismatch
  • TLS ALPN negotiation
  • autoscaling based on connections
  • readiness probe
  • health check interval
  • health check retries
  • slow failover
  • DNS TTL management
  • canary deployments
  • rolling deployments
  • certificate rotation automation
  • security groups ACLs
  • DDoS mitigation strategies
  • rate limiting at LB
  • WAF integration
  • observability tag correlation
  • SLI SLO for load balancer
  • error budget burn rate
  • alert deduplication
  • runbook automation
  • incident response playbook
  • postmortem for LB incidents
  • capacity planning for NLB
  • nodeport LoadBalancer pattern
  • serverless L4 fronting
  • websocket load balancing
  • gRPC over TCP load balancer
  • VPN concentrator load balancing
  • SIP UDP balancing
  • on-prem bare-metal load balancing
  • MetalLB BGP integration
  • IPVS kernel virtual server
  • kube-proxy mode IPVS
  • cloud provider load balancer limits
  • flow stickiness timeout
  • client IP header preservation
  • PROXY v1 v2 compatibility
  • backend draining grace period
  • flow table size limits
  • connection timeout tuning
  • high throughput packet forwarding
  • L4 vs L7 trade-offs
  • ALB vs NLB differences
  • reverse proxy vs load balancer
  • engineered observability for NLB
  • per-target connection metrics
  • load testing for NLB
  • game day testing load balancer
  • automated LB configuration as code
  • Terraform load balancer modules
  • cloud monitoring for NLB
  • Prometheus LB metrics
  • eBPF for connection tracing
  • flow collector for traffic analysis
  • packet capture for TLS debugging
  • TCP keepalive tuning
  • circuit breaker for backend saturation
  • rate limit whitelisting
  • cross-region failover strategies
  • DNS failover with NLB endpoints
  • health-check tuning best practices
  • ephemeral port limits and tuning
  • per-connection CPU overhead
  • TLS termination cost comparison
  • cost vs performance TLS offload
  • tracing and correlation behind TLS passthrough
  • sampling strategies for high-cardinality metrics
  • central logging of LB events
  • security logging for source IPs
  • admission policies for LB changes
  • CI/CD validation for load balancer updates
  • runbook testing and validation
  • ownership model for platform LBs
  • on-call rotation for network/platform teams
  • reducing toil with LB automation
Scroll to Top