Closing HA gaps in a redis-sentinel + haproxy setup

TL;DR: A common haproxy-in-front-of-redis-sentinel setup will cascade-restart its haproxy pods during sentinel failover if the kubernetes liveness probe checks backend health. The probes need splitting: liveness reflects process aliveness, readiness reflects backend availability. I also had to add a preStop hook that runs kill -USR1 1 so haproxy can drain existing connections before kubelet sends SIGTERM, which is a hard stop in haproxy 3.x.

The setup Link to heading

A small redis high-availability cluster:

3 redis pods each with a redis-sentinel sidecar (quorum: 2)
3 haproxy pods in front, routing client traffic to whichever redis is currently master
Clients connect to the haproxy Service, never to redis pods directly

The haproxy config picks the master by asking each sentinel get-master-addr-by-name, and serves on :6379 only when it has a healthy master backend.

How it kept failing Link to heading

I was looking at recurring overnight pages. The on-call alert was Connection refused to a downstream service that relies on redis as part of its startup path. My first reading was that GCE spot nodes were getting preempted and dragging redis pods with them. I checked the GCP audit log and found something different:

compute.instances.preempted events in 24 hours: 1
compute.instances.delete events from the cluster autoscaler service account: about 50, all between 05:00 and 07:30 UTC

So the trigger was cluster autoscaler consolidating nodes during quiet hours, not spot preemption. Raising consolidationDelayMinutes on the ComputeClass from 5 to 30 reduced churn frequency by roughly 6x, and that alone took the alert volume down. But that only addresses how often redis pods get rescheduled, not what happens to clients when one does.

The cascading restart Link to heading

I had each haproxy pod watching the same redis cluster and reporting its own k8s health on /healthz:

listen health_check_http_url
  bind :8888
  mode http
  monitor-uri /healthz
  monitor fail if { nbsrv(bk_redis_master) lt 1 }

That monitor fail if { nbsrv(bk_redis_master) lt 1 } line caused the cascade. /healthz returns 503 whenever bk_redis_master has zero healthy servers. Both my liveness and readiness probes hit the same endpoint.

Sentinel takes 5 to 15 seconds in my config to elect a new master. During that window, no backend looks like a master to haproxy. Every haproxy replica returns 503 on /healthz at the same time. Three replicas, all watching the same cluster, all failing in lockstep. The liveness probe fires failureThreshold: 3 × periodSeconds: 3 = 9s later and kubelet kills each of them. The Service drops to zero endpoints. New replicas have to come up from scratch. A 15 second backend outage turns into a 30 to 60 second total outage with no haproxy at all.

The pattern shows up elsewhere too: any setup where multiple replicas check a shared external dependency, and tie k8s liveness to it, will wipe the fleet on a dependency blip.

Splitting liveness from readiness Link to heading

I had collapsed two questions onto one endpoint. Liveness should answer “is the process alive (would restarting help)”. Readiness should answer “should this pod receive traffic”. My /healthz only answered the readiness question.

The fix is two listen blocks, one per probe semantic:

# Liveness: 200 if haproxy is alive. Decoupled from backend state.
listen health_alive
  bind :8888
  mode http
  monitor-uri /alive
  option dontlognull

# Readiness: 503 if there is no healthy redis master right now.
listen health_ready
  bind :8889
  mode http
  monitor-uri /ready
  monitor fail if { nbsrv(bk_redis_master) lt 1 }
  option dontlognull

And the deployment:

livenessProbe:
  httpGet:
    path: /alive
    port: 8888
readinessProbe:
  httpGet:
    path: /ready
    port: 8889

During a sentinel failover now, readiness fails on all three replicas at the same time. kube-proxy removes them from the Service while they stay alive. New connections from clients route nowhere for a few seconds, but the haproxy pods themselves do not restart. Once sentinel finishes the election, readiness recovers within a probe cycle and the Service comes back to full capacity. No long restart cycle to wait through.

I should name a small correctness trade-off. With readiness failing on all replicas at once, the Service has zero endpoints during the failover window, so clients see Connection refused or routing-level errors. That is the same end state as the old setup during the same window, but without the cost of pod restarts. The cleaner answer would be a sentinel-aware client library, or a redis cluster with multi-master sharding, but neither was on the table for this setup.

Graceful shutdown in haproxy 3.x Link to heading

After fixing the cascade, I wanted to confirm an individual pod could be removed cleanly during cluster autoscaler scale-down. That part took me down a longer detour than I had budgeted for.

I tested by deleting a haproxy pod and watching a probe client hit the Service. The probe saw no failures during sentinel failover, which was what I wanted. But during pod replacement, the probe was still seeing client errors. The cause is in the haproxy management docs (haproxy 3.0 management.html):

The hard stop is simple, when the SIGTERM signal is sent to the haproxy process, it immediately quits and all established connections are closed.

The graceful stop is triggered when the SIGUSR1 signal is sent to the haproxy process. It consists in only unbinding from listening ports, but continue to process existing connections until they close. Once the last connection is closed, the process leaves.

SIGTERM is what kubelet sends when a pod is terminated. With no preStop hook, haproxy hard-stops and RSTs every in-flight client connection.

SIGUSR1 is the signal I wanted: stop accepting new connections, let existing ones drain. To trigger it during pod termination, I needed a preStop hook.

The order matters Link to heading

My first attempt was wrong, and a bot code reviewer caught it before I shipped:

lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "kill -USR1 1; sleep 30"]

The thinking was “send the soft-stop signal, then wait for things to drain”. The problem: kube-proxy removes the pod from the Service asynchronously, 5 to 10 seconds after the pod is marked for deletion. If I send SIGUSR1 at t=0, haproxy stops accepting at t=0, but kube-proxy keeps routing new connections to this pod until it has propagated the EndpointSlice update. Those new connections see Connection refused.

The right order is: sleep first (so haproxy keeps accepting while kube-proxy propagates the EndpointSlice removal), then signal:

lifecycle:
  preStop:
    exec:
      # 1. sleep 10s so kube-proxy removes this pod from the Service while
      #    haproxy keeps accepting (any racing connection still gets served)
      # 2. SIGUSR1 starts the soft-stop, drains in-flight connections
      # 3. sleep 20s gives existing connections time to complete
      command: ["sh", "-c", "sleep 10; kill -USR1 1; sleep 20"]

And in haproxy.cfg, I added hard-stop-after so haproxy exits within the preStop window, before kubelet sends SIGTERM:

global
  # Bound the soft-stop so haproxy exits at t=30s,
  # before SIGTERM arrives at t=30s end-of-preStop.
  hard-stop-after 20s

On the deployment, I raised terminationGracePeriodSeconds from the default 30 to 60 so the 30-second preStop has headroom.

Sequence end to end Link to heading

t=0       Pod marked for deletion. preStop starts.
t=0-10s   haproxy still listening. kube-proxy removes endpoint.
t=10s     kill -USR1 1 -> haproxy stops listening, drain begins.
          hard-stop-after 20s timer starts.
t=10-30s  Existing connections complete naturally.
t=30s     hard-stop-after fires. haproxy exits. Container gone.
t=30s     preStop sleep ends. kubelet would send SIGTERM
          but the container is already gone.

How I verified it Link to heading

I ran a probe pod that loops redis-cli PING against the Service every 500ms, then deleted a haproxy pod from the new cohort and counted how many of the probe’s PINGs failed during the next minute.

Before the changes, a single pod deletion produced four Connection refused entries in the probe log over the 9-second kube-proxy lag window. After the changes, the same test produced zero. The fleet replacement during the rollout still produced some errors because the old pods predated the preStop config and got hard-stopped on terminate, but later pod deletions were clean.

The control test, hitting a running pod with kubectl exec -- kill -USR1 1, was useful too. It confirms the signal handling without changing the deployment:

$ kubectl exec haproxy-pod -- sh -c 'kill -USR1 1'
$ kubectl logs haproxy-pod | tail
Proxy health_alive stopped (cumulated conns: FE: 409, BE: 0).
Proxy health_ready stopped (cumulated conns: FE: 409, BE: 0).
Proxy ft_redis_master stopped (cumulated conns: FE: 879, BE: 0).
...

Existing connections stay ESTABLISHED, the pod is marked NotReady because readiness on :8889 can no longer connect, and new connections to the pod’s IP get refused, which matches the docs.

What I’d do differently Link to heading

Liveness probes that depend on a shared external service create correlated failures across replicas. Once I named the pattern, I started seeing it elsewhere: webhook controllers checking the API server they front, sidecars checking the same upstream as the main container. The fix is the same shape each time: separate the “am I alive” question from the “can I serve” question.

The haproxy docs are explicit about SIGTERM, and I had read past them. My assumption was that graceful shutdown was the default on SIGTERM. The default plain mode treats SIGTERM as immediate hard stop. Master-worker mode (opt-in via -W or the master-worker keyword) handles signals as a separate concern, but I was not running it.

A bot reviewer caught the preStop ordering bug. My first version sent SIGUSR1 before the sleep, which defeated the point of having a sleep at all, and I had missed it on my own read. The PR review made the catch, not me, on a change I felt confident shipping.