Zero-Downtime Kubernetes Deployments: The Complete Playbook

Zero-downtime deployments aren't a feature you add at the end. They're an architectural constraint you design for from day one.

After deploying dozens of services to production Kubernetes clusters, I've developed a playbook that reliably achieves zero-downtime releases — even for stateful workloads.

The fundamental requirement: graceful shutdown

Every container in your deployment must handle SIGTERM correctly. This is step zero, and most guides skip it.

// main.go — proper graceful shutdown
func main() {
  server := &http.Server{Addr: ":8080", Handler: router}
  
  // Start server in goroutine
  go func() {
    if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed {
      log.Fatal(err)
    }
  }()
  
  // Wait for interrupt signal
  quit := make(chan os.Signal, 1)
  signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
  <-quit
  
  // Give in-flight requests time to complete (must match terminationGracePeriodSeconds)
  ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
  defer cancel()
  
  if err := server.Shutdown(ctx); err != nil {
    log.Fatal("Server forced shutdown:", err)
  }
}

Rolling update configuration

For most workloads, a properly tuned rolling update is all you need:

spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0   # Never kill old pods until new ones are ready
      maxSurge: 1         # Allow one extra pod during transition
  template:
    spec:
      terminationGracePeriodSeconds: 30
      containers:
      - name: api
        readinessProbe:
          httpGet:
            path: /healthz/ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 3
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 5"]  # Drain in-flight before SIGTERM

The preStop sleep is critical — it gives the load balancer time to deregister the pod before it starts refusing connections.

Automated rollback with GitHub Actions

- name: Deploy and verify
  run: |
    kubectl set image deployment/api api=$IMAGE_TAG
    kubectl rollout status deployment/api --timeout=5m || {
      echo "Deployment failed — rolling back"
      kubectl rollout undo deployment/api
      exit 1
    }

- name: Run smoke tests
  run: |
    ./scripts/smoke-test.sh $DEPLOYMENT_URL || {
      echo "Smoke tests failed — rolling back"
      kubectl rollout undo deployment/api
      exit 1
    }

Conclusion

Zero-downtime deployments require discipline across the entire stack: application code, container configuration, Kubernetes settings, and your CI/CD pipeline. Get any layer wrong and you'll have incidents.

Build the habit of running kubectl rollout status and validating smoke tests on every deploy, and you'll build the confidence to deploy 20 times a day without hesitation.