k3s 1.29.3 upgrade — zero downtime, eventually
Rolling k3s from 1.28.x to 1.29.3 on a three-node cluster. The goal was zero downtime. The result was zero downtime with one two-second blip that I’m choosing to call acceptable.
Method: drain, cordon, upgrade one node, uncordon, wait five minutes, repeat. The five-minute wait matters — rescheduled pods need time to settle before you pull the rug out from the next node.
kubectl drain k3s-02 --ignore-daemonsets --delete-emptydir-data
# wait for drain to complete
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.29.3+k3s1 sh -
kubectl uncordon k3s-02
The tricky part was Ceph. With three OSDs across three nodes, draining any node temporarily drops redundancy from 3 replicas to 2. I paused between each node and waited for ceph -s to show HEALTH_OK before moving on.
watch ceph -s # wait for: health: HEALTH_OK, all PGs active+clean
The blip: Traefik lingered on a stale informer after the last node came back. Two-second 502 window. Fix is to bounce the ingress controller after the last node upgrade, before declaring done. Not documented anywhere obvious — found it via Uptime Kuma alerts.
Next up: 1.30.x when it hits stable. The anxiety is gone once you know the restore actually works.