← Homelab

k3s 1.29.3 upgrade — zero downtime, eventually

2026-04-23 #101

#kubernetes #k3s #upgrade #ceph #traefik

Rolling k3s from 1.28.x to 1.29.3 on a three-node cluster. The goal was zero downtime. The result was zero downtime with one two-second blip that I’m choosing to call acceptable.

Method: drain, cordon, upgrade one node, uncordon, wait five minutes, repeat. The five-minute wait matters — rescheduled pods need time to settle before you pull the rug out from the next node.

kubectl drain k3s-02 --ignore-daemonsets --delete-emptydir-data
# wait for drain to complete
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.29.3+k3s1 sh -
kubectl uncordon k3s-02

The tricky part was Ceph. With three OSDs across three nodes, draining any node temporarily drops redundancy from 3 replicas to 2. I paused between each node and waited for ceph -s to show HEALTH_OK before moving on.

watch ceph -s  # wait for: health: HEALTH_OK, all PGs active+clean

The blip: Traefik lingered on a stale informer after the last node came back. Two-second 502 window. Fix is to bounce the ingress controller after the last node upgrade, before declaring done. Not documented anywhere obvious — found it via Uptime Kuma alerts.

Next up: 1.30.x when it hits stable. The anxiety is gone once you know the restore actually works.

← previous

Terrain chunks + LOD rings — from framerate disaster to playable

2026-04-18 · Gamedev · #104

My Homelab Journey until 2026

2026-04-29 · Homelab · #105

→ sibling benches