cm-etcd-upgrade-for-k8s-132-133-134-135.md

BCM 10/11 - etcd Upgrade Required for Kubernetes >= 1.31

Overview

This article provides instructions for resolving etcd version compatibility issues when upgrading to Kubernetes >= 1.31 in BCM environments.

Problem: During Kubernetes installations or upgrades to versions 1.31 and later, the kubeadm commands such as init fail with an etcd version error. BCM 10.0 and 11.0 were released with etcd version 3.5.22.

Root Cause: Kubernetes tightened the minimum supported etcd version in patch releases to prevent clusters from upgrading into a known-bad state that can break control-plane rollouts. Older etcd 3.5.x versions had upgrade bugs (learner promotion/membership inconsistencies) that can cause upgrades to fail. To "fail fast", kubeadm's version gate was backported across supported branches. Unfortunately the etcd version used by BCM Kubernetes is 3.5.22, which is now no longer compatible with the latest patch versions of Kubernetes >= 1.31.

Solution: Upgrade etcd to version 3.5.24 or later before attempting a Kubernetes installation or upgrade.

Prerequisites

  • BCM Version: 10 or 11
  • Target Kubernetes Version: >= 1.31
  • Current Etcd Version: cm-etcd package <= 3.5.25 must be upgraded

Background

The etcd maintainers documented a critical upgrade failure path from etcd 3.5 to 3.6. Under certain sequences, a voting member can revert to a learner because membership changes were persisted only in the v2store (in etcd 3.5) but etcd 3.6 treats the v3store as the source of truth. This can strand upgrades with "too many learner members" errors or propagate incorrect membership information.

Fixes were implemented in etcd 3.5.20+ with additional backports in 3.5.24, including learner-promotion persistence fixes and Go toolchain bumps. Kubernetes raised the minimum etcd version requirement to ensure clusters don't upgrade into this problematic state.

Relevant Documentation

For more detailed information about this issue:

Scenario 1: New Kubernetes Setup via cm-kubernetes-setup

When using the BCM Kubernetes setup wizard, it defaults to installing the latest patch version for the selected minor version (e.g., 1.32.10 for Kubernetes 1.32.x). Recent patch releases for versions >= 1.31 include the etcd version check that will cause installation to fail.

Symptom

The installation fails during the kubeadm initialization stage with the following error:

#### stage: kubernetes: Kubeadm Initialize First Node
Initializing kubeadm cluster on node001...
[init] Using Kubernetes version: v1.32.10
[preflight] Running pre-flight checks
[preflight] Some fatal errors occurred:
       [ERROR ExternalEtcdVersion]: this version of kubeadm only supports external etcd version >= 3.5.24-0. Current version: 3.5.22
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
error execution phase preflight

Solution

BCM will release updated cm-etcd packages to address this issue. At the time of writing, these packages are available for manual download.

Step 1: Remove the failed Kubernetes setup first if needed.

Choose undo to clean up the failed cm-kubernetes-setup. If the setup was aborted instead, it's possible to clean it up with:

cm-kubernetes-setup --remove --yes-i-really-mean-it

It may be necessary to provide the label of the Kube cluster with an additional --cluster <label> flag to the above command. (To see all the labels of installed clusters, see inside cmsh -c 'kubernetes list')

There have been reported cases where the above instructions did not work. In case the setup is not removable, please use the following procedure to remove instead.

Install the cleanup script on the head node
root@rb-kube:~# wget https://support2.brightcomputing.com/cm-etcd/cleanup-old-install.sh
...
root@rb-kube:~# chmod +x cleanup-old-install.sh 
Determine which Kube cluster should be removed
root@rb-kube:~# cmsh -c 'kubernetes list'
Name (key)        
------------------
default
k8s-user

In our case we want to remove the k8s-user cluster.

root@rb-kube:~# ./cleanup-old-install.sh k8s-user
firewall role on rb-kube
firewall role on rb-kube
apiserverproxy role on rb-kube
removing kubecluster from apiserverproxy
...
Verify that the kube cluster was removed.

We repeat the cmsh command:

root@rb-kube:~# cmsh -c 'kubernetes list'
Name (key)        
------------------
default

Step 2: Download the Updated cm-etcd Package

Download the appropriate package from: https://support2.brightcomputing.com/cm-etcd/

Step 3: Verify Package Integrity

Verify the downloaded package using the following MD5 checksums:

BCM Version Distribution Package Filename MD5 Checksum
BCM 10 RHEL 8 cm-etcd-3.5.25-100101_cm10.0_960146e38f.x86_64.rpm 33ecb94d6b16d52dd204432ddbe5b2ac
BCM 10 Ubuntu 24.04 cm-etcd_3.5.25-100101-cm10.0-960146e38f_amd64.deb ab9c5bc39f912eb722eab7eb90ddef41
BCM 10 Ubuntu 20.04 cm-etcd_3.5.25-100101-cm10.0-960146e38f_amd64.deb 7c8e78bd066ff5f8d09b5c541005fa76
BCM 10 Ubuntu 22.04 cm-etcd_3.5.25-100101-cm10.0-960146e38f_amd64.deb 4cb03d03eb6b0038c1bd77e673cee9e0
BCM 10 SLES 15 cm-etcd-3.5.25-100101_cm10.0_960146e38f.x86_64.rpm 51980d287d294b50f284c44fc751e488
BCM 10 RHEL 9 cm-etcd-3.5.25-100101_cm10.0_960146e38f.x86_64.rpm 7768f407db9f52e5b9e1ea7733f2997c
BCM 11 RHEL 8 cm-etcd-3.5.25-100104_cm11.0_5e7f36e727.x86_64.rpm 644d14e350807f88bbb9f2e8cfefff8c
BCM 11 Ubuntu 24.04 cm-etcd_3.5.25-100104-cm11.0-5e7f36e727_amd64.deb 45dd303abe3190cf407ec6c0d02995b4
BCM 11 Ubuntu 22.04 cm-etcd_3.5.25-100104-cm11.0-5e7f36e727_amd64.deb 9fa483e9f6d8b424d454f50755c25ff9
BCM 11 SLES 15 cm-etcd-3.5.25-100104_cm11.0_5e7f36e727.x86_64.rpm 6efa06afd7d5d9ea217e2ce7c0fa6c21
BCM 11 RHEL 9 cm-etcd-3.5.25-100104_cm11.0_5e7f36e727.x86_64.rpm 90f7f7d615bf9b151107c1c45c718cc9

Step 4: Install Package in Software Image

Install the package in the appropriate software image for the etcd nodes before executing the setup.

For example if the Etcd nodes are provisioned from /cm/images/k8s-control-image:

# Example for BCM 11 on Ubuntu 24.04

# first enter the software image chroot
cm-chroot /cm/images/k8s-control-image

# inside it, download and install the appropriate package
wget https://support2.brightcomputing.com/cm-etcd/bcm11/ubuntu2404/cm-etcd_3.5.25-100104-cm11.0-5e7f36e727_amd64.deb
apt install ./cm-etcd_3.5.25-100104-cm11.0-5e7f36e727_amd64.deb

# exit the chroot
exit

Step 5: Proceed with Setup

After installing the updated package, proceed or redo with the Kubernetes setup as normal.

Scenario 2: Upgrading Existing Kubernetes Clusters

For existing clusters that need to be upgraded to Kubernetes >= 1.31, the etcd version must be updated to meet the minimum requirements.

Solution

Follow the same process as Scenario 1 to obtain and verify the updated cm-etcd package. However, for existing clusters, use a rolling update approach to minimize disruption.

Step 1: Update Software Image

Install the new cm-etcd package in the relevant software images as described in Scenario 1.

Step 2: Perform Rolling Update

Instead of updating all etcd nodes simultaneously, update them one at a time:

  1. Update First etcd Node:

    # On the etcd node, BCM 11 Ubuntu 24.04
    wget https://support2.brightcomputing.com/cm-etcd/bcm11/ubuntu2404/cm-etcd_3.5.25-100104-cm11.0-5e7f36e727_amd64.deb
    apt install ./cm-etcd_3.5.25-100104-cm11.0-5e7f36e727_amd64.deb
  2. Restart etcd Service:

    # Note: Package installation may have already prompted for automatic restart, in that case can be skipped
    systemctl restart etcd
  3. Verify etcd Version and Health:

    module load etcd && etcdctl endpoint status --cluster --write-out=table

    Expected output showing mixed versions during rolling update:

    +-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    |        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
    +-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    | https://10.141.0.1:2379 | 4a336cbcb0bafdc0 |  3.5.25 |  249 MB |     false |      false |      8623 |    4355787 |            4355782 |        |
    | https://10.141.0.2:2379 | 10cee25dc156ff4a |  3.5.22 |  280 MB |      true |      false |      8623 |    4355778 |            4355778 |        |
    | https://10.141.0.3:2379 | bd786940e5446229 |  3.5.22 |  252 MB |     false |      false |      8623 |    4355788 |            4355772 |        |
    +-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    
  4. Verify Cluster Health:

    etcdctl -w table endpoint --cluster health

    Expected output:

    +-------------------------+--------+--------------+-------+
    |        ENDPOINT         | HEALTH |     TOOK     | ERROR |
    +-------------------------+--------+--------------+-------+
    | https://10.141.0.2:2379 |   true |  20.617606ms |       |
    | https://10.141.0.3:2379 |   true |  23.436572ms |       |
    | https://10.141.0.1:2379 |   true | 327.724023ms |       |
    +-------------------------+--------+--------------+-------+
    
  5. Repeat for Remaining Nodes: Continue only if all nodes report healthy status. Repeat steps 1-4 for each remaining etcd node.

Step 3: Proceed with Kubernetes Upgrade

Once all etcd nodes are running version 3.5.24 or later, proceed with the Kubernetes upgrade.