Upgrade a Ceph Cluster from Mimic to Nautilus using Ceph-Ansible

Nov 21, 2019 · 395 words · 2 minute read

Today I finally got to upgrading our BorgBase.com Ceph cluster in the EU region from Mimic to Nautilus. The process is rather simple using Ceph-Ansible’s rolling-update.yml playbook, but I still want to document some gotchas and collect links to resources that may be useful for others.

First you should look at the official upgrade upgrade notes and release notes to get a feel for the progress. Red Hat also has a page on upgrading Ceph. The main takeaways for me were:

You need to upgrade your services in a specific order, from MON to MGR to OSD to MDS. Ansible would already take care of that.
There can only be one active MDS while upgrading. Ansible takes care of that too.

There is also some Ansible code to deal with collocated MON and MGR services on the same server. Great. Thing is that I also run MDS on the same servers and I didn’t see any code to deal with this case in Ansible. So I ended up temporarily using a dedicated MDS.

With the prerequisits taken care of, you can continue to prepare Ceph-Ansible:

First check out the correct branch for the Nautilus release.
$ git checkout stable-4.0

Then install the correct Ansible version
$ pip install -r requirements.txt

Next copy the playbook to the root folder:
$ cp infrastructure-playbooks/rolling_update.yml .

Update ceph_stable_release to nautilus in your customized group_vars/all.yml

Last, run the playbook, which will take 1-2 hours depending on the cluster size.
$ ansible-playbook -i inventory.ini rolling_update.yml

If there were no errors, you should now run a shiny-new Nautilus cluster. One thing you will notice in the aftermath is the warning Legacy BlueStore stats reporting detected on 14 OSD(s) or similar. This is expected to happen for all OSDs that were set up pre-Nautilus. You can hide the warning or run repair on the OSD to set up a newer stats reporting, as mentioned here:

systemctl stop ceph-osd@$OSDID
ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-$OSDID
systemctl start ceph-osd@$OSDID

While OSDs were repairing, I ran Ansible to add some additional MDS. Before doing so, you need to add some new variables to your own group_vars/all.yml:

dashboard_enabled: False: If you don’t plan to use the new dashboard right away.
And to avoid errors if you have existing filesystems set up: cephfs: myfs cephfs_data_pool: name: myfs_data cephfs_metadata_pool: name: myfs_meta

That should be all. Hope your upgrade goes well too!