Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: fix activate failure when block device moves (backport #14374) #14377

Merged
merged 1 commit into from
Jun 26, 2024

Conversation

mergify[bot]
Copy link

@mergify mergify bot commented Jun 25, 2024

Block devices can move between reboots. In corner cases, an OSDs block device might move to a lower-indexed device while the previous device does not exist. For example, an OSD on /dev/sde might move to /dev/sdd on reboot if the original /dev/sdd died. There would be no /dev/sde after that.

Users report that NVMe drives commonly change names, even when there are no disk failures.

For these cases, ensure the activate script properly handles cases where the previous disk is not present on the node and where the OSD is still available on a different disk.

Resolves #13564


I tested this manually by editing one of my OSDs to use /dev/vdf in my environment with no /dev/vdf present. Upon upgrade to the patched version, I see that ceph-volume fails when the disk is not present with the same error, but the activate script is able to move ahead to continue successfully.

+ OSD_ID=3
+ CEPH_FSID=549c9978-d49a-4c79-bfbf-a1257e983194
+ OSD_UUID=7c154eff-9de6-4983-b882-5e123001669c
+ OSD_STORE_FLAG=--bluestore
+ OSD_DATA_DIR=/var/lib/ceph/osd/ceph-3
+ CV_MODE=raw
+ DEVICE=/dev/vdf
+ cp --no-preserve=mode /etc/temp-ceph/ceph.conf /etc/ceph/ceph.conf
+ python3 -c '
import configparser

config = configparser.ConfigParser()
config.read('\''/etc/ceph/ceph.conf'\'')

if not config.has_section('\''global'\''):
    config['\''global'\''] = {}

if not config.has_option('\''global'\'','\''fsid'\''):
    config['\''global'\'']['\''fsid'\''] = '\''549c9978-d49a-4c79-bfbf-a1257e983194'\''

with open('\''/etc/ceph/ceph.conf'\'', '\''w'\'') as configfile:
    config.write(configfile)
'
+ ceph -n client.admin auth get-or-create osd.3 mon 'allow profile osd' mgr 'allow profile osd' osd 'allow *' -k /etc/ceph/admin-keyring-store/keyring
[osd.3]
	key = AQBInnlmmmJ1GRAAA2nGVvousbw1pSHzPA8fqA==
+ [[ raw == \l\v\m ]]
++ mktemp
+ OSD_LIST=/tmp/tmp.v0wIeVE7O3
+ ceph-volume raw list /dev/vdf
 stderr: lsblk: /dev/vdf: not a block device
Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 11, in <module>
    load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__
    self.main(self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 174, in main
    self.list(args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 130, in list
    report = self.generate(args.device)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 99, in generate
    info_devices.append(disk.lsblk(dev, abspath=True))
  File "/usr/lib/python3.6/site-packages/ceph_volume/util/disk.py", line 245, in lsblk
    abspath=abspath)
  File "/usr/lib/python3.6/site-packages/ceph_volume/util/disk.py", line 337, in lsblk_all
    raise RuntimeError(f"Error: {err}")
RuntimeError: Error: ['lsblk: /dev/vdf: not a block device']
+ echo ''
+ cat /tmp/tmp.v0wIeVE7O3

+ find_device
+ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 3:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 3'\'')
'
Traceback (most recent call last):
  File "<string>", line 3, in <module>
  File "/usr/lib64/python3.6/json/__init__.py", line 299, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)
+ ceph-volume raw list
+ cat /tmp/tmp.v0wIeVE7O3
{
    "1c0ce558-9a82-4ec6-bb16-062951f96d8a": {
        "ceph_fsid": "549c9978-d49a-4c79-bfbf-a1257e983194",
        "device": "/dev/vdc",
        "osd_id": 2,
        "osd_uuid": "1c0ce558-9a82-4ec6-bb16-062951f96d8a",
        "type": "bluestore"
    },
    "7c154eff-9de6-4983-b882-5e123001669c": {
        "ceph_fsid": "549c9978-d49a-4c79-bfbf-a1257e983194",
        "device": "/dev/vdb",
        "osd_id": 3,
        "osd_uuid": "7c154eff-9de6-4983-b882-5e123001669c",
        "type": "bluestore"
    }
}
++ find_device
++ python3 -c '
import sys, json
for _, info in json.load(sys.stdin).items():
	if info['\''osd_id'\''] == 3:
		print(info['\''device'\''], end='\'''\'')
		print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
		sys.exit(0)  # don'\''t keep processing once the disk is found
sys.exit('\''no disk found with OSD ID 3'\'')
'
found device: /dev/vdb
+ DEVICE=/dev/vdb
+ [[ -z /dev/vdb ]]
+ OSD_BLOCK_PATH=/var/lib/ceph/osd/ceph-3/block
++ readlink /var/lib/ceph/osd/ceph-3/block
+ '[' -L /var/lib/ceph/osd/ceph-3/block -a /dev/vdb '!=' /dev/vdb ']'
+ ceph-volume raw activate --device /dev/vdb --no-systemd --no-tmpfs
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-3
Running command: /usr/bin/ceph-bluestore-tool prime-osd-dir --path /var/lib/ceph/osd/ceph-3 --no-mon-config --dev /dev/vdb
Running command: /usr/bin/chown -R ceph:ceph /dev/vdb
Running command: /usr/bin/ln -s /dev/vdb /var/lib/ceph/osd/ceph-3/block
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-3
--> ceph-volume raw activate successful for osd ID: 3

I don't believe we have a good way of guaranteeing this code path gets tested in unit or CI tests, so the manual testing will have to do for now.

Because this was reported by a user upgrading to 1.13, we will plan to backport to 1.14 and 1.13.

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

This is an automatic backport of pull request #14374 done by [Mergify](https://mergify.com).

@BlaineEXE
Copy link
Member

BlaineEXE commented Jun 26, 2024

@Mergifyio rebase

Block devices can move between reboots. In corner cases, an OSDs block
device might move to a lower-indexed device while the previous device
does not exist. For example, an OSD on /dev/sde might move to /dev/sdd
on reboot if the original /dev/sdd died. There would be no /dev/sde
after that.

Users report that NVMe drives commonly change names, even when there are
no disk failures.

For these cases, ensure the activate script properly handles cases where
the previous disk is not present on the node and where the OSD is still
available on a different disk.

Signed-off-by: Blaine Gardner <[email protected]>
(cherry picked from commit f2304bf)
Copy link
Author

mergify bot commented Jun 26, 2024

rebase

✅ Branch has been successfully rebased

@BlaineEXE BlaineEXE force-pushed the mergify/bp/release-1.13/pr-14374 branch from be009c0 to 309e0c9 Compare June 26, 2024 20:17
@BlaineEXE BlaineEXE merged commit 9e9787c into release-1.13 Jun 26, 2024
51 of 52 checks passed
@mergify mergify bot deleted the mergify/bp/release-1.13/pr-14374 branch June 26, 2024 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant