Feature: Limit IOps for VMs #576

sharnoff · 2023-10-21T05:16:33Z

Problem description / Motivation

This hasn't happened yet for VMs, but in theory a noisy tenant can saturate disk IO by itself, leading to significant degradation on the underlying k8s node (affecting other pods and kubelet itself).

Recent inspiration: https://neondb.slack.com/archives/C061XEGSCE7/p1697733194985739?thread_ts=1697732054.624899&cid=C061XEGSCE7

This is also potentially affected by recently moving the file cache to disk.

Feature idea(s) / DoD

IO rate limiting for VMs should not be an accidental side-effect of the speed of QEMU; i.e. we should have intentional safeguards to cap the amount of IO a single VM can do.

This could be implemented as a global constant, compiled in, or we could make it part of the VM spec (with some default value) — maybe could combine with settings from #547.

Implementation ideas

We're already running QEMU in a cgroup — we can additionally limit e.g. io.max based on CPU.

We should also consider how this looks from within the VM: if QEMU is blocked on disk, does the VM kernel observe that as the underlying device being slow, or does the VM get invisibly paused? Does time spent waiting on disk count towards the QEMU cgroup's cpu.max? (if so, do we need to change that?)

The text was updated successfully, but these errors were encountered:

rahulinux · 2023-10-23T13:32:09Z

Related ticket: https://github.com/neondatabase/cloud/issues/5647

cicdteam · 2023-10-29T18:09:37Z

@sharnoff

Just FYI: we can manage IOPS limits by QEMU itself. There are params for -drive option (used on QEMU start) or we can use QMP to manage limits in runtime. Simple example from docs - link.

lassizci · 2023-12-07T11:08:21Z

We by the way seem to have mq-deadline as scheduler in the quests. I think we should use noop instead, because the cpu cycles the guest uses for scheduling, go waste, because the host (or well, the storage controller in case of nvme drives..) will do scheduling anyway.

cicdteam · 2023-12-07T17:48:25Z

from one of NeonVMs

root disk IO scheduler

root@neonvm:~# cat /sys/block/vda/queue/scheduler 
[mq-deadline] kyber none

/neonvm/cache IO scheduler

root@neonvm:~# cat /sys/block/vdc/queue/scheduler 
[mq-deadline] kyber none

Disks in VMs are VirtIO-blk devices, so we can try none IO scheduler to see it improve anything or not.

sharnoff · 2023-12-27T16:07:01Z

Blocked on design work. Moving back from "in progress" to "selected".

sharnoff added t/feature Issue type: feature, for new features or requests c/autoscaling/neonvm Component: autoscaling: NeonVM labels Oct 21, 2023

sharnoff mentioned this issue Oct 23, 2023

Epic: Scheduler-triggered migration informed by CPU/memory/disk metrics #581

Open

sharnoff mentioned this issue Nov 17, 2023

neonvm postgres gets stuck with 0 TPS when running pgbench neondatabase/neon#5678

Closed

sharnoff linked a pull request Dec 19, 2023 that will close this issue

IOPS and throughput limits #693

Draft

sharnoff assigned cicdteam Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Limit IOps for VMs #576

Feature: Limit IOps for VMs #576

sharnoff commented Oct 21, 2023 •

edited

Loading

rahulinux commented Oct 23, 2023

cicdteam commented Oct 29, 2023 •

edited

Loading

lassizci commented Dec 7, 2023

cicdteam commented Dec 7, 2023

sharnoff commented Dec 27, 2023

Feature: Limit IOps for VMs #576

Feature: Limit IOps for VMs #576

Comments

sharnoff commented Oct 21, 2023 • edited Loading

Problem description / Motivation

Feature idea(s) / DoD

Implementation ideas

rahulinux commented Oct 23, 2023

cicdteam commented Oct 29, 2023 • edited Loading

lassizci commented Dec 7, 2023

cicdteam commented Dec 7, 2023

sharnoff commented Dec 27, 2023

sharnoff commented Oct 21, 2023 •

edited

Loading

cicdteam commented Oct 29, 2023 •

edited

Loading