Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-4600 - Quotas: disable produce quota by default #20142

Merged
merged 2 commits into from
Jun 27, 2024

Conversation

pgellert
Copy link
Contributor

@pgellert pgellert commented Jun 25, 2024

During performance testing it became apparent that the overhead of quota management, while small, is non-negligible (~400ns/request). It is believed that some of our latency sensitive customers would prefer to avoid having this overhead on the request path. Therefore, this PR adds the ability to disable the produce quota config and sets it to disabled by default.

This is implemented without changing the configuration type (from bounded_property<uint32_t> to property<std::optional<uint32_t>>), instead I opted to use 0 as a sentinel value for disabled. Locally, I tested the change to property<std::optional<uint32_t>> as well, and it seems to work since the values are serialized to ss::sstring into the controller log and the serialized values of uint32 seem to be a subset of those of optional<uint32>. However, it still seems lower risk to just use a sentinel value here because it's hard to pin down and test all the places where the config values are serialized into a yaml/json and how those serializations will change if we change the config value into a std::optional<uint32_t>. Since these configs are deprecated and going to be removed in 2 major versions, I believe this is a reasonable trade off.

Fixes https://redpandadata.atlassian.net/browse/CORE-4600

Benchmark results

As expected, all the benchmarks that have the quotas off improve - which now also includes the default case for produce.

single run iterations:    0
single run duration:      1.000s
number of runs:           5
number of cores:          32
random seed:              2540878669

test                                                            iterations      median         mad         min         max      allocs       tasks        inst
throughput_group.test_quota_manager_on_unlimited_shared           41943040    28.397ns     0.012ns    28.368ns    28.410ns       0.656       0.000         0.0
throughput_group.test_quota_manager_on_unlimited_unique           41943040    27.000ns     0.087ns    26.840ns    28.767ns       0.252       0.001         0.0
throughput_group.test_quota_manager_on_limited_shared             41943040    31.136ns     0.155ns    30.972ns    31.410ns       0.656       0.000         0.0
throughput_group.test_quota_manager_on_limited_unique             41943040    27.482ns     0.126ns    27.294ns    27.608ns       0.252       0.001         0.0
throughput_group.test_quota_manager_off_shared                   608174080     1.597ns     0.009ns     1.588ns     1.727ns       0.094       0.000         0.0
throughput_group.test_quota_manager_off_unique                   618659840     1.524ns     0.009ns     1.515ns     1.721ns       0.063       0.000         0.0
latency_group.existing_client_produce_100_others                    110100   305.827ns     2.455ns   300.770ns   370.914ns      11.000       0.000         0.0
latency_group.existing_client_fetch_100_others                      105000   607.491ns     1.238ns   600.661ns   724.629ns      22.000       0.000         0.0
latency_group.new_client_produce_100_others                         102800   739.801ns     5.008ns   734.794ns   836.278ns      12.040       0.000         0.0
latency_group.new_client_fetch_100_others                           101200     1.016us     0.782ns     1.011us     1.122us      17.040       0.002         0.0
latency_group.existing_client_produce_1000_others                    58700   169.915ns     0.214ns   169.287ns   170.129ns      11.000       0.000         0.0
latency_group.existing_client_fetch_1000_others                      60700   345.662ns     2.627ns   342.027ns   395.834ns      22.000       0.000         0.0
latency_group.new_client_produce_1000_others                         65300   531.249ns     1.291ns   529.958ns   608.963ns      12.020       0.001         0.0
latency_group.new_client_fetch_1000_others                           60100   706.559ns     0.445ns   706.114ns   780.256ns      17.020       0.001         0.0
latency_group.existing_client_produce_10000_others                   13100   172.368ns     0.335ns   172.033ns   173.614ns      11.000       0.000         0.0
latency_group.existing_client_fetch_10000_others                     14400   352.623ns     1.389ns   350.707ns   354.012ns      22.000       0.000         0.0
latency_group.new_client_produce_10000_others                        14300   507.215ns     1.931ns   505.284ns   516.292ns      12.000       0.001         0.0
latency_group.new_client_fetch_10000_others                          14400   696.061ns     4.296ns   691.765ns   705.216ns      17.000       0.001         0.0
latency_group.existing_client_produce_100_others_not_shard_0         71700   376.862ns     3.136ns   373.006ns   379.997ns      11.000       0.002         0.0
latency_group.existing_client_fetch_100_others_not_shard_0           82600   701.857ns     7.485ns   694.373ns   818.803ns      22.000       0.005         0.0
latency_group.new_client_produce_100_others_not_shard_0              70600     2.779us     3.321ns     2.776us     3.208us       6.020       3.000         0.0
latency_group.new_client_fetch_100_others_not_shard_0                68500     3.066us     3.891ns     3.062us     3.546us      11.020       3.001         0.0
latency_group.existing_client_produce_1000_others_not_shard_0        25300   341.737ns     0.975ns   340.613ns   343.275ns      11.000       0.001         0.0
latency_group.existing_client_fetch_1000_others_not_shard_0          25000   688.687ns     5.226ns   683.461ns   718.709ns      22.000       0.002         0.0
latency_group.new_client_produce_1000_others_not_shard_0             23300     3.067us    20.573ns     3.012us     3.087us       6.010       3.000         0.0
latency_group.new_client_fetch_1000_others_not_shard_0               23600     3.252us     2.168ns     3.247us     3.255us      11.010       3.001         0.0
latency_group.existing_client_produce_10000_others_not_shard_0        3400   339.762ns     1.056ns   337.747ns   342.591ns      11.000       0.001         0.0
latency_group.existing_client_fetch_10000_others_not_shard_0          3400   667.726ns     1.000ns   666.725ns   672.398ns      22.000       0.004         0.0
latency_group.new_client_produce_10000_others_not_shard_0             3300     2.815us     5.786ns     2.797us     2.829us       6.000       3.000         0.0
latency_group.new_client_fetch_10000_others_not_shard_0               3300     3.108us    19.237ns     3.076us     4.192us      11.000       3.001         0.0
latency_group.default_configs_produce_worst                         127200    26.415ns     0.197ns    26.003ns    33.435ns       1.000       0.000         0.0
latency_group.default_configs_fetch_worst                           126000    47.980ns     0.153ns    47.827ns    54.350ns       2.000       0.000         0.0

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

Improvements

  • The produce client quota (target_quota_byte_rate) is now disabled by default. Previously this was enabled at 2GB/shard/client.id.

@pgellert
Copy link
Contributor Author

/dt

@pgellert pgellert force-pushed the quotas/disable-produce-default branch 2 times, most recently from a22894d to cd79309 Compare June 27, 2024 08:23
@pgellert pgellert self-assigned this Jun 27, 2024
@pgellert pgellert changed the title [WIP] Quotas: disable produce quota by default CORE-4600 - Quotas: disable produce quota by default Jun 27, 2024
@pgellert pgellert requested review from travisdowns, StephanDollberg, a team, michael-redpanda and BenPope and removed request for a team June 27, 2024 09:14
@pgellert pgellert marked this pull request as ready for review June 27, 2024 09:14
@pgellert pgellert requested a review from a team as a code owner June 27, 2024 09:14
Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does what it says on the tin.

Can you add the benchmark?

src/v/config/configuration.cc Show resolved Hide resolved
@pgellert
Copy link
Contributor Author

Can you add the benchmark?

I added it soon after raising the PR. I think if you refresh you should see it.

@BenPope
Copy link
Member

BenPope commented Jun 27, 2024

Can you add the benchmark?

I added it soon after raising the PR. I think if you refresh you should see it.

I only see changes to src/v/kafka/server/tests/client_quota_translator_test.cc

@pgellert
Copy link
Contributor Author

If you mean that I should add new benchmarks, I haven't added any because the existing ones cover the change. Specifically these lines from the PR cover letter:

throughput_group.test_quota_manager_off_shared                   608174080     1.597ns     0.009ns     1.588ns     1.727ns       0.094       0.000         0.0
throughput_group.test_quota_manager_off_unique                   618659840     1.524ns     0.009ns     1.515ns     1.721ns       0.063       0.000         0.0
latency_group.default_configs_produce_worst                         127200    26.415ns     0.197ns    26.003ns    33.435ns       1.000       0.000         0.0
latency_group.default_configs_fetch_worst                           126000    47.980ns     0.153ns    47.827ns    54.350ns       2.000       0.000         0.0

BenPope
BenPope previously approved these changes Jun 27, 2024
@@ -496,12 +496,13 @@ configuration::configuration()
, target_quota_byte_rate(
*this,
"target_quota_byte_rate",
"Target request size quota byte rate (bytes per second) - 2GB default",
"Target request size quota byte rate (bytes per second) - disabled "
Copy link
Contributor

@Deflaimun Deflaimun Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "disabled" mean in this description? It's disabled by default? Or 0 = disabled?
Also, no need to explicitly mention the default number in the description, even if it's referenced in another object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's disabled by default? Or 0 = disabled?

Both yes.

Also, no need to explicitly mention the default number in the description, even if it's referenced in another object

I am happy to remove it, but I'm wondering what's the best way to describe that 0 is a sentinel value for disabled since unfortunately target_quota_byte_rate is an integer and not an optional integer. I'm wondering if rewriting the end to - 0 means disabled would make the most sense. What do you think @Deflaimun?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simply remove the default value from the description. In docs website we can add this extra info.

If you really want to have it in the description maybe something like
"Target request size quota byte rate (bytes per second) - (default: "0" - disabled)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Sounds good, I've removed the ending now.

@@ -496,12 +496,13 @@ configuration::configuration()
, target_quota_byte_rate(
*this,
"target_quota_byte_rate",
"Target request size quota byte rate (bytes per second) - 2GB default",
"Target request size quota byte rate (bytes per second) - disabled "
Copy link
Contributor

@Deflaimun Deflaimun Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Target request size quota byte rate (bytes per second) - disabled "
"Target request size quota byte rate (bytes per second)",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -496,12 +496,13 @@ configuration::configuration()
, target_quota_byte_rate(
*this,
"target_quota_byte_rate",
"Target request size quota byte rate (bytes per second) - 2GB default",
"Target request size quota byte rate (bytes per second) - disabled "
"default (= 0)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"default (= 0)",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* Reinterpret the 0 value of `target_quota_byte_rate` as disable (this
  is safe as previously the minimum value was 1MB, so noone has this set
  to 0)
* Change the default value to be disabled
Minimize the impact of client quota management in the default (and
expectly most common case) of having no quotas configured.
@pgellert pgellert dismissed stale reviews from BenPope and michael-redpanda via 19e138d June 27, 2024 15:22
@pgellert pgellert force-pushed the quotas/disable-produce-default branch from cd79309 to 19e138d Compare June 27, 2024 15:22
@pgellert
Copy link
Contributor Author

Force-pushed to improve the config description as per feedback.

@pgellert pgellert requested a review from Deflaimun June 27, 2024 15:24
@pgellert pgellert merged commit dd3a163 into redpanda-data:dev Jun 27, 2024
18 checks passed
@pgellert pgellert deleted the quotas/disable-produce-default branch June 27, 2024 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants