cloud_storage: add remote path provider #20149

andrwng · 2024-06-26T08:30:59Z

Redpanda's current hash-prefixing naming scheme for objects leaves us open to collisions when multiple clusters point at the same bucket. We intend on moving away from this scheme in favor of one that includes the cluster UUID.

To this end, this PR adds various utility methods for generating paths for topic manifests, partition manifests, and remote segments. It also introduces a new remote_path_provider to be the centralized place where paths are created.

A later PR will plumb the path provider from partitions to the underlying archivers, purgers, anomaly detectors, etc.

Backports Required

Release Notes

none

Introduces the remote_label struct, which will be plugged into the topic properties in a later commit. This label includes a cluster UUID, meant to indicate the cluster UUID of the cluster that originally created the topic. In the future, this label struct can be extended with a user-provided tag, but for now it will serve as an effective means to avoid collisions between objects from different clusters.

With the naming scheme changing, adds some utilities for generating object paths or parts of paths (useful for listing objects) with both the old ("prefixed") and new ("labeled") naming scheme.

Adds a new remote_path_provider class that will be the centralized location for determining paths of remote objects. This commit introduces the interface for topic manifests, to be used by topic recovery, topic manifest uploads, etc.

Adds a filename to the partition manifest. This will be useful in allowing an external caller to decide the manifest path while still retaining the metadata referred in the spillover manifest path.

Similar to the topic manifest paths, this adds some utilities for partition manifest paths, introducing the old "prefixed" names and the new "labeled" names.

Adds partition manifest path generation to the remote path provider. Note, in various places, we try to get the path of a partition manifest but pass in a spillover manifest. This works today by having spillover overriding the path generation method. This commit continues this, by making the filename of the manifest overridable.

Similar to manifest paths, this adds some utilities for segment paths, introducing the old "prefixed" names and the new "labeled" names.

Adds segment path generation to the remote path provider, and some tests as examples of the new paths.

dotnwat · 2024-06-26T20:07:17Z

Redpanda's current hash-prefixing naming scheme for objects leaves us open to collisions when multiple clusters point at the same bucket.

Is this an over simplification, or am I just being pedantic (or wrong!)? Like, the hash-prefixing scheme AIUI is to help spread out load in an object storage system, so its a component of naming that would be orthogonal to naming which partitions objects by cluster?

We intend on moving away from this scheme in favor of one that includes the cluster UUID.

If we have <cluster-id> as a prefix, does that negate the purpose of the hash prefixing when it comes to scalability? for example, would we want <cluster-id>/hash-prefix (or the opposite)?

I seem to recall that S3 is much better / completely better at not needing a randomized prefix, but is that true for other object storage systems?

andrwng · 2024-06-26T21:14:20Z

Like, the hash-prefixing scheme AIUI is to help spread out load in an object storage system, so its a component of naming that would be orthogonal to naming which partitions objects by cluster?

It's true, the hash prefixing just doesn't serve as an effective means to avoid collisions, but that doesn't mean we necessarily should get rid of them.

If we have as a prefix, does that negate the purpose of the hash prefixing when it comes to scalability? for example, would we want /hash-prefix (or the opposite)?
I seem to recall that S3 is much better / completely better at not needing a randomized prefix, but is that true for other object storage systems?

Good callout.

Looking at others, GCS is much more concrete in their guidance: https://cloud.google.com/storage/docs/request-rate#randomness_after_a_common_prefix_is_effective_under_the_prefix. They suggest the <cluster-id>/<hash> is sufficient.

Note that the random string doesn’t necessarily need to be at the beginning of the object name. Adding a random string after a common prefix still allows auto-scaling to work, but the effect is limited to that prefix, with no consideration of the rest of the bucket.

For ABS, it's a bit less clear about whether their partitioning guidance requires randomization at the root or if a subdirectory is sufficient...

https://learn.microsoft.com/en-us/azure/well-architected/service-guides/azure-blob-storage#recommendations-4

Add a hash character sequence (such as three digits) as early as possible in the partition key of a blob. The partition key is the account name, container name, virtual directory name, and blob name. If you plan to use timestamps in names, then consider adding a seconds value to the beginning of that stamp. For more information, see Partitioning.

I'm unsure whether it's critical for the prefix to actually be near the front, or whether this is mainly to ensure good distribution within a broad key range.

Regardless, I'm thinking to update the names (at least for segments, and maybe for partition manifests) to use the hashes in some fashion. IMO <cluster-id>/<hash> would be preferred so it's easy to get objects of just a single cluster via listing with a prefix.

Topic manifests OTOH are uploaded and downloaded infrequently, and I think there's a benefit in having their naming scheme be pretty flat and simple. It's becomes very easy for topic recovery to look for topic manifest candidates to restore from.

dotnwat · 2024-06-26T20:24:28Z

src/v/cloud_storage/topic_path_utils.cc

+ss::sstring labeled_topic_manifest_prefix(
+  const remote_label& label, const model::topic_namespace& topic) {
+    return fmt::format(
+      "{}/{}", labeled_topic_manifest_root(topic), label.cluster_uuid());


i probably missed something in an RFC, but I'm a bit surprised that the cluster uuid is positioned so far to the right in the naming. for example, wouldn't that mean that we could not efficiently ListObjects for the topics in a given cluster, since the cluster_uuid isn't a prefix of the namespace/topic component of the key?

maybe that isn't an operation we care about being efficient, just seemed like we could have eaisly have a situation with 10's of 1000's of topics if we can have any number of clusters sharing a bucket.

Yea this was a late-arriving thought that came in after the initial discussions about naming scheme.

What makes topic manifests special is that we don't really care about listing the topics in cluster, but rather we care about finding a topic manifest if one exists (e.g. in the case of topic recovery, we have a topic in mind that we want to restore, and so we must find its topic manifest and we don't necessarily know about the desired cluster UUID associated with it). Since listing is generally limited to using prefixes instead of regex patterns, <ns>/<tp>/ is an efficient list prefix to find some <ns>/<tp>/<cluster uuid>.

just seemed like we could have eaisly have a situation with 10's of 1000's of topics

This is good to keep in mind. In principle the topic names should also act as potential prefixes, in which case this is only problematic if many clusters create many of the same topic. OTOH, topic names are not evenly distributed, and it's not well documented the degree to which a narrow distribution will suffer and for how long before auto-sharding will kick in for most of these storage backends.

makes sense!

dotnwat · 2024-06-26T20:25:31Z

src/v/cloud_storage/topic_path_utils.cc

+    constexpr uint32_t bitmask = 0xF0000000;
+    auto path = fmt::format("{}/{}", topic.ns(), topic.tp());
+    uint32_t hash = bitmask & xxhash_32(path.data(), path.size());
+    return fmt::format("{:08x}/meta/{}/{}", hash, topic.ns(), topic.tp());


presumably this is just copied from the old code?

Yea this is the same as the current path generation in

redpanda/src/v/cloud_storage/topic_manifest.cc

Lines 564 to 577 in 4cdf78d

remote_manifest_path topic_manifest::get_topic_manifest_path(

model::ns ns, model::topic topic, manifest_format format) {

// The path is <prefix>/meta/<ns>/<topic>/topic_manifest.json or

// topic_manifest.bin depending on format

constexpr uint32_t bitmask = 0xF0000000;

auto path = fmt::format("{}/{}", ns(), topic());

uint32_t hash = bitmask & xxhash_32(path.data(), path.size());

// use format to decide if the path is json or bin

return remote_manifest_path(fmt::format(

"{:08x}/meta/{}/topic_manifest.{}",

hash,

path,

format == manifest_format::json ? "json" : "bin"));

}

andrwng · 2024-06-26T23:59:24Z

CI failures:

CI Failure (BadLogLines: replacement segment has the same size as replaced segment) in STSRoleFetchTests.test_write #18386
CI Failure (expected consumer throttling) in EndToEndThrottlingTest.test_throttling #14225
CI Failure (Unexpected tx_error error: {tx_errc::stale}) in TxAdminTest.test_mark_transaction_expired #19322
CI Failure (key symptom) in ShadowIndexingCacheSpaceLeakTest.test_si_cache #19913

WillemKauf

LGTM!

WillemKauf · 2024-06-27T01:09:03Z

src/v/cloud_storage/topic_path_utils.h

+
+#include <seastar/core/sstring.hh>
+
+namespace cloud_storage {


Is it worth leaving a comment explaining/clarifying the distinction between the "labeled" (new) vs "prefixed" (old) style in the code here and elsewhere for topic/partition/segment path utils?

I can imagine that most people won't know what is legacy and what is new from these function names.

This is a great point. Will do this in a follow-up PR.

Yeh, I'd be into this as well. I was just kinda guessing about the distinction.

Adds some comments describing labeled and prefixed paths. This is review follow-up from redpanda-data#20149

andrwng added 3 commits June 26, 2024 01:10

cloud_storage: add path utils for topic manifests

6308e1d

With the naming scheme changing, adds some utilities for generating object paths or parts of paths (useful for listing objects) with both the old ("prefixed") and new ("labeled") naming scheme.

github-actions bot added the area/redpanda label Jun 26, 2024

andrwng added 5 commits June 26, 2024 01:35

cloud_storage: add filename to partition manifest interface

4acafff

Adds a filename to the partition manifest. This will be useful in allowing an external caller to decide the manifest path while still retaining the metadata referred in the spillover manifest path.

cloud_storage: add path utils for partition manifests

ed6397b

Similar to the topic manifest paths, this adds some utilities for partition manifest paths, introducing the old "prefixed" names and the new "labeled" names.

cloud_storage: add segment path utils

479abe6

Similar to manifest paths, this adds some utilities for segment paths, introducing the old "prefixed" names and the new "labeled" names.

cloud_storage: add segment paths to path provider

3ac6e49

Adds segment path generation to the remote path provider, and some tests as examples of the new paths.

andrwng force-pushed the initial-path-provider branch from eacdf6b to 3ac6e49 Compare June 26, 2024 08:35

andrwng requested review from WillemKauf, Lazin and dotnwat June 26, 2024 08:39

dotnwat approved these changes Jun 26, 2024

View reviewed changes

andrwng mentioned this pull request Jun 26, 2024

cloud_storage: add topic manifest downloader #20311

Merged

7 tasks

WillemKauf approved these changes Jun 27, 2024

View reviewed changes

dotnwat merged commit 65f6b39 into redpanda-data:dev Jun 27, 2024
14 of 18 checks passed

andrwng added a commit to andrwng/redpanda that referenced this pull request Jun 27, 2024

cloud_storage: add clarifying comments around naming and versions

dc1a021

Adds some comments describing labeled and prefixed paths. This is review follow-up from redpanda-data#20149

andrwng mentioned this pull request Jun 27, 2024

cloud_storage: add clarifying comments around naming and versions #20586

Merged

7 tasks

andrwng added a commit to andrwng/redpanda that referenced this pull request Jun 27, 2024

cloud_storage: add clarifying comments around naming and versions

d72c312

Adds some comments describing labeled and prefixed paths. This is review follow-up from redpanda-data#20149

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage: add remote path provider #20149

cloud_storage: add remote path provider #20149

andrwng commented Jun 26, 2024

dotnwat commented Jun 26, 2024

andrwng commented Jun 26, 2024

dotnwat Jun 26, 2024

andrwng Jun 26, 2024

dotnwat Jun 27, 2024

dotnwat Jun 26, 2024

andrwng Jun 26, 2024

andrwng commented Jun 26, 2024

WillemKauf left a comment

WillemKauf Jun 27, 2024

andrwng Jun 27, 2024

dotnwat Jun 27, 2024

	remote_manifest_path topic_manifest::get_topic_manifest_path(
	model::ns ns, model::topic topic, manifest_format format) {
	// The path is <prefix>/meta/<ns>/<topic>/topic_manifest.json or
	// topic_manifest.bin depending on format
	constexpr uint32_t bitmask = 0xF0000000;
	auto path = fmt::format("{}/{}", ns(), topic());
	uint32_t hash = bitmask & xxhash_32(path.data(), path.size());
	// use format to decide if the path is json or bin
	return remote_manifest_path(fmt::format(
	"{:08x}/meta/{}/topic_manifest.{}",
	hash,
	path,
	format == manifest_format::json ? "json" : "bin"));
	}


		#include <seastar/core/sstring.hh>

		namespace cloud_storage {

cloud_storage: add remote path provider #20149

cloud_storage: add remote path provider #20149

Conversation

andrwng commented Jun 26, 2024

Backports Required

Release Notes

dotnwat commented Jun 26, 2024

andrwng commented Jun 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrwng commented Jun 26, 2024

WillemKauf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment