Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: add remote path provider #20149

Merged
merged 8 commits into from
Jun 27, 2024

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Jun 26, 2024

Redpanda's current hash-prefixing naming scheme for objects leaves us open to collisions when multiple clusters point at the same bucket. We intend on moving away from this scheme in favor of one that includes the cluster UUID.

To this end, this PR adds various utility methods for generating paths for topic manifests, partition manifests, and remote segments. It also introduces a new remote_path_provider to be the centralized place where paths are created.

A later PR will plumb the path provider from partitions to the underlying archivers, purgers, anomaly detectors, etc.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

  • none

Introduces the remote_label struct, which will be plugged into the topic
properties in a later commit. This label includes a cluster UUID, meant
to indicate the cluster UUID of the cluster that originally created the
topic.

In the future, this label struct can be extended with a user-provided
tag, but for now it will serve as an effective means to avoid collisions
between objects from different clusters.
With the naming scheme changing, adds some utilities for generating
object paths or parts of paths (useful for listing objects) with both
the old ("prefixed") and new ("labeled") naming scheme.
Adds a new remote_path_provider class that will be the centralized
location for determining paths of remote objects.

This commit introduces the interface for topic manifests, to be used by
topic recovery, topic manifest uploads, etc.
Adds a filename to the partition manifest. This will be useful in
allowing an external caller to decide the manifest path while still
retaining the metadata referred in the spillover manifest path.
Similar to the topic manifest paths, this adds some utilities for
partition manifest paths, introducing the old "prefixed" names and the
new "labeled" names.
Adds partition manifest path generation to the remote path provider.

Note, in various places, we try to get the path of a partition manifest
but pass in a spillover manifest. This works today by having spillover
overriding the path generation method. This commit continues this, by
making the filename of the manifest overridable.
Similar to manifest paths, this adds some utilities for segment paths,
introducing the old "prefixed" names and the new "labeled" names.
Adds segment path generation to the remote path provider, and some tests
as examples of the new paths.
@dotnwat
Copy link
Member

dotnwat commented Jun 26, 2024

Redpanda's current hash-prefixing naming scheme for objects leaves us open to collisions when multiple clusters point at the same bucket.

Is this an over simplification, or am I just being pedantic (or wrong!)? Like, the hash-prefixing scheme AIUI is to help spread out load in an object storage system, so its a component of naming that would be orthogonal to naming which partitions objects by cluster?

We intend on moving away from this scheme in favor of one that includes the cluster UUID.

If we have <cluster-id> as a prefix, does that negate the purpose of the hash prefixing when it comes to scalability? for example, would we want <cluster-id>/hash-prefix (or the opposite)?

I seem to recall that S3 is much better / completely better at not needing a randomized prefix, but is that true for other object storage systems?

@andrwng
Copy link
Contributor Author

andrwng commented Jun 26, 2024

Like, the hash-prefixing scheme AIUI is to help spread out load in an object storage system, so its a component of naming that would be orthogonal to naming which partitions objects by cluster?

It's true, the hash prefixing just doesn't serve as an effective means to avoid collisions, but that doesn't mean we necessarily should get rid of them.

If we have as a prefix, does that negate the purpose of the hash prefixing when it comes to scalability? for example, would we want /hash-prefix (or the opposite)?
I seem to recall that S3 is much better / completely better at not needing a randomized prefix, but is that true for other object storage systems?

Good callout.

Looking at others, GCS is much more concrete in their guidance: https://cloud.google.com/storage/docs/request-rate#randomness_after_a_common_prefix_is_effective_under_the_prefix. They suggest the <cluster-id>/<hash> is sufficient.

Note that the random string doesn’t necessarily need to be at the beginning of the object name. Adding a random string after a common prefix still allows auto-scaling to work, but the effect is limited to that prefix, with no consideration of the rest of the bucket.

For ABS, it's a bit less clear about whether their partitioning guidance requires randomization at the root or if a subdirectory is sufficient...

https://learn.microsoft.com/en-us/azure/well-architected/service-guides/azure-blob-storage#recommendations-4

Add a hash character sequence (such as three digits) as early as possible in the partition key of a blob. The partition key is the account name, container name, virtual directory name, and blob name. If you plan to use timestamps in names, then consider adding a seconds value to the beginning of that stamp. For more information, see Partitioning.

I'm unsure whether it's critical for the prefix to actually be near the front, or whether this is mainly to ensure good distribution within a broad key range.

Regardless, I'm thinking to update the names (at least for segments, and maybe for partition manifests) to use the hashes in some fashion. IMO <cluster-id>/<hash> would be preferred so it's easy to get objects of just a single cluster via listing with a prefix.

Topic manifests OTOH are uploaded and downloaded infrequently, and I think there's a benefit in having their naming scheme be pretty flat and simple. It's becomes very easy for topic recovery to look for topic manifest candidates to restore from.

ss::sstring labeled_topic_manifest_prefix(
const remote_label& label, const model::topic_namespace& topic) {
return fmt::format(
"{}/{}", labeled_topic_manifest_root(topic), label.cluster_uuid());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i probably missed something in an RFC, but I'm a bit surprised that the cluster uuid is positioned so far to the right in the naming. for example, wouldn't that mean that we could not efficiently ListObjects for the topics in a given cluster, since the cluster_uuid isn't a prefix of the namespace/topic component of the key?

maybe that isn't an operation we care about being efficient, just seemed like we could have eaisly have a situation with 10's of 1000's of topics if we can have any number of clusters sharing a bucket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this was a late-arriving thought that came in after the initial discussions about naming scheme.

What makes topic manifests special is that we don't really care about listing the topics in cluster, but rather we care about finding a topic manifest if one exists (e.g. in the case of topic recovery, we have a topic in mind that we want to restore, and so we must find its topic manifest and we don't necessarily know about the desired cluster UUID associated with it). Since listing is generally limited to using prefixes instead of regex patterns, <ns>/<tp>/ is an efficient list prefix to find some <ns>/<tp>/<cluster uuid>.

just seemed like we could have eaisly have a situation with 10's of 1000's of topics

This is good to keep in mind. In principle the topic names should also act as potential prefixes, in which case this is only problematic if many clusters create many of the same topic. OTOH, topic names are not evenly distributed, and it's not well documented the degree to which a narrow distribution will suffer and for how long before auto-sharding will kick in for most of these storage backends.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense!

Comment on lines +39 to +42
constexpr uint32_t bitmask = 0xF0000000;
auto path = fmt::format("{}/{}", topic.ns(), topic.tp());
uint32_t hash = bitmask & xxhash_32(path.data(), path.size());
return fmt::format("{:08x}/meta/{}/{}", hash, topic.ns(), topic.tp());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presumably this is just copied from the old code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this is the same as the current path generation in

remote_manifest_path topic_manifest::get_topic_manifest_path(
model::ns ns, model::topic topic, manifest_format format) {
// The path is <prefix>/meta/<ns>/<topic>/topic_manifest.json or
// topic_manifest.bin depending on format
constexpr uint32_t bitmask = 0xF0000000;
auto path = fmt::format("{}/{}", ns(), topic());
uint32_t hash = bitmask & xxhash_32(path.data(), path.size());
// use format to decide if the path is json or bin
return remote_manifest_path(fmt::format(
"{:08x}/meta/{}/topic_manifest.{}",
hash,
path,
format == manifest_format::json ? "json" : "bin"));
}

Copy link
Contributor

@WillemKauf WillemKauf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


#include <seastar/core/sstring.hh>

namespace cloud_storage {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth leaving a comment explaining/clarifying the distinction between the "labeled" (new) vs "prefixed" (old) style in the code here and elsewhere for topic/partition/segment path utils?

I can imagine that most people won't know what is legacy and what is new from these function names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point. Will do this in a follow-up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, I'd be into this as well. I was just kinda guessing about the distinction.

@dotnwat dotnwat merged commit 65f6b39 into redpanda-data:dev Jun 27, 2024
14 of 18 checks passed
andrwng added a commit to andrwng/redpanda that referenced this pull request Jun 27, 2024
Adds some comments describing labeled and prefixed paths.

This is review follow-up from redpanda-data#20149
andrwng added a commit to andrwng/redpanda that referenced this pull request Jun 27, 2024
Adds some comments describing labeled and prefixed paths.

This is review follow-up from redpanda-data#20149
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants