[CORE-2841] Transforms: Start consuming from an arbitrary offset (numeric from start/end or timestamp) #19975

oleiman · 2024-06-24T21:09:00Z

This PR wires up the ability to configure the start offset of a transform at deploy time. This can be either a unix timestamp (ms since epoch) or an offset delta (+oo from start offset of -oo from end).

Includes rpk experience.

Backports Required

Release Notes

Improvements

Adds the ability to start transform processing from an arbitrary offset on the input topic.

src/v/transform/transform_processor.cc

oleiman · 2024-06-27T04:28:13Z

/ci-repeat 1

Thin proxy for `kafka::partition_proxy::start_offset` Signed-off-by: Oren Leiman <[email protected]>

rockwood-openai · 2024-06-27T19:27:55Z

src/go/rpk/pkg/cli/transform/deploy.go

@@ -154,6 +155,7 @@ The --var flag can be repeated to specify multiple variables like so:
 	cmd.Flags().StringVar(&fc.functionName, "name", "", "The name of the transform")
 	cmd.Flags().Var(&fc.env, "var", "Specify an environment variable in the form of KEY=VALUE")
 	cmd.Flags().StringVar(&fc.compression, "compression", "", "Output batch compression type")
+	cmd.Flags().StringVar(&fc.from_offset, "from-offset", "", "Process an input topic partition from this offset")


This needs docs in the long form help text. How does this work since offsets are per partition and deploys are for topics?

There seems to be a string that is parsed on the backend. Generally I would recommend the parsing to be in rpk so the API is structured data only.

We should also note that this only works on the first deploy.

But also is there a use case for arbitrary start offsets? Or is it just beginning?

docs in the long form help text

Good catch, that's my oversight

how does this work since offsets are per partition

What I have currently is the ability to specify a positive delta from start or a negative delta from end. I would expect the most common usage to be "+00", meaning "begin processing from the start offset of each partition", so we can always add a "start" alias or something.

also is there a use case for arbitrary start offsets?

In practice, I don't know, though I suppose the same question applies to "start from timestamp". The motivating event for exposing this was user error leading to nominal data loss; so if the general vibe is around minor disaster recover, I think the added flexibility is probably good?

recommend the parsing to be in rpk

fair enough. i had something more structured in a previous iteration. moved to this while working through the timequery bug and sort of liked it, but if it's particularly unsavory i can wind it back.

put the structure back. looks similar to env vars parsing now.

src/go/rpk/pkg/cli/transform/deploy.go

rockwotj

nice

rockwotj · 2024-06-28T03:36:07Z

src/v/model/transform.h

@@ -102,14 +102,15 @@ struct transform_offset_options
    //
    // When a timestamp is used, a timequery is used to resolve the offset for
    // each partition.
-    serde::variant<latest_offset, model::timestamp> position;
+    serde::variant<latest_offset, model::timestamp, model::offset> position;


nit: offset_delta is a better fit? And please update the docs

Ya I'm not exactly sure. I balked initially because the on-label use for offset_delta is translating between logical and physical offsets. It's all ints at the end of the day, so it felt a but better to use the more generic option. Will consider

I think it's conceptually a kafka::offset_delta (rather than a model::offset_delta or kafka|model::offset) since we're adding it onto a kafka start/end offset.

Near as I can tell model::offset_delta is explicitly meant to hit overloads that perform the model::offset <-> kafka::offset conversion automatically, e.g.

/// \brief conversion from kafka offset to redpanda offset inline constexpr model::offset operator+(kafka::offset o, model::offset_delta d) { return model::offset{o() + d()}; }

So yeah, I agree with Gellert. Maybe just add kafka::offset_delta for specifically the purpose of staying w/in kafka land.

rockwotj · 2024-06-28T03:37:27Z

src/v/redpanda/admin/transform.cc

+        } else if (format == "from_end") {
+            offset_opts.position = model::offset{-value};
+        }
+    }


Strictly speaking, covered by the validator, but ya probably wise

rockwotj · 2024-06-28T03:38:13Z

src/v/transform/api.cc

+    // transform metadata (i.e. legacy deployments) won't traverse this code.
+    // Otherwise, respect whatever offset was included in the request.
+    if (
+      std::get_if<model::transform_offset_options::latest_offset>(


holds_alternative?

yeah of course 🙃

pgellert · 2024-06-28T11:33:49Z

src/v/transform/api.cc

+          // Set the transform to start processing new records starting now,
+          // this is the default expectations for developers, as once deploy
+          // completes, they should be able to produce without waiting for the
+          // vm to start. If we start from the end of the log, then records
+          // produced between now and the vm start would be skipped.


I'm wondering if by the same logic, we should be resolving the offset delta to a model::offset here instead of doing it in transform::processor::load_latest_committed(). For example, "offset from the end -0" is equivalent to latest yet they behave slightly differently. But maybe it's not a big deal since the processor is started "soon enough"?

Would it make sense to move the offset delta resolution and latest resolution up to the admin API handler and change the transform_offset_options to hold a fixed position serde::variant<model::timestamp, model::offset> position?

The trouble is that offsets are committed per transform + partition, so we actually need to compute a concrete offset when each vm spins up. So I think load_latest_committed is the right place for the calculation - "give me the last committed offset on this partition, and if none exists calculate one based on my config".

To the similarity between latest and -0, it's true, but latest isn't exposed at the API layer. As I understand it, it was meant primarily as a compatibility default for deployed transforms that predate the options field. So "-oo from end" is inclusive of that behavior but intended as a user-facing knob.

Ah, I didn't think this through carefully. -0 will not give the same result as latest. offset{"format": "from_end", "value": 0} winds up as a named int64, so we will eventually interpret it as plain old 0 and start processing from the beginning of the partition.

We should cover for this somehow - probably just by rejecting --from-offset -0 in rpk. I don't think it's ever particularly useful compared to the default behavior. i.e. "commit offset at deploy time" vs "commit offset at VM start time".

oleiman · 2024-06-28T18:44:43Z

force push contents:

introduce kafka::offset_delta
reject offset "-0" in rpk
minor CR changes

model::offset_delta resolves arithmetic operator overloads that perform automatic conversions between kafka and model offsets. This commit introduces an offset_delta that is used specifically for applying a numeric delta to an existing kafka::offset without adjusting its type. Useful for transform start offset calculations. Signed-off-by: Oren Leiman <[email protected]>

Signed-off-by: Oren Leiman <[email protected]>

{ "format": enum[timestamp, from_start, from_end], "value": int64 } Signed-off-by: Oren Leiman <[email protected]>

@t

--from-offset to start from this offset * @t: start from UNIX timestamp (ms from epoch) * +oo: start offset + oo * -oo: latest ofset - oo Signed-off-by: Oren Leiman <[email protected]>

Signed-off-by: Oren Leiman <[email protected]>

- Consume records that were produced before the deploy - Specify offsets that run off the end of the input topic - Ill-formed offsets Signed-off-by: Oren Leiman <[email protected]>

oleiman · 2024-06-28T18:45:45Z

force push empty diff (signoff)

rockwotj

LGTM, I think we can clarify the data model and make this slightly easier to reason about

rockwotj · 2024-06-29T01:37:04Z

src/v/model/transform.h

+    //
+    // When an offset_delta is used, offset resolution depends on its sign:
+    //   - if delta < 0  - start at latest - abs(delta) (for each partition)
+    //   - if delta >= 0 - start at earliest + delta (for each partition)


Not to be a pain but - we could get rid of the awkwardness in rpk with -0 if we represented things more similar to the API (which is more verbose and leaning into the sumtype more)

by making latest_offset struct have a delta field meaning the delta from the end, and adding a new struct for earliest offset and adding a delta field there too.

We could translate latest_offset delta=0 to freeze offset like we do now, or we could leave latest offset and just have two new variants for from_beginning and from_end with (positive) deltas

Not a pain...I thought about doing that, but it seemed a bit over complex at the time. I'm not opposed. You could argue that the current approach just pushes the complexity into code comments and across the api boundary anyway, which is not great.

rockwotj · 2024-06-29T01:38:02Z

src/v/redpanda/admin/transform.cc

+        auto offset = doc["offset"].GetObject();
+        auto format = ss::sstring{
+          offset["format"].GetString(), offset["format"].GetStringLength()};
+        auto value = offset["value"].GetInt64();


Should we validate this is positive?

oleiman self-assigned this Jun 24, 2024

github-actions bot added area/rpk area/redpanda area/wasm WASM Data Transforms labels Jun 24, 2024

rockwood-openai reviewed Jun 24, 2024

View reviewed changes

src/v/transform/transform_processor.cc Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

oleiman changed the title ~~Transforms: Start consuming from an arbitrary offset (numeric from start/end or timestamp)~~ [CORE-2841] Transforms: Start consuming from an arbitrary offset (numeric from start/end or timestamp) Jun 24, 2024

oleiman force-pushed the xform/core-2841/specify-start-offset branch 5 times, most recently from b177085 to 78e09c0 Compare June 25, 2024 17:26

oleiman marked this pull request as ready for review June 25, 2024 17:28

oleiman requested review from a team, twmb, r-vasquez, gene-redpanda and Deflaimun as code owners June 25, 2024 17:28

oleiman requested review from a team and michael-redpanda and removed request for a team June 25, 2024 17:29

oleiman force-pushed the xform/core-2841/specify-start-offset branch from 78e09c0 to 612cd16 Compare June 25, 2024 18:01

xform/io: Add transform::source::start_offset

42a03ec

Thin proxy for `kafka::partition_proxy::start_offset` Signed-off-by: Oren Leiman <[email protected]>

oleiman force-pushed the xform/core-2841/specify-start-offset branch from 612cd16 to 2d34443 Compare June 27, 2024 19:13

oleiman added doc-needed labels Jun 27, 2024

oleiman requested review from rockwood-openai and pgellert and removed request for rockwood-openai June 27, 2024 19:15

oleiman requested review from pgellert and removed request for rockwood-openai June 27, 2024 19:15

rockwood-openai reviewed Jun 27, 2024

View reviewed changes

oleiman force-pushed the xform/core-2841/specify-start-offset branch from 2d34443 to 6ba341a Compare June 27, 2024 22:40

rockwotj reviewed Jun 28, 2024

View reviewed changes

pgellert reviewed Jun 28, 2024

View reviewed changes

oleiman force-pushed the xform/core-2841/specify-start-offset branch from 6ba341a to 0fcb3a5 Compare June 28, 2024 18:43

oleiman added 7 commits June 28, 2024 11:45

model/x: add kafka::offset_delta to offset_options::position variant

bc6dab5

Signed-off-by: Oren Leiman <[email protected]>

x/api: Respect user-specified start offset at deploy time

f4942cc

Signed-off-by: Oren Leiman <[email protected]>

admin/x: Add 'offset' to deploy request

5217e23

{ "format": enum[timestamp, from_start, from_end], "value": int64 } Signed-off-by: Oren Leiman <[email protected]>

rpk/x/deploy: Add flag to read from an arbitrary offset

b9a93f4

--from-offset to start from this offset * @t: start from UNIX timestamp (ms from epoch) * +oo: start offset + oo * -oo: latest ofset - oo Signed-off-by: Oren Leiman <[email protected]>

dt/rpk: Add offset parameter to deploy_wasm

f7cdc72

Signed-off-by: Oren Leiman <[email protected]>

dt/xform: Add tests for 'from-offset' option

17b827b

- Consume records that were produced before the deploy - Specify offsets that run off the end of the input topic - Ill-formed offsets Signed-off-by: Oren Leiman <[email protected]>

oleiman force-pushed the xform/core-2841/specify-start-offset branch from 0fcb3a5 to 17b827b Compare June 28, 2024 18:45

oleiman requested a review from rockwotj June 28, 2024 21:30

rockwotj reviewed Jun 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE-2841] Transforms: Start consuming from an arbitrary offset (numeric from start/end or timestamp) #19975

[CORE-2841] Transforms: Start consuming from an arbitrary offset (numeric from start/end or timestamp) #19975

oleiman commented Jun 24, 2024 •

edited

Loading

This comment was marked as resolved.

oleiman commented Jun 27, 2024

rockwood-openai Jun 27, 2024

rockwood-openai Jun 27, 2024

oleiman Jun 27, 2024

oleiman Jun 27, 2024

rockwotj left a comment

rockwotj Jun 28, 2024

oleiman Jun 28, 2024

pgellert Jun 28, 2024

oleiman Jun 28, 2024

rockwotj Jun 28, 2024

oleiman Jun 28, 2024

rockwotj Jun 28, 2024

oleiman Jun 28, 2024

pgellert Jun 28, 2024 •

edited

Loading

oleiman Jun 28, 2024

oleiman Jun 28, 2024

oleiman commented Jun 28, 2024

oleiman commented Jun 28, 2024

rockwotj left a comment

rockwotj Jun 29, 2024

rockwotj Jun 29, 2024

oleiman Jun 29, 2024

rockwotj Jun 29, 2024

oleiman Jun 29, 2024

[CORE-2841] Transforms: Start consuming from an arbitrary offset (numeric from start/end or timestamp) #19975

Are you sure you want to change the base?

[CORE-2841] Transforms: Start consuming from an arbitrary offset (numeric from start/end or timestamp) #19975

Conversation

oleiman commented Jun 24, 2024 • edited Loading

Backports Required

Release Notes

Improvements

This comment was marked as resolved.

oleiman commented Jun 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rockwotj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgellert Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleiman commented Jun 28, 2024

oleiman commented Jun 28, 2024

rockwotj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleiman commented Jun 24, 2024 •

edited

Loading

pgellert Jun 28, 2024 •

edited

Loading