Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability for SpanProcessor to mutate spans on end #4024

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

JonasKunz
Copy link
Contributor

@JonasKunz JonasKunz commented Apr 30, 2024

Fixes #1089.

In addition to the comments on the issue, this was discussed in the spec SIG Meeting on 2024/23/04:

  • The filtering use-case explained in this comment should rather be solved by the upcoming samplerV2 instead of SpanProcessors due to better conceptual fit
  • The buffering use-case also explained in this comment seems to be not relevant enough to influence the design decision
  • Apparently there was a discussion around building the SpanProcessors in a chaining fashion during the initial SDK spec design and it was actively decided against it. However, no one could recall the reason why.

Based on this input, I decided to not move the chaining-based solution forward and stay with the original proposal of adding a new callback to be invoked just before a span is ended.

I also decided to name the new callback OnEnding instead of BeforeEnd as suggested in this comment. The name OnEnding is more precise about when the callback is invoked.

A big discussion point in the SIG Meeting on 2024/23/04 also was the problem of evolving SDK plugin interfaces without breaking backwards compatibility. This PR contains a proposal to clarify how this should be handled: If the language allows it, interfaces should be directly extended. If not possible, implementations will need to introduce new interfaces and accept them in addition to the existing ones for backwards compatibility. I feel like this allow every language to implement changes to interfaces in the best way possible. Of course, changes to interfaces should still be kept to a necessary minimum.

I also wasn't sure whether this change warrants an addition to the spec-compliance-matrix, so I've left it out for now.
Please leave a comment if you think this is required, then I'll add it.

Changes

CHANGELOG.md Show resolved Hide resolved
@JonasKunz JonasKunz marked this pull request as ready for review April 30, 2024 09:18
@JonasKunz JonasKunz requested review from a team as code owners April 30, 2024 09:18
specification/trace/sdk.md Show resolved Hide resolved
specification/versioning-and-stability.md Outdated Show resolved Hide resolved
@Flarna
Copy link
Member

Flarna commented May 8, 2024

What is the benefit of onEnding() compared to pass the read/write span to onEnd()?
Both are sync, both are called at the same time (within the Span.End() api).

To my understanding both changes are breaking for SpanProcessor implementers. Either they have to add a method or change the type.

@jack-berg
Copy link
Member

What is the benefit of onEnding() compared to pass the read/write span to onEnd()?

One benefit is that while order of span processor registration matters for onEnding(), it doesn't today and wouldn't with this change with onEnd(). Implementers of onEnd() can continue to rely on semantics where the they get an immutable span which cannot be impacted by other span processors.

To my understanding both changes are breaking for SpanProcessor implementers. Either they have to add a method or change the type.

We're working out the details of that in #4030, but I disagree that adding a new method to an SDK plugin interface is a breaking change. Languages vary in terms of how they expose these plugin interfaces and the language features for evolving interfaces, but there it should be possible for all SDKs to accommodate this type of change, albeit with some creativity in some cases. For example, its simple in java to add a new default method to an interface which existing implementers don't need to implement. This doesn't exist in some language like go, but there you can offer a new ExtendedSpanProcessor interface which includes the new method, and provide a new mechanism for registering that new span processor interface with TracerProvider.

@jmacd jmacd self-assigned this May 20, 2024
@JonasKunz
Copy link
Contributor Author

JonasKunz commented May 22, 2024

@JonasKunz I am curious to know whether you think it is essential for the on-end handler being discussed in this PR to make changes that are visible to other processors. In my opinion, it would be better if the changes made on-end applied to only the exporter associated with the processor making the change.

@jmacd In my opinion, this would remove the benefits this PR is intending to add: Making it easier for users to enrich spans in span processors before they are exported.

There is already a workaround for doing this with the existing onEnd() callback: Users can decorate/wrap the SimpleSpanProcessor / BatchSpanProcessor and wrap the incoming ReadableSpans in order to pass the mutated ones to the span processor invoking the exporter.
We actually implemented this exact mechanism to provide a span processor which collects stacktraces as attributes for long running spans here. The problem with this approach that it requires a big amount of boilerplate code and doesn't play well with SDK autoconfiguration. This PR was an example use-case for why we wanted to drive #1089 forward for providing a simpler, SDK-native approach for enriching spans on end.

So the problem I see with making changes in onEnding not visible to other users is that it again pushes the responsibility to setup bootstrap pipelines to the user.

My take here is that there are two viable classes of solutions for allowing easy enrichment of spans via the SDK:

  1. Adding a callback which mutates the actual span (like onStart)
  2. Adding the capability to build pipelines which pass around immutable (but wrappable) span data where the terminal stage of the pipeline is some consumer (e.g. a span exporter)

This PR specs out solution category 1.
In this comment I discussed both solution classes and also provided backwards-compatible PoC implementations for Java for both.
However, the arguments for 2.) seemed to not be convincing enough (this was also discussed in a spec-SIG Meeting), I went for 1. in this PR.

We can definitely revisit this decision! However, if we do go for a pipelining approach, I'd propose to rather enhance the onEnd callback for it instead of introducing a new onEnding callback.

However, if we go for the pipeline approach, I'd propose to separate the construction of processors from the assembly of the pipeline by either:

  • Passing the next pipeline stage to call on invocation as a Function to onEnd
  • Introducing a separate init callback which receives the next pipeline stage as a function

The reason why I'm suggesting is that if we leave the pipeline assembly to the user via decoration/wrapping, this makes the pipeline structure a blackbox for the SDK.

If instead the pipeline assembly is managed by the SDK, it has more control over it: e.g. it could insert logging or telemetry between the pipeline stages if necessary. It also plays much better with autoconfiguration.

However, I'm wondering if this PR is the right place to have this discussion, as it seems to affect SpanProcessors in a broader way, given that, should we decide in #4010 to switch to processors with independent pipelines, wouldn't it mean that (for SpanProcessors) we should address OnStart() as well? At least if we wanted to keep consistency in the overall SDK behavior anyway.

I'm not sure whether onStart is a good candidate for pipelining. Pipelines make sense if you have some consumer at the end. While onStart might have such consumers (e.g. creating a metric about started spans), it usually is used to mutate spans before they are visible to the user/application code. I don't see how the second aspect fits nicely into a pipeline model vs just invoking onStart on all SpanProcessors in a loop.

Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @JonasKunz. I understand the motivation and agree with this change. Also, you've helped me understand how the change I'm looking for can be made orthogonally.

I added suggestions because I think OnEnd() becomes less-well specified without adding more text, given the new and similarly-named OnEnding(). Presently, nothing is said about the execution model for OnEnd()--for example whether or not the first call to OnEnd() is required to return before the second processor's OnEnd() is called.

Reading into the OnEnd() text below your changes here, I find that lines 613/614 (in your change, lines 592/593 prior) are problematic, given the question posed in #4010. The OnEnd() callback (which I'd like to be named OnExport()) should not receive a mutable Span, so "modifying is not allowed" is redundant. For us to add pipeline capabilities, the OnEnd() callback should be allowed to change the data, which is distinct from modifying it -- only its own exporter would see the changes.

specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/sdk.md Outdated Show resolved Hide resolved
Co-authored-by: Joshua MacDonald <[email protected]>
@JonasKunz
Copy link
Contributor Author

Thank you @JonasKunz. I understand the motivation and agree with this change. Also, you've helped me understand how the change I'm looking for can be made orthogonally

@jmacd so if my understanding is correct, you are planning to add some kind of pipelining / chaining capabilities to OnEnd (or renamed OnExport), right?

I wonder if the addition of such pipelines would render the OnEnding callback proposed here useless and whether we should flesh out the pipelining first before moving ahead with this PR.

The main differences in terms of capabilities between OnEnding and the pipelining I can think of are the following:

  • OnEnding mutates the original span, these changes are therefore visible to anyone holding a reference to that span. I don't know if this has any effect though: The user application might be holding onto API-Spans, but those are write-only anyway to my understanding. So there wouldn't be any way for the application to actually observe those changes made in OnEnding
  • OnEnding is guaranteed to be executed on the application thread ending the span. We might not guarantee this for the pipeline stages
  • OnEnding is guaranteed to be invoked for every span (unless there is a bug in the instrumentation with dangling spans of course), this might not be the case for pipeline stages if we allow filtering there

So what do you think? Should we move ahead with this PR and get it merged or should we wait and check the overlap with the pipeline approach you are coming up with?

@jmacd
Copy link
Contributor

jmacd commented May 24, 2024

@JonasKunz I assumed that you mean for OnEnding() to receive a "read/write span object" the same as is passed to OnStart(). I think it means that processors can read (the data object) and write (the span object).

OnEnding mutates the original span, these changes are therefore visible to anyone holding a reference to that span.

The requirements, as written, I think ensure that callers are not permitted to use the span reference after End() is called, though the processors are still permitted to via direct access from OnEnding(). There's a bit of conflict with:

It SHOULD be possible to keep a reference to this span object and updates to the span
SHOULD be reflected in it.

I'd probably tack on "SHOULD be reflected in it until End() is called on the Span".

For the last two bullets, I don't see a problem. I expect the OnEnd() callbacks all to execute before the export begins (a.k.a. OnEnd()). Nothing is stated about when the OnEnd happens, but after your change it should be clear that pipelining effects (whatever they are) begin with the OnEnd call. I think OnEnding() makes sense the way you have it -- and there are real use cases so we don't need to block for future design work.

@jmacd
Copy link
Contributor

jmacd commented May 28, 2024

Discussed this in the context of #4062

Copy link
Member

@pellared pellared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this as blocked by #4030.

@jmacd
Copy link
Contributor

jmacd commented May 29, 2024

@pellared Would you say that if #4030 is accepted, this PR should be approached differently? (To me, it seems like the answer is yes.) I've speculated about what the solution might look like in #4062 (review), which is briefly for SDKs to support a "FanoutProcessor" that gives users control over whether mutations are private to their export pipeline and/or visible to the next processor.

@pellared
Copy link
Member

@pellared Would you say that if #4030 is accepted, this PR should be approached differently?

I think this PR is fine and it is a logical extension. We should only highlight that processor does not need to implement the newly added method for backwards compatibility.

Copy link
Member

@pellared pellared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions as #4061 is merged.

specification/trace/sdk.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@JonasKunz JonasKunz requested a review from pellared June 26, 2024 12:33
specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/sdk.md Show resolved Hide resolved
Comment on lines +609 to +611
The SDK MUST guarantee that the span can no longer be modified by any other thread
before invoking `OnEnding` of the first `SpanProcessor`. From that point on, modifications
are only allowed synchronously from within the invoked `OnEnding` callbacks. All registered SpanProcessor `OnEnding` callbacks are executed before any SpanProcessor's `OnEnd` callback is invoked.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how and SDK can provide such guarantee. It looks like more like a hint for the user.

Copy link
Contributor Author

@JonasKunz JonasKunz Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typical pattern to implement this would be to simply hold the lock for modifications of the span while executing the OnEnding callbacks. After the those are finished, the span is set to an immutable state. So if anyone concurrently tries to modify the span, they will wait on the lock and afterwards fail the modification due to the span now being immutable.

If you want concurrent modification attempts to fail eagerly, you can instead mark the span as immutable before executing the OnEndings. While OnEnding callbacks are executed a second lock is held and mutation are still allowed when a thread is holding that second lock.

I'd expect all languages dealing with concurrency to also have mutexes/locks, so I don't see why any SDK shouldn't be able to provide this guarantee? Everything else is likely to make the OnEnding callbacks prone to race conditions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they will wait on the lock and afterwards fail the modification due to the span now being immutable

As you mentioned, I think that most SDKs are only able to make such guarantee at runtime. We should have a way to give feedback.

What should be the output when a failed modification occur?

  • no-op
  • Log that the modification is ignored and that is an illegal operation. Tt would be good to recommend a log message. (my preference)
  • Exception (this would be rather against the OTel spec policy to not throw exception at runtime)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception (this would be rather against the OTel spec policy to not throw exception at runtime)

I think this would also violate the API-spec, because API methods would suddenly start throwing exceptions.

For the other two options: Do you think that this is required to be part of the spec and should not just be left to the SDK implementors to decide whats best/idiomatic? I don't really see the benefit in specifying this

Copy link
Member

@pellared pellared Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that this is required to be part of the spec and should not just be left to the SDK implementors to decide whats best/idiomatic? I don't really see the benefit in specifying this

It makes implementors think on how to do notify about giving feedback to the caller rather than just doing no-op.

Example from the same document:

There SHOULD be a message printed in the SDK's log to indicate to the user that an attribute, event, or link was discarded due to such a limit. To prevent excessive logging, the message MUST be printed at most once per span (i.e., not per discarded attribute, event, or link).

However, this portion of information should be probably out of scope of this PR as this about the behavior of the SDK span itself.

I also want to indicate that there are operations' definitions of the Span interface. Maybe we should update the definition of https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#updatename from

Updates the Span name.

To e.g.

A Span MUST have the ability to update its name.

All of this should be probably handled in a separate PR.

Copy link
Member

@pellared pellared Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the only way for the span to escape to be modified concurrently while any SpanProcessor.OnStart is running, would be to let it escape from a SpanProcessor.OnStart call.

Yes. If it is possible and if we are concerned about it then we should add the same restrictions in OnStart. Changing a span asynchronously is also rather something very uncommon.

I am also concerned that such restrictions on SDK would produce additional computational overhead.

At this point, I would rather add a general guideline that the processors and applications should not modify the spans asynchronously to avoid race conditions and indeterministic behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If it is possible and if we are concerned about it then we should add the same restrictions in OnStart.

I personally think it an edge case here not important enough. What we could do is state that for OnStart processors must not concurrently modify the spans. So impose a restriction on implementors, which however is not enforced through any code in the SDK. More like don't do this, or bad things may happen.

At this point, I would rather add a general guideline that the processors should not modify the spans asynchronously to avoid race conditions and indeterministic behavior.

For OnStart, I agree to go this route (like stated in my paragraph above) if we deem it important enough. For OnEnding, this doesn't solve the problem of concurrent modifications performed by application code and not SpanProcessors (which is impossible for OnStart).

Moreover, I am concerned that such restrictions on SDK would produce additional computational overhead.

I'm trying to understand what computational overhead you are thinking of: Isn't locking already involved anyway to make spans concurrency safe, so this shouldn't impose additional cost?

Copy link
Contributor Author

@JonasKunz JonasKunz Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing a span asynchronously is also rather something very uncommon.

I personally don't agree with that statement. Getting the current span (from the context) and adding an attribute to it or changing the name doesn't seem too uncommon to me. And the current span might have been started and ended on another thread. I think this shouldn't be to uncommen in go too, because you could easily pass the current context/span to a goroutine, right?

Copy link
Member

@pellared pellared Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concurrent modifications performed by application code

Is not an edge case as well? I never saw any application or instrumentation library doing asynchronous span modification that could run concurrently while ending a span.

Isn't locking already involved anyway to make spans concurrency safe, so this shouldn't impose additional cost?

It would be a more complex locking logic. I guess the overhead would be minimal, but never sure until you measure it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concurrent modifications performed by application code

Is not an edge case as well? I never saw any application or instrumentation library doing asynchronous span modification that could run concurrently while ending a span.

Looks like we commented at the same time, I think I answered that in my previous comment

Getting the current span (from the context) and adding an attribute to it or changing the name doesn't seem too uncommon to me. And the current span might have been started and ended on another thread. I think this shouldn't be to uncommen in go too, because you could easily pass the current context/span to a goroutine, right?

At least in Java it is very common that the current context (and therefore current span) is passed around to different threads / asynchronous tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add BeforeEnd to have a callback where the span is still writeable