-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Is your feature request related to a problem? Please describe.
The observability requirements for stable components recommend emitting telemetry in a way that allows users to differentiate between errors originating from a component and errors propagated from downstream components. This is currently somewhat complicated to do in receivers that use receiverhelper
, notably the OTLP receiver (see OTLP receiver telemetry review), for two reasons:
- All errors are surfaced as the same
otelcol_receiver_refused_x
metric; - If an internal error happens before the telemetry payload was fully received and parsed, we cannot determine the number of telemetry items involved, and thus cannot properly surface the error with
ObsReport.EndXOp
. This means thatStartXOp
may be delayed until everything is parsed (as in the OTLP receiver), which mean internal failures are never surfaced through metrics.
Describe the solution you'd like
Following the precedent of the pipeline auto-instrumentation RFC, I believe we should differentiate between payloads that were "refused" by downstream components and requests that "failed".
Telemetry-wise, this would mean specializing the otelcol_receiver_refused_x
metric to downstream errors (ones returned from nextConsumer.ConsumeX
; this is already the case de-facto in the OTLP receiver), and add a new metric to account for internal errors:
- Either a simple
otelcol_receiver_failed_requests
metric (maybe_operations
if we want to account for scrapers?); - Or a generic
otelcol_receiver_requests
metric which counts all receiver operations, with anoutcome: success / failure / refused
attribute, following the convention in the above RFC.
API-wise, with the goal of avoiding breakage, I think the simplest way to implement this would be to add a new method to ObsReport
which could be called in place of EndXOp
, which would emit a "failure" metric instead of a "refused" metric, and encourage component authors to call StartXOp
as early in processing as possible. (Note: This could also be used to improve the timing information provided by tracing by adding a span event signifying the end of internal processing). Under the assumption that most receivers behave like the OTLP receiver and mostly only wrap downstream processing in Start/EndXOp
, components that haven't updated would continue to behave as before.
Describe alternatives you've considered
We could also leave things as-is, and let receiver component authors add their own internal failure metrics.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status