-
Notifications
You must be signed in to change notification settings - Fork 276
Add SDK span telemetry metrics #1631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
04f924f to
8bbea82
Compare
|
Related #1580 |
|
Would a |
@lzchen not in this PR, but I don't see why we wouldn't add something like this in the future. To me, this would fall in a This PR is current is about tracking data loss (+bonus of tracking the effective sampling rate). |
@lzchen Would http and gRPC instrumentation be good enough to solve this use-case? Or do you think having explicit additional metrics in the exporters is needed? |
For our use case in particular, tracking those things (request count, size and duration) are exactly what we need. Speaking separately though, would "duration" be a useful metric for exporters in general even for those that don't wind up using network requests?
I believe certain implementations (like Python) have made it so that instrumentations do not track calls made by the SDK (and thus, the exporter) itself. I think explicit metrics related to SDK components are needed in that regard. |
That makes sense for tracing (where it is easy to produce an infinite export loop) but, IMO, makes less sense for metrics where that kind of feedback loop doesn't exist. |
That's a good point. At least today unfortunately, all our instrumentations behave that way. Hypothetically, if we were to change this behavior, the instrumentations won't be able to differentiate between calls made from the SDK and ones made from the user's application correct? |
|
Yeah... People would need to use the |
|
A few points on duration:
So I think duration is the first and the most important choice. |
So IIUC you are referring to a part of what I would call "pipeline latency", the total time a span takes from being ended to being successfully exported. The metric you are envisioning would be the portion of this latency taken in the exporter, ignoring e.g. batching span processor delay.
My main concern here would be storage overhead. Histograms are much more expensive than counters, speaking of at least 10x with bad buckets, even more if you use exponential histograms or proper fine-granular buckets. This would make it hard to justify having the health metrics enabled by default. At the same time having the health metrics enabled by default gives the best out-of-the-box experience for users. At the same time I can't really see the general importance / usefulness of having the exporter durations: It feels like more of a nice to have. What conclusions does this metric allow you to draw? Do you have concrete examples? |
|
Since all other comments and discussions above are resolved and WDYT? @lmolkova @JonasKunz @dashpole @lzchen |
Pipeline latency is cool, but the moment you have it, you need to also have a way to break it down into pieces (exporting part, processor queue).
Debugging connectivity with my backend - network issues, throttling, slow backend response, retries, retry backoff interval optimizations. Counts are good, but they won't tell you that your P99 is 10 sec after all tries because your backoff interval is wrong. You'd just see less of them and will have no idea. I don't think it's nice to have. As a cost mitigation strategy, we can always use a small amount of buckets by default and users can always reconfigure them if they need less/more. |
|
Issue for follow-up discussions around adding duration: |
- Generate `semconv/v1.31.0` - Stop generating deprecated metric semconv similar to all other generation - Fix acronyms: - `ReplicationController` - `ResourceQuota` ## [`v1.31.0` semantic convention release notes](https://github.com/open-telemetry/semantic-conventions/releases/tag/v1.31.0): <h3>🛑 Breaking changes 🛑</h3> <ul> <li> <p><code>code</code>: <code>code.function.name</code> value should contain the fully qualified function name, <code>code.namespace</code> is now deprecated (<a href="open-telemetry/semantic-conventions#1677" data-hovercard-type="issue" data-hovercard-url="/open-telemetry/semantic-conventions/issues/1677/hovercard">#1677</a>)</p> </li> <li> <p><code>gen-ai</code>: Introduce <code>gen_ai.output.type</code>and deprecate <code>gen_ai.openai.request.response_format</code> (<a href="open-telemetry/semantic-conventions#1757" data-hovercard-type="pull_request" data-hovercard-url="/open-telemetry/semantic-conventions/pull/1757/hovercard">#1757</a>)</p> </li> <li> <p><code>mobile</code>: Rework <code>device.app.lifecycle</code> mobile event. (<a href="open-telemetry/semantic-conventions#1880" data-hovercard-type="pull_request" data-hovercard-url="/open-telemetry/semantic-conventions/pull/1880/hovercard">#1880</a>)<br> The <code>device.app.lifecycle</code> event has been reworked to use attributes instead<br> of event body fields. The <code>ios.app.state</code> and <code>android.app.state</code> attributes<br> have been reintroduced to the attribute registry.</p> </li> <li> <p><code>system</code>: Move CPU-related system.cpu.* metrics to CPU namespace (<a href="open-telemetry/semantic-conventions#1873" data-hovercard-type="issue" data-hovercard-url="/open-telemetry/semantic-conventions/issues/1873/hovercard">#1873</a>)</p> </li> <li> <p><code>k8s</code>: Change k8s.replication_controller metrics to k8s.replicationcontroller (<a href="open-telemetry/semantic-conventions#1848" data-hovercard-type="pull_request" data-hovercard-url="/open-telemetry/semantic-conventions/pull/1848/hovercard">#1848</a>)</p> </li> <li> <p><code>db</code>: Rename <code>db.system</code> to <code>db.system.name</code> in database metrics, and update the values to be consistent with database spans. (<a href="open-telemetry/semantic-conventions#1581" data-hovercard-type="issue" data-hovercard-url="/open-telemetry/semantic-conventions/issues/1581/hovercard">#1581</a>)</p> </li> <li> <p><code>session</code>: Move <code>session.id</code> and <code>session.previous_id</code> from body fields to event attributes, and yamlize <code>session.start</code> and <code>session.end</code> events. (<a href="open-telemetry/semantic-conventions#1845" data-hovercard-type="pull_request" data-hovercard-url="/open-telemetry/semantic-conventions/pull/1845/hovercard">#1845</a>)<br> As part of the ongoing migration of event fields from LogRecord body to extended/complex attributes, the <code>session.start</code> and <code>session.end</code> events have been redefined.</p> </li> </ul> <h3>💡 Enhancements 💡</h3> <ul> <li> <p><code>code</code>: Mark <code>code.*</code> semantic conventions as release candidate (<a href="open-telemetry/semantic-conventions#1377" data-hovercard-type="issue" data-hovercard-url="/open-telemetry/semantic-conventions/issues/1377/hovercard">#1377</a>)</p> </li> <li> <p><code>gen-ai</code>: Added AI Agent Semantic Convention (<a href="open-telemetry/semantic-conventions#1732" data-hovercard-type="issue" data-hovercard-url="/open-telemetry/semantic-conventions/issues/1732/hovercard">#1732</a>, <a href="open-telemetry/semantic-conventions#1739" data-hovercard-type="pull_request" data-hovercard-url="/open-telemetry/semantic-conventions/pull/1739/hovercard">#1739</a>)</p> </li> <li> <p><code>db</code>: Add database-specific notes on db.operation.name and db.collection.name for Cassandra, Cosmos DB, HBase, MongoDB, and Redis, covering their batch/bulk terms and lack of cross-table queries. (<a href="open-telemetry/semantic-conventions#1863" data-hovercard-type="pull_request" data-hovercard-url="/open-telemetry/semantic-conventions/pull/1863/hovercard">#1863</a>, <a href="open-telemetry/semantic-conventions#1573" data-hovercard-type="issue" data-hovercard-url="/open-telemetry/semantic-conventions/issues/1573/hovercard">#1573</a>)</p> </li> <li> <p><code>gen-ai</code>: Adds <code>gen_ai.request.choice.count</code> span attribute (<a href="open-telemetry/semantic-conventions#1888" data-hovercard-type="issue" data-hovercard-url="/open-telemetry/semantic-conventions/issues/1888/hovercard">#1888</a>)<br> Enables recording target number of completions to generate</p> </li> <li> <p><code>enduser</code>: Undeprecate 'enduser.id' and introduce new attribute <code>enduser.pseudo.id</code> (<a href="open-telemetry/semantic-conventions#1104" data-hovercard-type="issue" data-hovercard-url="/open-telemetry/semantic-conventions/issues/1104/hovercard">#1104</a>)<br> The new attribute <code>enduser.pseudo.id</code> is intended to provide a unique identifier of a pseudonymous enduser.</p> </li> <li> <p><code>k8s</code>: Add <code>k8s.hpa</code>, <code>k8s.resourcequota</code> and <code>k8s.replicationcontroller</code> attributes and resources (<a href="open-telemetry/semantic-conventions#1656" data-hovercard-type="issue" data-hovercard-url="/open-telemetry/semantic-conventions/issues/1656/hovercard">#1656</a>)</p> </li> <li> <p><code>k8s</code>: How to populate resource attributes based on attributes, labels and transformation (<a href="open-telemetry/semantic-conventions#236" data-hovercard-type="issue" data-hovercard-url="/open-telemetry/semantic-conventions/issues/236/hovercard">#236</a>)</p> </li> <li> <p><code>process</code>: Adjust the semantic expectations for <code>process.executable.name</code> (<a href="open-telemetry/semantic-conventions#1736" data-hovercard-type="issue" data-hovercard-url="/open-telemetry/semantic-conventions/issues/1736/hovercard">#1736</a>)</p> </li> <li> <p><code>otel</code>: Adds SDK self-monitoring metrics for span processing (<a href="open-telemetry/semantic-conventions#1631" data-hovercard-type="pull_request" data-hovercard-url="/open-telemetry/semantic-conventions/pull/1631/hovercard">#1631</a>)</p> </li> <li> <p><code>cicd</code>: Adds a new attribute <code>cicd.pipeline.run.url.full</code> and corrects the attribute description of <code>cicd.pipeline.task.run.url.full</code> (<a href="open-telemetry/semantic-conventions#1796" data-hovercard-type="issue" data-hovercard-url="/open-telemetry/semantic-conventions/issues/1796/hovercard">#1796</a>)</p> </li> <li> <p><code>user-agent</code>: Add <code>user_agent.os.name</code> and <code>user_agent.os.version</code> attributes (<a href="open-telemetry/semantic-conventions#1433" data-hovercard-type="issue" data-hovercard-url="/open-telemetry/semantic-conventions/issues/1433/hovercard">#1433</a>)</p> </li> </ul> <h3>🧰 Bug fixes 🧰</h3> <ul> <li><code>process</code>: Fix units of process.open_file_descriptor.count and process.context_switches (<a href="open-telemetry/semantic-conventions#1662" data-hovercard-type="issue" data-hovercard-url="/open-telemetry/semantic-conventions/issues/1662/hovercard">#1662</a>)</li> </ul> --------- Co-authored-by: Robert Pająk <[email protected]>
Changes
With this PR I'd like to start a discussion around adding SDK self-monitoring metrics to the semantic conventions.
The goal of these metrics is to give insights into how the SDK is performing, e.g. whether data is being dropped due to overload / misconfiguration or everything is healthy.
I'd like to add these to semconv to keep them language agnostic, so that for example a single dashboard can be used to visualize the health state of all SDKs used in a system.
We checked the SDK implementations, it seems like only the Java SDK currently has some health metrics implemented.
This PR took some inspiration from those and is intended to improve and therefore supersede them.
I'd like to start out with just span related metrics to keep the PR and discussions simpler here, but would follow up with similar PRs for logs and traces based on the discussion results on this PR.
Prior work
This PR can be seen as a follow up to the closed OTEP 259:
So we kind of have gone full circle: The discussion started with just SDK metrics (only for exporters), going to an approach to unify the metrics across SDK-exporters and collector, which then ended up with just collector metrics.
So this PR can be seen as the required revival of #184 (see also this comment).
In my opinion, it is a good thing to separate the collector and SDK self-metrics:
Existing Metrics in Java SDK
For reference, here is what the existing health metrics currently look like in the Java SDK:
Batch Span Processor metrics
queueSize, value is the current size of the queuespanProcessorType=BatchSpanProcessor(there was a formerExecutorServiceSpanProcessorwhich has been removed)BatchSpanProcessorinstances are usedprocessedSpans, value is the number of spans submitted to the ProcessorspanProcessorType=BatchSpanProcessordropped(boolean),truefor the number of spans which could not be processed due to a full queueThe SDK also implements pretty much the same metrics for the
BatchLogRecordProcessorjustspanreplaced everywhere withlogExporter metrics
Exporter metrics are the same for spans, metrics and logs. They are distinguishable based on a
typeattribute.Also the metric names are dependent on a "name" and "transport" defined by the exporter. For OTLP those are:
exporterName=otlptransportis one ofgrpc,http(= protobuf) orhttp-jsonThe transport is used just for the instrumentation scope name:
io.opentelemetry.exporters.<exporterName>-<transport>Based on that, the following metrics are exposed:
Counter
<exporterName>.exporter.seen: The number of records (spans, metrics or logs) submitted to the exportertype: one ofspan,metricorlogCounter
<exporterName>.exporter.exported: The number of records (spans, metrics or logs) actually exported (or failed)type: one ofspan,metricorlogsuccess(boolean):falsefor exporter failuresMerge requirement checklist
[chore]