Skip to content

Conversation

@link04
Copy link
Contributor

@link04 link04 commented Sep 13, 2025

Summary of changes

This PR implements the OTLP Metrics Exporter for OpenTelemetry Metrics API support, completing the final piece of the RFC implementation. Key additions include:

  • OTLP Metrics Exporter: Implemented MetricExporter abstract class and OtlpExporter concrete implementation initially with only HTTP/protobuf transport support ATM.
  • Performance Optimizations: Custom protobuf serialization without external dependencies, async operations with ConfigureAwait(false)
  • Retry Logic: Exponential backoff with Retry-After header handling for HTTP 429, 502, 503, 504 responses
  • Telemetry Metrics: Internal metrics tracking (metrics_export_attempts, metrics_export_successes, metrics_export_failures) with protocol tag
  • Sample App Integration: Enhanced Samples.OpenTelemetrySdk to support both DD and OTel SDK approaches for comparative snapshot testing

This completes the series of 3 PRs for OpenTelemetry Metrics API support.

  1. [OTEL] Metrics API Support - Configurations & Telemetry #7420
  2. [OTEL] Metrics API Support - Collection Implementation #7511
  3. [OTEL] Metrics API Support - Metrics Reader & OTLP Exporter #7514 <- You are here.

Reason for change

The [RFC] OpenTelemetry Metrics API support requires the .NET tracer to act as a drop-in replacement for the OpenTelemetry SDK. This PR implements the OTLP export functionality, enabling customers to migrate from OTel SDK to Datadog SDK with zero friction.
Business Impact: Enables customers using OpenTelemetry Metrics API to switch to Datadog without code changes, supporting our strategic initiative to be OTel-native.

Implementation details

✅ OTLP Metrics Exporter Components:

  • MetricExporter - Abstract base class defining export contract
  • OtlpExporter - Concrete implementation with HTTP/protobuf transport
  • OtlpMetricsSerializer - Custom protobuf serialization for performance
  • MetricReader - Updated to use the new exporter architecture

✅ Key Features:

  • HTTP/protobuf transport support
  • Resource attribute precedence following RFC (DD_* → OTEL_* → defaults)
  • Retry logic with exponential backoff and proper HTTP status handling
  • Telemetry metrics with protocol tags (protocol:http/protobuf, protocol:grpc, protocol:http/json)
  • Performance optimizations with async operations and custom serialization

✅ Sample App Enhancements:

  • Conditional OTel SDK exporter via OTEL_METRICS_EXPORTER_ENABLED environment variable for testing
  • Comparative testing between DD and OTel SDK approaches
  • Proper MeterProvider lifecycle management with disposal

Test coverage

Added comprehensive integration test SubmitsOtlpMetrics that:

  • Tests both approaches: DD enabled/OTel disabled vs DD disabled/OTel enabled
  • Verifies identical results: Both approaches produce the same OTLP metric requests
  • Cross-version testing: Runs across all OpenTelemetry package versions
  • Snapshot verification: Ensures RFC compliance and data model compatibility

NOTE: Hoping to add Grpc and HTTP/Json tests along with the current SubmitsOtlpMetrics protobuf one and also benchmarks if possible to compare OTEL vs Datadog overhead.

Other details

Jira: APMAPI-1573

@datadog-official

This comment has been minimized.

@dd-trace-dotnet-ci-bot
Copy link

dd-trace-dotnet-ci-bot bot commented Sep 13, 2025

Execution-Time Benchmarks Report ⏱️

Execution-time results for samples comparing the following branches/commits:

Execution-time benchmarks measure the whole time it takes to execute a program. And are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are shown in red. The following thresholds were used for comparing the execution times:

  • Welch test with statistical test for significance of 5%
  • Only results indicating a difference greater than 5% and 5 ms are considered.

Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard.

Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph).

gantt
    title Execution time (ms) FakeDbCommand (.NET Framework 4.8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7514) - mean (72ms)  : 71, 73
     .   : milestone, 72,
    master - mean (72ms)  : 71, 73
     .   : milestone, 72,

    section Baseline
    This PR (7514) - mean (68ms)  : 66, 71
     .   : milestone, 68,
    master - mean (68ms)  : 67, 69
     .   : milestone, 68,

    section CallTarget+Inlining+NGEN
    This PR (7514) - mean (1,049ms)  : 1010, 1089
     .   : milestone, 1049,
    master - mean (1,054ms)  : 1000, 1108
     .   : milestone, 1054,

Loading
gantt
    title Execution time (ms) FakeDbCommand (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7514) - mean (107ms)  : 105, 108
     .   : milestone, 107,
    master - mean (106ms)  : 105, 108
     .   : milestone, 106,

    section Baseline
    This PR (7514) - mean (106ms)  : 103, 109
     .   : milestone, 106,
    master - mean (106ms)  : 104, 108
     .   : milestone, 106,

    section CallTarget+Inlining+NGEN
    This PR (7514) - mean (745ms)  : 722, 769
     .   : milestone, 745,
    master - mean (746ms)  : 724, 769
     .   : milestone, 746,

Loading
gantt
    title Execution time (ms) FakeDbCommand (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7514) - mean (101ms)  : 100, 102
     .   : milestone, 101,
    master - mean (101ms)  : 100, 101
     .   : milestone, 101,

    section Baseline
    This PR (7514) - mean (100ms)  : 98, 102
     .   : milestone, 100,
    master - mean (100ms)  : 98, 102
     .   : milestone, 100,

    section CallTarget+Inlining+NGEN
    This PR (7514) - mean (776ms)  : 730, 822
     .   : milestone, 776,
    master - mean (775ms)  : 727, 822
     .   : milestone, 775,

Loading
gantt
    title Execution time (ms) FakeDbCommand (.NET 8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7514) - mean (93ms)  : 91, 94
     .   : milestone, 93,
    master - mean (93ms)  : 92, 94
     .   : milestone, 93,

    section Baseline
    This PR (7514) - mean (92ms)  : 90, 95
     .   : milestone, 92,
    master - mean (92ms)  : 90, 94
     .   : milestone, 92,

    section CallTarget+Inlining+NGEN
    This PR (7514) - mean (656ms)  : 642, 670
     .   : milestone, 656,
    master - mean (664ms)  : 648, 679
     .   : milestone, 664,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET Framework 4.8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7514) - mean (198ms)  : 194, 202
     .   : milestone, 198,
    master - mean (200ms)  : 196, 205
     .   : milestone, 200,

    section Baseline
    This PR (7514) - mean (193ms)  : 189, 196
     .   : milestone, 193,
    master - mean (196ms)  : 190, 202
     .   : milestone, 196,

    section CallTarget+Inlining+NGEN
    This PR (7514) - mean (1,170ms)  : 1096, 1244
     .   : milestone, 1170,
    master - mean (1,197ms)  : 1120, 1273
     .   : milestone, 1197,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7514) - mean (279ms)  : 274, 284
     .   : milestone, 279,
    master - mean (282ms)  : 276, 287
     .   : milestone, 282,

    section Baseline
    This PR (7514) - mean (278ms)  : 273, 283
     .   : milestone, 278,
    master - mean (281ms)  : 275, 287
     .   : milestone, 281,

    section CallTarget+Inlining+NGEN
    This PR (7514) - mean (933ms)  : 899, 967
     .   : milestone, 933,
    master - mean (948ms)  : 904, 992
     .   : milestone, 948,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7514) - mean (281ms)  : 277, 286
     .   : milestone, 281,
    master - mean (285ms)  : 280, 291
     .   : milestone, 285,

    section Baseline
    This PR (7514) - mean (281ms)  : 275, 287
     .   : milestone, 281,
    master - mean (286ms)  : 279, 294
     .   : milestone, 286,

    section CallTarget+Inlining+NGEN
    This PR (7514) - mean (985ms)  : 933, 1037
     .   : milestone, 985,
    master - mean (1,008ms)  : 965, 1051
     .   : milestone, 1008,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET 8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7514) - mean (269ms)  : 264, 274
     .   : milestone, 269,
    master - mean (276ms)  : 266, 286
     .   : milestone, 276,

    section Baseline
    This PR (7514) - mean (269ms)  : 265, 273
     .   : milestone, 269,
    master - mean (275ms)  : 269, 281
     .   : milestone, 275,

    section CallTarget+Inlining+NGEN
    This PR (7514) - mean (847ms)  : 826, 867
     .   : milestone, 847,
    master - mean (869ms)  : 841, 898
     .   : milestone, 869,

Loading

@link04 link04 added the docker_image_artifacts Use to label PRs for which you would need a Docker Image created for. label Sep 16, 2025
@link04 link04 marked this pull request as ready for review September 26, 2025 16:41
@link04 link04 requested a review from a team as a code owner September 26, 2025 16:41
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just leave a file-level comment to make sure it doesn't get lost, but as a follow-up task we should vendor this code so that it's easy to update how we emit protobuf if we need to update to a newer proto file to implement. In general it seems like we're creating a bunch of intermediate byte arrays which could be optimized away, but it's not worth spending time on this if we're going to change the code soon. Otherwise, as long as we are able to emit correct OTLP, this is good for this PR

Copy link
Contributor

@zacharycmontoya zacharycmontoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last overall comment I have is that we should remove the gRPC emission from this PR and focus on the http/protobuf serialization, but nothing else major from me!

@link04 link04 force-pushed the maximo/otel-metrics-api-exporter branch 2 times, most recently from 8a89f6a to 57b1eb5 Compare October 3, 2025 20:02
Copy link
Contributor

@zacharycmontoya zacharycmontoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! We've got some follow-up items so let's make sure to track all of them, and please make sure system-tests for the http/protobuf exporter tests pass.

@bouwkast
Copy link
Collaborator

bouwkast commented Oct 6, 2025

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

Comment on lines 216 to 227
private async Task<bool> SendWithRetry(HttpRequestMessage httpRequest)
{
const int maxRetries = 3;
var retryDelay = TimeSpan.FromMilliseconds(100);

for (var attempt = 0; attempt <= maxRetries; attempt++)
{
try
{
using var cts = new CancellationTokenSource(TimeSpan.FromMilliseconds(_timeoutMs));
var response = await _httpClient.SendAsync(httpRequest, cts.Token).ConfigureAwait(false);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Recreate requests between retries

The retry loop reuses the same HttpRequestMessage instance across attempts. After _httpClient.SendAsync sends a request once, the framework throws InvalidOperationException: The request message was already sent on any subsequent call, so none of the configured retries ever execute and transient 5xx/429 failures immediately fail the export. Cloning the request or rebuilding it inside the loop is required for the retry policy to work.

Useful? React with 👍 / 👎.

Comment on lines 45 to +56
switch (_identity.InstrumentType)
{
case InstrumentType.Counter:
case InstrumentType.ObservableCounter:
case InstrumentType.UpDownCounter:
case InstrumentType.ObservableUpDownCounter:
point.UpdateCounter(value);
break;
case InstrumentType.ObservableCounter:
case InstrumentType.ObservableUpDownCounter:
point.UpdateObservableCounter(value);
break;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Delta exports for ObservableCounter report cumulative totals

For asynchronous counters the measurement callback writes the current cumulative value into _runningDoubleValue, and CreateSnapshotAndReset() zeroes that value whenever the temporality is delta (the default). Because no previous value is preserved, each export emits the full cumulative reading rather than the increment since the last export (e.g., 120 is exported instead of 20 when a counter rises from 100 to 120). This doubles counts every interval for any ObservableCounter<T> when OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=delta.

Useful? React with 👍 / 👎.

Comment on lines 339 to 346
if (metric.InstrumentType is InstrumentType.Counter
or InstrumentType.UpDownCounter
or InstrumentType.ObservableCounter
or InstrumentType.ObservableUpDownCounter)
{
WriteTag(writer, FieldNumbers.NumberDataPointAsInt, Fixed64);
writer.Write((long)metric.SnapshotSum);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Double counters are truncated to integers during OTLP serialization

When serializing sum metrics the code always writes counters to the protobuf as_int field and casts SnapshotSum to long. Any counter based on double (e.g., instruments incremented by 0.1 or 1.5) loses its fractional component, and large values can overflow the 64‑bit cast. The serializer should choose NumberDataPointAsDouble whenever the metric was recorded as a floating‑point value or, at minimum, avoid casting away precision.

Useful? React with 👍 / 👎.

link04 pushed a commit that referenced this pull request Oct 6, 2025
Reverts #7582

This is preventing us to run the action as expected as the regex appears
to be incorrect.
Since we must trigger this ourselves there really isn't a risk here at
the moment, will let the SINT team follow up to resolve.

This is blocking the testing of
#7514


https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=188595&view=logs&j=8299a7f3-61e2-5626-a12e-85b7b20deab1&t=e26f8d61-df24-5d93-564c-322f7221200d
@link04 link04 force-pushed the maximo/otel-metrics-api-exporter branch from ee91524 to 2c59b2c Compare October 7, 2025 14:00
Copy link
Collaborator

@bouwkast bouwkast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments but LGTM overall

Note that I didn't inspect the snapshots very closely as I'm not super sure on what they should be 😅

@link04 link04 enabled auto-merge (squash) October 7, 2025 21:03
@link04 link04 merged commit 33b6106 into master Oct 7, 2025
156 of 157 checks passed
@link04 link04 deleted the maximo/otel-metrics-api-exporter branch October 7, 2025 23:58
@github-actions github-actions bot added this to the vNext-v3 milestone Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:opentelemetry OpenTelemetry support docker_image_artifacts Use to label PRs for which you would need a Docker Image created for.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants