[OTEL] Metrics API Support - Collection Implementation #7511

link04 · 2025-09-13T02:08:09Z

Summary of changes

This PR implements the metrics collection foundation for OpenTelemetry Metrics API supportKey additions include:

MetricReader: Component for collecting instrument measurements and aggregating them
MetricReaderHandler: High-performance measurement processing with proper aggregation (Sum, Last Value, Histogram)
MetricPoint: Thread-safe data structure for metric aggregation with histogram bucket support
Testing: Unit test covering all 7 instrument types (4 sync + 3 async) with exact OTLP value verification

This is the second of a series of 3 PRs to complete the OpenTelemetry Metrics API support.

Reason for change

The [RFC] OpenTelemetry Metrics API support requires implementing metrics collection that's compatible with the OpenTelemetry data model. This PR provides the collection foundation that will enable customers to use .NET's System.Diagnostics.Metrics API with Datadog as a drop-in replacement for the OpenTelemetry SDK.

Business Impact: Enables the collection phase of OpenTelemetry Metrics API support, laying the groundwork for OTLP export in the next PR.

Implementation details

✅ Metrics Collection Components:

MetricReader - RFC-compliant component for initialization and async collection
MetricReaderHandler - Measurement processing and aggregation
MetricPoint - Aggregation data structure with histogram bucket support

✅ Aggregation Mapping:

Counter/Asynchronous Counter → Sum Aggregation → Sum Metric Point ✅
UpDownCounter/Asynchronous UpDownCounter → Sum Aggregation → Sum Metric Point ✅
Gauge/Asynchronous Gauge → Last Value Aggregation → Gauge Metric Point ✅
Histogram → Explicit Bucket Histogram Aggregation → Histogram Metric Point ✅

Test coverage

Added unit test MeterListenerCapturesAllMetrics() that verifies:

All 7 instrument types (Counter, Histogram, Gauge, UpDownCounter + 3 Async variants)
Value verification matching OTLP snapshot expectations
Histogram bucket logic verification
Sync + Async instrument collection
Tag capture and temporality settings

Other details

Jira: APMAPI-1572

dd-trace-dotnet-ci-bot · 2025-09-13T18:22:02Z

Execution-Time Benchmarks Report ⏱️

Execution-time results for samples comparing the following branches/commits:

Execution-time benchmarks measure the whole time it takes to execute a program. And are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are shown in red. The following thresholds were used for comparing the execution times:

Welch test with statistical test for significance of 5%
Only results indicating a difference greater than 5% and 5 ms are considered.

Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard.

Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph).

gantt
    title Execution time (ms) FakeDbCommand (.NET Framework 4.8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (72ms)  : 71, 73
     .   : milestone, 72,
    master - mean (72ms)  : 71, 73
     .   : milestone, 72,

    section Baseline
    This PR (7511) - mean (69ms)  : 65, 72
     .   : milestone, 69,
    master - mean (68ms)  : 66, 71
     .   : milestone, 68,

    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (1,044ms)  : 1002, 1086
     .   : milestone, 1044,
    master - mean (1,052ms)  : 990, 1115
     .   : milestone, 1052,

gantt
    title Execution time (ms) FakeDbCommand (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (106ms)  : 105, 107
     .   : milestone, 106,
    master - mean (106ms)  : 105, 107
     .   : milestone, 106,

    section Baseline
    This PR (7511) - mean (105ms)  : 103, 108
     .   : milestone, 105,
    master - mean (106ms)  : 102, 109
     .   : milestone, 106,

    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (740ms)  : 721, 758
     .   : milestone, 740,
    master - mean (743ms)  : 714, 771
     .   : milestone, 743,

gantt
    title Execution time (ms) FakeDbCommand (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (101ms)  : 99, 102
     .   : milestone, 101,
    master - mean (101ms)  : 100, 102
     .   : milestone, 101,

    section Baseline
    This PR (7511) - mean (100ms)  : 97, 102
     .   : milestone, 100,
    master - mean (100ms)  : 98, 102
     .   : milestone, 100,

    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (770ms)  : 724, 815
     .   : milestone, 770,
    master - mean (779ms)  : 730, 828
     .   : milestone, 779,

gantt
    title Execution time (ms) FakeDbCommand (.NET 8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (93ms)  : 92, 93
     .   : milestone, 93,
    master - mean (93ms)  : 92, 94
     .   : milestone, 93,

    section Baseline
    This PR (7511) - mean (92ms)  : 90, 95
     .   : milestone, 92,
    master - mean (92ms)  : 90, 94
     .   : milestone, 92,

    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (655ms)  : 639, 672
     .   : milestone, 655,
    master - mean (661ms)  : 645, 678
     .   : milestone, 661,

gantt
    title Execution time (ms) HttpMessageHandler (.NET Framework 4.8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (208ms)  : 198, 218
     .   : milestone, 208,
    master - mean (208ms)  : 199, 216
     .   : milestone, 208,

    section Baseline
    This PR (7511) - mean (204ms)  : 194, 214
     .   : milestone, 204,
    master - mean (204ms)  : 195, 212
     .   : milestone, 204,

    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (1,218ms)  : 1141, 1295
     .   : milestone, 1218,
    master - mean (1,216ms)  : 1157, 1275
     .   : milestone, 1216,

gantt
    title Execution time (ms) HttpMessageHandler (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (295ms)  : 283, 307
     .   : milestone, 295,
    master - mean (300ms)  : 277, 323
     .   : milestone, 300,

    section Baseline
    This PR (7511) - mean (294ms)  : 282, 305
     .   : milestone, 294,
    master - mean (297ms)  : 280, 315
     .   : milestone, 297,

    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (971ms)  : 933, 1010
     .   : milestone, 971,
    master - mean (993ms)  : 953, 1034
     .   : milestone, 993,

gantt
    title Execution time (ms) HttpMessageHandler (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (297ms)  : 287, 307
     .   : milestone, 297,
    master - mean (303ms)  : 281, 324
     .   : milestone, 303,

    section Baseline
    This PR (7511) - mean (296ms)  : 285, 306
     .   : milestone, 296,
    master - mean (300ms)  : 282, 318
     .   : milestone, 300,

    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (1,030ms)  : 978, 1082
     .   : milestone, 1030,
    master - mean (1,046ms)  : 989, 1103
     .   : milestone, 1046,

gantt
    title Execution time (ms) HttpMessageHandler (.NET 8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (286ms)  : 274, 298
     .   : milestone, 286,
    master - mean (287ms)  : 271, 302
     .   : milestone, 287,

    section Baseline
    This PR (7511) - mean (286ms)  : 273, 298
     .   : milestone, 286,
    master - mean (286ms)  : 268, 304
     .   : milestone, 286,

    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (894ms)  : 836, 951
     .   : milestone, 894,
    master - mean (907ms)  : 834, 980
     .   : milestone, 907,

zacharycmontoya · 2025-09-15T21:16:19Z

tracer/src/Datadog.Trace/OTelMetrics/MetricReader.cs

+                return;
+            }
+
+            var meterListener = new System.Diagnostics.Metrics.MeterListener();


This is interesting to directly reference the MeterListener since we can target the .NET 6.0 target framework. I think this decision is fine for this initial set of PR's so that we can develop this feature with first-class support for .NET 6.0+ where the API's are shipped in the runtime, but we'll want to revisit this later and do reflection/duck-typing so that applications on older runtimes that use the API's via the System.Diagnostics.DiagnosticSource NuGet package can also emit metrics.

tracer/src/Datadog.Trace/OTelMetrics/MetricReader.cs

tracer/src/Datadog.Trace/OTelMetrics/MetricReaderHandler.cs

tracer/src/Datadog.Trace/OTelMetrics/MetricPoint.cs

tracer/src/Datadog.Trace/OTelMetrics/MetricReaderHandler.cs

tracer/test/Datadog.Trace.ClrProfiler.IntegrationTests/OpenTelemetrySdkTests.cs

zacharycmontoya · 2025-09-16T00:09:46Z

tracer/test/Datadog.Trace.ClrProfiler.IntegrationTests/OpenTelemetrySdkTests.cs

+
+            // Trigger async collection and get metrics for testing
+            OTelMetrics.MetricReader.CollectObservableInstruments();
+            var capturedMetrics = OTelMetrics.MetricReaderHandler.GetCapturedMetricsForTesting();


Note: When you implement the Exporter work, I recommend replacing this GetCapturedMetricsForTesting method with an InMemoryExporter so you're continuing to test the Export processing. We probably shouldn't expose it without a good use case, but you can make it internal to Datadog.Trace

tracer/test/Datadog.Trace.ClrProfiler.IntegrationTests/OpenTelemetrySdkTests.cs

zacharycmontoya

I took a first pass at this, and I left comments with the following high-level feedback:

The TemporalityPreference to Instrument.Temporality needs to be updated for correctness. And then the system-tests I've started should identify any other correctness issues.
In general, we can use enums rather than strings since we have a well-defined set of Instruments
Performance: We can add state to the measurement event callbacks, so let's use that to improve the access to individual metric points when we need to update their values.

Note: I still have several configurations I need to test in system-tests so let's sync offline as you progress on that testing.

tracer/src/Datadog.Trace/OTelMetrics/MetricState.cs

tracer/src/Datadog.Trace/OTelMetrics/MetricReaderHandler.cs

tracer/test/Datadog.Trace.Tests/MeterListenerTests.cs

zacharycmontoya

The new MetricState looks much improved over the original single dictionary! The main feedback after reviewing the latest changes is that we need to output separate metrics streams when individual measurements have separate tags, so you'll need to modify your implementation to account for that. I'm sorry I don't have tests ready to show you what that looks like, I'll turn my attention to that ASAP so I have a good reference for you to follow.

tracer/src/Datadog.Trace/OTelMetrics/MetricReaderHandler.cs

tracer/src/Datadog.Trace/OTelMetrics/MetricState.cs

zacharycmontoya · 2025-09-23T21:11:48Z

tracer/src/Datadog.Trace/OTelMetrics/TagSet.cs

+            var sorted = tags.ToArray();
+            Array.Sort(sorted, (a, b) => string.CompareOrdinal(a.Key, b.Key));
+
+            var key = string.Join(";", sorted.Select(kv => $"{kv.Key}={kv.Value}"));


This TagSet implementation and usage looks good. I was trying to think of a more efficient way to create a unique key from the ReadOnlySpan<KeyValuePair<string, object?>> tags and off the top of my head I can't think of one. Perhaps we could intern the resulting string because if we get the same tags over and over again, we could save on memory allocations, but then that string value will not be garbage-collected quickly, so it's not a clear win.

Perhaps at a later time we could create a central cache that maps unique ReadOnlySpan<KeyValuePair<string, object?>> tags objects => TagSet objects by creating a trie data structure. But not worth spending more time on it without microbenchmarking

Arguably a cached stringbuilder with a foreach over the tags would make sense here, e.g. something like this?

var sb = StringBuilderCache.Acquire(); foreach(var pair in sorted) { sb.Append(kv.Key); sb.Append('='); sb.Append(kv.Value); sb.Append(';'); } var key = StringBuilderCache.GetStringAndRelease(sb);

Will address this in the next PR: #7514

tracer/test/Datadog.Trace.Tests/MeterListenerTests.cs

zacharycmontoya

LGTM! I've left a couple of small nits but overall this looks great and it's organized really well. Thanks for working through my detailed feedback

andrewlock

LGTM in general, nice job!

Some overall comments:

Lose the [MethodImpl(MethodImplOptions.AggressiveInlining)]. There's a lot of places where that is likely going to be harmful overall, and there's no real need for it unless benchmarking shows it's definitely going to be useful IMO.
The use of statics makes testing things more complicated. I would consider converting to instance objects and then creating a single static instance for the lifetime management.
We may need to be careful with the logging in "hot paths" (e.g. recording values). We will likely have to add some sort of rate-limiting/gating to avoid flooding the logs
There's a couple of places that might need "fixing" for correctness, or at least adding comments acknowledging/working around the issue e.g. overflow

tracer/missing-nullability-files.csv

tracer/src/Datadog.Trace/OTelMetrics/AggregationTemporality.cs

tracer/src/Datadog.Trace/OTelMetrics/MetricPoint.cs

andrewlock · 2025-09-25T10:30:49Z

tracer/src/Datadog.Trace/OTelMetrics/MetricPoint.cs

+
+            lock (_histogramLock)
+            {
+                unchecked


FWIW, unchecked is technically the default, but it doesn't hurt to add this in case someone has explicitly opted in to checked. However, is unchecked roll over what you actually want (where a big number suddenly becomes negative?!)? Or do you actually want to clamp to max in this case? Or should we reject the point entirely?

That's maybe a question for @zacharycmontoya tbh

Looking at the equivalent history in opentelemetry-dotnet, they currently implement a similar unchecked section:

unchecked { this.runningValue.AsLong++; this.histogramBuckets.RunningSum += number; this.histogramBuckets.RunningBucketCounts[i]++; }

This unchecked was added in this PR so that they could avoid throwing an exception on a possible increment operation since they are hold a critical section lock.

In terms of correctness, I'm not aware of any mandates for handling integer overflows. I think it would be OK to accept this, and it would be consistent with opentelemetry-dotnet

tracer/src/Datadog.Trace/OTelMetrics/MetricReaderHandler.cs

andrewlock · 2025-09-25T11:02:24Z

tracer/src/Datadog.Trace/OTelMetrics/MetricReaderHandler.cs

+        // For testing only - reset all captured metrics
+        internal static void ResetForTesting()
+        {
+            CapturedMetrics.Clear();
+            MetricStreamNames.Clear();
+            Interlocked.Exchange(ref _metricCount, 0);
+        }


nit: this is another sign for me that this whole type shouldn't be static. This is the sort of thing that's super easy to test and reason about if you can just do new MetricReaderHandler(). It also can make testing faster, as you can run all those tests in parallel if desired.

Will remove static methods in next PR: #7514

tracer/src/Datadog.Trace/OTelMetrics/MetricReaderHandler.cs

andrewlock · 2025-09-25T11:10:34Z

tracer/src/Datadog.Trace/OTelMetrics/MetricState.cs

+            var tagDict = new Dictionary<string, object?>();
+            for (int i = 0; i < tags.Length; i++)
+            {
+                var kv = tags[i];
+                tagDict[kv.Key] = kv.Value;
+            }


Why can't we do this? 🤔

var tagDict = new Dictionary<string, object?>(tags);

Also, we're creating this new dictionary every time you try to get a point, even if it already exists. I would suggest moving the construction into the lambda as a minimum, e.g.:

return _points.GetOrAdd(tagSet, _ => new MetricPoint( _identity.InstrumentName, _identity.MeterName, _identity.InstrumentType, MetricReaderHandler.GetTemporality(_identity.InstrumentType), new Dictionary<string, object?>(tags)));

Also, this will still allocate a closure every time you call this method... you might want to consider a non-allocating TryGetValue first:

private MetricPoint GetOrCreatePoint(ReadOnlySpan<KeyValuePair<string, object?>> tags) { var tagSet = TagSet.FromSpan(tags); if (_points.TryGetValue(tagset, out var value)) { return value; } // fallback, have to allocate return _points.GetOrAdd(tagSet, _ => new MetricPoint( _identity.InstrumentName, _identity.MeterName, _identity.InstrumentType, MetricReaderHandler.GetTemporality(_identity.InstrumentType), tagDict)); }

+1, I think in a previous commit the dict was constructed in the lambda. Let's make sure to move the dictionary construction into the lambda so it's only calculated when creating a new metric point

## Summary of changes This PR implements the OTLP Metrics Exporter for OpenTelemetry Metrics API support, completing the final piece of the RFC implementation. Key additions include: - OTLP Metrics Exporter: Implemented `MetricExporter` abstract class and `OtlpExporter` concrete implementation initially with only ` HTTP/protobuf` transport support ATM. - Performance Optimizations: Custom protobuf serialization without external dependencies, async operations with ConfigureAwait(false) - Retry Logic: Exponential backoff with Retry-After header handling for HTTP 429, 502, 503, 504 responses - Telemetry Metrics: Internal metrics tracking (`metrics_export_attempts`, `metrics_export_successes`, `metrics_export_failures`) with protocol tag - Sample App Integration: Enhanced Samples.OpenTelemetrySdk to support both DD and OTel SDK approaches for comparative snapshot testing This completes the series of 3 PRs for OpenTelemetry Metrics API support. 1. [[OTEL] Metrics API Support - Configurations & Telemetry #7420](#7420) 2. [[OTEL] Metrics API Support - Collection Implementation #7511](#7511) 3. [[OTEL] Metrics API Support - Metrics Reader & OTLP Exporter #7514](#7514) <- You are here. ## Reason for change The [RFC] OpenTelemetry Metrics API support requires the .NET tracer to act as a drop-in replacement for the OpenTelemetry SDK. This PR implements the OTLP export functionality, enabling customers to migrate from OTel SDK to Datadog SDK with zero friction. Business Impact: Enables customers using OpenTelemetry Metrics API to switch to Datadog without code changes, supporting our strategic initiative to be OTel-native. ## Implementation details :white_check_mark: OTLP Metrics Exporter Components: - MetricExporter - Abstract base class defining export contract - OtlpExporter - Concrete implementation with HTTP/protobuf transport - OtlpMetricsSerializer - Custom protobuf serialization for performance - MetricReader - Updated to use the new exporter architecture :white_check_mark: Key Features: - HTTP/protobuf transport support - Resource attribute precedence following RFC (DD_* → OTEL_* → defaults) - Retry logic with exponential backoff and proper HTTP status handling - Telemetry metrics with protocol tags (protocol:http/protobuf, protocol:grpc, protocol:http/json) - Performance optimizations with async operations and custom serialization :white_check_mark: Sample App Enhancements: - Conditional OTel SDK exporter via OTEL_METRICS_EXPORTER_ENABLED environment variable for testing - Comparative testing between DD and OTel SDK approaches - Proper MeterProvider lifecycle management with disposal ## Test coverage Added comprehensive integration test `SubmitsOtlpMetrics` that: - Tests both approaches: DD enabled/OTel disabled vs DD disabled/OTel enabled - Verifies identical results: Both approaches produce the same OTLP metric requests - Cross-version testing: Runs across all OpenTelemetry package versions - Snapshot verification: Ensures RFC compliance and data model compatibility **NOTE**: Hoping to add Grpc and HTTP/Json tests along with the current `SubmitsOtlpMetrics` protobuf one and also benchmarks if possible to compare OTEL vs Datadog overhead. ## Other details  Jira: [APMAPI-1573](https://datadoghq.atlassian.net/browse/APMAPI-1573)  [APMAPI-1573]: https://datadoghq.atlassian.net/browse/APMAPI-1573?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

link04 added the docker_image_artifacts Use to label PRs for which you would need a Docker Image created for. label Sep 13, 2025

link04 added 3 commits September 13, 2025 13:37

Initial Metric listener collects and logs metric

8439ac1

Adding support for other metrics, missing async

1231ba1

Collecting and capturing all metrics types

8ee89bd

link04 force-pushed the maximo/otel-metrics-api-collection branch from ea31ebc to 8ee89bd Compare September 13, 2025 17:37

Merge branch 'master' into maximo/otel-metrics-api-collection

df692b6

link04 removed the docker_image_artifacts Use to label PRs for which you would need a Docker Image created for. label Sep 13, 2025

This was referenced Sep 13, 2025

[OTEL] Metrics API Support - Metrics Reader & OTLP Exporter #7514

Merged

[OTEL] Metrics API Support - Configurations & Telemetry #7420

Merged

link04 added the area:opentelemetry OpenTelemetry support label Sep 13, 2025

link04 marked this pull request as ready for review September 15, 2025 20:03

link04 requested review from a team as code owners September 15, 2025 20:03

zacharycmontoya reviewed Sep 15, 2025

View reviewed changes

tracer/src/Datadog.Trace/OTelMetrics/MetricReader.cs Outdated Show resolved Hide resolved

zacharycmontoya reviewed Sep 15, 2025

View reviewed changes

tracer/src/Datadog.Trace/OTelMetrics/MetricReaderHandler.cs Outdated Show resolved Hide resolved

zacharycmontoya reviewed Sep 15, 2025

View reviewed changes

tracer/src/Datadog.Trace/OTelMetrics/MetricPoint.cs Outdated Show resolved Hide resolved