Skip to content

Conversation

@link04
Copy link
Contributor

@link04 link04 commented Sep 13, 2025

Summary of changes

This PR implements the metrics collection foundation for OpenTelemetry Metrics API supportKey additions include:

  • MetricReader: Component for collecting instrument measurements and aggregating them
  • MetricReaderHandler: High-performance measurement processing with proper aggregation (Sum, Last Value, Histogram)
  • MetricPoint: Thread-safe data structure for metric aggregation with histogram bucket support
  • Testing: Unit test covering all 7 instrument types (4 sync + 3 async) with exact OTLP value verification

This is the second of a series of 3 PRs to complete the OpenTelemetry Metrics API support.

  1. [OTEL] Metrics API Support - Configurations & Telemetry #7420
  2. [OTEL] Metrics API Support - Collection Implementation #7511 <- You are here.
  3. [OTEL] Metrics API Support - Metrics Reader & OTLP Exporter #7514

Reason for change

The [RFC] OpenTelemetry Metrics API support requires implementing metrics collection that's compatible with the OpenTelemetry data model. This PR provides the collection foundation that will enable customers to use .NET's System.Diagnostics.Metrics API with Datadog as a drop-in replacement for the OpenTelemetry SDK.

Business Impact: Enables the collection phase of OpenTelemetry Metrics API support, laying the groundwork for OTLP export in the next PR.

Implementation details

✅ Metrics Collection Components:

  • MetricReader - RFC-compliant component for initialization and async collection
  • MetricReaderHandler - Measurement processing and aggregation
  • MetricPoint - Aggregation data structure with histogram bucket support

✅ Aggregation Mapping:

  • Counter/Asynchronous Counter → Sum Aggregation → Sum Metric Point ✅
  • UpDownCounter/Asynchronous UpDownCounter → Sum Aggregation → Sum Metric Point ✅
  • Gauge/Asynchronous Gauge → Last Value Aggregation → Gauge Metric Point ✅
  • Histogram → Explicit Bucket Histogram Aggregation → Histogram Metric Point ✅

Test coverage

Added unit test MeterListenerCapturesAllMetrics() that verifies:

  • All 7 instrument types (Counter, Histogram, Gauge, UpDownCounter + 3 Async variants)
  • Value verification matching OTLP snapshot expectations
    Histogram bucket logic verification
  • Sync + Async instrument collection
  • Tag capture and temporality settings

Other details

Jira: APMAPI-1572

@link04 link04 added the docker_image_artifacts Use to label PRs for which you would need a Docker Image created for. label Sep 13, 2025
@link04 link04 force-pushed the maximo/otel-metrics-api-collection branch from ea31ebc to 8ee89bd Compare September 13, 2025 17:37
@link04 link04 removed the docker_image_artifacts Use to label PRs for which you would need a Docker Image created for. label Sep 13, 2025
@dd-trace-dotnet-ci-bot
Copy link

dd-trace-dotnet-ci-bot bot commented Sep 13, 2025

Execution-Time Benchmarks Report ⏱️

Execution-time results for samples comparing the following branches/commits:

Execution-time benchmarks measure the whole time it takes to execute a program. And are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are shown in red. The following thresholds were used for comparing the execution times:

  • Welch test with statistical test for significance of 5%
  • Only results indicating a difference greater than 5% and 5 ms are considered.

Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard.

Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph).

gantt
    title Execution time (ms) FakeDbCommand (.NET Framework 4.8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (72ms)  : 71, 73
     .   : milestone, 72,
    master - mean (72ms)  : 71, 73
     .   : milestone, 72,

    section Baseline
    This PR (7511) - mean (69ms)  : 65, 72
     .   : milestone, 69,
    master - mean (68ms)  : 66, 71
     .   : milestone, 68,

    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (1,044ms)  : 1002, 1086
     .   : milestone, 1044,
    master - mean (1,052ms)  : 990, 1115
     .   : milestone, 1052,

Loading
gantt
    title Execution time (ms) FakeDbCommand (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (106ms)  : 105, 107
     .   : milestone, 106,
    master - mean (106ms)  : 105, 107
     .   : milestone, 106,

    section Baseline
    This PR (7511) - mean (105ms)  : 103, 108
     .   : milestone, 105,
    master - mean (106ms)  : 102, 109
     .   : milestone, 106,

    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (740ms)  : 721, 758
     .   : milestone, 740,
    master - mean (743ms)  : 714, 771
     .   : milestone, 743,

Loading
gantt
    title Execution time (ms) FakeDbCommand (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (101ms)  : 99, 102
     .   : milestone, 101,
    master - mean (101ms)  : 100, 102
     .   : milestone, 101,

    section Baseline
    This PR (7511) - mean (100ms)  : 97, 102
     .   : milestone, 100,
    master - mean (100ms)  : 98, 102
     .   : milestone, 100,

    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (770ms)  : 724, 815
     .   : milestone, 770,
    master - mean (779ms)  : 730, 828
     .   : milestone, 779,

Loading
gantt
    title Execution time (ms) FakeDbCommand (.NET 8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (93ms)  : 92, 93
     .   : milestone, 93,
    master - mean (93ms)  : 92, 94
     .   : milestone, 93,

    section Baseline
    This PR (7511) - mean (92ms)  : 90, 95
     .   : milestone, 92,
    master - mean (92ms)  : 90, 94
     .   : milestone, 92,

    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (655ms)  : 639, 672
     .   : milestone, 655,
    master - mean (661ms)  : 645, 678
     .   : milestone, 661,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET Framework 4.8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (208ms)  : 198, 218
     .   : milestone, 208,
    master - mean (208ms)  : 199, 216
     .   : milestone, 208,

    section Baseline
    This PR (7511) - mean (204ms)  : 194, 214
     .   : milestone, 204,
    master - mean (204ms)  : 195, 212
     .   : milestone, 204,

    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (1,218ms)  : 1141, 1295
     .   : milestone, 1218,
    master - mean (1,216ms)  : 1157, 1275
     .   : milestone, 1216,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (295ms)  : 283, 307
     .   : milestone, 295,
    master - mean (300ms)  : 277, 323
     .   : milestone, 300,

    section Baseline
    This PR (7511) - mean (294ms)  : 282, 305
     .   : milestone, 294,
    master - mean (297ms)  : 280, 315
     .   : milestone, 297,

    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (971ms)  : 933, 1010
     .   : milestone, 971,
    master - mean (993ms)  : 953, 1034
     .   : milestone, 993,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (297ms)  : 287, 307
     .   : milestone, 297,
    master - mean (303ms)  : 281, 324
     .   : milestone, 303,

    section Baseline
    This PR (7511) - mean (296ms)  : 285, 306
     .   : milestone, 296,
    master - mean (300ms)  : 282, 318
     .   : milestone, 300,

    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (1,030ms)  : 978, 1082
     .   : milestone, 1030,
    master - mean (1,046ms)  : 989, 1103
     .   : milestone, 1046,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET 8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (286ms)  : 274, 298
     .   : milestone, 286,
    master - mean (287ms)  : 271, 302
     .   : milestone, 287,

    section Baseline
    This PR (7511) - mean (286ms)  : 273, 298
     .   : milestone, 286,
    master - mean (286ms)  : 268, 304
     .   : milestone, 286,

    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (894ms)  : 836, 951
     .   : milestone, 894,
    master - mean (907ms)  : 834, 980
     .   : milestone, 907,

Loading

@link04 link04 added the area:opentelemetry OpenTelemetry support label Sep 13, 2025
@link04 link04 marked this pull request as ready for review September 15, 2025 20:03
@link04 link04 requested review from a team as code owners September 15, 2025 20:03
return;
}

var meterListener = new System.Diagnostics.Metrics.MeterListener();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting to directly reference the MeterListener since we can target the .NET 6.0 target framework. I think this decision is fine for this initial set of PR's so that we can develop this feature with first-class support for .NET 6.0+ where the API's are shipped in the runtime, but we'll want to revisit this later and do reflection/duck-typing so that applications on older runtimes that use the API's via the System.Diagnostics.DiagnosticSource NuGet package can also emit metrics.


// Trigger async collection and get metrics for testing
OTelMetrics.MetricReader.CollectObservableInstruments();
var capturedMetrics = OTelMetrics.MetricReaderHandler.GetCapturedMetricsForTesting();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: When you implement the Exporter work, I recommend replacing this GetCapturedMetricsForTesting method with an InMemoryExporter so you're continuing to test the Export processing. We probably shouldn't expose it without a good use case, but you can make it internal to Datadog.Trace

Copy link
Contributor

@zacharycmontoya zacharycmontoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a first pass at this, and I left comments with the following high-level feedback:

  • The TemporalityPreference to Instrument.Temporality needs to be updated for correctness. And then the system-tests I've started should identify any other correctness issues.
  • In general, we can use enums rather than strings since we have a well-defined set of Instruments
  • Performance: We can add state to the measurement event callbacks, so let's use that to improve the access to individual metric points when we need to update their values.

Note: I still have several configurations I need to test in system-tests so let's sync offline as you progress on that testing.

@datadog-official

This comment has been minimized.

Copy link
Contributor

@zacharycmontoya zacharycmontoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new MetricState looks much improved over the original single dictionary! The main feedback after reviewing the latest changes is that we need to output separate metrics streams when individual measurements have separate tags, so you'll need to modify your implementation to account for that. I'm sorry I don't have tests ready to show you what that looks like, I'll turn my attention to that ASAP so I have a good reference for you to follow.

var sorted = tags.ToArray();
Array.Sort(sorted, (a, b) => string.CompareOrdinal(a.Key, b.Key));

var key = string.Join(";", sorted.Select(kv => $"{kv.Key}={kv.Value}"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TagSet implementation and usage looks good. I was trying to think of a more efficient way to create a unique key from the ReadOnlySpan<KeyValuePair<string, object?>> tags and off the top of my head I can't think of one. Perhaps we could intern the resulting string because if we get the same tags over and over again, we could save on memory allocations, but then that string value will not be garbage-collected quickly, so it's not a clear win.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps at a later time we could create a central cache that maps unique ReadOnlySpan<KeyValuePair<string, object?>> tags objects => TagSet objects by creating a trie data structure. But not worth spending more time on it without microbenchmarking

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably a cached stringbuilder with a foreach over the tags would make sense here, e.g. something like this?

var sb = StringBuilderCache.Acquire();

foreach(var pair in sorted)
{
    sb.Append(kv.Key);
    sb.Append('=');
    sb.Append(kv.Value);
    sb.Append(';');
}

var key = StringBuilderCache.GetStringAndRelease(sb);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address this in the next PR: #7514

Copy link
Contributor

@zacharycmontoya zacharycmontoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I've left a couple of small nits but overall this looks great and it's organized really well. Thanks for working through my detailed feedback

Copy link
Member

@andrewlock andrewlock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, nice job!

Some overall comments:

  • Lose the [MethodImpl(MethodImplOptions.AggressiveInlining)]. There's a lot of places where that is likely going to be harmful overall, and there's no real need for it unless benchmarking shows it's definitely going to be useful IMO.
  • The use of statics makes testing things more complicated. I would consider converting to instance objects and then creating a single static instance for the lifetime management.
  • We may need to be careful with the logging in "hot paths" (e.g. recording values). We will likely have to add some sort of rate-limiting/gating to avoid flooding the logs
  • There's a couple of places that might need "fixing" for correctness, or at least adding comments acknowledging/working around the issue e.g. overflow


lock (_histogramLock)
{
unchecked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, unchecked is technically the default, but it doesn't hurt to add this in case someone has explicitly opted in to checked. However, is unchecked roll over what you actually want (where a big number suddenly becomes negative?!)? Or do you actually want to clamp to max in this case? Or should we reject the point entirely?

That's maybe a question for @zacharycmontoya tbh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the equivalent history in opentelemetry-dotnet, they currently implement a similar unchecked section:

unchecked
{
    this.runningValue.AsLong++;
    this.histogramBuckets.RunningSum += number;
    this.histogramBuckets.RunningBucketCounts[i]++;
}

This unchecked was added in this PR so that they could avoid throwing an exception on a possible increment operation since they are hold a critical section lock.

In terms of correctness, I'm not aware of any mandates for handling integer overflows. I think it would be OK to accept this, and it would be consistent with opentelemetry-dotnet

Comment on lines 151 to 157
// For testing only - reset all captured metrics
internal static void ResetForTesting()
{
CapturedMetrics.Clear();
MetricStreamNames.Clear();
Interlocked.Exchange(ref _metricCount, 0);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is another sign for me that this whole type shouldn't be static. This is the sort of thing that's super easy to test and reason about if you can just do new MetricReaderHandler(). It also can make testing faster, as you can run all those tests in parallel if desired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove static methods in next PR: #7514

Comment on lines 119 to 124
var tagDict = new Dictionary<string, object?>();
for (int i = 0; i < tags.Length; i++)
{
var kv = tags[i];
tagDict[kv.Key] = kv.Value;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we do this? 🤔

var tagDict = new Dictionary<string, object?>(tags);

Also, we're creating this new dictionary every time you try to get a point, even if it already exists. I would suggest moving the construction into the lambda as a minimum, e.g.:

return _points.GetOrAdd(tagSet, _ => new MetricPoint(
                _identity.InstrumentName,
                _identity.MeterName,
                _identity.InstrumentType,
                MetricReaderHandler.GetTemporality(_identity.InstrumentType),
                new Dictionary<string, object?>(tags)));

Also, this will still allocate a closure every time you call this method... you might want to consider a non-allocating TryGetValue first:

        private MetricPoint GetOrCreatePoint(ReadOnlySpan<KeyValuePair<string, object?>> tags)
        {
            var tagSet = TagSet.FromSpan(tags);
            if (_points.TryGetValue(tagset, out var value))
            {
                return value;
            }

            // fallback, have to allocate
            return _points.GetOrAdd(tagSet, _ => new MetricPoint(
                _identity.InstrumentName,
                _identity.MeterName,
                _identity.InstrumentType,
                MetricReaderHandler.GetTemporality(_identity.InstrumentType),
                tagDict));
        }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I think in a previous commit the dict was constructed in the lambda. Let's make sure to move the dictionary construction into the lambda so it's only calculated when creating a new metric point

@link04 link04 merged commit bad5402 into master Sep 26, 2025
156 checks passed
@link04 link04 deleted the maximo/otel-metrics-api-collection branch September 26, 2025 17:56
@github-actions github-actions bot added this to the vNext-v3 milestone Sep 26, 2025
link04 added a commit that referenced this pull request Oct 7, 2025
## Summary of changes
This PR implements the OTLP Metrics Exporter for OpenTelemetry Metrics
API support, completing the final piece of the RFC implementation. Key
additions include:
- OTLP Metrics Exporter: Implemented `MetricExporter` abstract class and
`OtlpExporter` concrete implementation initially with only `
HTTP/protobuf` transport support ATM.
- Performance Optimizations: Custom protobuf serialization without
external dependencies, async operations with ConfigureAwait(false)
- Retry Logic: Exponential backoff with Retry-After header handling for
HTTP 429, 502, 503, 504 responses
- Telemetry Metrics: Internal metrics tracking
(`metrics_export_attempts`, `metrics_export_successes`,
`metrics_export_failures`) with protocol tag
- Sample App Integration: Enhanced Samples.OpenTelemetrySdk to support
both DD and OTel SDK approaches for comparative snapshot testing

This completes the series of 3 PRs for OpenTelemetry Metrics API
support.
1. [[OTEL] Metrics API Support - Configurations & Telemetry
#7420](#7420)
2. [[OTEL] Metrics API Support - Collection Implementation
#7511](#7511)
3. [[OTEL] Metrics API Support - Metrics Reader & OTLP Exporter
#7514](#7514) <- You are
here.

## Reason for change
The [RFC] OpenTelemetry Metrics API support requires the .NET tracer to
act as a drop-in replacement for the OpenTelemetry SDK. This PR
implements the OTLP export functionality, enabling customers to migrate
from OTel SDK to Datadog SDK with zero friction.
Business Impact: Enables customers using OpenTelemetry Metrics API to
switch to Datadog without code changes, supporting our strategic
initiative to be OTel-native.

## Implementation details
:white_check_mark: OTLP Metrics Exporter Components:
- MetricExporter - Abstract base class defining export contract
- OtlpExporter - Concrete implementation with HTTP/protobuf transport
- OtlpMetricsSerializer - Custom protobuf serialization for performance
- MetricReader - Updated to use the new exporter architecture

:white_check_mark: Key Features:
- HTTP/protobuf transport support
- Resource attribute precedence following RFC (DD_* → OTEL_* → defaults)
- Retry logic with exponential backoff and proper HTTP status handling
- Telemetry metrics with protocol tags (protocol:http/protobuf,
protocol:grpc, protocol:http/json)
- Performance optimizations with async operations and custom
serialization

:white_check_mark: Sample App Enhancements:
- Conditional OTel SDK exporter via OTEL_METRICS_EXPORTER_ENABLED
environment variable for testing
- Comparative testing between DD and OTel SDK approaches
- Proper MeterProvider lifecycle management with disposal

## Test coverage
Added comprehensive integration test `SubmitsOtlpMetrics` that:
- Tests both approaches: DD enabled/OTel disabled vs DD disabled/OTel
enabled
- Verifies identical results: Both approaches produce the same OTLP
metric requests
- Cross-version testing: Runs across all OpenTelemetry package versions
- Snapshot verification: Ensures RFC compliance and data model
compatibility

**NOTE**: Hoping to add Grpc and HTTP/Json tests along with the current
`SubmitsOtlpMetrics` protobuf one and also benchmarks if possible to
compare OTEL vs Datadog overhead.

## Other details
<!-- Fixes #{issue} -->
Jira: [APMAPI-1573](https://datadoghq.atlassian.net/browse/APMAPI-1573) 

<!--  ⚠️ Note:
Where possible, please obtain 2 approvals prior to merging. Unless
CODEOWNERS specifies otherwise, for external teams it is typically best
to have one review from a team member, and one review from apm-dotnet.
Trivial changes do not require 2 reviews.

MergeQueue is NOT enabled in this repository. If you have write access
to the repo, the PR has 1-2 approvals (see above), and all of the
required checks have passed, you can use the Squash and Merge button to
merge the PR. If you don't have write access, or you need help, reach
out in the #apm-dotnet channel in Slack.
-->


[APMAPI-1573]:
https://datadoghq.atlassian.net/browse/APMAPI-1573?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:opentelemetry OpenTelemetry support

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants