- 
                Notifications
    
You must be signed in to change notification settings  - Fork 150
 
[OTEL] Metrics API Support - Collection Implementation #7511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ea31ebc    to
    8ee89bd      
    Compare
  
    
          Execution-Time Benchmarks Report ⏱️Execution-time results for samples comparing the following branches/commits: Execution-time benchmarks measure the whole time it takes to execute a program. And are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are shown in red. The following thresholds were used for comparing the execution times: 
 Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard. Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph). gantt
    title Execution time (ms) FakeDbCommand (.NET Framework 4.8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (72ms)  : 71, 73
     .   : milestone, 72,
    master - mean (72ms)  : 71, 73
     .   : milestone, 72,
    section Baseline
    This PR (7511) - mean (69ms)  : 65, 72
     .   : milestone, 69,
    master - mean (68ms)  : 66, 71
     .   : milestone, 68,
    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (1,044ms)  : 1002, 1086
     .   : milestone, 1044,
    master - mean (1,052ms)  : 990, 1115
     .   : milestone, 1052,
    gantt
    title Execution time (ms) FakeDbCommand (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (106ms)  : 105, 107
     .   : milestone, 106,
    master - mean (106ms)  : 105, 107
     .   : milestone, 106,
    section Baseline
    This PR (7511) - mean (105ms)  : 103, 108
     .   : milestone, 105,
    master - mean (106ms)  : 102, 109
     .   : milestone, 106,
    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (740ms)  : 721, 758
     .   : milestone, 740,
    master - mean (743ms)  : 714, 771
     .   : milestone, 743,
    gantt
    title Execution time (ms) FakeDbCommand (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (101ms)  : 99, 102
     .   : milestone, 101,
    master - mean (101ms)  : 100, 102
     .   : milestone, 101,
    section Baseline
    This PR (7511) - mean (100ms)  : 97, 102
     .   : milestone, 100,
    master - mean (100ms)  : 98, 102
     .   : milestone, 100,
    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (770ms)  : 724, 815
     .   : milestone, 770,
    master - mean (779ms)  : 730, 828
     .   : milestone, 779,
    gantt
    title Execution time (ms) FakeDbCommand (.NET 8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (93ms)  : 92, 93
     .   : milestone, 93,
    master - mean (93ms)  : 92, 94
     .   : milestone, 93,
    section Baseline
    This PR (7511) - mean (92ms)  : 90, 95
     .   : milestone, 92,
    master - mean (92ms)  : 90, 94
     .   : milestone, 92,
    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (655ms)  : 639, 672
     .   : milestone, 655,
    master - mean (661ms)  : 645, 678
     .   : milestone, 661,
    gantt
    title Execution time (ms) HttpMessageHandler (.NET Framework 4.8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (208ms)  : 198, 218
     .   : milestone, 208,
    master - mean (208ms)  : 199, 216
     .   : milestone, 208,
    section Baseline
    This PR (7511) - mean (204ms)  : 194, 214
     .   : milestone, 204,
    master - mean (204ms)  : 195, 212
     .   : milestone, 204,
    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (1,218ms)  : 1141, 1295
     .   : milestone, 1218,
    master - mean (1,216ms)  : 1157, 1275
     .   : milestone, 1216,
    gantt
    title Execution time (ms) HttpMessageHandler (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (295ms)  : 283, 307
     .   : milestone, 295,
    master - mean (300ms)  : 277, 323
     .   : milestone, 300,
    section Baseline
    This PR (7511) - mean (294ms)  : 282, 305
     .   : milestone, 294,
    master - mean (297ms)  : 280, 315
     .   : milestone, 297,
    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (971ms)  : 933, 1010
     .   : milestone, 971,
    master - mean (993ms)  : 953, 1034
     .   : milestone, 993,
    gantt
    title Execution time (ms) HttpMessageHandler (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (297ms)  : 287, 307
     .   : milestone, 297,
    master - mean (303ms)  : 281, 324
     .   : milestone, 303,
    section Baseline
    This PR (7511) - mean (296ms)  : 285, 306
     .   : milestone, 296,
    master - mean (300ms)  : 282, 318
     .   : milestone, 300,
    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (1,030ms)  : 978, 1082
     .   : milestone, 1030,
    master - mean (1,046ms)  : 989, 1103
     .   : milestone, 1046,
    gantt
    title Execution time (ms) HttpMessageHandler (.NET 8) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Bailout
    This PR (7511) - mean (286ms)  : 274, 298
     .   : milestone, 286,
    master - mean (287ms)  : 271, 302
     .   : milestone, 287,
    section Baseline
    This PR (7511) - mean (286ms)  : 273, 298
     .   : milestone, 286,
    master - mean (286ms)  : 268, 304
     .   : milestone, 286,
    section CallTarget+Inlining+NGEN
    This PR (7511) - mean (894ms)  : 836, 951
     .   : milestone, 894,
    master - mean (907ms)  : 834, 980
     .   : milestone, 907,
     | 
    
| return; | ||
| } | ||
| 
               | 
          ||
| var meterListener = new System.Diagnostics.Metrics.MeterListener(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting to directly reference the MeterListener since we can target the .NET 6.0 target framework. I think this decision is fine for this initial set of PR's so that we can develop this feature with first-class support for .NET 6.0+ where the API's are shipped in the runtime, but we'll want to revisit this later and do reflection/duck-typing so that applications on older runtimes that use the API's via the System.Diagnostics.DiagnosticSource NuGet package can also emit metrics.
        
          
                tracer/test/Datadog.Trace.ClrProfiler.IntegrationTests/OpenTelemetrySdkTests.cs
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | 
               | 
          ||
| // Trigger async collection and get metrics for testing | ||
| OTelMetrics.MetricReader.CollectObservableInstruments(); | ||
| var capturedMetrics = OTelMetrics.MetricReaderHandler.GetCapturedMetricsForTesting(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: When you implement the Exporter work, I recommend replacing this GetCapturedMetricsForTesting method with an InMemoryExporter so you're continuing to test the Export processing. We probably shouldn't expose it without a good use case, but you can make it internal to Datadog.Trace
        
          
                tracer/test/Datadog.Trace.ClrProfiler.IntegrationTests/OpenTelemetrySdkTests.cs
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                tracer/test/Datadog.Trace.ClrProfiler.IntegrationTests/OpenTelemetrySdkTests.cs
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a first pass at this, and I left comments with the following high-level feedback:
- The TemporalityPreference to Instrument.Temporality needs to be updated for correctness. And then the system-tests I've started should identify any other correctness issues.
 - In general, we can use enums rather than strings since we have a well-defined set of Instruments
 - Performance: We can add state to the measurement event callbacks, so let's use that to improve the access to individual metric points when we need to update their values.
 
Note: I still have several configurations I need to test in system-tests so let's sync offline as you progress on that testing.
      
        
              This comment has been minimized.
        
        
      
    
  This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new MetricState looks much improved over the original single dictionary! The main feedback after reviewing the latest changes is that we need to output separate metrics streams when individual measurements have separate tags, so you'll need to modify your implementation to account for that. I'm sorry I don't have tests ready to show you what that looks like, I'll turn my attention to that ASAP so I have a good reference for you to follow.
| var sorted = tags.ToArray(); | ||
| Array.Sort(sorted, (a, b) => string.CompareOrdinal(a.Key, b.Key)); | ||
| 
               | 
          ||
| var key = string.Join(";", sorted.Select(kv => $"{kv.Key}={kv.Value}")); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TagSet implementation and usage looks good. I was trying to think of a more efficient way to create a unique key from the ReadOnlySpan<KeyValuePair<string, object?>> tags and off the top of my head I can't think of one. Perhaps we could intern the resulting string because if we get the same tags over and over again, we could save on memory allocations, but then that string value will not be garbage-collected quickly, so it's not a clear win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps at a later time we could create a central cache that maps unique ReadOnlySpan<KeyValuePair<string, object?>> tags objects => TagSet objects by creating a trie data structure. But not worth spending more time on it without microbenchmarking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably a cached stringbuilder with a foreach over the tags would make sense here, e.g. something like this?
var sb = StringBuilderCache.Acquire();
foreach(var pair in sorted)
{
    sb.Append(kv.Key);
    sb.Append('=');
    sb.Append(kv.Value);
    sb.Append(';');
}
var key = StringBuilderCache.GetStringAndRelease(sb);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will address this in the next PR: #7514
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I've left a couple of small nits but overall this looks great and it's organized really well. Thanks for working through my detailed feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general, nice job!
Some overall comments:
- Lose the 
[MethodImpl(MethodImplOptions.AggressiveInlining)]. There's a lot of places where that is likely going to be harmful overall, and there's no real need for it unless benchmarking shows it's definitely going to be useful IMO. - The use of statics makes testing things more complicated. I would consider converting to instance objects and then creating a single static instance for the lifetime management.
 - We may need to be careful with the logging in "hot paths" (e.g. recording values). We will likely have to add some sort of rate-limiting/gating to avoid flooding the logs
 - There's a couple of places that might need "fixing" for correctness, or at least adding comments acknowledging/working around the issue e.g. overflow
 
| 
               | 
          ||
| lock (_histogramLock) | ||
| { | ||
| unchecked | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, unchecked is technically the default, but it doesn't hurt to add this in case someone has explicitly opted in to checked. However, is unchecked roll over what you actually want (where a big number suddenly becomes negative?!)? Or do you actually want to clamp to max in this case? Or should we reject the point entirely?
That's maybe a question for @zacharycmontoya tbh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the equivalent history in opentelemetry-dotnet, they currently implement a similar unchecked section:
unchecked
{
    this.runningValue.AsLong++;
    this.histogramBuckets.RunningSum += number;
    this.histogramBuckets.RunningBucketCounts[i]++;
}This unchecked was added in this PR so that they could avoid throwing an exception on a possible increment operation since they are hold a critical section lock.
In terms of correctness, I'm not aware of any mandates for handling integer overflows. I think it would be OK to accept this, and it would be consistent with opentelemetry-dotnet
| // For testing only - reset all captured metrics | ||
| internal static void ResetForTesting() | ||
| { | ||
| CapturedMetrics.Clear(); | ||
| MetricStreamNames.Clear(); | ||
| Interlocked.Exchange(ref _metricCount, 0); | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is another sign for me that this whole type shouldn't be static. This is the sort of thing that's super easy to test and reason about if you can just do new MetricReaderHandler(). It also can make testing faster, as you can run all those tests in parallel if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove static methods in next PR: #7514
| var tagDict = new Dictionary<string, object?>(); | ||
| for (int i = 0; i < tags.Length; i++) | ||
| { | ||
| var kv = tags[i]; | ||
| tagDict[kv.Key] = kv.Value; | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we do this? 🤔
var tagDict = new Dictionary<string, object?>(tags);
Also, we're creating this new dictionary every time you try to get a point, even if it already exists. I would suggest moving the construction into the lambda as a minimum, e.g.:
return _points.GetOrAdd(tagSet, _ => new MetricPoint(
                _identity.InstrumentName,
                _identity.MeterName,
                _identity.InstrumentType,
                MetricReaderHandler.GetTemporality(_identity.InstrumentType),
                new Dictionary<string, object?>(tags)));Also, this will still allocate a closure every time you call this method... you might want to consider a non-allocating TryGetValue first:
        private MetricPoint GetOrCreatePoint(ReadOnlySpan<KeyValuePair<string, object?>> tags)
        {
            var tagSet = TagSet.FromSpan(tags);
            if (_points.TryGetValue(tagset, out var value))
            {
                return value;
            }
            // fallback, have to allocate
            return _points.GetOrAdd(tagSet, _ => new MetricPoint(
                _identity.InstrumentName,
                _identity.MeterName,
                _identity.InstrumentType,
                MetricReaderHandler.GetTemporality(_identity.InstrumentType),
                tagDict));
        }There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I think in a previous commit the dict was constructed in the lambda. Let's make sure to move the dictionary construction into the lambda so it's only calculated when creating a new metric point
## Summary of changes This PR implements the OTLP Metrics Exporter for OpenTelemetry Metrics API support, completing the final piece of the RFC implementation. Key additions include: - OTLP Metrics Exporter: Implemented `MetricExporter` abstract class and `OtlpExporter` concrete implementation initially with only ` HTTP/protobuf` transport support ATM. - Performance Optimizations: Custom protobuf serialization without external dependencies, async operations with ConfigureAwait(false) - Retry Logic: Exponential backoff with Retry-After header handling for HTTP 429, 502, 503, 504 responses - Telemetry Metrics: Internal metrics tracking (`metrics_export_attempts`, `metrics_export_successes`, `metrics_export_failures`) with protocol tag - Sample App Integration: Enhanced Samples.OpenTelemetrySdk to support both DD and OTel SDK approaches for comparative snapshot testing This completes the series of 3 PRs for OpenTelemetry Metrics API support. 1. [[OTEL] Metrics API Support - Configurations & Telemetry #7420](#7420) 2. [[OTEL] Metrics API Support - Collection Implementation #7511](#7511) 3. [[OTEL] Metrics API Support - Metrics Reader & OTLP Exporter #7514](#7514) <- You are here. ## Reason for change The [RFC] OpenTelemetry Metrics API support requires the .NET tracer to act as a drop-in replacement for the OpenTelemetry SDK. This PR implements the OTLP export functionality, enabling customers to migrate from OTel SDK to Datadog SDK with zero friction. Business Impact: Enables customers using OpenTelemetry Metrics API to switch to Datadog without code changes, supporting our strategic initiative to be OTel-native. ## Implementation details :white_check_mark: OTLP Metrics Exporter Components: - MetricExporter - Abstract base class defining export contract - OtlpExporter - Concrete implementation with HTTP/protobuf transport - OtlpMetricsSerializer - Custom protobuf serialization for performance - MetricReader - Updated to use the new exporter architecture :white_check_mark: Key Features: - HTTP/protobuf transport support - Resource attribute precedence following RFC (DD_* → OTEL_* → defaults) - Retry logic with exponential backoff and proper HTTP status handling - Telemetry metrics with protocol tags (protocol:http/protobuf, protocol:grpc, protocol:http/json) - Performance optimizations with async operations and custom serialization :white_check_mark: Sample App Enhancements: - Conditional OTel SDK exporter via OTEL_METRICS_EXPORTER_ENABLED environment variable for testing - Comparative testing between DD and OTel SDK approaches - Proper MeterProvider lifecycle management with disposal ## Test coverage Added comprehensive integration test `SubmitsOtlpMetrics` that: - Tests both approaches: DD enabled/OTel disabled vs DD disabled/OTel enabled - Verifies identical results: Both approaches produce the same OTLP metric requests - Cross-version testing: Runs across all OpenTelemetry package versions - Snapshot verification: Ensures RFC compliance and data model compatibility **NOTE**: Hoping to add Grpc and HTTP/Json tests along with the current `SubmitsOtlpMetrics` protobuf one and also benchmarks if possible to compare OTEL vs Datadog overhead. ## Other details <!-- Fixes #{issue} --> Jira: [APMAPI-1573](https://datadoghq.atlassian.net/browse/APMAPI-1573) <!--⚠️ Note: Where possible, please obtain 2 approvals prior to merging. Unless CODEOWNERS specifies otherwise, for external teams it is typically best to have one review from a team member, and one review from apm-dotnet. Trivial changes do not require 2 reviews. MergeQueue is NOT enabled in this repository. If you have write access to the repo, the PR has 1-2 approvals (see above), and all of the required checks have passed, you can use the Squash and Merge button to merge the PR. If you don't have write access, or you need help, reach out in the #apm-dotnet channel in Slack. --> [APMAPI-1573]: https://datadoghq.atlassian.net/browse/APMAPI-1573?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
Summary of changes
This PR implements the metrics collection foundation for OpenTelemetry Metrics API supportKey additions include:
This is the second of a series of 3 PRs to complete the OpenTelemetry Metrics API support.
Reason for change
The [RFC] OpenTelemetry Metrics API support requires implementing metrics collection that's compatible with the OpenTelemetry data model. This PR provides the collection foundation that will enable customers to use .NET's
System.Diagnostics.MetricsAPI with Datadog as a drop-in replacement for the OpenTelemetry SDK.Business Impact: Enables the collection phase of OpenTelemetry Metrics API support, laying the groundwork for OTLP export in the next PR.
Implementation details
✅ Metrics Collection Components:
✅ Aggregation Mapping:
Test coverage
Added unit test
MeterListenerCapturesAllMetrics()that verifies:Histogram bucket logic verification
Other details
Jira: APMAPI-1572