-
Couldn't load subscription status.
- Fork 3.1k
Description
Component(s)
processor/tailsampling
Is your feature request related to a problem? Please describe.
The composite policy allows the user to allocate a certain amount of total spans per second to each subpolicy. For example, one might configure "error" traces to get 90%, but "normal" traces to only get 10% of the total allocated spans per second. However, as traffic increases, its impossible to know what the effective sampling rate is (ex, how many error traces are being sampled, how many normal traces are being sampled). This information is important for understanding the characteristics of traffic in a distributed system and detecting when the amount of errors (or other conditions is rising).
Describe the solution you'd like
There are two places I'd like to see this information show up.
- In metrics, there should be an additional dimension on "count_traces_sampled" called "subpolicy" or a new counter called "composite_count_traces_sampled"
- This information is also important when inspecting the trace. I'd like to have two attributes added to the root span (or first span, if root is missing) of a trace.
a) "trace.sampling_policy" - the name of the subpolicy which triggered the sampling decision
b) "trace.sampling_rate" - the effective sampling rate averaged over a time window (sampled_traces_for_subpolicy/total_traces_for_subpolicy)
The first will allow for monitoring of trends. The second is important when inspecting a trace to understand how often that particular type of trace shows up relative to total traffic.
Describe alternatives you've considered
There are potentially other places to record the sampling policy and sampling rate for a trace, but it seems the root span is the best option.
If traces are dropped due to other issues, such as memory constraints, it would impact the accuracy of the metrics and the "sampling_rate" attribute on the trace.
Additional context
No response