Skip to content

Commit f1fe602

Browse files
codebotendjaglowskimx-psi
authored
Document collector's internal telemetry (#10695)
This documents is to provide guidelines for component authors when making decisions about what the telemetry of their components should look like in order to provide a consistent experience to end users. --------- Signed-off-by: Alex Boten <[email protected]> Co-authored-by: Daniel Jaglowski <[email protected]> Co-authored-by: Pablo Baeyens <[email protected]>
1 parent c0a846e commit f1fe602

File tree

1 file changed

+97
-39
lines changed

1 file changed

+97
-39
lines changed

docs/observability.md

Lines changed: 97 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,101 @@ If you need to troubleshoot the Collector, see [Troubleshooting].
1212
Read on to learn about experimental features and the project's overall vision
1313
for internal telemetry.
1414

15+
<!-- toc -->
16+
17+
- [Goals of internal telemetry](#goals-of-internal-telemetry)
18+
* [Observable elements](#observable-elements)
19+
* [Impact](#impact)
20+
* [Configurable level of observability](#configurable-level-of-observability)
21+
* [Internal telemetry properties](#internal-telemetry-properties)
22+
+ [Units](#units)
23+
+ [Process for defining new metrics](#process-for-defining-new-metrics)
24+
- [Experimental trace telemetry](#experimental-trace-telemetry)
25+
26+
<!-- tocstop -->
27+
28+
## Goals of internal telemetry
29+
30+
The Collector's internal telemetry is an important part of fulfilling
31+
OpenTelemetry's [project vision](vision.md). The following section explains the
32+
priorities for making the Collector an observable service.
33+
34+
### Observable elements
35+
36+
The following aspects of the Collector need to be observable.
37+
38+
- [Current values]
39+
- Some of the current values and rates might be calculated as derivatives of
40+
cumulative values in the backend, so it's an open question whether to expose
41+
them separately or not.
42+
- [Cumulative values]
43+
- [Trace or log events]
44+
- For start or stop events, an appropriate hysteresis must be defined to avoid
45+
generating too many events. Note that start and stop events can't be
46+
detected in the backend simply as derivatives of current rates. The events
47+
include additional data that is not present in the current value.
48+
- [Host metrics]
49+
- Host metrics can help users determine if the observed problem in a service
50+
is caused by a different process on the same host.
51+
52+
### Impact
53+
54+
The impact of these observability improvements on the core performance of the
55+
Collector must be assessed.
56+
57+
### Configurable level of observability
58+
59+
Some metrics and traces can be high volume and users might not always want to
60+
observe them. An observability verbosity “level” allows configuration of the
61+
Collector to send more or less observability data or with even finer
62+
granularity, to allow turning on or off specific metrics.
63+
64+
The default level of observability must be defined in a way that has
65+
insignificant performance impact on the service.
66+
67+
### Internal telemetry properties
68+
69+
Telemetry produced by the Collector has the following properties:
70+
71+
- metrics produced by Collector components use the prefix `otelcol_`
72+
- metrics produced by any instrumentation library used by Collector components will *not* be prefixed with `otelcol_`
73+
- code is instrumented using the OpenTelemetry API for metrics, and traces. Logs are instrumented using zap. Telemetry is collected and produced via the OpenTelemetry Go SDK
74+
- instrumentation scope defaults to the package name of the component recording telemetry. It can be configured
75+
via the `scope_name` option in mdatagen, but the recommendation is to keep the default
76+
- metrics are defined via `metadata.yaml` except in components that have specific cases where
77+
it is not possible to do so. See the [issue](https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/33523)
78+
which list such components
79+
- whenever possible, components should leverage core components or helper libraries to capture
80+
telemetry, ensuring that all components of the Collector can be consistently observed
81+
- telemetry produced by components should include attributes that identify specific instances
82+
of the components
83+
84+
#### Units
85+
86+
The following units should be used for metrics emitted by the Collector
87+
for the purpose of its internal telemetry:
88+
89+
| Field type | Unit |
90+
| -------------------------------------------------------------------------- | -------------- |
91+
| Metric counting the number of log records received, processed, or exported | `{records}` |
92+
| Metric counting the number of spans received, processed, or exported | `{spans}` |
93+
| Metric counting the number of data points received, processed, or exported | `{datapoints}` |
94+
95+
#### Process for defining new metrics
96+
97+
Metrics in the Collector are defined via `metadata.yaml`, which is used by [mdatagen] to
98+
produce:
99+
100+
- code to create metric instruments that can be used by components
101+
- documentation for internal metrics
102+
- a consistent prefix for all internal metrics
103+
- convenience accessors for meter and tracer
104+
- a consistent instrumentation scope for components
105+
- test methods for validating the telemetry
106+
107+
The process to generate new metrics is to configure them via
108+
`metadata.yaml`, and run `go generate` on the component.
109+
15110
## Experimental trace telemetry
16111

17112
The Collector does not expose traces by default, but an effort is underway to
@@ -73,45 +168,6 @@ service:
73168
endpoint: ${MY_POD_IP}:4317
74169
```
75170
76-
## Goals of internal telemetry
77-
78-
The Collector's internal telemetry is an important part of fulfilling
79-
OpenTelemetry's [project vision](vision.md). The following section explains the
80-
priorities for making the Collector an observable service.
81-
82-
### Observable elements
83-
84-
The following aspects of the Collector need to be observable.
85-
86-
- [Current values]
87-
- Some of the current values and rates might be calculated as derivatives of
88-
cumulative values in the backend, so it's an open question whether to expose
89-
them separately or not.
90-
- [Cumulative values]
91-
- [Trace or log events]
92-
- For start or stop events, an appropriate hysteresis must be defined to avoid
93-
generating too many events. Note that start and stop events can't be
94-
detected in the backend simply as derivatives of current rates. The events
95-
include additional data that is not present in the current value.
96-
- [Host metrics]
97-
- Host metrics can help users determine if the observed problem in a service
98-
is caused by a different process on the same host.
99-
100-
### Impact
101-
102-
The impact of these observability improvements on the core performance of the
103-
Collector must be assessed.
104-
105-
### Configurable level of observability
106-
107-
Some metrics and traces can be high volume and users might not always want to
108-
observe them. An observability verboseness “level” allows configuration of the
109-
Collector to send more or less observability data or with even finer
110-
granularity, to allow turning on or off specific metrics.
111-
112-
The default level of observability must be defined in a way that has
113-
insignificant performance impact on the service.
114-
115171
[Internal telemetry]:
116172
https://opentelemetry.io/docs/collector/internal-telemetry/
117173
[Troubleshooting]: https://opentelemetry.io/docs/collector/troubleshooting/
@@ -132,3 +188,5 @@ insignificant performance impact on the service.
132188
https://opentelemetry.io/docs/collector/internal-telemetry/#events-observable-with-internal-logs
133189
[Host metrics]:
134190
https://opentelemetry.io/docs/collector/internal-telemetry/#lists-of-internal-metrics
191+
[mdatagen]:
192+
https://github.com/open-telemetry/opentelemetry-collector/tree/main/cmd/mdatagen

0 commit comments

Comments
 (0)