Skip to content

Conversation

xiu
Copy link
Contributor

@xiu xiu commented Apr 25, 2025

Description

With this commit, we add support for 128 bits TraceIDs coming from Datadog instrumented services. This can happen when an OTel instrumented service calls a downstream Datadog instrumented one. Datadog instrumentation libraries store the 128 bits TraceID into two different fields:

  • TraceID: lower 64 bits of the 128 bits TraceID
  • _dd.p.tid: upper 64 bits of the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before this commit, only the lower 64 bits were used as TraceID.

Link to tracking issue

Fixes #36926

Testing

Tested the setup with the following chain: OTel Instrumented Service --> Datadog Instrumented Service -> OTel Instrumented Service. The TraceID was maintained on the whole chain.

Also added a unit test (TestToTraces64to128bits).

Documentation

Updated README.md in 58129d5

@xiu xiu requested review from MovieStoreGuy and a team as code owners April 25, 2025 10:03
@xiu xiu force-pushed the fix/datadogreceiver_128bits_traceid branch 5 times, most recently from d4a5fca to 8012d98 Compare April 25, 2025 12:00
Copy link
Contributor

@dehaansa dehaansa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to dig in to this issue!

This seems like a very useful fix, but would probably constitute a breaking change for existing users of the component. As such I think we'll need to make it opt-in using a featuregate.

You can check for examples of using featuregates in other components, but they're relatively easy to set up. If you have one set to StageAlpha, it will be disabled by default, and then in later versions we will move it to Beta (enabled by default) and then the gate can later be removed. The gate will then need to be configured at runtime using a flag like --feature-gates=receiver.datadogreceiver.Enable128BitTraceID

Here's an example of what defining the gate looks like in code, copied and modified slightly from another component

var FullTraceIDFeatureGate = featuregate.GlobalRegistry().MustRegister(
	"receiver.datadogreceiver.Enable128BitTraceID",
	featuregate.StageAlpha,
	featuregate.WithRegisterDescription("<description."),
	featuregate.WithRegisterFromVersion("v0.125.0"),
)

...

if FullTraceIDFeatureGate.Enabled() {
...

@xiu
Copy link
Contributor Author

xiu commented Apr 26, 2025

Thanks for taking the time to dig in to this issue!

This seems like a very useful fix, but would probably constitute a breaking change for existing users of the component. As such I think we'll need to make it opt-in using a featuregate.

You can check for examples of using featuregates in other components, but they're relatively easy to set up. If you have one set to StageAlpha, it will be disabled by default, and then in later versions we will move it to Beta (enabled by default) and then the gate can later be removed. The gate will then need to be configured at runtime using a flag like --feature-gates=receiver.datadogreceiver.Enable128BitTraceID

Here's an example of what defining the gate looks like in code, copied and modified slightly from another component

var FullTraceIDFeatureGate = featuregate.GlobalRegistry().MustRegister(
	"receiver.datadogreceiver.Enable128BitTraceID",
	featuregate.StageAlpha,
	featuregate.WithRegisterDescription("<description."),
	featuregate.WithRegisterFromVersion("v0.125.0"),
)

...

if FullTraceIDFeatureGate.Enabled() {
...

@dehaansa Thanks for reviewing!

I've added a featuregate in af83cf2, let me know if anything.

@xiu xiu force-pushed the fix/datadogreceiver_128bits_traceid branch from 58129d5 to 014c6e0 Compare April 29, 2025 09:01
Copy link
Contributor

@dehaansa dehaansa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will try to get codeowner review before marking ready for merge.

@xiu xiu force-pushed the fix/datadogreceiver_128bits_traceid branch from 71347ee to fa7e5d3 Compare May 1, 2025 19:29
@cyrille-leclerc
Copy link
Member

cyrille-leclerc commented May 3, 2025

When testing activating --feature-gates=receiver.datadogreceiver.Enable128BitTraceID, I get the following failure:

2025-05-03T17:17:06.934+0200    error   [email protected]/receiver.go:254        
Error converting traces 
{
   "otelcol.component.id": "datadog", 
   "otelcol.component.kind": "receiver", 
   "otelcol.signal": "traces", 
   "error": "hex encoded trace-id must have length equals to 32"
}
Full stack trace
2025-05-03T17:17:06.934+0200    error   [email protected]/receiver.go:254        Error converting traces {"otelcol.component.id": "datadog", "otelcol.component.kind": "receiver", "otelcol.signal": "traces", "error": "hex encoded trace-id must have length equals to 32"}
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/datadogreceiver.(*datadogReceiver).handleTraces
        github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/receiver.go:254
net/http.HandlerFunc.ServeHTTP
        net/http/server.go:2294
net/http.(*ServeMux).ServeHTTP
        net/http/server.go:2822
go.opentelemetry.io/collector/config/confighttp.(*decompressor).ServeHTTP
        go.opentelemetry.io/collector/config/[email protected]/compression.go:183
go.opentelemetry.io/collector/config/confighttp.(*ServerConfig).ToServer.maxRequestBodySizeInterceptor.func2
        go.opentelemetry.io/collector/config/[email protected]/confighttp.go:615
net/http.HandlerFunc.ServeHTTP
        net/http/server.go:2294
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP
        go.opentelemetry.io/contrib/instrumentation/net/http/[email protected]/handler.go:179
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1
        go.opentelemetry.io/contrib/instrumentation/net/http/[email protected]/handler.go:67
net/http.HandlerFunc.ServeHTTP
        net/http/server.go:2294
go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP
        go.opentelemetry.io/collector/config/[email protected]/clientinfohandler.go:26
net/http.serverHandler.ServeHTTP
        net/http/server.go:3301
net/http.(*conn).serve
        net/http/server.go:2102
2025-05-03T17:17:09.015+0200    error   [email protected]/receiver.go:254        Error converting traces {"otelcol.component.id": "datadog", "otelcol.component.kind": "receiver", "otelcol.signal": "traces", "error": "hex encoded trace-id must have length equals to 32"}
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/datadogreceiver.(*datadogReceiver).handleTraces
        github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/receiver.go:254
net/http.HandlerFunc.ServeHTTP
        net/http/server.go:2294
net/http.(*ServeMux).ServeHTTP
        net/http/server.go:2822
go.opentelemetry.io/collector/config/confighttp.(*decompressor).ServeHTTP
        go.opentelemetry.io/collector/config/[email protected]/compression.go:183
go.opentelemetry.io/collector/config/confighttp.(*ServerConfig).ToServer.maxRequestBodySizeInterceptor.func2
        go.opentelemetry.io/collector/config/[email protected]/confighttp.go:615
net/http.HandlerFunc.ServeHTTP
        net/http/server.go:2294
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP
        go.opentelemetry.io/contrib/instrumentation/net/http/[email protected]/handler.go:179
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1
        go.opentelemetry.io/contrib/instrumentation/net/http/[email protected]/handler.go:67
net/http.HandlerFunc.ServeHTTP
        net/http/server.go:2294
go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP
        go.opentelemetry.io/collector/config/[email protected]/clientinfohandler.go:26
net/http.serverHandler.ServeHTTP
        net/http/server.go:3301
net/http.(*conn).serve
        net/http/server.go:2102
^C2025-05-03T17:17:09.866+0200  info    [email protected]/collector.go:358 Received signal from OS {"signal": "interrupt"}
2025-05-03T17:17:09.866+0200    info    [email protected]/service.go:331   Starting shutdown...
2025-05-03T17:17:09.866+0200    info    healthcheck/handler.go:132      Health Check state change       {"otelcol.component.id": "health_check", "otelcol.component.kind": "extension", "status": "unavailable"}
2025-05-03T17:17:09.867+0200    info    [email protected]/connector.go:219  Shutting down spanmetrics connector     {"otelcol.component.id": "spanmetrics", "otelcol.component.kind": "connector", "otelcol.signal": "traces", "otelcol.signal.output": "metrics"}
2025-05-03T17:17:09.867+0200    info    [email protected]/connector.go:221  Stopping ticker {"otelcol.component.id": "spanmetrics", "otelcol.component.kind": "connector", "otelcol.signal": "traces", "otelcol.signal.output": "metrics"}
2025-05-03T17:17:09.867+0200    info    [email protected]/connector.go:114 Stopping Grafana Cloud connector        {"otelcol.component.id": "grafanacloud", "otelcol.component.kind": "connector", "otelcol.signal": "traces", "otelcol.signal.output": "metrics"}
2025-05-03T17:17:10.485+0200    info    extensions/extensions.go:69     Stopping extensions...
2025-05-03T17:17:10.485+0200    info    [email protected]/service.go:345   Shutdown complete.

I don't know how to get the details of the Datadog span that causes this problem.

Here are my findings:

@cyrille-leclerc
Copy link
Member

cyrille-leclerc commented May 4, 2025

Example exception 1: postgresql.query span

Trace initiated on the frontend service instrumented with dd-trace-java, it's a span of this service that is causing this exception.

2025-05-04T14:46:39.095+0200    error   [email protected]/receiver.go:254        Error converting traces 
{
   "otelcol.component.id": "datadog", 
   "otelcol.component.kind": "receiver", 
   "otelcol.signal": "traces", 
   "error": "error converting to a 128bit traceid (_dd.p.tid: 681761ae00000000 - span.TraceID: 531758935570545189): hex encoded trace-id must have length equals to 32"
}
OTel Col stack trace ``` 2025-05-04T14:46:39.095+0200 error [email protected]/receiver.go:254 Error converting traces {"otelcol.component.id": "datadog", "otelcol.component.kind": "receiver", "otelcol.signal": "traces", "error": "error converting to a 128bit traceid (_dd.p.tid: 681761ae00000000 - span.TraceID: 531758935570545189): hex encoded trace-id must have length equals to 32"} github.com/open-telemetry/opentelemetry-collector-contrib/receiver/datadogreceiver.(*datadogReceiver).handleTraces github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/receiver.go:254 net/http.HandlerFunc.ServeHTTP net/http/server.go:2294 net/http.(*ServeMux).ServeHTTP net/http/server.go:2822 go.opentelemetry.io/collector/config/confighttp.(*decompressor).ServeHTTP go.opentelemetry.io/collector/config/[email protected]/compression.go:183 go.opentelemetry.io/collector/config/confighttp.(*ServerConfig).ToServer.maxRequestBodySizeInterceptor.func2 go.opentelemetry.io/collector/config/[email protected]/confighttp.go:615 net/http.HandlerFunc.ServeHTTP net/http/server.go:2294 go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP go.opentelemetry.io/contrib/instrumentation/net/http/[email protected]/handler.go:179 go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1 go.opentelemetry.io/contrib/instrumentation/net/http/[email protected]/handler.go:67 net/http.HandlerFunc.ServeHTTP net/http/server.go:2294 go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP go.opentelemetry.io/collector/config/[email protected]/clientinfohandler.go:26 net/http.serverHandler.ServeHTTP net/http/server.go:3301 net/http.(*conn).serve net/http/server.go:2102 ```

dd-trace-java logs

[dd.trace 2025-05-04 14:46:38:693 +0200] [main] DEBUG datadog.trace.agent.core.DDSpan - Started span: DDSpan [ t_id=531758935570545189, s_id=4563708466107625754, p_id=0 ] trace=frontend/database.query/database.query tags={_dd.profiling.enabled=0, _dd.trace_span_attribute_schema=0, env=production, language=jvm, process_id=97714, runtime-id=0b29790c-1684-466e-ba5e-6b0d90f95a14, thread.id=1, thread.name=main, version=1.1}, duration_ns=0, forceKeep=false, links=[]
[dd.trace 2025-05-04 14:46:38:693 +0200] [main] DEBUG datadog.trace.agent.common.writer.RemoteWriter - Enqueued for serialization: [DDSpan [ t_id=531758935570545189, s_id=4563708466107625754, p_id=0 ] trace=postgresql/postgresql.query/insert into product (id, name, picture_url, price) values (?,?, ?, ?) on conflict do nothing *measured* tags={_dd.agent_psr=1.0, _dd.profiling.enabled=0, _dd.trace_span_attribute_schema=0, _sample_rate=1, component=java-jdbc-statement, db.instance=my_shopping_cart, db.operation=insert, db.pool.name=HikariPool-1, db.type=postgresql, db.user=my_shopping_cart, env=production, language=jvm, peer.hostname=postgresql.local, process_id=97714, runtime-id=0b29790c-1684-466e-ba5e-6b0d90f95a14, span.kind=client, thread.id=1, thread.name=main, version=1.1}, duration_ns=847666, forceKeep=false, links=[]]
[dd.trace 2025-05-04 14:46:38:693 +0200] [main] DEBUG datadog.trace.agent.core.DDSpan - Finished span (WRITTEN): DDSpan [ t_id=531758935570545189, s_id=4563708466107625754, p_id=0 ] trace=postgresql/postgresql.query/insert into product (id, name, picture_url, price) values (?,?, ?, ?) on conflict do nothing *measured* tags={_dd.agent_psr=1.0, _dd.profiling.enabled=0, _dd.trace_span_attribute_schema=0, _sample_rate=1, component=java-jdbc-statement, db.instance=my_shopping_cart, db.operation=insert, db.pool.name=HikariPool-1, db.type=postgresql, db.user=my_shopping_cart, env=production, language=jvm, peer.hostname=postgresql.local, process_id=97714, runtime-id=0b29790c-1684-466e-ba5e-6b0d90f95a14, span.kind=client, thread.id=1, thread.name=main, version=1.1}, duration_ns=847666, forceKeep=false, links=[]

Example exception 2: postgresql.query span

@xiu xiu force-pushed the fix/datadogreceiver_128bits_traceid branch from 9d8cbdb to b3f8d1d Compare May 4, 2025 19:46
@xiu
Copy link
Contributor Author

xiu commented May 4, 2025

Thanks for testing and the help debugging the issue!
I believe I've fixed the issue with b3f8d1d, it was due to a naive string concatenation rather than working at the byte level through uInt64ToTraceID.

I've also improved error handling so as not to reject the whole ToTraces run when an error in encountered but rather keep the 64-bit trace id.

@xiu xiu force-pushed the fix/datadogreceiver_128bits_traceid branch from b3f8d1d to 0f8a876 Compare May 4, 2025 20:41
@cyrille-leclerc
Copy link
Member

cyrille-leclerc commented May 4, 2025

I could successfully test the fix:

  • Upstream service called frontend instrumented with otel java auto-instr
  • Downstream service shipping instrumented with dd-trace-java emitting telemetry to the otelcol datadog receiver

With otelcol 0.125.0, the traces of the frontend service don't connect the shipping service which is an error.
With otelcol --feature-gates=receiver.datadogreceiver.Enable128BitTraceID, the the traces of the frontend service do connect the shipping service which is the desired behaviour.

image

@xiu
Copy link
Contributor Author

xiu commented May 5, 2025

Unrelated vulnerability addressed in #39862

@dehaansa
Copy link
Contributor

dehaansa commented May 5, 2025

Could we get a codeowner review? @boostchicken @gouthamve @MovieStoreGuy

@xiu xiu force-pushed the fix/datadogreceiver_128bits_traceid branch from 6e567c7 to b130d80 Compare May 6, 2025 08:38
@MovieStoreGuy
Copy link
Contributor

Please resolve the conflicts :)

xiu added 8 commits May 7, 2025 07:17
With this commit, we add support for 128 bits TraceIDs coming from Datadog
instrumented services. This can happen when an OTel instrumented service calls
a downstream Datadog instrumented one. Datadog instrumentation libraries store
the 128 bits TraceID into two different fields:
* TraceID: lower 64 bits of the 128 bits TraceID
* _dd.p.tid: upper 64 bits of the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before this
commit, only the lower 64 bits were used as TraceID.

Fixes open-telemetry#36926
This helps with reusing trace id across multiple flushes. This commit also adds
a new configuration item: `trace_id_cache_size`, which will be used to size the
LRU cache.

If translator.ToTraces doesn't get a cache (e.g. nil), it'll disable the
feature.
As noted by @cyrille-leclerc, some spans failed to convert their trace id to
128 bits. This was due to a naive string concatenation vs properly splitting
the trace id at the byte level as in uInt64ToTraceID. Previously, we were also
rejecting the traces in the current ToTraces run. Now, we just skip converting
and keep the 64-bit trace id if anything.
@xiu xiu force-pushed the fix/datadogreceiver_128bits_traceid branch from b130d80 to 6f904c0 Compare May 7, 2025 05:19
@xiu
Copy link
Contributor Author

xiu commented May 7, 2025

@MovieStoreGuy done!

@ArthurSens ArthurSens added ready to merge Code review completed; ready to merge by maintainers and removed waiting-for-code-owners labels May 7, 2025
@atoulme atoulme merged commit 8965825 into open-telemetry:main May 8, 2025
187 of 190 checks passed
@github-actions github-actions bot added this to the next release milestone May 8, 2025
dragonlord93 pushed a commit to dragonlord93/opentelemetry-collector-contrib that referenced this pull request May 23, 2025
…-telemetry#39654)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
With this commit, we add support for 128 bits TraceIDs coming from
Datadog instrumented services. This can happen when an OTel instrumented
service calls a downstream Datadog instrumented one. Datadog
instrumentation libraries store the 128 bits TraceID into two different
fields:
* TraceID: lower 64 bits of the 128 bits TraceID 
* _dd.p.tid: upper 64 bits of the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before
this commit, only the lower 64 bits were used as TraceID.

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes open-telemetry#36926

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Tested the setup with the following chain: OTel Instrumented Service -->
Datadog Instrumented Service -> OTel Instrumented Service. The TraceID
was maintained on the whole chain.

Also added a unit test (`TestToTraces64to128bits`).

#### Documentation
Updated README.md in
[58129d5](open-telemetry@58129d5)
dd-jasminesun pushed a commit to DataDog/opentelemetry-collector-contrib that referenced this pull request Jun 23, 2025
…-telemetry#39654)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
With this commit, we add support for 128 bits TraceIDs coming from
Datadog instrumented services. This can happen when an OTel instrumented
service calls a downstream Datadog instrumented one. Datadog
instrumentation libraries store the 128 bits TraceID into two different
fields:
* TraceID: lower 64 bits of the 128 bits TraceID 
* _dd.p.tid: upper 64 bits of the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before
this commit, only the lower 64 bits were used as TraceID.

<!-- Issue number (e.g. #1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes open-telemetry#36926

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Tested the setup with the following chain: OTel Instrumented Service -->
Datadog Instrumented Service -> OTel Instrumented Service. The TraceID
was maintained on the whole chain.

Also added a unit test (`TestToTraces64to128bits`).

#### Documentation
Updated README.md in
[58129d5](open-telemetry@58129d5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready to merge Code review completed; ready to merge by maintainers receiver/datadog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Broken trace context propagation: OTel Trace ID of DD agent spans converted by OTel Col Datadog Receiver are wrong
7 participants