Skip to content

Conversation

@fandreuz
Copy link
Contributor

@fandreuz fandreuz commented May 30, 2025

In this PR I update the OTLP version to v1.7.0.

All unit tests are green, but integration tests in pkg/test/integration/ingest_otlp_test.go aren't: .pb.bin files in testdata must be updated. Is there an automated way to run the eBPF profiler on some known programs to refresh the test data? I put together an updated version of opentelemetry-ebpf-profiler consuming open-telemetry/opentelemetry-collector#13075.

@fandreuz fandreuz requested a review from a team as a code owner May 30, 2025 00:49
@CLAassistant
Copy link

CLAassistant commented May 30, 2025

CLA assistant check
All committers have signed the CLA.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@fandreuz fandreuz changed the title Update go.opentelemetry.io/proto/otlp to v1.7.0 Bump go.opentelemetry.io/proto/otlp to v1.7.0 May 30, 2025
@korniltsev
Copy link
Contributor

Is there an automated way to run the eBPF profiler on some known programs to refresh the test data?

no.

I made the dumps with something like this

+var dumpNo = atomic.Int64{}
+
 func (h *ingestHandler) Export(ctx context.Context, er *pprofileotlp.ExportProfilesServiceRequest) (*pprofileotlp.ExportProfilesServiceResponse, error) {
+       dump, _ := proto.Marshal(er)
+       _ = os.WriteFile(fmt.Sprintf("dump_%d", dumpNo.Add(1)), dump, 0666)

I suggest you regenerate the first testcase "unsymbolized profile from otel-ebpf-profiler" and testing.T.Skip or comment out the rest of testcases (they were generated using alloy & grafana fork of otel ebpf profiler). I will regenerate them later.

@korniltsev
Copy link
Contributor

Other than that it looks good to me. Does the new version work for asprof? 🤞 🥳

@fandreuz
Copy link
Contributor Author

I suggest you regenerate the first testcase "unsymbolized profile from otel-ebpf-profiler" and testing.T.Skip or comment out the rest of testcases (they were generated using alloy & grafana fork of otel ebpf profiler). I will regenerate them later.

Ok thanks, let me know if you're happy with the new profiles in testdata, it seems like I have many deletions: 7780f1e

Other than that it looks good to me. Does the new version work for asprof? 🤞 🥳

Works like a charm 😃 Just tested with OTLP v1.7.0 and patched Pyroscope

@korniltsev korniltsev enabled auto-merge (squash) June 1, 2025 10:01
@korniltsev
Copy link
Contributor

  1. Could you please run make go/mod and push the go mod changes so that CI is happy?
  2. I think it is a good idea to add testdata generated from asprof. I've created an issue for this Update otlp integration testdata #4215 . Feel free to submit something in a follow PR if you feel like.

auto-merge was automatically disabled June 1, 2025 12:06

Head branch was pushed to by a user without write access

@fandreuz
Copy link
Contributor Author

fandreuz commented Jun 1, 2025

Could you please run make go/mod and push the go mod changes so that CI is happy?

Sure: 75f3b60

I think it is a good idea to add testdata generated from asprof

Yeah I can do that, I'll follow up in the coming days

@korniltsev korniltsev enabled auto-merge (squash) June 1, 2025 12:14
@fandreuz
Copy link
Contributor Author

fandreuz commented Jun 1, 2025

I'm looking at the test failure, will push again soon

auto-merge was automatically disabled June 1, 2025 12:53

Head branch was pushed to by a user without write access

@fandreuz
Copy link
Contributor Author

fandreuz commented Jun 1, 2025

Fixed, I wrongly assumed at least one sample type should always be provided

@fandreuz
Copy link
Contributor Author

fandreuz commented Jun 1, 2025

As a side note, I noticed Pyroscope crashes altogether when it ingests an invalid OTLP payload. For instance, I had an off-by-one bug involving the string table in one of my profiles. This is not related to this PR, is it something that should be looked at?

@korniltsev
Copy link
Contributor

korniltsev commented Jun 1, 2025

Fixed, I wrongly assumed at least one sample type should always be provided

Do you know when and why this is not the case? How did it work before?

As a side note, I noticed Pyroscope crashes altogether when it ingests an invalid OTLP payload. For instance, I had an off-by-one bug involving the string table in one of my profiles. This is not related to this PR, is it something that should be looked at?

We better not panic and fix the panic. At the same time I thought we had a per-request panic recovery mechanisms. If we do not have them for otlp, we should add them. We don't have to do this in this PR and do a followups if needed.

res.dst.PeriodType = res.convertValueTypeBack(src.PeriodType, dictionary)

var defaultSampleType string
if len(src.SampleType) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I am confused a bit. What is the value of DefaultSampleTypeIndex if len(src.SampleType) is 0?

Do you think we should reject such profiles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know when and why this is not the case?

I think there should always be one item in Profile.sample_type for each item in Profile.sample.value. Having e.g. len(Profile.sample.value) == 1 and len(Profile.sample_type) == 0 seems like a malformed profile to me.

How did it work before?

There used to be a field called default_sample_type_strindex which would be used to index directly into the string table. However, this has been replaced with default_sample_type_index, which should be used to index into Profile.sample_type. So we didn't need to check anything, we could just take whatever value was at the first index of the string table.

What is the value of DefaultSampleTypeIndex if len(src.SampleType) is 0?

DefaultSampleTypeIndex can be unset, in which case it defaults to zero. If len(src.SampleType) == 0, accessing src.SampleType[DefaultSampleTypeIndex] will cause an out-of-bounds exception. Thus, the value of DefaultSampleTypeIndex is invalid, and the profile is malformed.

Do you think we should reject such profiles?

I could not find an indication in the OTEL profiling spec about whether profiles with an empty sample_type list should be considered malformed and rejected. I would find it logical, otherwise there should be a mandated default sample type.

I'll think about it, and maybe raise this topic at the next Profiling SIG meeting. For the time being, I think assigning a default value to defaultSampleType is a safe choice.

@korniltsev korniltsev merged commit 837db07 into grafana:main Jun 2, 2025
25 checks passed
@korniltsev
Copy link
Contributor

Thank you for working on this

@kolesnikovae
Copy link
Collaborator

The PR caused panics in our internal dev environment, where we dogfood OTLP format, so I had to revert it: #4218

@fandreuz
Copy link
Contributor Author

fandreuz commented Jun 2, 2025

Hi @kolesnikovae, sorry for that. Where can I find the configuration code for this test? I'd like to check it out locally. If it's using v1.5.0 data, it should be updated too.

Comment on lines +179 to 187
func serviceNameFromSample(sample *otelProfile.Sample, dictionary *otelProfile.ProfilesDictionary) string {
for _, attributeIndex := range sample.AttributeIndices {
attribute := p.AttributeTable[attributeIndex]
attribute := dictionary.AttributeTable[attributeIndex]
if attribute.Key == serviceNameKey {
return attribute.Value.GetStringValue()
}
}
return ""
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look to deep, but it feels like a major change in the OTel spec, where we add a Dictionary member (to share parts across resource profiles). In the code we assume that the member is always present, and we hit nil deference the first time we access it – on this line.

Although it is an experimental API in dev, let's handle it more gracefully: if specs don't say anything about the fallback, we should drop the profile/request.

@kolesnikovae
Copy link
Collaborator

kolesnikovae commented Jun 3, 2025

It's all cool @fandreuz – no worries. I think it's here https://github.com/grafana/pyroscope/pull/4212/files#r2122527475

We should add sanitization for things like broken references, semantic issues, off-by-one errors, etc. All sorts of issues are totally expected – clearly out of scope of the PR – I'll create an issue to make sure we sanitize everything and handle such failures more gracefully.

Update: created #4220

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants