Skip to content

Conversation

kolesnikovae
Copy link
Contributor

@kolesnikovae kolesnikovae commented Mar 7, 2025

Hello team,

First off, thanks for the amazing work on this project! We use it extensively at Grafana, and we truly appreciate the effort that goes into maintaining it.

This PR removes unnecessary allocations in sysenc.Unmarshal caused by the binary.Read call. We propose using binary.Decode (introduced in Go 1.23) as an alternative. While the change may appear minor, its impact can be substantial depending on the specific use case.

In the flame graphs below, you can observe one of our services experiencing significant GC pressure, with around 30% of the CPU time spent in runtime.scanobject. The heap allocations profile further highlights the issue, tracing it to the sysenc.Unmarshal call within the ebpf.(*Map).Lookup -> ebpf.unmarshalPerCPUValue call chain.

CPU profile:
image

Heap allocs profile (alloc_objects):
image

The profiles were collected over the course of an hour from a fleet of observability agents using Pyroscope and the standard Go pprof endpoints (we also use cilium/ebpf for profiling, though not in this particular case).

Nearly half of all the allocations made by the program occur due to binary.Read in sysenc.Unmarshal. While sysenc.Unmarshal is generally well-optimized and only resorts to binary.Read as a last step, in our specific case, we miss out on most of these optimizations.

The exact locations where allocations occur:

Benchmarks confirm the presence of an allocation in sysenc.Unmarshal. After applying the optimization from this PR, the benchmark results show zero allocations.

Benchmark Results
goos: darwin
goarch: arm64
pkg: github.com/cilium/ebpf/internal/sysenc
cpu: Apple M1 Pro
                                     │  before.txt  │              after.txt               │
                                     │    sec/op    │    sec/op     vs base                │
Unmarshal/*[1]uint64-10                 41.16n ± 0%   42.85n ±  1%   +4.11% (p=0.000 n=10)
Unmarshal/*int16-10                     19.30n ± 2%   19.58n ± 10%   +1.50% (p=0.009 n=10)
Unmarshal/*uint16-10                    19.27n ± 0%   19.35n ±  1%   +0.42% (p=0.019 n=10)
Unmarshal/*int32-10                     19.86n ± 1%   19.95n ±  1%        ~ (p=0.060 n=10)
Unmarshal/*uint32-10                    18.95n ± 0%   19.04n ±  1%        ~ (p=0.072 n=10)
Unmarshal/*int64-10                     19.58n ± 3%   19.65n ±  2%        ~ (p=0.643 n=10)
Unmarshal/*uint64-10                    19.64n ± 1%   19.58n ±  2%        ~ (p=0.255 n=10)
Unmarshal/[]uint8-10                    19.30n ± 0%   19.36n ±  1%        ~ (p=0.811 n=10)
Unmarshal/*sysenc.explicitPad-10        44.42n ± 1%   43.69n ±  2%   -1.65% (p=0.001 n=10)
Unmarshal/[]sysenc.explicitPad-10       43.81n ± 0%   42.86n ±  0%   -2.16% (p=0.000 n=10)
Unmarshal/[]sysenc.explicitPad#01-10    44.81n ± 1%   43.83n ±  1%   -2.21% (p=0.000 n=10)
Unmarshal/[]sysenc.explicitPad#02-10    44.53n ± 2%   44.77n ±  1%        ~ (p=0.127 n=10)
Unmarshal/*sysenc.struc-10              86.26n ± 0%   62.59n ±  0%  -27.44% (p=0.000 n=10)
Unmarshal/[]sysenc.struc-10             46.00n ± 1%   45.12n ±  0%   -1.91% (p=0.000 n=10)
Unmarshal/[]sysenc.struc#01-10          95.92n ± 1%   70.00n ±  0%  -27.03% (p=0.000 n=10)
Unmarshal/[]sysenc.struc#02-10         131.80n ± 0%   91.78n ±  0%  -30.37% (p=0.000 n=10)
geomean                                 36.10n        33.91n         -6.06%

                                     │  before.txt  │                after.txt                │
                                     │     B/op     │    B/op     vs base                     │
Unmarshal/*[1]uint64-10                0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*int16-10                    0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*uint16-10                   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*int32-10                    0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*uint32-10                   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*int64-10                    0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*uint64-10                   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]uint8-10                   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*sysenc.explicitPad-10       0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]sysenc.explicitPad-10      0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]sysenc.explicitPad#01-10   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]sysenc.explicitPad#02-10   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*sysenc.struc-10             16.00 ± 0%      0.00 ± 0%  -100.00% (p=0.000 n=10)
Unmarshal/[]sysenc.struc-10            0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]sysenc.struc#01-10         16.00 ± 0%      0.00 ± 0%  -100.00% (p=0.000 n=10)
Unmarshal/[]sysenc.struc#02-10         24.00 ± 0%      0.00 ± 0%  -100.00% (p=0.000 n=10)


                                     │  before.txt  │                after.txt                │
                                     │  allocs/op   │ allocs/op   vs base                     │
Unmarshal/*[1]uint64-10                0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*int16-10                    0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*uint16-10                   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*int32-10                    0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*uint32-10                   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*int64-10                    0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*uint64-10                   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]uint8-10                   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*sysenc.explicitPad-10       0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]sysenc.explicitPad-10      0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]sysenc.explicitPad#01-10   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]sysenc.explicitPad#02-10   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*sysenc.struc-10             1.000 ± 0%     0.000 ± 0%  -100.00% (p=0.000 n=10)
Unmarshal/[]sysenc.struc-10            0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]sysenc.struc#01-10         1.000 ± 0%     0.000 ± 0%  -100.00% (p=0.000 n=10)
Unmarshal/[]sysenc.struc#02-10         1.000 ± 0%     0.000 ± 0%  -100.00% (p=0.000 n=10)

Thank you for considering this change. Please let me know if you need any additional information or clarifications.

@kolesnikovae kolesnikovae requested a review from a team as a code owner March 7, 2025 06:58
@kolesnikovae kolesnikovae force-pushed the perf-sysenc-unmarshal-binary-decode branch from 1730e57 to fefaf4a Compare March 7, 2025 07:01
@lmb
Copy link
Collaborator

lmb commented Mar 10, 2025

Looks good to me, thanks for your contribution! Can you add a test using testing.AllocsPerRun which asserts that the slow path doesn't have to allocate?

@kolesnikovae kolesnikovae force-pushed the perf-sysenc-unmarshal-binary-decode branch from 343b82e to 87a8080 Compare March 11, 2025 05:04
@kolesnikovae
Copy link
Contributor Author

@lmb Thanks for the review! I've added TestUnmarshalSlowPathZeroAllocations with a subset of test cases that previously caused an allocation in the slow path. Alternatively, we could remove the !test.zeroAllocs condition in TestUnmarshalAllocations, and the test would still pass – let me know if you'd like me to do that.

@ti-mo
Copy link
Collaborator

ti-mo commented Mar 12, 2025

@lmb Thanks for the review! I've added TestUnmarshalSlowPathZeroAllocations with a subset of test cases that previously caused an allocation in the slow path. Alternatively, we could remove the !test.zeroAllocs condition in TestUnmarshalAllocations, and the test would still pass – let me know if you'd like me to do that.

Yes, I think that would be ideal! No need for a separate test, let's update the existing one to reflect the new situation. Thanks!

@kolesnikovae kolesnikovae force-pushed the perf-sysenc-unmarshal-binary-decode branch from dc9d6e6 to 7c794bb Compare March 14, 2025 06:29
@kolesnikovae
Copy link
Contributor Author

Thank you for the review @ti-mo! I've applied the requested changes. Could you please take another look?

Remove allocations when unmarshaling values by using the new
binary.Decode API.

Signed-off-by: Anton Kolesnikov <[email protected]>
Signed-off-by: Anton Kolesnikov <[email protected]>
@lmb lmb force-pushed the perf-sysenc-unmarshal-binary-decode branch from 7c794bb to d620a32 Compare March 17, 2025 11:16
Copy link
Collaborator

@lmb lmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Added a comment to the tests and squashed the commits.

@lmb lmb merged commit 0bfb15f into cilium:main Mar 17, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants